Coa Lecture Unit 3 Pipelining
-
Upload
sumathy-jayaram -
Category
Documents
-
view
226 -
download
0
Transcript of Coa Lecture Unit 3 Pipelining
-
7/28/2019 Coa Lecture Unit 3 Pipelining
1/95
Topics covered:Pipelining
1
Unit 3
-
7/28/2019 Coa Lecture Unit 3 Pipelining
2/95
22
Basic concepts
Speed of execution of programs can be improved in twoways: Faster circuit technology to build the processor and the
memory. Arrange the hardware so that a number of operations can be
performed simultaneously. The number of operations performedper second is increased although the elapsed time needed toperform any one operation is not changed.
Pipelining is an effective way of organizing concurrent
activity in a computer system to improve the speed ofexecution of programs.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
3/95
33
Basic concepts (contd..)
Processor executes a program by fetching and executing instructions one after theother.
This is known as sequential execution.
If Fi refers to the fetch step, and Ei refers to the execution step of instruction Ii,
then sequential execution looks like:
F1
E1
F2
E2
F3
E3
1 2 3
Time
What if the execution of one instruction is overlapped with the fetching of the
next one?
-
7/28/2019 Coa Lecture Unit 3 Pipelining
4/95
44
Basic concepts (contd..)
Instructionfetchunit
Executionunit
Interstage buffer
B1
Computer has two separate hardware units, one for fetching instructions and one
for executing instructions.Instruction is fetched by instruction fetch unit and deposited in an intermediate
buffer B1.
Buffer enables the instruction execution unit to execute the instruction while the
fetch unit is fetching the next instruction.
Results of the execution are deposited in the destination location specified by the
instruction.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
5/95
55
Basic concepts (contd..)
F1
E1
F2
E2
F3
E3
I1
I2
I3
Instruction
Clock cycle 1 2 3 4Time
Computer is controlled by a clock whose period is such that the fetch and execute
steps of any instruction can be completed in one clock cycle.
First clock cycle:
- Fetch unit fetches an instruction I1 (F1) and stores it in B1.
Second clock cycle:
- Fetch unit fetches an instruction I2 (F2) , and execution unit executes instruction I1 (E1).
Third clock cycle:
- Fetch unit fetches an instruction I3 (F3), and execution unit executes instruction I2 (E2).
Fourth clock cycle:
- Execution unit executes instruction I3 (E3).
-
7/28/2019 Coa Lecture Unit 3 Pipelining
6/95
66
Basic concepts (contd..)
In each clock cycle, the fetch unit fetches the nextinstruction, while the execution unit executes the currentinstruction stored in the interstage buffer. Fetch and the execute units can be kept busy all the time.
If this pattern of fetch and execute can be sustained for a
long time, the completion rate of instruction execution willbe twice that achievable by the sequential operation.
Fetch and execute units constitute a two-stage pipeline. Each stage performs one step in processing of an instruction. Interstage storage buffer holds the information that needs to
be passed from the fetch stage to execute stage. New information gets loaded into the buffer every clock cycle.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
7/95
77
Basic concepts (contd..)
Suppose the processing of an instruction is divided into four steps:
F Fetch: Read the instruction from the memory.D Decode: Decode the instruction and fetch the source operands.
E Execute: Perform the operation specified by the instruction.
W Write: Store the result in the destination location.
There are four distinct hardware units, for each one of the steps.Information is passed from one unit to the next through an interstage buffer.
Three interstage buffers connecting four units.As an instruction progresses through the pipeline, the information needed by the
downstream units must be passed along.
F : Fetchinstruction
D : Decodeinstructionand fetchoperands
E: Executeoperation
W : Writeresults
Interstage buffers
B1 B2 B3
-
7/28/2019 Coa Lecture Unit 3 Pipelining
8/95
88
Basic concepts (contd..)
F4I4
F1
F2
F3
I1
I2
I3
D1
D2
D3
D4
E1
E2
E3
E4
W1
W2
W3
W4
Instruction
Clock cycle 1 2 3 4 5 6 7
Time
Clock cycle 1: F1
Clock cycle 2: D1, F2Clock cycle 3: E1, D2, F3
Clock cycle 4: W1, E2, D3, F4
Clock cycle 5: W2, E3, D4
Clock cycle 6: W3, E3, D4
Clock cycle 7: W4
-
7/28/2019 Coa Lecture Unit 3 Pipelining
9/95
99
Basic concepts (contd..)
F4I4
F1
F2
F3
I1
I2
I3
D1
D2
D3
D4
E1
E2
E3
E4
W1
W2
W3
W4
Instruction
Clock cycle 1 2 3 4 5 6 7
Time
Buffer B1 holds instruction I3, which is being decoded by the instruction-decoding
unit. Instruction I3 was fetched in cycle 3.Buffer B2 holds the source and destination operands for instruction I2. It also holds
the information needed for the Write step (W2) of instruction I2. This information
will be passed to the stage W in the following clock cycle.
Buffer B3 holds the results produced by the execution unit and the destination
information for instruction I1.
During clock cycle #4:
-
7/28/2019 Coa Lecture Unit 3 Pipelining
10/95
1010
Role of cache memory
Each stage in the pipeline is expected to complete itsoperation in one clock cycle: Clock period should be sufficient to complete the longest task. Units which complete the tasks early remain idle for the
remaining clock period.
Tasks being performed in different stages should require aboutthe same amount of time for pipelining to be effective.
If instructions are to be fetched from the main memory,the instruction fetch stage would take as much as ten timesgreater than the other stage operations inside the
processor. However, if instructions are to be fetched from the cachememory which is on the processor chip, the time required tofetch the instruction would be more or less similar to thetime required for other basic operations.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
11/95
Completes an instruction each clock cycle
Therefore, four times as fast as without pipeline
as long as nothing takes more than one cycle
But sometimes things take longer -- for example, most executes such as ADD
take one clock, but suppose DIVIDE takes three
-
7/28/2019 Coa Lecture Unit 3 Pipelining
12/95
-
7/28/2019 Coa Lecture Unit 3 Pipelining
13/95
and other stages idle
Write has nothing to write
Decode cant use its out buffer
Fetch cant use its out buffer
-
7/28/2019 Coa Lecture Unit 3 Pipelining
14/95
1414
Pipeline performance
Potential increase in performance achieved by usingpipelining is proportional to the number of pipeline stages. For example, if the number of pipeline stages is 4, then the
rate of instruction processing is 4 times that of sequentialexecution of instructions.
Pipelining does not cause a single instruction to be executedfaster, it is the throughput that increases.
This rate can be achieved only if the pipelined operation canbe sustained without interruption through programinstruction.
If a pipelined operation cannot be sustained withoutinterruption, the pipeline is said to stall.
A condition that causes the pipeline to stall is called ahazard.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
15/95
1515
Data hazard
A data hazard is a condition in which either the source or the destination
operand is not available at the time expected in the pipeline.Execution of the instruction occurs in the E stage of the pipeline.Execution of most arithmetic and logic operations would take only one clock cycle.However, some operations such as division would take more time to complete.
For example, the operation specified in instruction I2 takes three cycles to complete
from cycle 4 to cycle 6.
F 4I 4
F 1
F 2
F 3
I 1
I 2
I 3
D 1
D 2
D 3
D 4
E 1
E 2
E 3
E 4
W 1
W 2
W 3
W 4
Instruction
Clock cycle 1 2 3 4 5 6 7
T ime
-
7/28/2019 Coa Lecture Unit 3 Pipelining
16/95
1616
Data hazard (contd..)
F4I4
F1
F2
F3
I1
I2
I3
D1
D2
D3
D4
E1
E2
E3
E4
W1
W2
W3
W4
Instruction
Clock cycle 1 2 3 4 5 6 7
Time
Cycles 5 and 6, the Write stage is idle, because it has no data to work with.
Information in buffer B2 must be retained till the execution of the instruction I2 is
complete.Stage 2, and by extension stage 1 cannot accept new instructions because the
information in B1 cannot be overwritten.
Steps D6 and F5 must be postponed.
A data hazard is a condition in which either the source or the destination operand is
not available at the time expected in the pipeline.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
17/95
Data Hazards
Situations that cause the pipeline to stallbecause data to be operated on is delayed execute takes extra cycle, for example
-
7/28/2019 Coa Lecture Unit 3 Pipelining
18/95
1818
Control or instruction hazard
Pipeline may be stalled because an instruction is not available at the expected time.
For example, while fetching an instruction a cache miss may occur, and hence the
instruction may have to be fetched from the main memory.Fetching the instruction from the main memory takes much longer than fetching the
instruction from the cache.
Thus, the fetch cycle of the instruction cannot be completed in one cycle.
For example, the fetching of instruction I2 results in a cache miss.
Thus, F2 takes 4 clock cycles instead of 1.
F1
F2
F3
I1
I2
I3
D1
D2
D3
E1
E2
E3
W1
W2
W3
Instruction
1 2 3 4 5 6 7 8 9Clock cycle
Time
-
7/28/2019 Coa Lecture Unit 3 Pipelining
19/95
An instruction hazard (or control hazard) has caused the pipeline to
stall
Instruction I2 not in the cache, required a main memory access
-
7/28/2019 Coa Lecture Unit 3 Pipelining
20/95
2020
Control or instruction hazard (contd..)
Fetch operation for instruction I2 results in a cache miss, and the instruction fetch
unit must fetch this instruction from the main memory.Suppose fetching instruction I2 from the main memory takes 4 clock cycles.
Instruction I2 will be available in buffer B1 at the end of clock cycle 5.The pipeline resumes its normal operation at this point.Decode unit is idle in cycles 3 through 5.
Execute unit is idle in cycles 4 through 6.
Write unit is idle in cycles 5 through 7.Such idle periods are called as stalls or bubbles.
Once created in one of the pipeline stages, a bubble moves downstream unit it
reaches the last unit.
1 2 3 4 5 6 7 8Clock c ycle
Stage
F: Fetch
D: Decode
E: Execute
W: Write
F1
F2
F3
D1
D2
D3
idle idle idle
E1
E2
E3
idle idle idle
W 1 W 2idle idle idle
9
W 3
F2
F2
F2
T ime
-
7/28/2019 Coa Lecture Unit 3 Pipelining
21/95
2121
Structural hazard
Two instructions require the use of a hardware resource atthe same time.
Most common case is in access to the memory: One instruction needs to access the memory as part of the
Execute or Write stage.
Other instruction is being fetched. If instructions and data reside in the same cache unit, only one
instruction can proceed and the other is delayed. Many processors have separate data and instruction caches
to avoid this delay.
In general, structural hazards can be avoided by providingsufficient resources on the processor chip.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
22/95
2222
Structural hazard (contd..)
F1
F2
F3
I1
I2(Load X(R1),R2
I3
E1
M2
D1
D2
D3
W1
W2
Instruction
F4
I4
F5
I5 D5
Clock cycle 1 2 3 4 5 6 7
E2
E3 W3
E4D4
Memory address X+R1 is computed in step E2 in cycle 4, memory access takes place
in cycle 5, operand read from the memory is written into register R2 in cycle 6.
Execution of instruction I2 takes two clock cycles 4 and 5.
In cycle 6, both instructions I2 and I3 require access to register file.
Pipeline is stalled because the register file cannot handle two operations at once.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
23/95
I2 writing to
register file
I3 must wait for
register file
I2 takes extra cycle for cache access as
part of execution
calculate
the address
I5 fetch delayed
-
7/28/2019 Coa Lecture Unit 3 Pipelining
24/95
2424
Pipelining and performance
Pipelining does not cause an individual instruction to beexecuted faster, rather, it increases the throughput. Throughput is defined as the rate at which instruction
execution is completed.
When a hazard occurs, one of the stages in the pipeline
cannot complete its operation in one clock cycle. The pipeline stalls causing a degradation in performance. Performance level of one instruction completion in each clock
cycle is the upper limit for the throughput that can beachieved in a pipelined processor.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
25/95
2525
Data hazards
Data hazard is a situation in which the pipeline is stalled because the data to be
operated on are delayed.Consider two instructions:
I1: A = 3 + A
I2: B = 4 x A
If A = 5, and I1 and I2 are executed sequentially, B=32.
In a pipelined processor, the execution of I2 can begin before the execution of I1.
The value of A used in the execution of I2 will be the original value of 5 leading to
an incorrect result.
Thus, instructions I1 and I2 depend on each other, because the data used by I2
depends on the results generated by I1
.
Results obtained using sequential execution of instructions should be the same as
the results obtained from pipelined execution.When two instructions depend on each other, they must be performed in the correct
order.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
26/95
Concurrency
A 3 + AB 4 x A
A 5 x C
B 20 + C
Cant be performedconcurrently--result incorrect
if new value of A is not used
Can be performed
concurrently (or in either
order) without affecting result
-
7/28/2019 Coa Lecture Unit 3 Pipelining
27/95
Concurrency
A 3 + A
B 4 x A
A 5 x C
B 20 + C
Second operation depends oncompletion of first operation
The two operations are
independent
-
7/28/2019 Coa Lecture Unit 3 Pipelining
28/95
MUL R2, R3, R4
ADD R5, R4, R6 (dependent on result in R4 from previous instruction)
will write result in R4
cant finish decoding until result is in R4
-
7/28/2019 Coa Lecture Unit 3 Pipelining
29/95
2929
Data hazards (contd..)
F1
F2
F3
I1
I2
I3
D1
D3
E1
E3
E2
W3
Instruction1 2 3 4 5 6 7 8 9Clock cycle
W1
D2A
W2
F4
D4
E4
W4
I4
D2
Mul R2, R3, R4
Add R5,R4,R6
Mul instruction places the results of the multiply operation in register R4 at the end
of clock cycle 4.Register R4 is used as a source operand in the Add instruction. Hence the Decode
Unit decoding the Add instruction cannot proceed until the Write step of the first
instruction is complete.
Data dependency arises because the destination of one instruction is used as a source
in the next instruction.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
30/95
3030
Operand forwarding
Data hazard occurs because the destination of one instruction is used as the source
in the next instruction.
Hence, instruction I2 has to wait for the data to be written in the register file by the
Write stage at the end of step W1.
However, these data are available at the output of the ALU once the Execute stage
completes step E1.
Delay can be reduced or even eliminated if the result of instruction I1 can be
forwarded directly for use in step E2.
This is called operand forwarding.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
31/95
pipeline
stall
data
forwarding
-
7/28/2019 Coa Lecture Unit 3 Pipelining
32/95
-
7/28/2019 Coa Lecture Unit 3 Pipelining
33/95
ADD R5, R4, R6
MUL R2, R3, R4
R2, R3
R2 x R3
R2 x R3
R2 x R3
R5 R4 + R5
R4
R6
-
7/28/2019 Coa Lecture Unit 3 Pipelining
34/95
If solved by software:
MUL R2, R3, R4
NOOPNOOP
ADD R5, R4, R6
2 cycle stall introduced by
hardware
(if no data forwarding)
-
7/28/2019 Coa Lecture Unit 3 Pipelining
35/95
MUL R2, R3, R4
from
R2
R2 x R3
toR4
to
I2
from
R3
-
7/28/2019 Coa Lecture Unit 3 Pipelining
36/95
3636
Operand forwarding (contd..)
Register
file
SRC1 SRC2
RSLT
Destination
Source 1
Source 2
ALU
I1: Mul R2, R3, R4
I2: Add R5, R4, R6
Clock cycle 3:
- Instruction I2 is decoded, and a data
dependency is detected.
- Operand not involved in the dependency,
register R5 is loaded in register SRC1.Clock cycle 4:
- Product produced by I1 is available in
register RSLT.
- The forwarding connection allows the
result to be used in step E2.
Instruction I2 proceeds without interruption.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
37/95
3737
Handling data dependency in software
Data dependency may be detected by the hardware whiledecoding the instruction: Control hardware may delay by an appropriate number of clock
cycles reading of a register till its contents become available.The pipeline stalls for that many number of clock cycles.
Detecting data dependencies and handling them can also be
accomplished in software. Compiler can introduce the necessary delay by introducing an
appropriate number of NOP instructions. For example, if a two-cycle delay is needed between two instructions then two NOPinstructions can be introduced between the two instructions.
I1: Mul R2, R3, R4NOP
NOP
I2: Add R5, R4, R6
-
7/28/2019 Coa Lecture Unit 3 Pipelining
38/95
3838
Side effects
Data dependencies are explicit easy to detect if a registerspecified as the destination in one instruction is used as asource in the subsequent instruction.
However, some instructions also modify registers that arenot specified as the destination.
For example, in the autoincrement and autodecrementaddressing mode, the source register is modified as well.
When a location other than the one explicitly specified inthe instruction as a destination location is affected, theinstruction is said to have a side effect.
Another example of a side effect is condition code flagswhich implicitly record the results of the previousinstruction, and these results may be used in the subsequentinstruction.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
39/95
3939
Side effects (contd..)
I1: Add R3, R4
I2: AddWithCarry R2, R4
Instruction I1 sets the carry flag and instruction I2 uses the carry flag leading to
an
implicit dependency between the two instructions.
Instructions with side effects can lead to multiple data dependencies.Results in a significant increase in the complexity of hardware or software
needed
to handle the dependencies.Side effects should be kept to a minimum in instruction sets designed for
executionon pipelined hardware.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
40/95
4040
Instruction hazards
Instruction fetch units fetch instructions and supply theexecution units with a steady stream of instructions.
If the stream is interrupted then the pipeline stalls.
Stream of instructions may be interrupted because of acache miss or a branch instruction.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
41/95
4141
Instruction hazards (contd..)
Consider a two-stage pipeline, first stage is the instruction fetch stage and the second
stage is the instruction execute stage.Instructions I1, I2 and I3 are stored at successive memory locations.
I2 is a branch instruction with branch target as instruction Ik.
I2 is an unconditional branch instruction.
Clock cycle 3:- Fetch unit is fetching instruction I3.
- Execute unit is decoding I2 and computing the branch target address.
Clock cycle 4:
- Processor must discard I3 which has been incorrectly fetched and fetch Ik.
- Execution unit is idle, and the pipeline stalls for one clock cycle.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
42/95
4242
Instruction hazards (contd..)
F2I2 (Branch)
I3
Ik
E2
F3
Fk
Ek
Fk+1 Ek+1Ik+1
Instruction
Execution unit idle
1 2 3 4 5Clock cycle
Time
F1
I1
E1
6
X
Pipeline stalls for one clock cycle.
Time lost as a result of a branch instruction is called as branch penalty.
Branch penalty is one clock cycle.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
43/95
the lost cycle is the branch penalty
-
7/28/2019 Coa Lecture Unit 3 Pipelining
44/95
4444
Instruction hazards (contd..)
Branch penalty depends on the length of the pipeline, may be higher for a longer
pipeline.For a four-stage pipeline:
- Branch target address is computed in stage E2.
- Instructions I3 and I4 have to be discarded.
- Execution unit is idle for 2 clock cycles.
- Branch penalty is 2 clock cycles.
X
F1 D1 E1 W1
I2 (Branch)
I1
1 2 3 4 5 6 7Clock cycle
F2 D2
F3
Fk
Dk
Ek
Fk+1 Dk+1
I3
Ik
Ik+1
Wk
Ek+1
E2
D3
F4 XI4
8
Time
-
7/28/2019 Coa Lecture Unit 3 Pipelining
45/95
In a four stage pipeline, the penalty is two clock cycles
-
7/28/2019 Coa Lecture Unit 3 Pipelining
46/95
4646
Instruction hazards (contd..)
Branch penalty can be reduced by computing the branch target address earlier in the
pipeline.Instruction fetch unit has special hardware to identify a branch instruction after the
instruction is fetched.
Branch target address can be computed in the Decode stage (D2), rather than in the
Execute stage (E2).
Branch penalty is only one clock cycle.
F1 D1 E1 W1
I2 (Branch)
I1
1 2 3 4 5 6 7Clock cycle
F2 D2
F3 X
Fk
Dk
Ek
Fk+1 Dk+1
I3
Ik
Ik+1
Wk
Ek+1
Time
-
7/28/2019 Coa Lecture Unit 3 Pipelining
47/95
4747
Instruction hazards (contd..)
F : Fetchinstruction
E : Executeinstruction
W : Writeresults
D : Dispatch/Decode
Instruction queue
Instruction fetch unit
unit
Fetch unit fetchesinstructions beforethey are needed &
stores them in a queue
Queue can hold severalinstructions
Dispatch unit takes instructions from the front of the queue anddispatches them to the Execution unit. Dispatch unit also decodesthe instruction.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
48/95
4848
Instruction hazards (contd..)
Fetch unit must have sufficient decoding and processingcapability to recognize and execute branch instructions.
Pipeline stalls because of a data hazard: Dispatch unit cannot issue instructions from the queue. Fetch unit continues to fetch instructions and add them to the
queue. Delay in fetching because of a cache miss or a branch:
Dispatch unit continues to dispatch instructions from theinstruction queue.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
49/95
4949
Instruction hazards (contd..)
X
F1 D1 E1 E1 E1 W1
F4
W3E3
I5 (Branch)
I1
F2 D2
1 2 3 4 5 6 7 8 9Clock cycle
E2 W2
F3 D3
E4D4 W4
F5 D5
F6
Fk
Dk
Ek
Fk+1 Dk+1
I2
I3
I4
I6
Ik
Ik+1
Wk
Ek+1
10
1 1 1 1 2 3 2 1 1Queue length 1
Initial length of the queue is 1.Fetch adds 1 to the queue, dispatch
reduces the length by 1.Queue length remains the same
for first 4 clock cycles.
I1 stalls the pipeline for 2 cycles.
Queue has space, so the fetch unit
continues and queue length rises
to 3 in clock cycle 6.
I5 is a branch instruction with
target instruction Ik.
Ikis fetched in cycle 7, and I6is discarded.However, this does not stall the
pipeline, since I4 is dispatched.I2, I3, I4 and Ik are executed in successive clock cycles.
Fetch unit computes the branch address concurrently with the execution of other
instructions. This is called as branch folding.
previous F1 out, F2 out, F4 in F5 in
-
7/28/2019 Coa Lecture Unit 3 Pipelining
50/95
p
out, F1 in
,
F2 in
,
F3 in
instructions 3
d 4
-
7/28/2019 Coa Lecture Unit 3 Pipelining
51/95
and 4
instructions 3,4,and 5
keeps fetching despite
stall
-
7/28/2019 Coa Lecture Unit 3 Pipelining
52/95
calculates branch target
concurrently
branch folding
discards F6
and fetches K
-
7/28/2019 Coa Lecture Unit 3 Pipelining
53/95
completes an
instruction
each clock
cycle
no branch
penalty
-
7/28/2019 Coa Lecture Unit 3 Pipelining
54/95
5454
Instruction hazards (contd..)
Branch folding can occur if there is at least one instructionavailable in the queue other than the branch instruction. Queue should ideally be full most of the time. Increasing the rate at which the fetch unit reads instructions
from the cache. Most processors allow more than one instruction to be fetched
from the cache in one clock cycle. Fetch unit must replenish the queue quickly after a branch hasoccurred.
Instruction queue also mitigates the impact of cache misses: In the event of a cache miss, the dispatch unit continues to
send instructions to the execution unit as long as the queue is
full. In the meantime, the desired cache block is read. If the queue does not become empty, cache miss has no effect
on the rate of instruction execution.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
55/95
5555
Conditional branches and branch prediction
Conditional branch instructions depend on the result of apreceding instruction. Decision on whether to branch cannot be made until the
execution of the preceding instruction is complete.
Branch instructions represent 20% of the dynamic
instruction count of most programs. Dynamic instruction count takes into consideration that someinstructions are executed repeatedly.
Branch instructions may incur branch penalty reducing theperformance gains expected from pipelining.
Several techniques to mitigate the negative impact ofbranch penalty on performance.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
56/95
5656
Delayed branch
Branch target address is computed in stage E2.
Instructions I3
and I4
have to be discarded.
Location following a branch instruction is called a branch delay slot.There may be more than one branch delay slot depending on the time it takes to
determine whether the instruction is a branch instruction.In this case, there are two branch delay slots.The instructions in the delay slot are always fetched and at least partially executed
before the branch decision is made and the branch address is computed.
X
F1 D1 E1 W1
I2 (Branch)
I1
1 2 3 4 5 6 7Clock cycle
F2 D2
F3
Fk
Dk
Ek
Fk+1 Dk+1
I3
Ik
Ik+1
Wk
Ek+1
E2
D3
F4 XI4
8
Time
-
7/28/2019 Coa Lecture Unit 3 Pipelining
57/95
the penalty is two clock cycles in a four stage pipeline
two branch
delay slots
if branch
address
calculated here
-
7/28/2019 Coa Lecture Unit 3 Pipelining
58/95
fetched and decoded
discarded
penalty reduced to one cycle
one branch delay slot
-
7/28/2019 Coa Lecture Unit 3 Pipelining
59/95
5959
Delayed branch (contd..)
Delayed branching can minimize the penalty incurred as aresult of conditional branch instructions.
Since the instructions in the delay slots are always fetchedand partially executed, it is better to arrange for them tobe fully executed whether or not branch is taken.
If we are able to place useful instructions in these slots, thenthey will always be executed whether or not the branch istaken.
If we cannot place useful instructions in the branch delayslots, then we can fill these slots with NOP instructions.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
60/95
6060
Delayed branch (contd..)
Add
LOOP Shift_left R1
Decrement
Branch=0
R2
LOOP
NEXT
(a) Original program loop
LOOP Decrement R2
Branch=0
Shift_left
LOOP
R1NEXT Add
R1,R3
R1,R3
Register R2 is used as a counter to
determine how many times R1 is to
be shifted.
Processor has a two stage pipeline or
one delay slot.
Instructions can be reordered so thatthe shift left instruction appears in the
delay slot.
Shift left instruction is always executed
whether the branch condition is true or
false.
(b) Reordered instructions
-
7/28/2019 Coa Lecture Unit 3 Pipelining
61/95
6161
Delayed branch (contd..)
F E
F E
F E
F E
F E
F E
F E
Instruction
Decrement
Branch
Shift (delay slot)
Decrement (Branch taken)
Branch
Shift (delay slot)
Add (Branch not taken)
1 2 3 4 5 6 7 8Clock cycleTime
Shift instruction is executed when the branch
is taken.
Shift instruction is executedwhen the branch is not taken.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
62/95
6262
Delayed branch (contd..)
Logically, the program is executed as if the branchinstruction were placed after the shift instruction.
Branching takes place one instruction later than where thebranch instruction appears in the instruction sequence (withreference to reordered instructions).
Hence, this technique is termed as delayed branch. Delayed branch requires reordering as many instructions as
the number of delay slots. Usually possible to reorganize one instruction to fill one delay
slot.
Difficult to reorganize two or more instructions to fill two ormore delay slots.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
63/95
6363
Branch prediction
To reduce the branch penalty associated with conditional
branches, we can predict whether the branch will be taken. Simplest form of branch prediction:
Assume that the branch will not take place. Continue to fetch instructions in sequential execution order. Until the branch condition is evaluated, instruction execution
along the predicted path must be done on a speculative basis. Speculative execution implies that the processor is
executing instructions before it is certain that they are inthe correct sequence. Processor registers and memory locations should not be
updated unless the sequence is confirmed.
If the branch prediction turns out to be wrong, theninstructions that were executed on a speculative basis andtheir data must be purged.
Correct sequence of instructions must be fetched andexecuted.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
64/95
6464
Branch prediction (contd..)
F1
F2
I1 (Compare)
I2 (Branch>0)
I3
D1 E1 W1
F3
F4
Fk
Dk
D3 X
XI4
Ik
Instruction
E2
Clock cycle 1 2 3 4 5 6
D2/P2
Time
I1 is a compare instruction and I2is a branch instruction.Branch prediction takes place in cycle
3 when I2 is being decoded.
I3 is being fetched at that time.
Fetch unit predicts that the branch willnot be taken and continues to fetch I4in cycle 4 when I3 is being decoded.
Results of I1 are available in cycle 3.
Fetch unit evaluates branch condition in cycle 4.
If the branch prediction is incorrect, the fetch unit realizes at this point.
I3 and I4 are discarded and Ik is fetched from the branch target address.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
65/95
6565
Branch prediction (contd..)
Which way to predict the result of the branch instruction(taken or not taken) may be made in the hardware,depending on whether the target address of the branchinstruction is lower or higher than the address of thebranch instruction. If the target address is lower, then the branch is predicted as
taken. If the target address is higher, then the branch is predicted
as not taken.
Branch prediction can also be handled by the compiler. Complier can set the branch prediction bit to 0 or 1 to indicate
the desired behavior. Instruction fetch unit checks the branch prediction bit to
predict whether the branch will be taken.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
66/95
6666
Branch prediction (contd..)
Branch prediction decision is the same every time aninstruction is executed. This is static branch prediction.
Branch prediction decision may change depending on theexecution history.
This is dynamic branch prediction.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
67/95
6767
Branch prediction (contd..)
Branch prediction algorithm may be described as a two-state machine with 2 states:
LT : Branch is likely to be takenLNT: Branch is likely not to be taken
Initial state of the machine be LNT
When the branch instruction is executed, and if the branch is taken, the machine
moves to state LT.
If the branch is not taken, it remains in state LNT.
When the same branch instruction is executed the next time, the branch is predictedas taken if the state of the machine is LT, else it is predicted as not taken.
Branch taken (BT)
Branch not taken (BNT)
BTBNT LNT LT
-
7/28/2019 Coa Lecture Unit 3 Pipelining
68/95
6868
Branch prediction (contd..)
Requires only one bit of history information for each branchinstruction.
Works well inside loops: Once a loop is entered, the branch instruction that controls the
looping will always yield the same result until the last pass. In the last pass, the branch prediction will turn out to be
incorrect. The branch history state machine will be changed to the
opposite state. However, if the same loop is entered the next time, and there
is more than one pass, the branch prediction machine will lead
to wrong branch prediction. Better performance may be achieved by keeping more
execution history.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
69/95
6969
Branch prediction (contd..)
BTBNT
BNT
BT
BNT
BT
BNT LNT
LT ST
SNT
BT
ST : Strong likely to be takenLT : Likely to be taken
LNT : Likely not to be taken
SNT : Strong likely not to be taken
Initial state of the algorithm is LNT.
After the branch instruction is executed, ifthe branch is taken, the state is changed to STFor a branch instruction, the fetch unit predicts
that the branch will be taken if the state is ST
or LT, else it predicts that the branch will not
be taken.
In state SNT:- The prediction is that the branch is not taken.
- If the branch is actually taken, the state
changes to LNT.
- Next time the branch is encountered, the
prediction again is that it is not taken.
- If the prediction is wrong the second time,the state changes to ST.
- After that, the branch is predicted as taken.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
70/95
7070
Branch prediction (contd..)
Consider a loop with a branch instruction at the end.
Initial state of the branch prediction algorithm is LNT. In the first pass, the algorithm will predict that the branch
is not taken. This prediction will be incorrect. The state of the algorithm will change to ST.
In the subsequent passes, the algorithm will predict that thebranch is taken: Prediction will be incorrect, except for the last pass.
In the last pass, the branch is not taken:
The state will change to LT from ST. When the loop is entered the second time, the algorithm will
predict that the branch is taken: This prediction will be correct.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
71/95
7171
Branch prediction (contd..)
Branch prediction algorithm mispredicts the outcome of thebranch instruction in the first pass. The prediction in the first pass depends on the initial state of
the branch prediction algorithm. If the initial state can be set correctly, the misprediction in
the first pass can be avoided.
Information necessary to set the initial state of the branchprediction algorithm can be provided by static predictionschemes. Comparing the branch target address with the address of the
branch instruction, Checking the branch prediction bit set by the compiler.
Branch instruction at the end of the loop, initial state is LT. Branch instruction at the start of the loop, initial state is LNT.
With this, the only misprediction that occurs is on the finalpass through the loop. This misprediction is unavoidable.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
72/95
Topics covered:Pipelining
Influence on Instruction Sets
-
7/28/2019 Coa Lecture Unit 3 Pipelining
73/95
Overview
Some instructions are much better suited to pipelineexecution than others.
Addressing modes
Conditional code flags
-
7/28/2019 Coa Lecture Unit 3 Pipelining
74/95
Addressing Modes
Addressing modes include simple ones and complex ones. In choosing the addressing modes to be implemented in a
pipelined processor, we must consider the effect of eachaddressing mode on instruction flow in the pipeline:
Side effects The extent to which complex addressing modes cause the
pipeline to stall Whether a given mode is likely to be used by compilers
-
7/28/2019 Coa Lecture Unit 3 Pipelining
75/95
Recall
F1
F2
F3
I1
I2 (Load)
I3
E1
M2
D1
D2
D3
W1
W2
Instruction
F4I4
Clock cycle 1 2 3 4 5 6 7
Figure 8.5. Effect of a Load instruction on pipeline timing.
F5I5 D5
Time
E2
E3 W3
E4D4
Load X(R1), R2
Load (R1), R2
-
7/28/2019 Coa Lecture Unit 3 Pipelining
76/95
Complex Addressing Mode
F
F D
D E
X + [R1] [X +[R1]] [[X +[R1]]]Load
Next instruction
(a) Complex addressing mode
W
1 2 3 4 5 6 7Clock cycleTime
W
Forward
Load (X(R1)), R2
-
7/28/2019 Coa Lecture Unit 3 Pipelining
77/95
Simple Addressing Mode
X + [R1]F D
F
F
F D
D
D
E
[X +[R1]]
[[X +[R1]]]
Add
Load
Load
Next instruction
(b) Simple addressing mode
W
W
W
W
Add #X, R1, R2Load (R2), R2
Load (R2), R2
dd d
-
7/28/2019 Coa Lecture Unit 3 Pipelining
78/95
Addressing Modes
In a pipelined processor, complex addressingmodes do not necessarily lead to fasterexecution.
Advantage: reducing the number of
instructions / program spaceDisadvantage: cause pipeline to stall / more
hardware to decode / not convenient forcompiler to work with
Conclusion: complex addressing modes are notsuitable for pipelined execution.
dd i M d
-
7/28/2019 Coa Lecture Unit 3 Pipelining
79/95
Addressing Modes
Good addressing modes should have:
Access to an operand does not require more thanone access to the memory
Only load and store instruction access memoryoperands
The addressing modes used do not have sideeffects
Register, register indirect, index
C di i l C d
-
7/28/2019 Coa Lecture Unit 3 Pipelining
80/95
Conditional Codes
If an optimizing compiler attempts to reorder instruction toavoid stalling the pipeline when branches or datadependencies between successive instructions occur, it mustensure that reordering does not cause a change in theoutcome of a computation.
The dependency introduced by the condition-code flagsreduces the flexibility available for the compiler to reorderinstructions.
C diti l C d
-
7/28/2019 Coa Lecture Unit 3 Pipelining
81/95
Conditional Codes
AddCompare
Branch=0
R1,R2R3,R4
. . .
Compare
AddBranch=0
R3,R4
R1,R2. . .
(a) A program fragment
(b) Instructions reordered
Figure 8.17. Instruction reordering.
C diti l C d
-
7/28/2019 Coa Lecture Unit 3 Pipelining
82/95
Conditional Codes
Two conclusion:
To provide flexibility in reordering instructions,the condition-code flags should be affected by asfew instruction as possible.
The compiler should be able to specify in whichinstructions of a program the condition codes areaffected and in which they are not.
-
7/28/2019 Coa Lecture Unit 3 Pipelining
83/95
Topics covered:Pipelining
Datapath and Control Considerations
Bus A Bus B Bus C
Incrementer
-
7/28/2019 Coa Lecture Unit 3 Pipelining
84/95
Memory b usdata lines
Figure 7.8. Three-b us organization of the datapath.
Instructiondecoder
PC
Register
file
Constant 4
AL U
MDR
A
B
R
MUX
Addresslines
MAR
IR
Register
file
Pipelined Desi n
-
7/28/2019 Coa Lecture Unit 3 Pipelining
85/95
Instruction cache
Figure 8.18. Datapath modified for pipelined e xecution, with
BusA
BusB
Control s ignal pipeline
IMAR
PC
ALU
Instruction
A
B
R
decoder
Incrementer
MDR/Write
Instructionqueue
BusC
Data cache
Memory address
MDR/ReadDMAR
Memory address
(Instruction fetches)
(Data access)
interstage b uffers at the input and output of the ALU.
Pipelined Design
- Separate instruction and data caches
- PC is connected to IMAR
- DMAR
- Separate MDR
- Buffers for ALU
- Instruction queue
- Instruction decoder output
- Reading an instruction from the instruction cache
- Incrementing the PC
- Decoding an instruction
- Reading from or writing into the data cache
- Reading the contents of up to two regs
- Writing into one register in the reg file
- Performing an ALU operation
-
7/28/2019 Coa Lecture Unit 3 Pipelining
86/95
Topics covered:Pipelining
Superscalar Operation
Overview
-
7/28/2019 Coa Lecture Unit 3 Pipelining
87/95
Overview
The maximum throughput of a pipelinedprocessor is one instruction per clock cycle.
If we equip the processor with multipleprocessing units to handle several instructionsin parallel in each processing stage, several
instructions start execution in the same clockcycle multiple-issue.
Processors are capable of achieving aninstruction execution throughput of more than
one instruction per cycle superscalarprocessors.Multiple-issue requires a wider path to the
cache and multiple execution units.
Superscalar
-
7/28/2019 Coa Lecture Unit 3 Pipelining
88/95
Superscalar
W : Writeresults
Dispatchunit
Instruction queue
Floating-pointunit
Integerunit
Figure 8.19. A processor with two execution units.
F : Instructionfetch unit
Timing
-
7/28/2019 Coa Lecture Unit 3 Pipelining
89/95
Timing
I1 (Fadd) D1
D2
D3
D4
E1A E1B E1C
E2
E3 E3 E3
E4
W1
W2
W3
W4
I2 (Add)
I3 (Fsub)
I4 (Sub)
Figure 8.20. An example of instruction execution flow in the processor of Figure 8.19,
assuming no hazards are encountered.
1 2 3 4 5 6Clock cycleTime
F1
F2
F3
F4
7
Out of Order Execution
-
7/28/2019 Coa Lecture Unit 3 Pipelining
90/95
Out-of-Order Execution
HazardsExceptionsImprecise exceptions
Precise exceptions
I1
(Fadd) D1
D2
D3
D4
E1A E1B E1C
E2
E3A E3B E3C
E4
W1
W2
W3
W4
I2 (Add)
I3
(Fsub)
I4
(Sub)
1 2 3 4 5 6Clock cycleTime
(a) Delayed write
F1
F2
F3
F4
7
Execution Completion
-
7/28/2019 Coa Lecture Unit 3 Pipelining
91/95
Execution Completion
It is desirable to used out-of-order execution, so that anexecution unit is freed to execute other instructions as soonas possible.
At the same time, instructions must be completed in programorder to allow precise exceptions.
The use of temporary registers Commitment unit
I1
(Fadd) D1
D2
D3
D4
E1A E1B E1C
E2
E3A
E3B
E3C
E4
W1
W2
W3
W4
I2 (Add)
I3
(Fsub)
I4
(Sub)
1 2 3 4 5 6Clock cycleTime
(b) Using temporary registers
TW2
TW4
7
F1
F2
F3
F4
-
7/28/2019 Coa Lecture Unit 3 Pipelining
92/95
Topics covered:Pipelining
Performance Considerations
Overview
-
7/28/2019 Coa Lecture Unit 3 Pipelining
93/95
Overview
The execution time T of a program thathas a dynamic instruction count N is givenby:
where S is the average number of clockcycles it takes to fetch and execute oneinstruction, and R is the clock rate.
Instruction throughput is defined as thenumber of instructions executed persecond.
R
SNT
=
S
RP
s=
Overview
-
7/28/2019 Coa Lecture Unit 3 Pipelining
94/95
Overview
Ann
-stage pipeline has the potential toincrease the throughput by n times.However, the only real measure of
performance is the total execution time of a
program.Higher instruction throughput will not
necessarily lead to higher performance.Two questions regarding pipelining
How much of this potential increase in instructionthroughput can be realized in practice?
What is good value of n?
Number of Pipeline Stages
-
7/28/2019 Coa Lecture Unit 3 Pipelining
95/95
Number of Pipeline Stages
Since ann
-stage pipeline has the potential toincrease the throughput by n times, how aboutwe use a 10,000-stage pipeline?
As the number of stages increase, the
probability of the pipeline being stalledincreases.The inherent delay in the basic operations
increases.
Hardware considerations (area, power,complexity,)