Coa Lecture Unit 3 Pipelining

7/28/2019 Coa Lecture Unit 3 Pipelining

1/95

Topics covered:Pipelining

1

Unit 3


2/95

22

Basic concepts

Speed of execution of programs can be improved in twoways: Faster circuit technology to build the processor and the

memory. Arrange the hardware so that a number of operations can be

performed simultaneously. The number of operations performedper second is increased although the elapsed time needed toperform any one operation is not changed.

Pipelining is an effective way of organizing concurrent

activity in a computer system to improve the speed ofexecution of programs.


3/95

33

Basic concepts (contd..)

Processor executes a program by fetching and executing instructions one after theother.

This is known as sequential execution.

If Fi refers to the fetch step, and Ei refers to the execution step of instruction Ii,

then sequential execution looks like:

F1

E1

F2

E2

F3

E3

1 2 3

Time

What if the execution of one instruction is overlapped with the fetching of the

next one?


4/95

44


Instructionfetchunit

Executionunit

Interstage buffer

B1

Computer has two separate hardware units, one for fetching instructions and one

for executing instructions.Instruction is fetched by instruction fetch unit and deposited in an intermediate

buffer B1.

Buffer enables the instruction execution unit to execute the instruction while the

fetch unit is fetching the next instruction.

Results of the execution are deposited in the destination location specified by the

instruction.


5/95

55


F1

E1

F2

E2

F3

E3

I1

I2

I3

Instruction

Clock cycle 1 2 3 4Time

Computer is controlled by a clock whose period is such that the fetch and execute

steps of any instruction can be completed in one clock cycle.

First clock cycle:

- Fetch unit fetches an instruction I1 (F1) and stores it in B1.

Second clock cycle:

- Fetch unit fetches an instruction I2 (F2) , and execution unit executes instruction I1 (E1).

Third clock cycle:

- Fetch unit fetches an instruction I3 (F3), and execution unit executes instruction I2 (E2).

Fourth clock cycle:

- Execution unit executes instruction I3 (E3).


6/95

66


In each clock cycle, the fetch unit fetches the nextinstruction, while the execution unit executes the currentinstruction stored in the interstage buffer. Fetch and the execute units can be kept busy all the time.

If this pattern of fetch and execute can be sustained for a

long time, the completion rate of instruction execution willbe twice that achievable by the sequential operation.

Fetch and execute units constitute a two-stage pipeline. Each stage performs one step in processing of an instruction. Interstage storage buffer holds the information that needs to

be passed from the fetch stage to execute stage. New information gets loaded into the buffer every clock cycle.


7/95

77


Suppose the processing of an instruction is divided into four steps:

F Fetch: Read the instruction from the memory.D Decode: Decode the instruction and fetch the source operands.

E Execute: Perform the operation specified by the instruction.

W Write: Store the result in the destination location.

There are four distinct hardware units, for each one of the steps.Information is passed from one unit to the next through an interstage buffer.

Three interstage buffers connecting four units.As an instruction progresses through the pipeline, the information needed by the

downstream units must be passed along.

F : Fetchinstruction

D : Decodeinstructionand fetchoperands

E: Executeoperation

W : Writeresults

Interstage buffers

B1 B2 B3


8/95

88


F4I4

F1

F2

F3

I1

I2

I3

D1

D2

D3

D4

E1

E2

E3

E4

W1

W2

W3

W4

Instruction

Clock cycle 1 2 3 4 5 6 7

Time

Clock cycle 1: F1

Clock cycle 2: D1, F2Clock cycle 3: E1, D2, F3

Clock cycle 4: W1, E2, D3, F4

Clock cycle 5: W2, E3, D4

Clock cycle 6: W3, E3, D4

Clock cycle 7: W4


9/95

99


F4I4

F1

F2

F3

I1

I2

I3

D1

D2

D3

D4

E1

E2

E3

E4

W1

W2

W3

W4

Instruction


Time

Buffer B1 holds instruction I3, which is being decoded by the instruction-decoding

unit. Instruction I3 was fetched in cycle 3.Buffer B2 holds the source and destination operands for instruction I2. It also holds

the information needed for the Write step (W2) of instruction I2. This information

will be passed to the stage W in the following clock cycle.

Buffer B3 holds the results produced by the execution unit and the destination

information for instruction I1.

During clock cycle #4:


10/95

1010

Role of cache memory

Each stage in the pipeline is expected to complete itsoperation in one clock cycle: Clock period should be sufficient to complete the longest task. Units which complete the tasks early remain idle for the

remaining clock period.

Tasks being performed in different stages should require aboutthe same amount of time for pipelining to be effective.

If instructions are to be fetched from the main memory,the instruction fetch stage would take as much as ten timesgreater than the other stage operations inside the

processor. However, if instructions are to be fetched from the cachememory which is on the processor chip, the time required tofetch the instruction would be more or less similar to thetime required for other basic operations.


11/95

Completes an instruction each clock cycle

Therefore, four times as fast as without pipeline

as long as nothing takes more than one cycle

But sometimes things take longer -- for example, most executes such as ADD

take one clock, but suppose DIVIDE takes three


12/95


13/95

and other stages idle

Write has nothing to write

Decode cant use its out buffer

Fetch cant use its out buffer


14/95

1414

Pipeline performance

Potential increase in performance achieved by usingpipelining is proportional to the number of pipeline stages. For example, if the number of pipeline stages is 4, then the

rate of instruction processing is 4 times that of sequentialexecution of instructions.

Pipelining does not cause a single instruction to be executedfaster, it is the throughput that increases.

This rate can be achieved only if the pipelined operation canbe sustained without interruption through programinstruction.

If a pipelined operation cannot be sustained withoutinterruption, the pipeline is said to stall.

A condition that causes the pipeline to stall is called ahazard.


15/95

1515

Data hazard

A data hazard is a condition in which either the source or the destination

operand is not available at the time expected in the pipeline.Execution of the instruction occurs in the E stage of the pipeline.Execution of most arithmetic and logic operations would take only one clock cycle.However, some operations such as division would take more time to complete.

For example, the operation specified in instruction I2 takes three cycles to complete

from cycle 4 to cycle 6.

F 4I 4

F 1

F 2

F 3

I 1

I 2

I 3

D 1

D 2

D 3

D 4

E 1

E 2

E 3

E 4

W 1

W 2

W 3

W 4

Instruction


T ime


16/95

1616

Data hazard (contd..)

F4I4

F1

F2

F3

I1

I2

I3

D1

D2

D3

D4

E1

E2

E3

E4

W1

W2

W3

W4

Instruction


Time

Cycles 5 and 6, the Write stage is idle, because it has no data to work with.

Information in buffer B2 must be retained till the execution of the instruction I2 is

complete.Stage 2, and by extension stage 1 cannot accept new instructions because the

information in B1 cannot be overwritten.

Steps D6 and F5 must be postponed.

A data hazard is a condition in which either the source or the destination operand is

not available at the time expected in the pipeline.


17/95

Data Hazards

Situations that cause the pipeline to stallbecause data to be operated on is delayed execute takes extra cycle, for example


18/95

1818

Control or instruction hazard

Pipeline may be stalled because an instruction is not available at the expected time.

For example, while fetching an instruction a cache miss may occur, and hence the

instruction may have to be fetched from the main memory.Fetching the instruction from the main memory takes much longer than fetching the

instruction from the cache.

Thus, the fetch cycle of the instruction cannot be completed in one cycle.

For example, the fetching of instruction I2 results in a cache miss.

Thus, F2 takes 4 clock cycles instead of 1.

F1

F2

F3

I1

I2

I3

D1

D2

D3

E1

E2

E3

W1

W2

W3

Instruction

1 2 3 4 5 6 7 8 9Clock cycle

Time


19/95

An instruction hazard (or control hazard) has caused the pipeline to

stall

Instruction I2 not in the cache, required a main memory access


20/95

2020

Control or instruction hazard (contd..)

Fetch operation for instruction I2 results in a cache miss, and the instruction fetch

unit must fetch this instruction from the main memory.Suppose fetching instruction I2 from the main memory takes 4 clock cycles.

Instruction I2 will be available in buffer B1 at the end of clock cycle 5.The pipeline resumes its normal operation at this point.Decode unit is idle in cycles 3 through 5.

Execute unit is idle in cycles 4 through 6.

Write unit is idle in cycles 5 through 7.Such idle periods are called as stalls or bubbles.

Once created in one of the pipeline stages, a bubble moves downstream unit it

reaches the last unit.

1 2 3 4 5 6 7 8Clock c ycle

Stage

F: Fetch

D: Decode

E: Execute

W: Write

F1

F2

F3

D1

D2

D3

idle idle idle

E1

E2

E3

idle idle idle

W 1 W 2idle idle idle

9

W 3

F2

F2

F2

T ime


21/95

2121

Structural hazard

Two instructions require the use of a hardware resource atthe same time.

Most common case is in access to the memory: One instruction needs to access the memory as part of the

Execute or Write stage.

Other instruction is being fetched. If instructions and data reside in the same cache unit, only one

instruction can proceed and the other is delayed. Many processors have separate data and instruction caches

to avoid this delay.

In general, structural hazards can be avoided by providingsufficient resources on the processor chip.


22/95

2222

Structural hazard (contd..)

F1

F2

F3

I1

I2(Load X(R1),R2

I3

E1

M2

D1

D2

D3

W1

W2

Instruction

F4

I4

F5

I5 D5


E2

E3 W3

E4D4

Memory address X+R1 is computed in step E2 in cycle 4, memory access takes place

in cycle 5, operand read from the memory is written into register R2 in cycle 6.

Execution of instruction I2 takes two clock cycles 4 and 5.

In cycle 6, both instructions I2 and I3 require access to register file.

Pipeline is stalled because the register file cannot handle two operations at once.


23/95

I2 writing to

register file

I3 must wait for

register file

I2 takes extra cycle for cache access as

part of execution

calculate

the address

I5 fetch delayed


24/95

2424

Pipelining and performance

Pipelining does not cause an individual instruction to beexecuted faster, rather, it increases the throughput. Throughput is defined as the rate at which instruction

execution is completed.

When a hazard occurs, one of the stages in the pipeline

cannot complete its operation in one clock cycle. The pipeline stalls causing a degradation in performance. Performance level of one instruction completion in each clock

cycle is the upper limit for the throughput that can beachieved in a pipelined processor.


25/95

2525

Data hazards

Data hazard is a situation in which the pipeline is stalled because the data to be

operated on are delayed.Consider two instructions:

I1: A = 3 + A

I2: B = 4 x A

If A = 5, and I1 and I2 are executed sequentially, B=32.

In a pipelined processor, the execution of I2 can begin before the execution of I1.

The value of A used in the execution of I2 will be the original value of 5 leading to

an incorrect result.

Thus, instructions I1 and I2 depend on each other, because the data used by I2

depends on the results generated by I1

.

Results obtained using sequential execution of instructions should be the same as

the results obtained from pipelined execution.When two instructions depend on each other, they must be performed in the correct

order.


26/95

Concurrency

A 3 + AB 4 x A

A 5 x C

B 20 + C

Cant be performedconcurrently--result incorrect

if new value of A is not used

Can be performed

concurrently (or in either

order) without affecting result


27/95

Concurrency

A 3 + A

B 4 x A

A 5 x C

B 20 + C

Second operation depends oncompletion of first operation

The two operations are

independent


28/95

MUL R2, R3, R4

ADD R5, R4, R6 (dependent on result in R4 from previous instruction)

will write result in R4

cant finish decoding until result is in R4


29/95

2929

Data hazards (contd..)

F1

F2

F3

I1

I2

I3

D1

D3

E1

E3

E2

W3

Instruction1 2 3 4 5 6 7 8 9Clock cycle

W1

D2A

W2

F4

D4

E4

W4

I4

D2

Mul R2, R3, R4

Add R5,R4,R6

Mul instruction places the results of the multiply operation in register R4 at the end

of clock cycle 4.Register R4 is used as a source operand in the Add instruction. Hence the Decode

Unit decoding the Add instruction cannot proceed until the Write step of the first

instruction is complete.

Data dependency arises because the destination of one instruction is used as a source

in the next instruction.


30/95

3030

Operand forwarding

Data hazard occurs because the destination of one instruction is used as the source

in the next instruction.

Hence, instruction I2 has to wait for the data to be written in the register file by the

Write stage at the end of step W1.

However, these data are available at the output of the ALU once the Execute stage

completes step E1.

Delay can be reduced or even eliminated if the result of instruction I1 can be

forwarded directly for use in step E2.

This is called operand forwarding.


31/95

pipeline

stall

data

forwarding


32/95


33/95

ADD R5, R4, R6

MUL R2, R3, R4

R2, R3

R2 x R3

R2 x R3

R2 x R3

R5 R4 + R5

R4

R6


34/95

If solved by software:

MUL R2, R3, R4

NOOPNOOP

ADD R5, R4, R6

2 cycle stall introduced by

hardware

(if no data forwarding)


35/95

MUL R2, R3, R4

from

R2

R2 x R3

toR4

to

I2

from

R3


36/95

3636

Operand forwarding (contd..)

Register

file

SRC1 SRC2

RSLT

Destination

Source 1

Source 2

ALU

I1: Mul R2, R3, R4

I2: Add R5, R4, R6

Clock cycle 3:

- Instruction I2 is decoded, and a data

dependency is detected.

- Operand not involved in the dependency,

register R5 is loaded in register SRC1.Clock cycle 4:

- Product produced by I1 is available in

register RSLT.

- The forwarding connection allows the

result to be used in step E2.

Instruction I2 proceeds without interruption.


37/95

3737

Handling data dependency in software

Data dependency may be detected by the hardware whiledecoding the instruction: Control hardware may delay by an appropriate number of clock

cycles reading of a register till its contents become available.The pipeline stalls for that many number of clock cycles.

Detecting data dependencies and handling them can also be

accomplished in software. Compiler can introduce the necessary delay by introducing an

appropriate number of NOP instructions. For example, if a two-cycle delay is needed between two instructions then two NOPinstructions can be introduced between the two instructions.

I1: Mul R2, R3, R4NOP

NOP

I2: Add R5, R4, R6


38/95

3838

Side effects

Data dependencies are explicit easy to detect if a registerspecified as the destination in one instruction is used as asource in the subsequent instruction.

However, some instructions also modify registers that arenot specified as the destination.

For example, in the autoincrement and autodecrementaddressing mode, the source register is modified as well.

When a location other than the one explicitly specified inthe instruction as a destination location is affected, theinstruction is said to have a side effect.

Another example of a side effect is condition code flagswhich implicitly record the results of the previousinstruction, and these results may be used in the subsequentinstruction.


39/95

3939

Side effects (contd..)

I1: Add R3, R4

I2: AddWithCarry R2, R4

Instruction I1 sets the carry flag and instruction I2 uses the carry flag leading to

an

implicit dependency between the two instructions.

Instructions with side effects can lead to multiple data dependencies.Results in a significant increase in the complexity of hardware or software

needed

to handle the dependencies.Side effects should be kept to a minimum in instruction sets designed for

executionon pipelined hardware.


40/95

4040

Instruction hazards

Instruction fetch units fetch instructions and supply theexecution units with a steady stream of instructions.

If the stream is interrupted then the pipeline stalls.

Stream of instructions may be interrupted because of acache miss or a branch instruction.


41/95

4141

Instruction hazards (contd..)

Consider a two-stage pipeline, first stage is the instruction fetch stage and the second

stage is the instruction execute stage.Instructions I1, I2 and I3 are stored at successive memory locations.

I2 is a branch instruction with branch target as instruction Ik.

I2 is an unconditional branch instruction.

Clock cycle 3:- Fetch unit is fetching instruction I3.

- Execute unit is decoding I2 and computing the branch target address.

Clock cycle 4:

- Processor must discard I3 which has been incorrectly fetched and fetch Ik.

- Execution unit is idle, and the pipeline stalls for one clock cycle.


42/95

4242


F2I2 (Branch)

I3

Ik

E2

F3

Fk

Ek

Fk+1 Ek+1Ik+1

Instruction

Execution unit idle

1 2 3 4 5Clock cycle

Time

F1

I1

E1

6

X

Pipeline stalls for one clock cycle.

Time lost as a result of a branch instruction is called as branch penalty.

Branch penalty is one clock cycle.


43/95

the lost cycle is the branch penalty


44/95

4444


Branch penalty depends on the length of the pipeline, may be higher for a longer

pipeline.For a four-stage pipeline:

- Branch target address is computed in stage E2.

- Instructions I3 and I4 have to be discarded.

- Execution unit is idle for 2 clock cycles.

- Branch penalty is 2 clock cycles.

X

F1 D1 E1 W1

I2 (Branch)

I1

1 2 3 4 5 6 7Clock cycle

F2 D2

F3

Fk

Dk

Ek

Fk+1 Dk+1

I3

Ik

Ik+1

Wk

Ek+1

E2

D3

F4 XI4

8

Time


45/95

In a four stage pipeline, the penalty is two clock cycles


46/95

4646


Branch penalty can be reduced by computing the branch target address earlier in the

pipeline.Instruction fetch unit has special hardware to identify a branch instruction after the

instruction is fetched.

Branch target address can be computed in the Decode stage (D2), rather than in the

Execute stage (E2).

Branch penalty is only one clock cycle.

F1 D1 E1 W1

I2 (Branch)

I1


F2 D2

F3 X

Fk

Dk

Ek

Fk+1 Dk+1

I3

Ik

Ik+1

Wk

Ek+1

Time


47/95

4747


F : Fetchinstruction

E : Executeinstruction

W : Writeresults

D : Dispatch/Decode

Instruction queue

Instruction fetch unit

unit

Fetch unit fetchesinstructions beforethey are needed &

stores them in a queue

Queue can hold severalinstructions

Dispatch unit takes instructions from the front of the queue anddispatches them to the Execution unit. Dispatch unit also decodesthe instruction.


48/95

4848


Fetch unit must have sufficient decoding and processingcapability to recognize and execute branch instructions.

Pipeline stalls because of a data hazard: Dispatch unit cannot issue instructions from the queue. Fetch unit continues to fetch instructions and add them to the

queue. Delay in fetching because of a cache miss or a branch:

Dispatch unit continues to dispatch instructions from theinstruction queue.


49/95

4949


X

F1 D1 E1 E1 E1 W1

F4

W3E3

I5 (Branch)

I1

F2 D2

1 2 3 4 5 6 7 8 9Clock cycle

E2 W2

F3 D3

E4D4 W4

F5 D5

F6

Fk

Dk

Ek

Fk+1 Dk+1

I2

I3

I4

I6

Ik

Ik+1

Wk

Ek+1

10

1 1 1 1 2 3 2 1 1Queue length 1

Initial length of the queue is 1.Fetch adds 1 to the queue, dispatch

reduces the length by 1.Queue length remains the same

for first 4 clock cycles.

I1 stalls the pipeline for 2 cycles.

Queue has space, so the fetch unit

continues and queue length rises

to 3 in clock cycle 6.

I5 is a branch instruction with

target instruction Ik.

Ikis fetched in cycle 7, and I6is discarded.However, this does not stall the

pipeline, since I4 is dispatched.I2, I3, I4 and Ik are executed in successive clock cycles.

Fetch unit computes the branch address concurrently with the execution of other

instructions. This is called as branch folding.

previous F1 out, F2 out, F4 in F5 in


50/95

p

out, F1 in

,

F2 in

,

F3 in

instructions 3

d 4


51/95

and 4

instructions 3,4,and 5

keeps fetching despite

stall


52/95

calculates branch target

concurrently

branch folding

discards F6

and fetches K


53/95

completes an

instruction

each clock

cycle

no branch

penalty


54/95

5454


Branch folding can occur if there is at least one instructionavailable in the queue other than the branch instruction. Queue should ideally be full most of the time. Increasing the rate at which the fetch unit reads instructions

from the cache. Most processors allow more than one instruction to be fetched

from the cache in one clock cycle. Fetch unit must replenish the queue quickly after a branch hasoccurred.

Instruction queue also mitigates the impact of cache misses: In the event of a cache miss, the dispatch unit continues to

send instructions to the execution unit as long as the queue is

full. In the meantime, the desired cache block is read. If the queue does not become empty, cache miss has no effect

on the rate of instruction execution.


55/95

5555

Conditional branches and branch prediction

Conditional branch instructions depend on the result of apreceding instruction. Decision on whether to branch cannot be made until the

execution of the preceding instruction is complete.

Branch instructions represent 20% of the dynamic

instruction count of most programs. Dynamic instruction count takes into consideration that someinstructions are executed repeatedly.

Branch instructions may incur branch penalty reducing theperformance gains expected from pipelining.

Several techniques to mitigate the negative impact ofbranch penalty on performance.


56/95

5656

Delayed branch

Branch target address is computed in stage E2.

Instructions I3

and I4

have to be discarded.

Location following a branch instruction is called a branch delay slot.There may be more than one branch delay slot depending on the time it takes to

determine whether the instruction is a branch instruction.In this case, there are two branch delay slots.The instructions in the delay slot are always fetched and at least partially executed

before the branch decision is made and the branch address is computed.

X

F1 D1 E1 W1

I2 (Branch)

I1


F2 D2

F3

Fk

Dk

Ek

Fk+1 Dk+1

I3

Ik

Ik+1

Wk

Ek+1

E2

D3

F4 XI4

8

Time


57/95

the penalty is two clock cycles in a four stage pipeline

two branch

delay slots

if branch

address

calculated here


58/95

fetched and decoded

discarded

penalty reduced to one cycle

one branch delay slot


59/95

5959

Delayed branch (contd..)

Delayed branching can minimize the penalty incurred as aresult of conditional branch instructions.

Since the instructions in the delay slots are always fetchedand partially executed, it is better to arrange for them tobe fully executed whether or not branch is taken.

If we are able to place useful instructions in these slots, thenthey will always be executed whether or not the branch istaken.

If we cannot place useful instructions in the branch delayslots, then we can fill these slots with NOP instructions.


60/95

6060


Add

LOOP Shift_left R1

Decrement

Branch=0

R2

LOOP

NEXT

(a) Original program loop

LOOP Decrement R2

Branch=0

Shift_left

LOOP

R1NEXT Add

R1,R3

R1,R3

Register R2 is used as a counter to

determine how many times R1 is to

be shifted.

Processor has a two stage pipeline or

one delay slot.

Instructions can be reordered so thatthe shift left instruction appears in the

delay slot.

Shift left instruction is always executed

whether the branch condition is true or

false.

(b) Reordered instructions


61/95

6161


F E

F E

F E

F E

F E

F E

F E

Instruction

Decrement

Branch

Shift (delay slot)

Decrement (Branch taken)

Branch

Shift (delay slot)

Add (Branch not taken)

1 2 3 4 5 6 7 8Clock cycleTime

Shift instruction is executed when the branch

is taken.

Shift instruction is executedwhen the branch is not taken.


62/95

6262


Logically, the program is executed as if the branchinstruction were placed after the shift instruction.

Branching takes place one instruction later than where thebranch instruction appears in the instruction sequence (withreference to reordered instructions).

Hence, this technique is termed as delayed branch. Delayed branch requires reordering as many instructions as

the number of delay slots. Usually possible to reorganize one instruction to fill one delay

slot.

Difficult to reorganize two or more instructions to fill two ormore delay slots.


63/95

6363

Branch prediction

To reduce the branch penalty associated with conditional

branches, we can predict whether the branch will be taken. Simplest form of branch prediction:

Assume that the branch will not take place. Continue to fetch instructions in sequential execution order. Until the branch condition is evaluated, instruction execution

along the predicted path must be done on a speculative basis. Speculative execution implies that the processor is

executing instructions before it is certain that they are inthe correct sequence. Processor registers and memory locations should not be

updated unless the sequence is confirmed.

If the branch prediction turns out to be wrong, theninstructions that were executed on a speculative basis andtheir data must be purged.

Correct sequence of instructions must be fetched andexecuted.


64/95

6464

Branch prediction (contd..)

F1

F2

I1 (Compare)

I2 (Branch>0)

I3

D1 E1 W1

F3

F4

Fk

Dk

D3 X

XI4

Ik

Instruction

E2

Clock cycle 1 2 3 4 5 6

D2/P2

Time

I1 is a compare instruction and I2is a branch instruction.Branch prediction takes place in cycle

3 when I2 is being decoded.

I3 is being fetched at that time.

Fetch unit predicts that the branch willnot be taken and continues to fetch I4in cycle 4 when I3 is being decoded.

Results of I1 are available in cycle 3.

Fetch unit evaluates branch condition in cycle 4.

If the branch prediction is incorrect, the fetch unit realizes at this point.

I3 and I4 are discarded and Ik is fetched from the branch target address.


65/95

6565


Which way to predict the result of the branch instruction(taken or not taken) may be made in the hardware,depending on whether the target address of the branchinstruction is lower or higher than the address of thebranch instruction. If the target address is lower, then the branch is predicted as

taken. If the target address is higher, then the branch is predicted

as not taken.

Branch prediction can also be handled by the compiler. Complier can set the branch prediction bit to 0 or 1 to indicate

the desired behavior. Instruction fetch unit checks the branch prediction bit to

predict whether the branch will be taken.


66/95

6666


Branch prediction decision is the same every time aninstruction is executed. This is static branch prediction.

Branch prediction decision may change depending on theexecution history.

This is dynamic branch prediction.


67/95

6767


Branch prediction algorithm may be described as a two-state machine with 2 states:

LT : Branch is likely to be takenLNT: Branch is likely not to be taken

Initial state of the machine be LNT

When the branch instruction is executed, and if the branch is taken, the machine

moves to state LT.

If the branch is not taken, it remains in state LNT.

When the same branch instruction is executed the next time, the branch is predictedas taken if the state of the machine is LT, else it is predicted as not taken.

Branch taken (BT)

Branch not taken (BNT)

BTBNT LNT LT


68/95

6868


Requires only one bit of history information for each branchinstruction.

Works well inside loops: Once a loop is entered, the branch instruction that controls the

looping will always yield the same result until the last pass. In the last pass, the branch prediction will turn out to be

incorrect. The branch history state machine will be changed to the

opposite state. However, if the same loop is entered the next time, and there

is more than one pass, the branch prediction machine will lead

to wrong branch prediction. Better performance may be achieved by keeping more

execution history.


69/95

6969


BTBNT

BNT

BT

BNT

BT

BNT LNT

LT ST

SNT

BT

ST : Strong likely to be takenLT : Likely to be taken

LNT : Likely not to be taken

SNT : Strong likely not to be taken

Initial state of the algorithm is LNT.

After the branch instruction is executed, ifthe branch is taken, the state is changed to STFor a branch instruction, the fetch unit predicts

that the branch will be taken if the state is ST

or LT, else it predicts that the branch will not

be taken.

In state SNT:- The prediction is that the branch is not taken.

- If the branch is actually taken, the state

changes to LNT.

- Next time the branch is encountered, the

prediction again is that it is not taken.

- If the prediction is wrong the second time,the state changes to ST.

- After that, the branch is predicted as taken.


70/95

7070


Consider a loop with a branch instruction at the end.

Initial state of the branch prediction algorithm is LNT. In the first pass, the algorithm will predict that the branch

is not taken. This prediction will be incorrect. The state of the algorithm will change to ST.

In the subsequent passes, the algorithm will predict that thebranch is taken: Prediction will be incorrect, except for the last pass.

In the last pass, the branch is not taken:

The state will change to LT from ST. When the loop is entered the second time, the algorithm will

predict that the branch is taken: This prediction will be correct.


71/95

7171


Branch prediction algorithm mispredicts the outcome of thebranch instruction in the first pass. The prediction in the first pass depends on the initial state of

the branch prediction algorithm. If the initial state can be set correctly, the misprediction in

the first pass can be avoided.

Information necessary to set the initial state of the branchprediction algorithm can be provided by static predictionschemes. Comparing the branch target address with the address of the

branch instruction, Checking the branch prediction bit set by the compiler.

Branch instruction at the end of the loop, initial state is LT. Branch instruction at the start of the loop, initial state is LNT.

With this, the only misprediction that occurs is on the finalpass through the loop. This misprediction is unavoidable.


72/95


Influence on Instruction Sets


73/95

Overview

Some instructions are much better suited to pipelineexecution than others.

Addressing modes

Conditional code flags


74/95

Addressing Modes

Addressing modes include simple ones and complex ones. In choosing the addressing modes to be implemented in a

pipelined processor, we must consider the effect of eachaddressing mode on instruction flow in the pipeline:

Side effects The extent to which complex addressing modes cause the

pipeline to stall Whether a given mode is likely to be used by compilers


75/95

Recall

F1

F2

F3

I1

I2 (Load)

I3

E1

M2

D1

D2

D3

W1

W2

Instruction

F4I4


Figure 8.5. Effect of a Load instruction on pipeline timing.

F5I5 D5

Time

E2

E3 W3

E4D4

Load X(R1), R2

Load (R1), R2


76/95

Complex Addressing Mode

F

F D

D E

X + [R1] [X +[R1]] [[X +[R1]]]Load

Next instruction

(a) Complex addressing mode

W

1 2 3 4 5 6 7Clock cycleTime

W

Forward

Load (X(R1)), R2


77/95

Simple Addressing Mode

X + [R1]F D

F

F

F D

D

D

E

[X +[R1]]

[[X +[R1]]]

Add

Load

Load

Next instruction

(b) Simple addressing mode

W

W

W

W

Add #X, R1, R2Load (R2), R2

Load (R2), R2

dd d


78/95

Addressing Modes

In a pipelined processor, complex addressingmodes do not necessarily lead to fasterexecution.

Advantage: reducing the number of

instructions / program spaceDisadvantage: cause pipeline to stall / more

hardware to decode / not convenient forcompiler to work with

Conclusion: complex addressing modes are notsuitable for pipelined execution.

dd i M d


79/95

Addressing Modes

Good addressing modes should have:

Access to an operand does not require more thanone access to the memory

Only load and store instruction access memoryoperands

The addressing modes used do not have sideeffects

Register, register indirect, index

C di i l C d


80/95

Conditional Codes

If an optimizing compiler attempts to reorder instruction toavoid stalling the pipeline when branches or datadependencies between successive instructions occur, it mustensure that reordering does not cause a change in theoutcome of a computation.

The dependency introduced by the condition-code flagsreduces the flexibility available for the compiler to reorderinstructions.

C diti l C d


81/95

Conditional Codes

AddCompare

Branch=0

R1,R2R3,R4

. . .

Compare

AddBranch=0

R3,R4

R1,R2. . .

(a) A program fragment

(b) Instructions reordered

Figure 8.17. Instruction reordering.

C diti l C d


82/95

Conditional Codes

Two conclusion:

To provide flexibility in reordering instructions,the condition-code flags should be affected by asfew instruction as possible.

The compiler should be able to specify in whichinstructions of a program the condition codes areaffected and in which they are not.


83/95


Datapath and Control Considerations

Bus A Bus B Bus C

Incrementer


84/95

Memory b usdata lines

Figure 7.8. Three-b us organization of the datapath.

Instructiondecoder

PC

Register

file

Constant 4

AL U

MDR

A

B

R

MUX

Addresslines

MAR

IR

Register

file

Pipelined Desi n


85/95

Instruction cache

Figure 8.18. Datapath modified for pipelined e xecution, with

BusA

BusB

Control s ignal pipeline

IMAR

PC

ALU

Instruction

A

B

R

decoder

Incrementer

MDR/Write

Instructionqueue

BusC

Data cache

Memory address

MDR/ReadDMAR

Memory address

(Instruction fetches)

(Data access)

interstage b uffers at the input and output of the ALU.

Pipelined Design

- Separate instruction and data caches

- PC is connected to IMAR

- DMAR

- Separate MDR

- Buffers for ALU

- Instruction queue

- Instruction decoder output

- Reading an instruction from the instruction cache

- Incrementing the PC

- Decoding an instruction

- Reading from or writing into the data cache

- Reading the contents of up to two regs

- Writing into one register in the reg file

- Performing an ALU operation


86/95


Superscalar Operation

Overview


87/95

Overview

The maximum throughput of a pipelinedprocessor is one instruction per clock cycle.

If we equip the processor with multipleprocessing units to handle several instructionsin parallel in each processing stage, several

instructions start execution in the same clockcycle multiple-issue.

Processors are capable of achieving aninstruction execution throughput of more than

one instruction per cycle superscalarprocessors.Multiple-issue requires a wider path to the

cache and multiple execution units.

Superscalar


88/95

Superscalar

W : Writeresults

Dispatchunit

Instruction queue

Floating-pointunit

Integerunit

Figure 8.19. A processor with two execution units.

F : Instructionfetch unit

Timing


89/95

Timing

I1 (Fadd) D1

D2

D3

D4

E1A E1B E1C

E2

E3 E3 E3

E4

W1

W2

W3

W4

I2 (Add)

I3 (Fsub)

I4 (Sub)

Figure 8.20. An example of instruction execution flow in the processor of Figure 8.19,

assuming no hazards are encountered.

1 2 3 4 5 6Clock cycleTime

F1

F2

F3

F4

7

Out of Order Execution


90/95

Out-of-Order Execution

HazardsExceptionsImprecise exceptions

Precise exceptions

I1

(Fadd) D1

D2

D3

D4

E1A E1B E1C

E2

E3A E3B E3C

E4

W1

W2

W3

W4

I2 (Add)

I3

(Fsub)

I4

(Sub)


(a) Delayed write

F1

F2

F3

F4

7

Execution Completion


91/95

Execution Completion

It is desirable to used out-of-order execution, so that anexecution unit is freed to execute other instructions as soonas possible.

At the same time, instructions must be completed in programorder to allow precise exceptions.

The use of temporary registers Commitment unit

I1

(Fadd) D1

D2

D3

D4

E1A E1B E1C

E2

E3A

E3B

E3C

E4

W1

W2

W3

W4

I2 (Add)

I3

(Fsub)

I4

(Sub)


(b) Using temporary registers

TW2

TW4

7

F1

F2

F3

F4


92/95


Performance Considerations

Overview


93/95

Overview

The execution time T of a program thathas a dynamic instruction count N is givenby:

where S is the average number of clockcycles it takes to fetch and execute oneinstruction, and R is the clock rate.

Instruction throughput is defined as thenumber of instructions executed persecond.

R

SNT

=

S

RP

s=

Overview


94/95

Overview

Ann

-stage pipeline has the potential toincrease the throughput by n times.However, the only real measure of

performance is the total execution time of a

program.Higher instruction throughput will not

necessarily lead to higher performance.Two questions regarding pipelining

How much of this potential increase in instructionthroughput can be realized in practice?

What is good value of n?

Number of Pipeline Stages


95/95

Number of Pipeline Stages

Since ann

-stage pipeline has the potential toincrease the throughput by n times, how aboutwe use a 10,000-stage pipeline?

As the number of stages increase, the

probability of the pipeline being stalledincreases.The inherent delay in the basic operations

increases.

Hardware considerations (area, power,complexity,)

Coa Lecture Unit 3 Pipelining

Documents

Transcript of Coa Lecture Unit 3 Pipelining