Pipelining (I). Pipelining Example Laundry Example Four students have one load of clothes each to...

Pipelining (I)

Pipelining Example Laundry Example

Four students have one load of clothes each to wash, dry, fold, and put away

Washer takes 30 minutes

Dryer takes 30 minutes

Folding clothes takes 30 min-utes

Roommate takes 30 minutes to put clothes away

2

Sequential Laundry

Sequential laundry takes 8 hours for 4 loads If they learned pipelining, how long would

laundry take?

Task

Ord

er

3

Pipelined Laundry: Start Work ASAP

Pipelined laundry takes 3.5 hours for 4 loads!

Task

Ord

er

4

5

Real-World Pipelines: Car Washes

Idea Divide process into inde-

pendent stages Move objects through

stages in sequence At any given times, multi-

ple objects being pro-cessed

Sequential Parallel

Pipelined

6

Computational Example

System Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 3.12 GOPS

7

3-Way Pipelined Version

System Divide combinational logic into 3 blocks of 100 ps

each Can begin new operation as soon as previous one

passes through stage A. Begin new operation every 120 ps

Overall latency increases 360 ps from start to finish

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps


8

Pipeline Diagrams Unpipelined

Cannot start new operation until previous one completes

3-Way Pipelined

Up to 3 operations in process simultaneously

Time

OP1

OP2

OP3

Time

A B C

A B C

A B C

OP1

OP2

OP3

9

Operating a Pipeline

Time

OP1

OP2

OP3

A B C

A B C

A B C

0 120 240 360 480 640

Clock

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

239

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

241

Reg

Reg

Reg

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb.logic

A

Comb.logic

B

Comb.logic

C

Clock

300

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

359

10

Limitations: Nonuniform Delays

Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced

stages

Reg

Clock

Reg

Comb.logic

B

Reg

Comb.logic

C

50 ps 20 ps 150 ps 20 ps 100 ps 20 ps


Comb.logic

A

Time

OP1

OP2

OP3

A B C

A B C

A B C

11

Limitations: Register Overhead

As try to deepen pipeline, overhead of loading registers becomes more significant

Percentage of clock cycle spent loading register: 1-stage pipeline: 6.25% (= 20/320) 3-stage pipeline: 16.67% (= 60/360) 6-stage pipeline: 28.57% (= 120/420)

High speeds of modern processor designs ob-tained through very deep pipelining

Delay = 420 ps, Throughput = 14.29 GOPSClock

Reg

Comb.logic

50 ps 20 ps

Reg

Comb.logic

50 ps 20 ps

Reg

Comb.logic

50 ps 20 ps

Reg

Comb.logic

50 ps 20 ps

Reg

Comb.logic

50 ps 20 ps

Reg

Comb.logic

50 ps 20 ps

Pipelining Lessons

Pipelining doesn’t help latency of single task, it helps throughput of entire workload

Multiple tasks operating simultaneously using different resources

Potential speedup = Number pipeline stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces

speedup Time to fill pipeline and time to drain it re-

duces speedup Stall for Dependences

12

The Five Stages of Load

I-fetch: Instruction Fetch Fetch the instruction from the Instruction Memory

Dec/Reg: Instruction Decode and Registers Fetch

Exec: Calculate the memory address Mem: Read the data from the Data Memory WB: Write the data back to the register file

I-fetch Dec/Reg Exec Mem WB

1 cycle 2 cycle 3 cycle 4 cycle 5 cycle

13

Pipelining Improve performance by increasing instruction

throughputProgramexecutionorder(in instructions)

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

Time200 400 600 800 1000 1200 1400 1600 1800

Instructionfetch Reg ALU Data

access Reg


access Reg

Instructionfetch

800 ps

800 ps

800 ps

Programexecutionorder(in instructions)

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

Time200 400 600 800 1000 1200 1400


access Reg

Instructionfetch

Instructionfetch

Reg ALU Dataaccess Reg

Reg ALU Dataaccess Reg

200 ps

200 ps

200 ps 200 ps 200 ps 200 ps 200 ps

14

Ideal Speedup

Q: Ideal speedup is number of stages in the pipeline.

Do we achieve this? Imperfect balance & Pipeline overhead

For 1,000,003 instructions Non-pipelined : 800,002,400 ps Pipelined : 200,001,400 ps

Pipelining improves performance By increasing instruction throughput Not by decreasing the execution time of an indi-

vidual instructions

Time between instructionsno-

pipelined

Number of pipe stages

=

800002400

2000014004

Time between instruction-

spipelined

15

Instruction Set for Pipelining MIPS is made for pipelining!

All instructions are the same length: Helps I-fetch & Decode State

A few instruction formats & fixed source operands fields Can read operands & decode opcode at the same time

Memory operands only appear in loads/stores Calculate the memory address in Execute stage

Operands must be aligned in memory One data memory access for a single data transfer in-

struction

16

Pipeline Hazards When the next instruction cannot execute in

the following clock cycle

Three Types of Hazards Structural hazards

HW cannot support this combination of instructions due to lack of HW capacity (or resource conflict)

Data hazards Instruction depends on result of prior instruction still in

the pipeline Control hazards

Pipelining of branches and other instructions that change the PC

17

18

Structural Hazard Resource shortage

Multiple instructions try to use the same resource on the same cycle

Delay younger instruction Detect hazard situation dynamically Stall younger instruction to allow older instruction

to use the resource

Eliminate conflicts Reorganize pipeline to avoid accessing resource in

2 stages(If possible, may require ISA change)

Provide separate copies or ports to resource per stage(Maybe too expensive)

Resource conflict

19

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

Instruction Order

Time(clock cycles)

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

Structural Hazard Example

Access memory fromtwo instructions at the same cycle

Insert three bubbles

20

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

Instruction Order

Time(clock cycles)


Pipeline Stall Due to Structural Hazard

bubble bubble bubble

Duplicate resource

21

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

Instruction Order

Time(clock cycles)


Structural Hazard Solution

Use dual port memoryto support twosimultaneous accesses

22

Data Hazard Data dependence

An instruction depends on the result of previous one Data hazard occurs due to data dependences

Read after Write (RAW) : flow dependence True dependence

Write after Read (WAR) : anti-dependence Write after Write (WAW): output dependence

Need to maintain illusion of in-order, sequential execution An instruction is fully executed before any following

instruction begins

add $t1, $t2, $t3

sub $t4, $t1, $t5

and $t5, $t6, $t7

xor $t5, $t6, $t7

RAW (Read after Write)

WAR (Write after Read)

WAW (Write after Write)

False dependence

Data Hazard Example

23

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM RegA

LU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

Time(clock cycles)


add $t1,$t2,$t3

sub $t4,$t1,$t3

and $t5,$t1,$t3

or $t6,$t1,$t3

xor $t7,$t2,$t1

Clcok

Reg

Store into Reg

Readfrom Reg

Pipeline Stall Due to Data Hazard

24

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM RegA

LU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

Time(clock cycles)


bubble bubble

add $t1,$t2,$t3

sub $t4,$t1,$t3

and $t5,$t1,$t3

or $t6,$t1,$t3

xor $t7,$t2,$t1

Forwarding (1) Forwarding = Bypassing = Short-Circuiting Observation:

Don’t need to wait for the instruction to complete. As soon as the ALU computes the sum for the add,

we can supply it as an input for the subtract

No stalling after forward-ing

IF ID MEM WBEX

IF ID MEM WBEX

Time200 400 600 800 1000

add $s0,$t0,$t1

add $t2,$s0,$t3


25

Forwarding (2) Load-Use Data Hazard

Stall even with forwarding

IF ID MEM WBEX

IF ID MEM WBEX

Time200 400 600 800 1000 1200

lw $s0,20($t1)

sub $t2,$s0,$t3


bub-ble

bub-ble

bub-ble

bub-ble

bub-ble

26

Another Forwarding Example

27

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM RegA

LU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

Time(clock cycles)


add $t1,$t2,$t3

sub $t4,$t1,$t3

lw $t6,20($t1)

or $t7,$t6,$t3

xor $t5,$t2,$t6

bubble

Control Hazard Need to change execution flow (change PC)

Branch, jump, call, return, etc. Also called branch hazard

After instruction fetch at cycle 1, still need to fig-ure out Current instruction is branch Condition check (for conditional branches) Target address calculation

Even if we do all these within one cycle with enough extra HW, one cycle stall is still unavoid-able

28

IF ID EXE MEM WB

IF ID EXE MEM WB

IF ID EXE MEM WB

bub-ble

bub-ble

bub-ble

bub-ble

bub-ble

add $4, $5, $6

beq $1, $2, 40

or $7, $8, $9

200 ps

400 ps

Branch Prediction for Control Hazard Simple branch prediction

Always predict branches will be untaken

Pipeline stall only when prediction is incorrect

IF ID EXE MEM WB

IF ID EXE MEM WB

IF ID EXE MEM WB

bub-ble

bub-ble

bub-ble

bub-ble

bub-ble

add $4, $5, $6

beq $1, $2, 40

or $7, $8, $9

200 ps

400 ps

IF ID EXE MEM WB

IF ID EXE MEM WB

IF ID EXE MEM WB

add $4, $5, $6

beq $1, $2, 40

lw $3, 300($0)200 ps

200 ps

29

2 cycles wasted for taken branches (in our mul-ticycle datapath)

Control Hazard Example (original datapath)

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM RegA

LU DM Reg

IM Reg

ALU DM Reg

Time(clock cycles)


IM

bne $t5,0,0x5F

add $t4,$t1,$t2

lw $t6,20($t4)

or $t7,$t2,$t3

Now we know instructions to be executed is branch

Now target address is available

Branch condition is available

bubblebubble

30

Delayed Branch for Control Hazard Delayed branch

Solution used in MIPS If you run SPIM in a bare mode, you must pay atten-

tion to delayed branches Compiler fills delayed branch slots with instructions

from multiple places

IF ID EXE MEM WB

IF ID EXE MEM WB

IF ID EXE MEM WB

beq $1, $2, L1

add $4, $5, $6

or $7, $8, $9200 ps

200 ps

delayed branch slot

add $4,$5,$6 beq $1,$2,L1

L1:or $7,$8,$9delay slot

beq $1,$2,L1

L1:or $7,$8,$9add $4,$5,$6

L1: sub $7,$8,$9 add $4,$5,$6 beq $1,$2,L1

delay slot

sub $7,$8,$9L1: add $4,$5,$6 beq $1,$2,L1

sub $7,$8,$9

beq $1,$2,L1

sub $7,$8,$9L1:

delay slot

beq $1,$2,L1

L1:

sub $7,$8,$9

same basic block target basic block fall-thru basic block

31

Instruction Scheduling Original schedule requires one cycle stall

After rescheduling (by HW or compiler), a stall is removed

32

lw $t1, 0($t0)lw $t2, 4($t0)sw $t2, 0($t0)sw $t1, 4($t0)

IF ID MEM WBEX

IF ID MEM WBEX

lw $t1, 0($t0)lw $t2, 4($t0)sw $t1, 4($t0)sw $t2, 0($t0)

IF ID MEM WBEX

IF ID MEM WBEX

IF ID MEM WBEX

Summary Pipelining

Executing multiple instructions at different steps

Improve the performance of instruction execution throughput By overlapping different phases of multiple instructions

Structural / data / control hazard exists Stalls are needed to make it work correctly Several techniques to reduce stalls

Capacity enhancement Two-phase register accesses Forwarding Prediction Delayed branch slot Instruction scheduling33

Pipelining (I). Pipelining Example Laundry Example Four students have one load of clothes each to...

Documents

Transcript of Pipelining (I). Pipelining Example Laundry Example Four students have one load of clothes each to...