Pipelining (I). Pipelining Example Laundry Example Four students have one load of clothes each to...

33
Pipelining (I)

Transcript of Pipelining (I). Pipelining Example Laundry Example Four students have one load of clothes each to...

Page 1: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Pipelining (I)

Page 2: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Pipelining Example Laundry Example

Four students have one load of clothes each to wash, dry, fold, and put away

Washer takes 30 minutes

Dryer takes 30 minutes

Folding clothes takes 30 min-utes

Roommate takes 30 minutes to put clothes away

2

Page 3: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Sequential Laundry

Sequential laundry takes 8 hours for 4 loads If they learned pipelining, how long would

laundry take?

Task

Ord

er

3

Page 4: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Pipelined Laundry: Start Work ASAP

Pipelined laundry takes 3.5 hours for 4 loads!

Task

Ord

er

4

Page 5: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

5

Real-World Pipelines: Car Washes

Idea Divide process into inde-

pendent stages Move objects through

stages in sequence At any given times, multi-

ple objects being pro-cessed

Sequential Parallel

Pipelined

Page 6: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

6

Computational Example

System Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 3.12 GOPS

Page 7: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

7

3-Way Pipelined Version

System Divide combinational logic into 3 blocks of 100 ps

each Can begin new operation as soon as previous one

passes through stage A. Begin new operation every 120 ps

Overall latency increases 360 ps from start to finish

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 8.33 GOPS

Page 8: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

8

Pipeline Diagrams Unpipelined

Cannot start new operation until previous one completes

3-Way Pipelined

Up to 3 operations in process simultaneously

Time

OP1

OP2

OP3

Time

A B C

A B C

A B C

OP1

OP2

OP3

Page 9: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

9

Operating a Pipeline

Time

OP1

OP2

OP3

A B C

A B C

A B C

0 120 240 360 480 640

Clock

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

239

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

241

Reg

Reg

Reg

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb.logic

A

Comb.logic

B

Comb.logic

C

Clock

300

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

359

Page 10: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

10

Limitations: Nonuniform Delays

Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced

stages

Reg

Clock

Reg

Comb.logic

B

Reg

Comb.logic

C

50 ps 20 ps 150 ps 20 ps 100 ps 20 ps

Delay = 510 psThroughput = 5.88 GOPS

Comb.logic

A

Time

OP1

OP2

OP3

A B C

A B C

A B C

Page 11: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

11

Limitations: Register Overhead

As try to deepen pipeline, overhead of loading registers becomes more significant

Percentage of clock cycle spent loading register: 1-stage pipeline: 6.25% (= 20/320) 3-stage pipeline: 16.67% (= 60/360) 6-stage pipeline: 28.57% (= 120/420)

High speeds of modern processor designs ob-tained through very deep pipelining

Delay = 420 ps, Throughput = 14.29 GOPSClock

Reg

Comb.logic

50 ps 20 ps

Reg

Comb.logic

50 ps 20 ps

Reg

Comb.logic

50 ps 20 ps

Reg

Comb.logic

50 ps 20 ps

Reg

Comb.logic

50 ps 20 ps

Reg

Comb.logic

50 ps 20 ps

Page 12: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Pipelining Lessons

Pipelining doesn’t help latency of single task, it helps throughput of entire workload

Multiple tasks operating simultaneously using different resources

Potential speedup = Number pipeline stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces

speedup Time to fill pipeline and time to drain it re-

duces speedup Stall for Dependences

12

Page 13: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

The Five Stages of Load

I-fetch: Instruction Fetch Fetch the instruction from the Instruction Memory

Dec/Reg: Instruction Decode and Registers Fetch

Exec: Calculate the memory address Mem: Read the data from the Data Memory WB: Write the data back to the register file

I-fetch Dec/Reg Exec Mem WB

1 cycle 2 cycle 3 cycle 4 cycle 5 cycle

13

Page 14: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Pipelining Improve performance by increasing instruction

throughputProgramexecutionorder(in instructions)

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

Time200 400 600 800 1000 1200 1400 1600 1800

Instructionfetch Reg ALU Data

access Reg

Instructionfetch Reg ALU Data

access Reg

Instructionfetch

800 ps

800 ps

800 ps

Programexecutionorder(in instructions)

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

Time200 400 600 800 1000 1200 1400

Instructionfetch Reg ALU Data

access Reg

Instructionfetch

Instructionfetch

Reg ALU Dataaccess Reg

Reg ALU Dataaccess Reg

200 ps

200 ps

200 ps 200 ps 200 ps 200 ps 200 ps

14

Page 15: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Ideal Speedup

Q: Ideal speedup is number of stages in the pipeline.

Do we achieve this? Imperfect balance & Pipeline overhead

For 1,000,003 instructions Non-pipelined : 800,002,400 ps Pipelined : 200,001,400 ps

Pipelining improves performance By increasing instruction throughput Not by decreasing the execution time of an indi-

vidual instructions

Time between instructionsno-

pipelined

Number of pipe stages

=

800002400

2000014004

Time between instruction-

spipelined

15

Page 16: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Instruction Set for Pipelining MIPS is made for pipelining!

All instructions are the same length: Helps I-fetch & Decode State

A few instruction formats & fixed source operands fields Can read operands & decode opcode at the same time

Memory operands only appear in loads/stores Calculate the memory address in Execute stage

Operands must be aligned in memory One data memory access for a single data transfer in-

struction

16

Page 17: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Pipeline Hazards When the next instruction cannot execute in

the following clock cycle

Three Types of Hazards Structural hazards

HW cannot support this combination of instructions due to lack of HW capacity (or resource conflict)

Data hazards Instruction depends on result of prior instruction still in

the pipeline Control hazards

Pipelining of branches and other instructions that change the PC

17

Page 18: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

18

Structural Hazard Resource shortage

Multiple instructions try to use the same resource on the same cycle

Delay younger instruction Detect hazard situation dynamically Stall younger instruction to allow older instruction

to use the resource

Eliminate conflicts Reorganize pipeline to avoid accessing resource in

2 stages(If possible, may require ISA change)

Provide separate copies or ports to resource per stage(Maybe too expensive)

Page 19: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Resource conflict

19

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

Instruction Order

Time(clock cycles)

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

Structural Hazard Example

Access memory fromtwo instructions at the same cycle

Page 20: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Insert three bubbles

20

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

Instruction Order

Time(clock cycles)

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

Pipeline Stall Due to Structural Hazard

bubble bubble bubble

Page 21: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Duplicate resource

21

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

Instruction Order

Time(clock cycles)

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

Structural Hazard Solution

Use dual port memoryto support twosimultaneous accesses

Page 22: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

22

Data Hazard Data dependence

An instruction depends on the result of previous one Data hazard occurs due to data dependences

Read after Write (RAW) : flow dependence True dependence

Write after Read (WAR) : anti-dependence Write after Write (WAW): output dependence

Need to maintain illusion of in-order, sequential execution An instruction is fully executed before any following

instruction begins

add $t1, $t2, $t3

sub $t4, $t1, $t5

and $t5, $t6, $t7

xor $t5, $t6, $t7

RAW (Read after Write)

WAR (Write after Read)

WAW (Write after Write)

False dependence

Page 23: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Data Hazard Example

23

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM RegA

LU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

Time(clock cycles)

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

add $t1,$t2,$t3

sub $t4,$t1,$t3

and $t5,$t1,$t3

or $t6,$t1,$t3

xor $t7,$t2,$t1

Clcok

Reg

Store into Reg

Readfrom Reg

Page 24: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Pipeline Stall Due to Data Hazard

24

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM RegA

LU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

Time(clock cycles)

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

bubble bubble

add $t1,$t2,$t3

sub $t4,$t1,$t3

and $t5,$t1,$t3

or $t6,$t1,$t3

xor $t7,$t2,$t1

Page 25: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Forwarding (1) Forwarding = Bypassing = Short-Circuiting Observation:

Don’t need to wait for the instruction to complete. As soon as the ALU computes the sum for the add,

we can supply it as an input for the subtract

No stalling after forward-ing

IF ID MEM WBEX

IF ID MEM WBEX

Time200 400 600 800 1000

add $s0,$t0,$t1

add $t2,$s0,$t3

Programexecutionorder(in instructions)

25

Page 26: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Forwarding (2) Load-Use Data Hazard

Stall even with forwarding

IF ID MEM WBEX

IF ID MEM WBEX

Time200 400 600 800 1000 1200

lw $s0,20($t1)

sub $t2,$s0,$t3

Programexecutionorder(in instructions)

bub-ble

bub-ble

bub-ble

bub-ble

bub-ble

26

Page 27: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Another Forwarding Example

27

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM RegA

LU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

Time(clock cycles)

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

add $t1,$t2,$t3

sub $t4,$t1,$t3

lw $t6,20($t1)

or $t7,$t6,$t3

xor $t5,$t2,$t6

bubble

Page 28: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Control Hazard Need to change execution flow (change PC)

Branch, jump, call, return, etc. Also called branch hazard

After instruction fetch at cycle 1, still need to fig-ure out Current instruction is branch Condition check (for conditional branches) Target address calculation

Even if we do all these within one cycle with enough extra HW, one cycle stall is still unavoid-able

28

IF ID EXE MEM WB

IF ID EXE MEM WB

IF ID EXE MEM WB

bub-ble

bub-ble

bub-ble

bub-ble

bub-ble

add $4, $5, $6

beq $1, $2, 40

or $7, $8, $9

200 ps

400 ps

Page 29: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Branch Prediction for Control Hazard Simple branch prediction

Always predict branches will be untaken

Pipeline stall only when prediction is incorrect

IF ID EXE MEM WB

IF ID EXE MEM WB

IF ID EXE MEM WB

bub-ble

bub-ble

bub-ble

bub-ble

bub-ble

add $4, $5, $6

beq $1, $2, 40

or $7, $8, $9

200 ps

400 ps

IF ID EXE MEM WB

IF ID EXE MEM WB

IF ID EXE MEM WB

add $4, $5, $6

beq $1, $2, 40

lw $3, 300($0)200 ps

200 ps

29

Page 30: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

2 cycles wasted for taken branches (in our mul-ticycle datapath)

Control Hazard Example (original datapath)

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM RegA

LU DM Reg

IM Reg

ALU DM Reg

Time(clock cycles)

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

IM

bne $t5,0,0x5F

add $t4,$t1,$t2

lw $t6,20($t4)

or $t7,$t2,$t3

Now we know instructions to be executed is branch

Now target address is available

Branch condition is available

bubblebubble

30

Page 31: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Delayed Branch for Control Hazard Delayed branch

Solution used in MIPS If you run SPIM in a bare mode, you must pay atten-

tion to delayed branches Compiler fills delayed branch slots with instructions

from multiple places

IF ID EXE MEM WB

IF ID EXE MEM WB

IF ID EXE MEM WB

beq $1, $2, L1

add $4, $5, $6

or $7, $8, $9200 ps

200 ps

delayed branch slot

add $4,$5,$6 beq $1,$2,L1

L1:or $7,$8,$9delay slot

beq $1,$2,L1

L1:or $7,$8,$9add $4,$5,$6

L1: sub $7,$8,$9 add $4,$5,$6 beq $1,$2,L1

delay slot

sub $7,$8,$9L1: add $4,$5,$6 beq $1,$2,L1

sub $7,$8,$9

beq $1,$2,L1

sub $7,$8,$9L1:

delay slot

beq $1,$2,L1

L1:

sub $7,$8,$9

same basic block target basic block fall-thru basic block

31

Page 32: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Instruction Scheduling Original schedule requires one cycle stall

After rescheduling (by HW or compiler), a stall is removed

32

lw $t1, 0($t0)lw $t2, 4($t0)sw $t2, 0($t0)sw $t1, 4($t0)

IF ID MEM WBEX

IF ID MEM WBEX

lw $t1, 0($t0)lw $t2, 4($t0)sw $t1, 4($t0)sw $t2, 0($t0)

IF ID MEM WBEX

IF ID MEM WBEX

IF ID MEM WBEX

Page 33: Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Summary Pipelining

Executing multiple instructions at different steps

Improve the performance of instruction execution throughput By overlapping different phases of multiple instructions

Structural / data / control hazard exists Stalls are needed to make it work correctly Several techniques to reduce stalls

Capacity enhancement Two-phase register accesses Forwarding Prediction Delayed branch slot Instruction scheduling33