Pipelining (I). Pipelining Example Laundry Example Four students have one load of clothes each to...
-
Upload
oscar-gilmore -
Category
Documents
-
view
216 -
download
1
Transcript of Pipelining (I). Pipelining Example Laundry Example Four students have one load of clothes each to...
Pipelining (I)
Pipelining Example Laundry Example
Four students have one load of clothes each to wash, dry, fold, and put away
Washer takes 30 minutes
Dryer takes 30 minutes
Folding clothes takes 30 min-utes
Roommate takes 30 minutes to put clothes away
2
Sequential Laundry
Sequential laundry takes 8 hours for 4 loads If they learned pipelining, how long would
laundry take?
Task
Ord
er
3
Pipelined Laundry: Start Work ASAP
Pipelined laundry takes 3.5 hours for 4 loads!
Task
Ord
er
4
5
Real-World Pipelines: Car Washes
Idea Divide process into inde-
pendent stages Move objects through
stages in sequence At any given times, multi-
ple objects being pro-cessed
Sequential Parallel
Pipelined
6
Computational Example
System Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Must have clock cycle of at least 320 ps
Combinationallogic
Reg
300 ps 20 ps
Clock
Delay = 320 psThroughput = 3.12 GOPS
7
3-Way Pipelined Version
System Divide combinational logic into 3 blocks of 100 ps
each Can begin new operation as soon as previous one
passes through stage A. Begin new operation every 120 ps
Overall latency increases 360 ps from start to finish
Reg
Clock
Comb.logic
A
Reg
Comb.logic
B
Reg
Comb.logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
Delay = 360 psThroughput = 8.33 GOPS
8
Pipeline Diagrams Unpipelined
Cannot start new operation until previous one completes
3-Way Pipelined
Up to 3 operations in process simultaneously
Time
OP1
OP2
OP3
Time
A B C
A B C
A B C
OP1
OP2
OP3
9
Operating a Pipeline
Time
OP1
OP2
OP3
A B C
A B C
A B C
0 120 240 360 480 640
Clock
Reg
Clock
Comb.logic
A
Reg
Comb.logic
B
Reg
Comb.logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
239
Reg
Clock
Comb.logic
A
Reg
Comb.logic
B
Reg
Comb.logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
241
Reg
Reg
Reg
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
Comb.logic
A
Comb.logic
B
Comb.logic
C
Clock
300
Reg
Clock
Comb.logic
A
Reg
Comb.logic
B
Reg
Comb.logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
359
10
Limitations: Nonuniform Delays
Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced
stages
Reg
Clock
Reg
Comb.logic
B
Reg
Comb.logic
C
50 ps 20 ps 150 ps 20 ps 100 ps 20 ps
Delay = 510 psThroughput = 5.88 GOPS
Comb.logic
A
Time
OP1
OP2
OP3
A B C
A B C
A B C
11
Limitations: Register Overhead
As try to deepen pipeline, overhead of loading registers becomes more significant
Percentage of clock cycle spent loading register: 1-stage pipeline: 6.25% (= 20/320) 3-stage pipeline: 16.67% (= 60/360) 6-stage pipeline: 28.57% (= 120/420)
High speeds of modern processor designs ob-tained through very deep pipelining
Delay = 420 ps, Throughput = 14.29 GOPSClock
Reg
Comb.logic
50 ps 20 ps
Reg
Comb.logic
50 ps 20 ps
Reg
Comb.logic
50 ps 20 ps
Reg
Comb.logic
50 ps 20 ps
Reg
Comb.logic
50 ps 20 ps
Reg
Comb.logic
50 ps 20 ps
Pipelining Lessons
Pipelining doesn’t help latency of single task, it helps throughput of entire workload
Multiple tasks operating simultaneously using different resources
Potential speedup = Number pipeline stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces
speedup Time to fill pipeline and time to drain it re-
duces speedup Stall for Dependences
12
The Five Stages of Load
I-fetch: Instruction Fetch Fetch the instruction from the Instruction Memory
Dec/Reg: Instruction Decode and Registers Fetch
Exec: Calculate the memory address Mem: Read the data from the Data Memory WB: Write the data back to the register file
I-fetch Dec/Reg Exec Mem WB
1 cycle 2 cycle 3 cycle 4 cycle 5 cycle
13
Pipelining Improve performance by increasing instruction
throughputProgramexecutionorder(in instructions)
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
Time200 400 600 800 1000 1200 1400 1600 1800
Instructionfetch Reg ALU Data
access Reg
Instructionfetch Reg ALU Data
access Reg
Instructionfetch
800 ps
800 ps
800 ps
Programexecutionorder(in instructions)
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
Time200 400 600 800 1000 1200 1400
Instructionfetch Reg ALU Data
access Reg
Instructionfetch
Instructionfetch
Reg ALU Dataaccess Reg
Reg ALU Dataaccess Reg
200 ps
200 ps
200 ps 200 ps 200 ps 200 ps 200 ps
14
Ideal Speedup
Q: Ideal speedup is number of stages in the pipeline.
Do we achieve this? Imperfect balance & Pipeline overhead
For 1,000,003 instructions Non-pipelined : 800,002,400 ps Pipelined : 200,001,400 ps
Pipelining improves performance By increasing instruction throughput Not by decreasing the execution time of an indi-
vidual instructions
Time between instructionsno-
pipelined
Number of pipe stages
=
800002400
2000014004
Time between instruction-
spipelined
15
Instruction Set for Pipelining MIPS is made for pipelining!
All instructions are the same length: Helps I-fetch & Decode State
A few instruction formats & fixed source operands fields Can read operands & decode opcode at the same time
Memory operands only appear in loads/stores Calculate the memory address in Execute stage
Operands must be aligned in memory One data memory access for a single data transfer in-
struction
16
Pipeline Hazards When the next instruction cannot execute in
the following clock cycle
Three Types of Hazards Structural hazards
HW cannot support this combination of instructions due to lack of HW capacity (or resource conflict)
Data hazards Instruction depends on result of prior instruction still in
the pipeline Control hazards
Pipelining of branches and other instructions that change the PC
17
18
Structural Hazard Resource shortage
Multiple instructions try to use the same resource on the same cycle
Delay younger instruction Detect hazard situation dynamically Stall younger instruction to allow older instruction
to use the resource
Eliminate conflicts Reorganize pipeline to avoid accessing resource in
2 stages(If possible, may require ISA change)
Provide separate copies or ports to resource per stage(Maybe too expensive)
Resource conflict
19
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
Instruction Order
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Structural Hazard Example
Access memory fromtwo instructions at the same cycle
Insert three bubbles
20
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
Instruction Order
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Pipeline Stall Due to Structural Hazard
bubble bubble bubble
Duplicate resource
21
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
Instruction Order
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Structural Hazard Solution
Use dual port memoryto support twosimultaneous accesses
22
Data Hazard Data dependence
An instruction depends on the result of previous one Data hazard occurs due to data dependences
Read after Write (RAW) : flow dependence True dependence
Write after Read (WAR) : anti-dependence Write after Write (WAW): output dependence
Need to maintain illusion of in-order, sequential execution An instruction is fully executed before any following
instruction begins
add $t1, $t2, $t3
sub $t4, $t1, $t5
and $t5, $t6, $t7
xor $t5, $t6, $t7
RAW (Read after Write)
WAR (Write after Read)
WAW (Write after Write)
False dependence
Data Hazard Example
23
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
IM RegA
LU DM Reg
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
add $t1,$t2,$t3
sub $t4,$t1,$t3
and $t5,$t1,$t3
or $t6,$t1,$t3
xor $t7,$t2,$t1
Clcok
Reg
Store into Reg
Readfrom Reg
Pipeline Stall Due to Data Hazard
24
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
IM RegA
LU DM Reg
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
bubble bubble
add $t1,$t2,$t3
sub $t4,$t1,$t3
and $t5,$t1,$t3
or $t6,$t1,$t3
xor $t7,$t2,$t1
Forwarding (1) Forwarding = Bypassing = Short-Circuiting Observation:
Don’t need to wait for the instruction to complete. As soon as the ALU computes the sum for the add,
we can supply it as an input for the subtract
No stalling after forward-ing
IF ID MEM WBEX
IF ID MEM WBEX
Time200 400 600 800 1000
add $s0,$t0,$t1
add $t2,$s0,$t3
Programexecutionorder(in instructions)
25
Forwarding (2) Load-Use Data Hazard
Stall even with forwarding
IF ID MEM WBEX
IF ID MEM WBEX
Time200 400 600 800 1000 1200
lw $s0,20($t1)
sub $t2,$s0,$t3
Programexecutionorder(in instructions)
bub-ble
bub-ble
bub-ble
bub-ble
bub-ble
26
Another Forwarding Example
27
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
IM RegA
LU DM Reg
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
add $t1,$t2,$t3
sub $t4,$t1,$t3
lw $t6,20($t1)
or $t7,$t6,$t3
xor $t5,$t2,$t6
bubble
Control Hazard Need to change execution flow (change PC)
Branch, jump, call, return, etc. Also called branch hazard
After instruction fetch at cycle 1, still need to fig-ure out Current instruction is branch Condition check (for conditional branches) Target address calculation
Even if we do all these within one cycle with enough extra HW, one cycle stall is still unavoid-able
28
IF ID EXE MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB
bub-ble
bub-ble
bub-ble
bub-ble
bub-ble
add $4, $5, $6
beq $1, $2, 40
or $7, $8, $9
200 ps
400 ps
Branch Prediction for Control Hazard Simple branch prediction
Always predict branches will be untaken
Pipeline stall only when prediction is incorrect
IF ID EXE MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB
bub-ble
bub-ble
bub-ble
bub-ble
bub-ble
add $4, $5, $6
beq $1, $2, 40
or $7, $8, $9
200 ps
400 ps
IF ID EXE MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB
add $4, $5, $6
beq $1, $2, 40
lw $3, 300($0)200 ps
200 ps
29
2 cycles wasted for taken branches (in our mul-ticycle datapath)
Control Hazard Example (original datapath)
IM Reg
ALU DM Reg
IM Reg
ALU DM Reg
IM RegA
LU DM Reg
IM Reg
ALU DM Reg
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
IM
bne $t5,0,0x5F
add $t4,$t1,$t2
lw $t6,20($t4)
or $t7,$t2,$t3
Now we know instructions to be executed is branch
Now target address is available
Branch condition is available
bubblebubble
30
Delayed Branch for Control Hazard Delayed branch
Solution used in MIPS If you run SPIM in a bare mode, you must pay atten-
tion to delayed branches Compiler fills delayed branch slots with instructions
from multiple places
IF ID EXE MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB
beq $1, $2, L1
add $4, $5, $6
or $7, $8, $9200 ps
200 ps
delayed branch slot
add $4,$5,$6 beq $1,$2,L1
L1:or $7,$8,$9delay slot
beq $1,$2,L1
L1:or $7,$8,$9add $4,$5,$6
L1: sub $7,$8,$9 add $4,$5,$6 beq $1,$2,L1
delay slot
sub $7,$8,$9L1: add $4,$5,$6 beq $1,$2,L1
sub $7,$8,$9
beq $1,$2,L1
sub $7,$8,$9L1:
delay slot
beq $1,$2,L1
L1:
sub $7,$8,$9
same basic block target basic block fall-thru basic block
31
Instruction Scheduling Original schedule requires one cycle stall
After rescheduling (by HW or compiler), a stall is removed
32
lw $t1, 0($t0)lw $t2, 4($t0)sw $t2, 0($t0)sw $t1, 4($t0)
IF ID MEM WBEX
IF ID MEM WBEX
lw $t1, 0($t0)lw $t2, 4($t0)sw $t1, 4($t0)sw $t2, 0($t0)
IF ID MEM WBEX
IF ID MEM WBEX
IF ID MEM WBEX
Summary Pipelining
Executing multiple instructions at different steps
Improve the performance of instruction execution throughput By overlapping different phases of multiple instructions
Structural / data / control hazard exists Stalls are needed to make it work correctly Several techniques to reduce stalls
Capacity enhancement Two-phase register accesses Forwarding Prediction Delayed branch slot Instruction scheduling33