Post on 14-Dec-2015
Speculative Execution CS510 Computer Architectures Lecture 11 - 1
Lecture 11Lecture 11Trace Scheduling, Trace Scheduling,
Conditional Execution, Conditional Execution, Speculation, Speculation, Limits of ILPLimits of ILP
Lecture 11Lecture 11Trace Scheduling, Trace Scheduling,
Conditional Execution, Conditional Execution, Speculation, Speculation, Limits of ILPLimits of ILP
Speculative Execution CS510 Computer Architectures Lecture 11 - 2
Trace SchedulingTrace SchedulingTrace SchedulingTrace Scheduling
• Parallelism across IF branches vs. LOOP branches– Trace scheduling works when the behavior of the branches is
fairly predictable at compile time
• Two steps:– Trace Selection
• Find likely sequence of basic blocks (trace) of (statically predicted) long sequence of straight-line code
– Trace Compaction• Squeeze trace into few VLIW instructions
• Need bookkeeping code in case prediction is wrong
Speculative Execution CS510 Computer Architectures Lecture 11 - 3
Trace SchedulingTrace SchedulingTrace SchedulingTrace Scheduling
* See the kinds of exceptions in page 179
Trace Compaction by speculation - Move the code associated with B and C to make VLIW word(s) before the branch - This may cause exceptions when executed
X
A[i] = A[i]+B[i]
B[i]=
C[i]=
A[i]=0T F
Select this Trace If True branch is taken more frequently
Speculation should not introduce any new exception*
Speculative Execution CS510 Computer Architectures Lecture 11 - 4
HW Support for More ILPHW Support for More ILPConditional InstructionsConditional Instructions
HW Support for More ILPHW Support for More ILPConditional InstructionsConditional Instructions
• Avoid branch prediction by turning branches into conditionally executed instructions:
if (x) then A = B op C else NOP– If false, then neither stores result nor causes exception*
– Expanded ISA of Alpha, MIPS, SPARC have conditional move; PA-RISC can annul any following instr.
• Drawbacks to conditional instructions– Still takes a clock even if “annulled”
– Stall if condition is evaluated late
– Complex conditions reduce effectiveness; condition becomes known late in pipeline
* See the kinds of exceptions in page 179
Speculative Execution CS510 Computer Architectures Lecture 11 - 5
HW Support for More ILPHW Support for More ILPConditional InstructionsConditional Instructions
HW Support for More ILPHW Support for More ILPConditional InstructionsConditional Instructions
LWC must have no effect if the condition is not satisfied. LWC cannot write the result nor cause any exceptions if the condition is not satisfied.
Two-issue superscalar, combination of one M reference and one ALU(or Br) operations
First instruction slot Second instruction slot
LW R1,40(R2) ADD R3,R4,R5
ADD R6,R3,R7
BEQZ R10,L
LW R8,20(R10)
LW R9,0(R8)
Waste of the Green slot.Data dependence in Reds.
Example
BNZ R1,L CMOVZ R2,R3,R1
MOV R2,R3
L:
First instruction slot Second instruction slot
LW R1,40(R2) ADD R3,R4,R5
LWC R8,20(R10),R10 ADD R6,R3,R7
BEQZ R10,L
LW R9,0(R8)
Execute LW only when [R10] = 0, i.e.,LWC is same as LW unless 3rd operand is 0.
Speculative Execution CS510 Computer Architectures Lecture 11 - 6
HW Support for More ILPHW Support for More ILPSpeculationSpeculation
HW Support for More ILPHW Support for More ILPSpeculationSpeculation
Speculation
Allow an instruction to issue that is dependent on a branch (predicted to be taken) without any consequences(including exceptions).
If branch is not actually taken (“HW undo”)
– allows the execution of an instruction before the processor knows that the instruction should execute(i.e., it avoids control dependence stall)
• Often try to combine with dynamic scheduling
• Tomasulo
Separate speculative bypassing of results from real bypassing of results
– When an instruction is no longer speculative, write its results (instruction commit)
– execute out-of-order but commit in order
Speculative Execution CS510 Computer Architectures Lecture 11 - 7
Compiler Speculation with HW Support:Compiler Speculation with HW Support:
(1) HW-SW Cooperation for Speculation(1) HW-SW Cooperation for SpeculationCompiler Speculation with HW Support:Compiler Speculation with HW Support:
(1) HW-SW Cooperation for Speculation(1) HW-SW Cooperation for Speculation
• HW undo for miss prediction– simply handle all resumable exceptions when exception occurs
– simply return an undefined value for any exception that would cause termination
the compiled code using compiler-basedspeculation
LW R1, 0(R3) ; load ALW R14, 0(R2) ; speculative load BBEQZ R1, L3 ; other branch of the ifADD R14, R1, 4 ; the else clause
L3: SW 0(R3), R14 ; nonspeculative store
if (A==0) A =B; else A = A + 4;
compiled code
LW R1, 0(R3) ; load ABNEZ R1,L1 ; test ALW R1, 0(R2) ; if clauseJ L2 ; skip else
L1: ADD R1,R1,4 ; else clauseL2: SW 0(R3), R1 ; store A
* Assume the then clause is almost always executed. Register renaming;
Need for an extra register
Speculative Execution CS510 Computer Architectures Lecture 11 - 8
Compiler Speculation with HW Support:Compiler Speculation with HW Support:
(2) Speculation with Poison Bits(2) Speculation with Poison BitsCompiler Speculation with HW Support:Compiler Speculation with HW Support:
(2) Speculation with Poison Bits(2) Speculation with Poison Bits
• Speculation with Poison Bits– allows compiler speculation with less change to the exception
behavior
– a poison bit is added to every register
– another bit is added to every instruction to indicate whether the instruction is speculative
LW R1, 0(R3) ; load ALW* R14, 0(R2) ; speculative load BBEQZ R1, L3 ; other branch of the ifADD R14, R1, 4 ; the else clause
L3: SW 0(R3), R14 ; nonspeculative store
If the speculative LW* generates a terminating exception,the poison bit of R14 will be set. When the nonspeculativeSW instruction occurs, it will raise an exception if the poisonbit for R14 is on.
Speculative Execution CS510 Computer Architectures Lecture 11 - 9
Compiler Speculation Compiler Speculation with HW Supportwith HW Support
• The main disadvantages of the two previous schemes– the need to introduce copies to deal with register renaming
– the possibility of exhausting the registers
• Speculative Instructions with Renaming (Boosting)– flagging the instructions which are moved past branches as
speculative
– providing renaming and buffering in the HW
Speculative Execution CS510 Computer Architectures Lecture 11 - 10
Compiler Speculation with HW Support:Compiler Speculation with HW Support:
(3) Speculative Instructions (3) Speculative Instructions with Renamingwith Renaming
Compiler Speculation with HW Support:Compiler Speculation with HW Support:
(3) Speculative Instructions (3) Speculative Instructions with Renamingwith Renaming
• Extra register is no longer necessary• Result of the boosted instruction is not written into R1
until after branch• Other boosted instructions could use the results of the boosted load
LW R1, 0(R3) ; load ALW+ R1, 0(R2) ;;boosted load BBEQZ R1, L3 ; other branch of the ifADD R1, R1, 4 ; the else clause
L3: SW 0(R3), R1 ; nonspeculative store
written to R1
never written to R1
Speculative Execution CS510 Computer Architectures Lecture 11 - 11
Hardware-based SpeculationHardware-based SpeculationHardware-based SpeculationHardware-based Speculation
• Hardware-based Speculation– dynamic branch prediction
– speculation to allow the execution of instructions before the control dependencies are resolved
– dynamic scheduling to deal with the scheduling of different combinations of basic blocks
• Advantages– dynamic runtime disambiguation of memory addresses
– hardware-based branch prediction
– a completely precise exception model
– does not require compensation or bookkeeping code
– does not require different code sequences to achieve good performance for different implementation
Speculative Execution CS510 Computer Architectures Lecture 11 - 12
HW-based SpeculationHW-based SpeculationHW-based SpeculationHW-based Speculation
Need HW buffer for results of uncommitted instructions: reorder buffer
– Reorder buffer can be operand source
– Once operand commits, result is found in register
– 3 fields: instr. type, destination, value
– Use reorder buffer number instead of reservation station
– Instructions commit in order
– As a result, it is easy to undo speculated instructions on mispredicted branches or on exceptions
ReorderBuffer
FP Regs
FPOp
Queue
FP Adder FP Adder
Res Stations Res Stations
From M(LD)
Speculative Execution CS510 Computer Architectures Lecture 11 - 13
4 4 Steps of Speculative Steps of Speculative Tomasulo AlgorithmTomasulo Algorithm
4 4 Steps of Speculative Steps of Speculative Tomasulo AlgorithmTomasulo Algorithm
1. Issue: Get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr &
send operands & reorder buffer no. to the RS
2. Execution: Operate on operands (EX) When both operands ready then execute; if not ready, watch CDB
for result; when both in reservation station, execute
3. Write result: Finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer;
mark reservation station available.
4. Commit: Update register with reorder result When an instruction is at the head of reorder buffer & result
present, update register with result (or store to memory) and remove the instruction from reorder buffer.
Speculative Execution CS510 Computer Architectures Lecture 11 - 14
Limits to ILPLimits to ILPLimits to ILPLimits to ILPConflicting studies of amount of parallelism available in late 1980s and early 1990s. Different assumptions about:
– Benchmarks (vectorized Fortran FP vs. integer C programs)
– Hardware sophistication
– Compiler sophistication
Speculative Execution CS510 Computer Architectures Lecture 11 - 15
Limits to ILPLimits to ILPLimits to ILPLimits to ILP
HW Model for ultimate issue performance; MIPS compilers
1. Register renaming: Infinite virtual registers and all WAW & WAR hazards are avoided
2. Branch prediction: Perfect; no mispredictions
3. Jump prediction: All jumps perfectly predicted => machine with perfect speculation and an
unbounded buffer of instructions available
4. Memory-address alias analysis: addresses are known and a store can be moved before a
load provided addresses are not equal
1 cycle latency for all instructions
Speculative Execution CS510 Computer Architectures Lecture 11 - 16
Upper Limit to ILPUpper Limit to ILPUpper Limit to ILPUpper Limit to ILP
Programs
Instr
ucti
on
Issu
es p
er
cycle
0
20
40
60
80
100
120
140
160
gcc espresso li fpppp doducd tomcatv
54.862.6
17.9
75.2
118.7
150.1Integer programsFloating point programs
Speculative Execution CS510 Computer Architectures Lecture 11 - 17
Limitations on Window Size Limitations on Window Size and Maximum Issue Countand Maximum Issue Count
Limitations on Window Size Limitations on Window Size and Maximum Issue Countand Maximum Issue Count
• Window : the set of instructions examined for simultaneous execution
– n instructions: to determine whether they have any register dependencies among them
2n - 2 + 2n - 4 + ..... + 2 = n2-n• 2000 instructions -- 4 million comparisons• 50 instructions -- 2450 comparisons
– current technology : window size - 4 to 32• requires about 900 comparisons
• Multiple Issues -- lengthen the clock cycle
– typically have clock cycles that are 1.5 to 3 times longer
– typically have CPIs that are 2 to 3 times lower
Speculative Execution CS510 Computer Architectures Lecture 11 - 18
Window Size ImpactWindow Size ImpactWindow Size ImpactWindow Size Impact
5563
18
75
119
150
3540
17
60 60 60
1015 12
49
16
45
10 13 11
35
15
34
8 8 914
914
4 4 4 5 4 63 3 3 3 3 3
0
20
40
60
80
100
120
140
160
gcc espresso li fpppp doduc tomcatv
infinite
2k
512
12832
8
4
Inst
r uct
ion
Is
ses
per
Cy
cle
Integer Programs FP Programs
Speculative Execution CS510 Computer Architectures Lecture 11 - 19
More Realistic HW:More Realistic HW: Branch ImpactBranch ImpactMore Realistic HW:More Realistic HW: Branch ImpactBranch Impact
window of 2000 and maximum issue of 64 instructions/clock cycle
Program
0
10
20
30
40
50
60
gcc espresso li fpppp doducd tomcatv
35
41
16
6158 60
9
1210
48
15
67 6
46
13
45
6 6 7
45
14
45
2 2 2
29
4
19
46
Perfect Selective predictor Standard 2-bit Static
None
Inst
r uct
ion
Is
sues
per
Cyc
le
Perfect Selective predictor Standard 2-bit Static None correlation+ BHT BHT BHT(512) Profile
Branch Prediction
Speculative Execution CS510 Computer Architectures Lecture 11 - 20
Selective History PredictorSelective History PredictorSelective History PredictorSelective History Predictor8192 x 2 bits
2048 x 4 x 2 bits
Branch Addr
GlobalHistory
2
00011011
Taken/Not Taken
8K x 2 bitSelector
11
10
01
00
Choose Non-correlator
Choose Correlator
10
11 Taken10 ”01 Not Taken00 ”
Non-correlatingpredictor
Correlatingpredictor
Speculative Execution CS510 Computer Architectures Lecture 11 - 21
More Realistic HW:More Realistic HW: Register ImpactRegister ImpactMore Realistic HW:More Realistic HW:
Register ImpactRegister Impact2000 instr window, 64 instr issue, 8K 2-level Prediction
Program
0
10
20
30
40
50
60
gcc espresso li fpppp doducd tomcatv
11
15
12
29
54
10
15
12
49
16
1013
12
35
15
44
9 10 11
20
11
28
5 5 6 5 57
4 45
45 5
59
45
Infinite 256 128 64 32 None*
Inst
r uct
ion
Is
sues
per
Cyc
le
*DLX: 31 Integer Registers/16 FP Registers
No. of renaming Regs
Speculative Execution CS510 Computer Architectures Lecture 11 - 22
More Realistic HW:More Realistic HW:
Alias ImpactAlias ImpactMore Realistic HW:More Realistic HW:
Alias ImpactAlias Impact2000 instr window,
64 instr issue, 8K 2 level Prediction, 256 renaming registers
Program
Instr
ucti
on
issu
es p
er
cycle
0
5
10
15
20
25
30
35
40
45
50
gcc espresso li fpppp doducd tomcatv
10
15
12
49
16
45
7 79
49
16
45 4 4
6 53
53 3 4 4
45
Perfect Global/stack Perfect + Inspection # None *
* All memory accesses are assumed to conflict+ Ongoing research# Most commercial compilers
Speculative Execution CS510 Computer Architectures Lecture 11 - 23
Realistic HW for 90s:Realistic HW for 90s: Window ImpactWindow ImpactRealistic HW for 90s:Realistic HW for 90s: Window ImpactWindow Impact
Realistic HW in 90s:
Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window
Program
Instr
ucti
on
issu
es p
er
cycle
0
10
20
30
40
50
60
gcc expresso li fpppp doducd tomcatv
10
15
12
52
17
56
10
15
12
47
16
10
1311
35
15
34
910 11
22
12
8 8 9
14
9
14
6 6 68
79
4 4 4 5 46
3 2 3 3 3 3
45
22
Infinite 256 128 64 32 16 8 4
Speculative Execution CS510 Computer Architectures Lecture 11 - 24
Fallacies and PitfallsFallacies and PitfallsFallacies and PitfallsFallacies and Pitfalls
Fallacy: Processors with lower CPIs will always be faster.– sophisticated pipelines typically have slower clock rates than
processors with simple pipelines
– example : • IBM Power-2(low CPI) : two FP and two load-store, clock rate 71.5
MHz(slower clock rate)
• Dec Alpha 21604(high CPI) : dual-issue with one load-store and one FP, 200 MHz(faster clock rate)
Speculative Execution CS510 Computer Architectures Lecture 11 - 25
Braniac vs. Speed DemonBraniac vs. Speed DemonBraniac vs. Speed DemonBraniac vs. Speed Demon
Benchmark
SP
EC
rati
o
0
100
200
300
400
500
600
700
800
900
esp
ress
o li
eqnto
tt
com
pre
ss sc gcc
spic
e
doduc
mdljd
p2
wave5
tom
catv
ora
alv
inn
ear
mdljs
p2
swm
25
6
su2
cor
hydro
2d
nasa
fpppp
6-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe) vs.
2-scalar Alpha @ 200 MHz (7 stage pipe)
Speculative Execution CS510 Computer Architectures Lecture 11 - 26
Recent High Performance Recent High Performance ProcessorsProcessors
Recent High Performance Recent High Performance ProcessorsProcessors
Issue capability SPEC Year Initial (measure shipped in clock rate Issue Schedul- Maxi- Load- Integer or
Processor systems (MHz) structure ing mum store ALU FP Branch estimate)
IBM 1994 67 Dynamic Static 6 2 2 2 2 95 intPower-2 270 FP
Intel 1994 66 Dynamic Static 2 2 2 1 1 65 intPentium 65 FP
DEC Alpha 1995 300 Static Static 4 2 2 2 1 330 int21164 500 FP
Sun Ultra- 1995 167 Dynamic Static 4 1 1 1 1 275 int305 FP
Intel P6 1995 150 Dynamic Dynamic 3 1 2 1 1 >200 int
PowerPC 1995 133 Dynamic Dynamic 4 1 1 1 2 25 int620 300 FP
MIPS 1996 200 Dynamic Dynamic 4 1 2 2 1 300 intR10000 600 FP
HP 8000 1996 200 Dynamic Static 4 2 2 2 1 >360 int>550 FP