Real-time Signal Processing on Embedded Systems
description
Transcript of Real-time Signal Processing on Embedded Systems
Real-time Signal Processing on Embedded Systems
Advanced Cutting-edge Research Seminar I&III
Advances in Microprocessor Technology
Architectural improvements of microprocessors
Pipelining Paralle processing exploiting ILP
Superscalar VLIW
SIMD
Procedure of instruction execution on a processor
Instruction Fetch (IF) fetches an instruction from main
memory. Instruction Decode (ID)
decodes fetched instruction Execution (EX)
executes decoded instruction Memory Access (MA)
accesses to main memory Write Back (WB)
Write back data to registers
Operation cycles on a processor
Single cycle machine This kinds of machines execute all
procedures from IF to WB in a cycle. Operation speed is determined by the
slowest instruction. (Because all instructions must be executed in a cycle)
Multi-cycle machine This kinds of machines execute an
instruction in several cycles.IF ID EX MA
WB
Piepelined operation can improve throughput of
instructions.
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
IF ID EX MA
WB IF ID EX M
AWB
IF ID EX MA
WB
To realize pipelined operation, several techniques are required.
Causes of pipeline hazards
Structural hazard: The hardware cannot cope with the combination of issued instructions.
Data hazard: The latter instruction must wait completion of former instruction because the latter uses the result of the former.
Control hazard: A condition that determines whether an instruction is executed or not depends on the result of the former instruction.
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
MAIFconflict
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX M
AWB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU Resolve 1: to stall the next
instruction
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX M
AWB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU Resolve 1: to stall the next
instruction
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
MAIFconflict
Resolve 2: to add another data bus to access the instruction memory.
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC
Inst Mem
Instructiondecoder
Instructionregister
ALU
Registers
CPU
Data Mem
Harvard Architecture
Resolve 2: to add another data bus to access the instruction memory.
Data hazard
IF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)
Registers5
0
4 3 2 1
0 0 0 0t0 t1 t2 t3 t4
s0 s1 s2 s3 s4
Data hazard
IF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)
$s0=$t0+$t1
Registers5
0
4 3 2 1
0 0 0 0t0 t1 t2 t3 t4
s0 s1 s2 s3 s4
Data hazard
IF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)
$s0=$t0+$t1
Registers5
0
4 3 2 1
0 0 0 0t0 t1 t2 t3 t4
s0 s1 s2 s3 s4
$t2=$s0-$t3
Data hazard
IF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)
$s0=$t0+$t1
Registers5
0
4 3 2 1
0 0 0 0t0 t1 t2 t3 t4
s0 s1 s2 s3 s4
$t2=$s0-$t3-2=0-2
Data hazard
IF ID EX MA
WBIF ID EX M
AWB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)
$s0=$t0+$t1
Registers5
0
4 3 2 1
0 0 0 0t0 t1 t2 t3 t4
s0 s1 s2 s3 s4
Waiting by stalls: consuming 3 cycles
Data hazard
IF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)
$s0=$t0+$t1
Registers5
0
4 3 2 1
0 0 0 0t0 t1 t2 t3 t4
s0 s1 s2 s3 s4
Resolve: forwarding
Data hazard
IF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)
$s0=$t0+$t1
Registers5
0
4 3 2 1
0 0 0 0t0 t1 t2 t3 t4
s0 s1 s2 s3 s4
Resolve: forwarding
The result is forwarded to ALU
Data hazard
IF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)
$s0=$t0+$t1
Registers5
0
4 3 2 1
0 0 0 0t0 t1 t2 t3 t4
s0 s1 s2 s3 s4
Resolve: forwarding
$t2=9-$t37=9-2
The result is forwarded to ALU
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
An instruction sequenceincluding branch
PC:10Instructiondecoder
Instructionregister
ALU
Registers
CPU ※ ※ In this explanation,PC adopts word addressfor simplification.
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:Instructiondecoder
Instructionregister
ALU
Registers
CPU
An instruction sequenceincluding branch
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:11Instructiondecoder
Instructionregister
ALU
Registers
CPU
An instruction sequenceincluding branch
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:12Instructiondecoder
Instructionregister
ALU
Registers
CPUPC’s value of next
instruction depends on the branch
conditionBranch is
taken:PC=40Not taken:PC=12
An instruction sequenceincluding branch
Control hazard Resolve 1: stall
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX M
AWB2 cycle stall
The number of required stall cycleaetermined by architecture.
Control hazard Resolve 1: stall
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX M
AWB
1 cycle stall
If the processor can calculate the branch targetaddress at the ID stage.
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:10Instructiondecoder
Instructionregister
ALU
Registers
CPU
Resolve 2: Branch prediction
In this example, the nextPC is predicted as if the branch is always untaken.
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:11Instructiondecoder
Instructionregister
ALU
Registers
CPU
Resolve 2: branch prediction
In this example, the nextPC is predicted as if the branch is always untaken.
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:12Instructiondecoder
Instructionregister
ALU
Registers
CPU
Resolve 2: branch prediction
In this example, the nextPC is predicted as if the branch is always untaken.
Control hazard Resolve 2: branch prediction
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX M
AWBstall
PC:40Instructiondecoder
Instructionregister
ALU
Registers
CPUIf the prediction is missed,in other words, if branchis taken.
Control hazard More practical scheme: dynamic
branch prediction n-bit counter-based prediction:
Address of a branch instraction Branch History TableLower i-bit
n-bit saturatingup/down counter
1-bit counter-based prediction
Predict branch will be taken
Predict branch will be untaken
1 0
Branch is taken
Branch is untaken
2-bit counter-based prediction
Predict branch will be taken
Predict branch will be taken
Predict branch will be taken
00
Predict branch will be taken
0110
11
Branch is taken
Branch is untaken
This scheme is adopted in Intel Pentium, Sun Ultra SPARC, MIPS R10000,etc
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})Inserted instructionor $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:11Instructiondecoder
Instructionregister
ALU
Registers
CPU
Resolve 3: delayed prediction
An instruction that has no dependencyis inserted.
IF ID EX MA
WB
Resolve 3: delayed prediction
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})Inserted instructionor $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:12Instructiondecoder
Instructionregister
ALU
Registers
CPUIF ID EX M
AWB
An instruction that has no dependencyis inserted.
Resolve 3: delayed prediction
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})Inserted instructionor $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:13or40Instructiondecoder
Instructionregister
ALU
Registers
CPUIF ID EX M
AWB
An instruction at determined addressis executed.
An instruction that has no dependencyis inserted.
Exploiting ILP (Instruction Level Parallelism)
SuperScalar : issuing multiple instructions per cycle with hardware support. Advantage: binary compatibility.
VLIW: issuing multiple instructions per cycle with compiler support. Advantage: simple hardware
Types of data dependence True data dependence (RAW: Read
After Write)
Anti-dependence (WAR: Write After Read)
Output dependence (WAW: Write After Write)
i1: r2=r1+r3i2: r4=r2+1
i1: r1=r2+r3i2: r2=r4+1i3: r1=r4+2
Anti Output
difficult to remove
can be removed by register renaming
They are called as artificial dependence
Basic Architecture of Superscaler Processor
Instruction cache
Instruction decodeRegister renaming
Branch prediction
Function unit
Function unit
Registers
・・・・・
・・・・・
Data cache
Reorder buffer
・・・・・
・・・・・
・・・・・
Frontend
Ex-coreBackend
dispatchInstruction window
commit
issue
Basic function of Frontend
provides enough instructions. predicts next instruction address if
branch instruction appears. resolves artificial dependences by
register renaming. analyzes true data dependence
after register renaming. transfers instructions after the
above operations. This operation is called “dispatch”.
Basic function of Ex-core finds independent instructions
stored in “instruction window” as many as possible. In this operation, dynamic scheduling
is performed to resolve several restrictions: data dependence, resource, prior defined priority, etc.
executes independent instructions in parallel. An operation that transfers an
instruction to a function unit is called “issue”.
Basic function of Backend
updates processor state. Results obtained as out-of-order are
reordered to in-order. Update of the processor state is
performed precisely. Update of the processor state based on
the execution result is called “commit”. Disappear of instruction is called “retire”.
Dynamic instruction scheduling
Instruction scheduling means to determine issuing order of instructions and when the instructions are issued.
In superscalar processors, dynamic instruction scheduling is performed using instructions stored in the instruction buffer.
In the following slides, dynamic scheduling will be explained using several types of processors:1-way in-order processor, i-way in-order processro, and i-way out-of-order processor.
1 way in-order issue The number of issued instructions
at a cycle is at most 1. The size of instruction window is 1
because all subsequent instructions cannot be issued if an instruction cannot be issued.
Only true and output dependences should be checked because anti dependence is always resolved.
Control by R flag R flag is used to check true and
output dependences.op dst src1 src2 R value
R valueR valueR valueR valueR valueR valueR value
Instruction
Registers
Register number
Only when R(dst) == true && R(src1) ==true && R(src2), the instruction is issued. (This condition is called “ready”.)
R==false means the register is reserved but the result has not been stored yet. In this case, the operand is not available.
Update sequence of the R flag
R bit of destination becomes false when an instruction is issued.
R bit of destination becomes true when a result is stored in the destination. by the above update,
• Instructions using unavailable registers as source registers are not issued; true dependence is resolved.•Instructions using unavailable a register as a destination register are not issued; output dependence is resolved.
Practically, resource restrictions must be satisfied to issue instructions in addition to the check of dependency. In this lecture, only restriction about function unit is considered to simplify the discussion.
i-way in-order issue We think about how the following 4
instructions are executed on this processor.
i1: r1 = r5i2: r2 = r1 + 1i3: r3 = r6i4: r4 = r3 +1
Cycle Funciont Unit0
Function Unit1
0 i1: r1=r51 i2: r2 = r1 +
1i3: r3 = r6
2 i4: r4 = r3 + 1
In-order scheduling
IPC becomes 1.3. (4instcuctions/3
cycle)
How to check dependency of instructions?
True and output dependence must be checked.
Instruction 0
Instruction i-1::i
Instruction window
R valueRegisters
R value
:::::
Register number
3 × i
3 × i i
How to allocate resources(funciton unit)?
Allocation of is performed as follows. Check whether any of preceding
ready instructions refers or not. If there is no instructions refering , the function unit is available.
Repeat the above procedure from to , where means the number of function units.
0R0R
1R 1rRr
Complexity of i-way in-order issue
Ready detection ports are required.
comparators are required for check of operand dependency.
Resource allocation input NOR gate is required.
i3
i
k
iik1
)1(23)1(3
Complexity increases by )()( 2iOiO ~
1i
i-way out-of-order issue Out-of-order scheduling of the
same code used in the previous i-way in-order case.
i1: r1 = r5i2: r2 = r1 + 1i3: r3 = r6i4: r4 = r3 +1
Cycle Funciont Unit0
Function Unit1
0 i1: r1=r5 i3: r3 = r61 i2: r2 = r1 +
1i4: r4 = r3 +
1
Out-of-order scheduling
IPC becomes 2.0. (4instcuctions/2
cycle)
Architectural requirements for out-of-order execution
The depth of instruction window should be increased to .
The number of registers’ ports must be for check of dependence.
Anti-dependence must be checked, in addition to the i-way in-order case.
Resource allocation can be performed in the same way as the i-way in-order case.
n3
n
Complexity of i-way in-order issue
Ready detection ports are required.
comparators are required for check of operand dependency.
Resource allocation input NOR gate is required.
n3
n
k
nnk1
)1(25)1(5
Complexity increases by
)()( 2nOnO ~
1n
Increase of hardware complexity is more significant than the in-order case because n>>i in general.
Tomasulo’s Algorithm was proposed by R.M. Tomasulo in
1967. was originally adopted in floating
point unit in IBM 360/91. Performance was drastically
improved. Similar algorithms are used in the
latest microprocessors.
Superscalar arch using TomasuloInstruction cache
Instruction decodeTag allocation
Branch prediction
Function unit
Function unit
Registers
・・・・・
・・・・・
Data cache
・・・・・
Frontend
Ex-coredispatch
issue・・・・・
Reservation Station
Contents of reservation station and register
Register Tag is used for register renaming.
Reservation station
op: opecode dtag: destination tag stag: source tag R: ready flag value: operand’s value
valuetagR
valuestagRvaluestagRdtagop
Source 1 Source 2
Operation on the arch Dispatch Issue Execution Finalization
Operation on the arch Dispatch
dtag is assigned to a destination operand from tag pool that holds unassigned tags.
Src operands are obtained by reading registers using each register number. If R is true, then value is read, otherwise tag’s value is read from the register.
Then, an instruction is stored in a reservatoin station corresponding to a function unit used in the instruction.
Operation on the arch Issue
A ready instruction in a reservation is executed on a corresponding function unit, if the function unit is available.
The issued instruction is deleted from the reservation station.
Execution Issued instructions are executed on
corresponding function units.
Operation on the arch Finalize
Based on a result of execution, dtag and a result value is broadcasted to the result bus.
If there is an instruction holds the broadcasted dtag as stag, R flag and value of the instruction is replaced by true and the broadcasted result value, respectively.
Only when there is a register holding a tag corresponding to broadcasted dtag, the broadcasted result is stored in the register.
Finally, the broadcasted tag is stored to tag pool.
An example of Tomasulo A superscalar processor used in
this example has the following 5 stage pipeline and the number of way is 2. IF: fetches 2 instructions. ID: decodes, allocates tags, and
dispatches. RS: waits operands until an
instruction becomes ready. EX: executes an instruction. WB: writes a result.
i1: r1 = load Ai2: r2 = r1 + 3i3: r3 = r2 + 1i4: r4 = load B#A and B are const
Cycle 0op Destination Source 1 Source 2
R dtag
val R stag
val R stag
valInstruction
i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B
Stage
State of instructions
# R tag Val1 1 X 22 1 X 43 1 X 74 1 X 9
Registers
30
・・・・・・ 54
53
52
51
50
Tag pool
Cycle 1op Destination Source 1 Source 2
R dtag
val R stag
val R stag
valInstruction
i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B
StageIFIF
State of instructions
# R tag Val1 1 X 22 1 X 43 1 X 74 1 X 9
Registers
30
・・・・・・ 54
53
52
51
50
Tag pool
Cycle 2op Destination Source 1 Source 2
R dtag
val R stag
val R stag
val
load
0 50 X 1 X A 1 X 0
add
0 51 X 0 50 X 1 X 7
Instruction
i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B
StageIDIDIFIF
State of instructions
# R tag Val1 0 50 X2 0 51 X3 1 X 74 1 X 9
Registers
30
・・・・・・ 54
53
52
Tag pool
Cycle 3op Destination Source 1 Source 2
R dtag
val R stag
val R stag
val
load
1 50 15 1 X A 1 X 0
add
0 51 X 0 50 X 1 X 7
add
0 52 X
load
0 53 X
Instruction
i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B
Stage
EXRSIDID
State of instructions
# R tag Val1 0 50 X2 0 53 X3 1 X 74 0 52 X
Registers
30
・・・・・・ 54
Tag pool
Cycle 4op Destination Source 1 Source 2
R dtag
val R stag
val R stag
val
load
1 50 15 1 X A 1 X 0
add
1 51 22 1 50 15 1 X 7
add
0 52 X 0 51 X 1 X 1
load
1 53 16 1 X B 1 X 0
Instruction
i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B
Stage
WBEXRSEX
State of instructions
# R tag Val1 1 X 152 0 53 X3 1 X 74 0 52 X
Registers
50
30 ・・・・・・ 54
Tag pool
Cycle 5op Destination Source 1 Source 2
R dtag
val R stag
val R stag
val
add
1 51 22 1 50 15 1 X 7
add
1 52 23 1 51 22 1 X 1
load
1 53 16 1 X B 1 X 0
Instruction
i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B
Stage
WBEXWB
State of instructions
# R tag Val1 1 X 152 1 X 163 1 X 74 0 52 X
Registers
53
51 50 30 ・・・・・・ 54
Tag pool
Cycle 6op Destination Source 1 Source 2
R dtag
val R stag
val R stag
val
add
1 52 23 1 51 22 1 X 1
Instruction
i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B
Stage
WB
State of instructions
# R tag Val1 1 X 152 1 X 163 1 X 74 1 X 23
Registers
52
53 51 50 30 ・・・・・・ 54
Tag pool
Problem of out-of-order execution
It is difficult to update the processor state precisely if exception is occurred.
Fin i0: ・・・・
Fin i1: ・・・・
Fin i2: r1=load r1E i3: r2=load r3
i4: ・・・・
i5: r3 = r4 << r2i6: ・・・・
In order execution Out of order execution
Fin i0: ・・・・
i1: ・・・・
Fin i2: r1=load r1E i3: r2=load r3
i4: ・・・・
Fin i5: r3 = r4 << r2i6: ・・・・
Flow of exception handling
Unfinished instructions that include an instruction causes the exception is invalidated.
Control is moved to OS to save the current state to main memory and to handle the exception.
After the process of the exception, CPU begins to execute the instruction causing the exception again.
Problem of out-of-order execution
It is difficult to update the processor state precisely if exception is occurred.
Fin i0: ・・・・
Fin i1: ・・・・
Fin i2: r1=load r1E i3: r2=load r3
i4: ・・・・
i5: r3 = r4 << r2i6: ・・・・
In order execution
•Save the current state.•OS handles the exception.•CPU restarts from i3.
Problem of out-of-order execution
It is difficult to update the processor state precisely if exception is occurred.
Out of order execution
Fin i0: ・・・・
i1: ・・・・
Fin i2: r1=load r1E i3: r2=load r3
i4: ・・・・
Fin i5: r3 = r4 << r2i6: ・・・・
•Save the current state.• i5 has finished before i3.• i1 has not finished.• the data of r3 has been lost.
•OS handles the exception.CPU cannot restart from i3.
Reorder buffer is used for precise exception handling.
Reorder buffer Updates CPU’s state in the original
program order by reordering results.
Handles exception at the state update.
Reorder Buffer
Registers
Results and information about exception
Store of results in the originalprogram order and detection ofexception.
Commit
Superscalar arch using Tomasulo and reorder buffer
Instruction cache
Instruction decodeTag allocation
Branch prediction
Function unit
Function unit
Registers
・・・・・
・・・・・
Data cache
・・・・・
Frontend
Ex-core
dispatch
issue・・・・・
Reservation Station
Reorder Buffer
Backend
commit
Behaviour of reorder buffer
If there is result without an exception, it is stored to a register and the entry corresponding to it is removed.
There is a result with an exception, pipeline and reorder buffer are cleared.
If a result is not stored, reorder buffer waits until the result is obtained.
Contents of reorder buffer
PC: instruction address R: Ready flag dreg: register number of
destination dtag: operand tag of destination E: Exception flag result: result
resultEdtagdregRPC
Operand bypass and supply of source operand tag
Tomasulo: operand values are obtained from registers that have the latest values.
Reorder buffer: the latest values are stored in reorder buffer. (not in registers)
Procedure of obtaining operands: Check dependency to instructions decoded
concurrently. If there is dependency, stag becomes dtag of the dependent instruction.
Otherwise, reorder buffer is searched by source register number to obtain value (when R=1) or tag. (when R=0) If reorder buffer does not have value and tag corresponding to the register number, values are obtained from registers.
An example of reorder buffer
A superscalar processor used in this example has the following 6 stage pipeline and the number of way is 2. IF: fetches 2 instructions. ID: decodes, allocates tags, and
dispatches. RS: waits operands until an
instruction becomes ready. EX: executes an instruction. WB: writes results to reorder buffer. RT: writes result to registers.
A code used in the example
i1: 0x40: r1 = load A (r0)i2: 0x44: r2 = r1 + r3i3: 0x48: r2 = r2 + 16i4: 0x4C: r5 = load 0 (r1)i5: 0x50: r1 = r1 + 1i6: 0x54: r2 = load 0 (r2)
Address of instruction
Cycle 0op Destination Source 1 Source 2
E dtag
val R stag
val R stag
valInstructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)
Stage
State of instructions
pointer
enrty PC R dreg dtag E result
h/t 202122232425
Reorder buffer
Cycle 1op Destination Source 1 Source 2
E dtag
val R stag
val R stag
valInstructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)
StageIFIF
State of instructions
pointer
enrty PC R dreg dtag E result
H/T 202122232425
Reorder buffer
Cycle 2op Destination Source 1 Source 2
E dtag
val R stag
val R stag
val
load
X 20 X 1 X A 1 X 0
add
X 21 X 0 20 X 1 X 7
Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)
StageIDIDIFIF
State of instructions
pointer
enrty PC R dreg dtag E result
Head 20 40 0 1 20 X X21 44 0 2 21 X X
Tail 22232425
Reorder buffer
Cycle 3op Destination Source 1 Source 2
E dtag
val R stag
val R stag
val
load
0 20 15 1 X A 1 X 0
add
X 21 X 0 20 X 1 X 7
add
X 22 X 0 21 X 1 X 16
load
X 23 X 1 X 0 0 20 X
Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)
Stage
EXRSIDIDIFIF
State of instructions
pointer
enrty PC R dreg dtag E result
Head 20 40 0 1 20 X X21 44 0 2 21 X X22 48 0 2 22 X X23 4C 0 5 23 X X
Tail 2425
Reorder buffer
Cycle 4op Destination Source 1 Source 2
E dtag
val R stag
val R stag
val
load
0 20 15 1 X A 1 X 0
add
0 21 22 1 20 15 1 X 7
add
X 22 X 0 21 X 1 X 16
load
1 23 ? 1 X 0 1 20 15
add
X 24 X 1 X 15 1 X 1
load
X 25 X 1 X 0 0 22 X
Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)
Stage
WBEXRSEXIDID
State of instructions
pointer
enrty PC R dreg dtag E result
Head 20 40 1 1 20 0 1521 44 0 2 21 X X22 48 0 2 22 X X23 4C 0 5 23 X X24 50 0 1 24 X X25 54 0 2 25 X X
Reorder buffer
Tail
Cycle 5op Destination Source 1 Source 2
E dtag
val R stag
val R stag
val
add
0 21 22 1 20 15 1 X 7
add
0 22 38 1 21 22 1 X 16
load
1 23 ? 1 X 0 1 20 15
add
0 24 16 1 X 15 1 X 1
load
X 25 X 1 X 0 0 22 X
Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)
StageRTWBEXWBEXRS
State of instructions
pointer
enrty PC R dreg dtag E result
20 40 1 1 20 0 15Head 21 44 1 2 21 0 22
22 48 0 2 22 X X23 4C 1 5 23 1 ?24 50 0 1 24 X X25 54 0 2 25 X X
Reorder buffer
Tail
Cycle 6op Destination Source 1 Source 2
E dtag
val R stag
val R stag
val
add
0 22 38 1 21 22 1 X 16
add
0 24 16 1 X 15 1 X 1
load
0 25 4 1 X 0 1 22 38
Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)
Stage
RTWBRTWBEX
State of instructions
pointer
enrty PC R dreg dtag E result
20 40 1 1 20 0 1521 44 1 2 21 0 22
Head 22 48 1 2 22 0 3823 4C 1 5 23 1 ?24 50 1 1 24 0 1625 54 0 2 25 X X
Reorder buffer
Tail
Cycle 7op Destination Source 1 Source 2
E dtag
val R stag
val R stag
val
load
0 25 4 1 X 0 1 22 38
Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)
Stage
RTRTRTWB
State of instructions
pointer
enrty PC R dreg dtag E result
20 40 1 1 20 0 1521 44 1 2 21 0 2222 48 1 2 22 0 38
H/T 23 4C 1 5 23 1 ?24 50 1 1 24 0 1625 54 0 2 25 X X
Reorder buffer
Exceptionis detected.
VLIW (Very Long Instruction Word)
In the VLIW processor, compiler extracts parallelism in a code. Therefore, special hardware support used in the superscalar processor becomes unnecesarry. Superscalar: dynamic scheduling by
hardware support VLIW: static scheduling by compiler
Overview of VLIWcompiler
compiler
processor
processor
main(){ ・・・・・・・・・・・・ }
add sub ・・・
code gen
scheduling
execution
execution
main(){ ・・・・・・・・・・・・ }
add sub ・・・
code gen
add sub load add mul load ・・・
scheduling
Superscalar
VLIW
VLIW code
i1: r3=r4+1i2: r1=load(r2)i3: r1=r1<<r3i4: r5=r2+r6i5:beq r5,L
Sequential code
ALU ALU MEM Branchi1: r3=r4+1 i4: r5=r2+r6 i2:
r1=load(r2)nop
i3: r1=r1<<r3
nop nop i5:beq r5,L VLIW code
・・・・・
Hardware organization of VLIW
ALU ALU MEM Branch
Registers
・・・・・
Instruction cache
Data cache
VLIW vs Superscalar
Superscalar VLIWHardware size Large SmallHardware complexity Large SmallScheduling algorithm Poor RichInstruction window Small LargeBinary compatibility Compatible Not
compatible
Dynamic vs Static schedluingi1: r1=load Ai2: r2=load(r1)i3: r3=load Bi4: r4=r3<<r2i5: r5=r4+1i6: r6=r2+r5
i1
i2
i3
i4
i5i6
Data dependencyof the code
Cycle ALU MEM0 i11 i22 i33 i44 i55 i6
Sample code
Dynamic scheduling
Cycle ALU MEM0 i31 i4 i12 i5 i23 i6
Optimal scheduling
Advantage of dynamic scheduling
Scheduling based on information that can only be obtained at run time. For example, cache miss can be
concealed. Scheduling based on accurate
dependency of memories. Data address that can be obtained
only at run time improves scheduling performance.
Taxonomy of scheduling algorithm
Local scheduling Global scheduling
Cyclic scheduling Acyclic scheduling
Trace-based scheduling DAG-based (Directed acyclic graph)
scheduling
VLIW-based commercial processors
Transmeta Crusoe Aiming mobile computing
Texas Instruments TMS320C6x series Embedded applications
Intel Itanium
Parallel operation by SIMD
What is SIMD?: SIMD (Single Instruction Multiple Data) means that the same operation is applied to several operands. Ex: Addition
int i;int a[4]={1,2,3,4};int b[4]={5,6,7,8};int c[4];for (i=0;i<4;i++){ c[i]=a[i]+b[i];}
c[0]=a[0]+b[0]c[1]=a[1]+b[1]c[2]=a[2]+b[2]c[3]=a[3]+b[3]
Sequential operation
SIMD
b[3]b[2]b[1]b[0]
a[3]a[2]a[1]a[0]
c[3]c[2]c[1]c[0]
SIMD data types (Cell/B.E.)
vector unsigned char 16 unsigned 8bit valuesvector signed char 16 signed 8bit valuesvector unsigned short 8 unsigned 16 bit valuesvector signed short 8 signed 16 bit valuesvector unsigned int 4 unsigned 32 bit valuesvector signed int 4 signed 32 bit valuesvector unsigned long long
2 unsigned 64 bit values
vector signed long long
2 signed 64 bit values
vector float 4 32bit floating vlauesvector double 2 64 bit double (floating) values
Allocation of vector values
Vector values are allocated to memory in the big-endian style as shown in the following figure.
*This figure is adapted from cell.fixstars.com
How to access vector type via normal pointer
vector signed int va = (vector signed int) { 1, 2, 3, 4 };int *a = (int *) &va;
*This figure is adapted from cell.fixstars.com
How to access a normal array from vector type
int a[8] __attribute__((aligned(16))) = { 1, 2, 3, 4, 5, 6, 7, 8 };vector signed int *va = (vector signed int *) a;
*This figure is adapted from cell.fixstars.com
__attribute__((aligned(16))) forces scalar data to be 16 byte-aligned
SIMD operation on PPE
int a[4] __attribute__((aligned(16))) = { 1, 2, 3, 4 };int b[4] __attribute__((aligned(16))) = { 5, 6, 7, 8 };int c[4] __attribute__((aligned(16)));vector signed int *va = (vector signed int *) a;vector signed int *vb = (vector signed int *) b;vector signed int *vc = (vector signed int *) c;*vc = vec_add(*va, *vb);
b[3]b[2]b[1]b[0]
a[3]a[2]a[1]a[0]
c[3]c[2]c[1]c[0]
vec_add is a SIMD function provided by VMX (Vector Multimedia Extension) proposed by IBM and Mtorola.
Entire code for vector addition
#include <stdio.h>#include <altivec.h>int a[4] __attribute__((aligned(16))) = { 1, 2, 3, 4 };int b[4] __attribute__((aligned(16))) = { 5, 6, 7, 8 };int c[4] __attribute__((aligned(16)));int main(int argc, char **argv){vector signed int *va = (vector signed int *) a;vector signed int *vb = (vector signed int *) b;vector signed int *vc = (vector signed int *) c;*vc = vec_add(*va, *vb);printf("c[0]=%d, c[1]=%d, c[2]=%d, c[3]=%d\n", c[0], c[1], c[2], c[3]);return 0;}
A part of VMX functionArithmetic operation
vec_add(a,b) a+bvec_sub(a,b) a-bvec_madd(a,b,c)
a*b+c
Logical operation
vec_and(a,b) Logical andvec_or(a,b) Logical or
Bit operation
vec_perm(a,b,c)
creating new vector from a[i] and b[i] based on c[i]
vec_sel(a,b,c) selecting a[i] or b[i] basedon c[i]
branch vec_cmpeq(a, b)
a[i]==b[i]
vec_cmpgt(a, b)
a[i]>b[i]
Type conversion
vec_ctf(a, b) (float)a[i]/(2^b)vec_ctu(a, b) (unsigned int) a[i]/(2^b)
Generating constant
vec_splat(a, b) a[b]vec_splat s32(a)
signed a[i]
How to create dense vector data
In general, vector data is not densely stored. Threfore, dense vector data must be created before vector operation.vc = vec_perm(va, vb, vpat);
*This figure is adapted from cell.fixstars.com
Ex of vec_perm: Transpose
*These figures are adapted from cell.fixstars.com
Branch on SIMD
*These figures are adapted from cell.fixstars.com
Procedure of SIMD Branch
*These figures are adapted from cell.fixstars.com
Detail of SIMD Branch
vec_cmpgt()
vec_sel() *These figures are adapted from cell.fixstars.com
Ex of SIMD Branchint a[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };int b[16] = { 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4 3, 2, 1 };int c[16];int i;for (i = 0; i < 16; i++) { if (a[i] > b[i]) { c[i] = a[i] - b[i]; } else { c[i] = b[i] - a[i]; }}
Ex of SIMD Branchint a[16] __attribute__((aligned(16))) = { 1, 2, 3, 4, 5, 6, 7, 8,9, 10, 11, 12, 13, 14, 15, 16 };int b[16] __attribute__((aligned(16))) = { 16, 15, 14, 13, 12, 11, 10, 9,8, 7, 6, 5, 4, 3, 2, 1 };int c[16] __attribute__((aligned(16)));vector signed int *va = (vector signed int *) a;vector signed int *vb = (vector signed int *) b;vector signed int *vc = (vector signed int *) c;vector signed int vc_true, vc_false;vector unsigned int vpat;int i;for (i = 0; i < 4; i++) { vpat = vec_cmpgt(va[i], vb[i]); vc_true = vec_sub(va[i], vb[i]); vc_false = vec_sub(vb[i], va[i]); vc[i] = vec_sel(vc_false, vc_true, vpat);}