Real-time Signal Processing on Embedded Systems

115
Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III

description

Real-time Signal Processing on Embedded Systems. Advanced Cutting-edge Research Seminar I&III. Advances in Microprocessor Technology. Architectural improvements of microprocessors. Pipelining Paralle processing exploiting ILP Superscalar VLIW SIMD. - PowerPoint PPT Presentation

Transcript of Real-time Signal Processing on Embedded Systems

Page 1: Real-time Signal Processing on Embedded Systems

Real-time Signal Processing on Embedded Systems

Advanced Cutting-edge Research Seminar I&III

Page 2: Real-time Signal Processing on Embedded Systems

Advances in Microprocessor Technology

Page 3: Real-time Signal Processing on Embedded Systems

Architectural improvements   of microprocessors

Pipelining Paralle processing exploiting ILP

Superscalar VLIW

SIMD

Page 4: Real-time Signal Processing on Embedded Systems

Procedure of instruction execution on a processor

Instruction Fetch (IF) fetches an instruction from main

memory. Instruction Decode (ID)

decodes fetched instruction Execution (EX)

executes decoded instruction Memory Access (MA)

accesses to main memory Write Back (WB)

Write back data to registers

Page 5: Real-time Signal Processing on Embedded Systems

Operation cycles on a processor

Single cycle machine This kinds of machines execute all

procedures from IF to WB in a cycle. Operation speed is determined by the

slowest instruction. (Because all instructions must be executed in a cycle)

Multi-cycle machine This kinds of machines execute an

instruction in several cycles.IF ID EX MA

WB

Page 6: Real-time Signal Processing on Embedded Systems

Piepelined operation can improve throughput of

instructions.

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

IF ID EX MA

WB IF ID EX M

AWB

IF ID EX MA

WB

To realize pipelined operation, several techniques are required.

Page 7: Real-time Signal Processing on Embedded Systems

Causes of pipeline hazards

Structural hazard: The hardware cannot cope with the combination of issued instructions.

Data hazard: The latter instruction must wait completion of former instruction because the latter uses the result of the former.

Control hazard: A condition that determines whether an instruction is executed or not depends on the result of the former instruction.

Page 8: Real-time Signal Processing on Embedded Systems

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

Page 9: Real-time Signal Processing on Embedded Systems

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

Page 10: Real-time Signal Processing on Embedded Systems

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

Page 11: Real-time Signal Processing on Embedded Systems

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

Page 12: Real-time Signal Processing on Embedded Systems

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

MAIFconflict

Page 13: Real-time Signal Processing on Embedded Systems

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX M

AWB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU Resolve 1: to stall the next

instruction

Page 14: Real-time Signal Processing on Embedded Systems

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX M

AWB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU Resolve 1: to stall the next

instruction

Page 15: Real-time Signal Processing on Embedded Systems

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

MAIFconflict

Resolve 2: to add another data bus to access the instruction memory.

Page 16: Real-time Signal Processing on Embedded Systems

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC

Inst Mem

Instructiondecoder

Instructionregister

ALU

Registers

CPU

Data Mem

Harvard Architecture

Resolve 2: to add another data bus to access the instruction memory.

Page 17: Real-time Signal Processing on Embedded Systems

Data hazard

IF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)

Registers5

0

4 3 2 1

0 0 0 0t0 t1 t2 t3 t4

s0 s1 s2 s3 s4

Page 18: Real-time Signal Processing on Embedded Systems

Data hazard

IF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)

$s0=$t0+$t1

Registers5

0

4 3 2 1

0 0 0 0t0 t1 t2 t3 t4

s0 s1 s2 s3 s4

Page 19: Real-time Signal Processing on Embedded Systems

Data hazard

IF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)

$s0=$t0+$t1

Registers5

0

4 3 2 1

0 0 0 0t0 t1 t2 t3 t4

s0 s1 s2 s3 s4

$t2=$s0-$t3

Page 20: Real-time Signal Processing on Embedded Systems

Data hazard

IF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)

$s0=$t0+$t1

Registers5

0

4 3 2 1

0 0 0 0t0 t1 t2 t3 t4

s0 s1 s2 s3 s4

$t2=$s0-$t3-2=0-2

Page 21: Real-time Signal Processing on Embedded Systems

Data hazard

IF ID EX MA

WBIF ID EX M

AWB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)

$s0=$t0+$t1

Registers5

0

4 3 2 1

0 0 0 0t0 t1 t2 t3 t4

s0 s1 s2 s3 s4

Waiting by stalls: consuming 3 cycles

Page 22: Real-time Signal Processing on Embedded Systems

Data hazard

IF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)

$s0=$t0+$t1

Registers5

0

4 3 2 1

0 0 0 0t0 t1 t2 t3 t4

s0 s1 s2 s3 s4

Resolve: forwarding

Page 23: Real-time Signal Processing on Embedded Systems

Data hazard

IF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)

$s0=$t0+$t1

Registers5

0

4 3 2 1

0 0 0 0t0 t1 t2 t3 t4

s0 s1 s2 s3 s4

Resolve: forwarding

The result is forwarded to ALU

Page 24: Real-time Signal Processing on Embedded Systems

Data hazard

IF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)

$s0=$t0+$t1

Registers5

0

4 3 2 1

0 0 0 0t0 t1 t2 t3 t4

s0 s1 s2 s3 s4

Resolve: forwarding

$t2=9-$t37=9-2

The result is forwarded to ALU

Page 25: Real-time Signal Processing on Embedded Systems

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

An instruction sequenceincluding branch

PC:10Instructiondecoder

Instructionregister

ALU

Registers

CPU ※ ※ In this explanation,PC adopts word addressfor simplification.

Page 26: Real-time Signal Processing on Embedded Systems

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:Instructiondecoder

Instructionregister

ALU

Registers

CPU

An instruction sequenceincluding branch

Page 27: Real-time Signal Processing on Embedded Systems

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:11Instructiondecoder

Instructionregister

ALU

Registers

CPU

An instruction sequenceincluding branch

Page 28: Real-time Signal Processing on Embedded Systems

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:12Instructiondecoder

Instructionregister

ALU

Registers

CPUPC’s value of next

instruction depends on the branch

conditionBranch is

taken:PC=40Not taken:PC=12

An instruction sequenceincluding branch

Page 29: Real-time Signal Processing on Embedded Systems

Control hazard Resolve 1: stall

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX M

AWB2 cycle stall

The number of required stall cycleaetermined by architecture.

Page 30: Real-time Signal Processing on Embedded Systems

Control hazard Resolve 1: stall

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX M

AWB

1 cycle stall

If the processor can calculate the branch targetaddress at the ID stage.

Page 31: Real-time Signal Processing on Embedded Systems

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:10Instructiondecoder

Instructionregister

ALU

Registers

CPU

Resolve 2: Branch prediction

In this example, the nextPC is predicted as if the branch is always untaken.

Page 32: Real-time Signal Processing on Embedded Systems

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:11Instructiondecoder

Instructionregister

ALU

Registers

CPU

Resolve 2: branch prediction

In this example, the nextPC is predicted as if the branch is always untaken.

Page 33: Real-time Signal Processing on Embedded Systems

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:12Instructiondecoder

Instructionregister

ALU

Registers

CPU

Resolve 2: branch prediction

In this example, the nextPC is predicted as if the branch is always untaken.

Page 34: Real-time Signal Processing on Embedded Systems

Control hazard Resolve 2: branch prediction

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX M

AWBstall

PC:40Instructiondecoder

Instructionregister

ALU

Registers

CPUIf the prediction is missed,in other words, if branchis taken.

Page 35: Real-time Signal Processing on Embedded Systems

Control hazard More practical scheme: dynamic

branch prediction n-bit counter-based prediction:

Address of a branch instraction Branch History TableLower i-bit

n-bit saturatingup/down counter

Page 36: Real-time Signal Processing on Embedded Systems

1-bit counter-based prediction

Predict branch will be taken

Predict branch will be untaken

1 0

Branch is taken

Branch is untaken

Page 37: Real-time Signal Processing on Embedded Systems

2-bit counter-based prediction

Predict branch will be taken

Predict branch will be taken

Predict branch will be taken

00

Predict branch will be taken

0110

11

Branch is taken

Branch is untaken

This scheme is adopted in Intel Pentium, Sun Ultra SPARC, MIPS R10000,etc

Page 38: Real-time Signal Processing on Embedded Systems

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})Inserted instructionor $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:11Instructiondecoder

Instructionregister

ALU

Registers

CPU

Resolve 3: delayed prediction

An instruction that has no dependencyis inserted.

IF ID EX MA

WB

Page 39: Real-time Signal Processing on Embedded Systems

Resolve 3: delayed prediction

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})Inserted instructionor $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:12Instructiondecoder

Instructionregister

ALU

Registers

CPUIF ID EX M

AWB

An instruction that has no dependencyis inserted.

Page 40: Real-time Signal Processing on Embedded Systems

Resolve 3: delayed prediction

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})Inserted instructionor $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:13or40Instructiondecoder

Instructionregister

ALU

Registers

CPUIF ID EX M

AWB

An instruction at determined addressis executed.

An instruction that has no dependencyis inserted.

Page 41: Real-time Signal Processing on Embedded Systems

Exploiting ILP (Instruction Level Parallelism)

SuperScalar : issuing multiple instructions per cycle with hardware support. Advantage: binary compatibility.

VLIW: issuing multiple instructions per cycle with compiler support. Advantage: simple hardware

Page 42: Real-time Signal Processing on Embedded Systems

Types of data dependence True data dependence (RAW: Read

After Write)

Anti-dependence (WAR: Write After Read)

Output dependence (WAW: Write After Write)

i1: r2=r1+r3i2: r4=r2+1

i1: r1=r2+r3i2: r2=r4+1i3: r1=r4+2

Anti Output

difficult to remove

can be removed by register renaming

They are called as artificial dependence

Page 43: Real-time Signal Processing on Embedded Systems

Basic Architecture of Superscaler Processor

Instruction cache

Instruction decodeRegister renaming

Branch prediction

Function unit

Function unit

Registers

・・・・・

・・・・・

Data cache

Reorder buffer

・・・・・

・・・・・

・・・・・

Frontend

Ex-coreBackend

dispatchInstruction window

commit

issue

Page 44: Real-time Signal Processing on Embedded Systems

Basic function of Frontend

provides enough instructions. predicts next instruction address if

branch instruction appears. resolves artificial dependences by

register renaming. analyzes true data dependence

after register renaming. transfers instructions after the

above operations. This operation is called “dispatch”.

Page 45: Real-time Signal Processing on Embedded Systems

Basic function of Ex-core finds independent instructions

stored in “instruction window” as many as possible. In this operation, dynamic scheduling

is performed to resolve several restrictions: data dependence, resource, prior defined priority, etc.

executes independent instructions in parallel. An operation that transfers an

instruction to a function unit is called “issue”.

Page 46: Real-time Signal Processing on Embedded Systems

Basic function of Backend

updates processor state. Results obtained as out-of-order are

reordered to in-order. Update of the processor state is

performed precisely. Update of the processor state based on

the execution result is called “commit”. Disappear of instruction is called “retire”.

Page 47: Real-time Signal Processing on Embedded Systems

Dynamic instruction scheduling

Instruction scheduling means to determine issuing order of instructions and when the instructions are issued.

In superscalar processors, dynamic instruction scheduling is performed using instructions stored in the instruction buffer.

In the following slides, dynamic scheduling will be explained using several types of processors:1-way in-order processor, i-way in-order processro, and i-way out-of-order processor.

Page 48: Real-time Signal Processing on Embedded Systems

1 way in-order issue The number of issued instructions

at a cycle is at most 1. The size of instruction window is 1

because all subsequent instructions cannot be issued if an instruction cannot be issued.

Only true and output dependences should be checked because anti dependence is always resolved.

Page 49: Real-time Signal Processing on Embedded Systems

Control by R flag R flag is used to check true and

output dependences.op dst src1 src2 R value

R valueR valueR valueR valueR valueR valueR value

Instruction

Registers

Register number

Only when R(dst) == true && R(src1) ==true && R(src2), the instruction is issued. (This condition is called “ready”.)

R==false means the register is reserved but the result has not been stored yet. In this case, the operand is not available.

Page 50: Real-time Signal Processing on Embedded Systems

Update sequence of the R flag

R bit of destination becomes false when an instruction is issued.

R bit of destination becomes true when a result is stored in the destination. by the above update,

• Instructions using unavailable registers as source registers are not issued; true dependence is resolved.•Instructions using unavailable a register as a destination register are not issued; output dependence is resolved.

Practically, resource restrictions must be satisfied to issue instructions in addition to the check of dependency. In this lecture, only restriction about function unit is considered to simplify the discussion.

Page 51: Real-time Signal Processing on Embedded Systems

i-way in-order issue We think about how the following 4

instructions are executed on this processor.

i1: r1 = r5i2: r2 = r1 + 1i3: r3 = r6i4: r4 = r3 +1

Cycle Funciont Unit0

Function Unit1

0 i1: r1=r51 i2: r2 = r1 +

1i3: r3 = r6

2 i4: r4 = r3 + 1

In-order scheduling

IPC becomes 1.3. (4instcuctions/3

cycle)

Page 52: Real-time Signal Processing on Embedded Systems

How to check dependency of instructions?

True and output dependence must be checked.

Instruction 0

Instruction i-1::i

Instruction window

R valueRegisters

R value

:::::

Register number

3 × i

3 × i i

Page 53: Real-time Signal Processing on Embedded Systems

How to allocate resources(funciton unit)?

Allocation of is performed as follows. Check whether any of preceding

ready instructions refers or not. If there is no instructions refering , the function unit is available.

Repeat the above procedure from to , where means the number of function units.

0R0R

1R 1rRr

Page 54: Real-time Signal Processing on Embedded Systems

Complexity of i-way in-order issue

Ready detection ports are required.

comparators are required for check of operand dependency.

Resource allocation     input NOR gate is required.

i3

i

k

iik1

)1(23)1(3

Complexity increases by )()( 2iOiO ~

1i

Page 55: Real-time Signal Processing on Embedded Systems

i-way out-of-order issue Out-of-order scheduling of the

same code used in the previous i-way in-order case.

i1: r1 = r5i2: r2 = r1 + 1i3: r3 = r6i4: r4 = r3 +1

Cycle Funciont Unit0

Function Unit1

0 i1: r1=r5 i3: r3 = r61 i2: r2 = r1 +

1i4: r4 = r3 +

1

Out-of-order scheduling

IPC becomes 2.0. (4instcuctions/2

cycle)

Page 56: Real-time Signal Processing on Embedded Systems

Architectural requirements for out-of-order execution

The depth of instruction window should be increased to .

The number of registers’ ports must be for check of dependence.

Anti-dependence must be checked, in addition to the i-way in-order case.

Resource allocation can be performed in the same way as the i-way in-order case.

n3

n

Page 57: Real-time Signal Processing on Embedded Systems

Complexity of i-way in-order issue

Ready detection ports are required.

comparators are required for check of operand dependency.

Resource allocation     input NOR gate is required.

n3

n

k

nnk1

)1(25)1(5

Complexity increases by

)()( 2nOnO ~

1n

Increase of hardware complexity is more significant than the in-order case because n>>i in general.

Page 58: Real-time Signal Processing on Embedded Systems

Tomasulo’s Algorithm was proposed by R.M. Tomasulo in

1967. was originally adopted in floating

point unit in IBM 360/91. Performance was drastically

improved. Similar algorithms are used in the

latest microprocessors.

Page 59: Real-time Signal Processing on Embedded Systems

Superscalar arch using TomasuloInstruction cache

Instruction decodeTag allocation

Branch prediction

Function unit

Function unit

Registers

・・・・・

・・・・・

Data cache

・・・・・

Frontend

Ex-coredispatch

issue・・・・・

Reservation Station

Page 60: Real-time Signal Processing on Embedded Systems

Contents of reservation station and register

Register Tag is used for register renaming.

Reservation station

op: opecode dtag: destination tag stag: source tag R: ready flag value: operand’s value

valuetagR

valuestagRvaluestagRdtagop

Source 1 Source 2

Page 61: Real-time Signal Processing on Embedded Systems

Operation on the arch Dispatch Issue Execution Finalization

Page 62: Real-time Signal Processing on Embedded Systems

Operation on the arch Dispatch

dtag is assigned to a destination operand from tag pool that holds unassigned tags.

Src operands are obtained by reading registers using each register number. If R is true, then value is read, otherwise tag’s value is read from the register.

Then, an instruction is stored in a reservatoin station corresponding to a function unit used in the instruction.

Page 63: Real-time Signal Processing on Embedded Systems

Operation on the arch Issue

A ready instruction in a reservation is executed on a corresponding function unit, if the function unit is available.

The issued instruction is deleted from the reservation station.

Execution Issued instructions are executed on

corresponding function units.

Page 64: Real-time Signal Processing on Embedded Systems

Operation on the arch Finalize

Based on a result of execution, dtag and a result value is broadcasted to the result bus.

If there is an instruction holds the broadcasted dtag as stag, R flag and value of the instruction is replaced by true and the broadcasted result value, respectively.

Only when there is a register holding a tag corresponding to broadcasted dtag, the broadcasted result is stored in the register.

Finally, the broadcasted tag is stored to tag pool.

Page 65: Real-time Signal Processing on Embedded Systems

An example of Tomasulo A superscalar processor used in

this example has the following 5 stage pipeline and the number of way is 2. IF: fetches 2 instructions. ID: decodes, allocates tags, and

dispatches. RS: waits operands until an

instruction becomes ready. EX: executes an instruction. WB: writes a result.

i1: r1 = load Ai2: r2 = r1 + 3i3: r3 = r2 + 1i4: r4 = load B#A and B are const

Page 66: Real-time Signal Processing on Embedded Systems

Cycle 0op Destination Source 1 Source 2

R dtag

val R stag

val R stag

valInstruction

i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B

Stage

State of instructions

# R tag Val1 1 X 22 1 X 43 1 X 74 1 X 9

Registers

30

・・・・・・ 54

53

52

51

50

Tag pool

Page 67: Real-time Signal Processing on Embedded Systems

Cycle 1op Destination Source 1 Source 2

R dtag

val R stag

val R stag

valInstruction

i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B

StageIFIF

State of instructions

# R tag Val1 1 X 22 1 X 43 1 X 74 1 X 9

Registers

30

・・・・・・ 54

53

52

51

50

Tag pool

Page 68: Real-time Signal Processing on Embedded Systems

Cycle 2op Destination Source 1 Source 2

R dtag

val R stag

val R stag

val

load

0 50 X 1 X A 1 X 0

add

0 51 X 0 50 X 1 X 7

Instruction

i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B

StageIDIDIFIF

State of instructions

# R tag Val1 0 50 X2 0 51 X3 1 X 74 1 X 9

Registers

30

・・・・・・ 54

53

52

Tag pool

Page 69: Real-time Signal Processing on Embedded Systems

Cycle 3op Destination Source 1 Source 2

R dtag

val R stag

val R stag

val

load

1 50 15 1 X A 1 X 0

add

0 51 X 0 50 X 1 X 7

add

0 52 X

load

0 53 X

Instruction

i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B

Stage

EXRSIDID

State of instructions

# R tag Val1 0 50 X2 0 53 X3 1 X 74 0 52 X

Registers

30

・・・・・・ 54

Tag pool

Page 70: Real-time Signal Processing on Embedded Systems

Cycle 4op Destination Source 1 Source 2

R dtag

val R stag

val R stag

val

load

1 50 15 1 X A 1 X 0

add

1 51 22 1 50 15 1 X 7

add

0 52 X 0 51 X 1 X 1

load

1 53 16 1 X B 1 X 0

Instruction

i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B

Stage

WBEXRSEX

State of instructions

# R tag Val1 1 X 152 0 53 X3 1 X 74 0 52 X

Registers

50

30 ・・・・・・ 54

Tag pool

Page 71: Real-time Signal Processing on Embedded Systems

Cycle 5op Destination Source 1 Source 2

R dtag

val R stag

val R stag

val

add

1 51 22 1 50 15 1 X 7

add

1 52 23 1 51 22 1 X 1

load

1 53 16 1 X B 1 X 0

Instruction

i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B

Stage

WBEXWB

State of instructions

# R tag Val1 1 X 152 1 X 163 1 X 74 0 52 X

Registers

53

51 50 30 ・・・・・・ 54

Tag pool

Page 72: Real-time Signal Processing on Embedded Systems

Cycle 6op Destination Source 1 Source 2

R dtag

val R stag

val R stag

val

add

1 52 23 1 51 22 1 X 1

Instruction

i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B

Stage

WB

State of instructions

# R tag Val1 1 X 152 1 X 163 1 X 74 1 X 23

Registers

52

53 51 50 30 ・・・・・・ 54

Tag pool

Page 73: Real-time Signal Processing on Embedded Systems

Problem of out-of-order execution

It is difficult to update the processor state precisely if exception is occurred.

Fin i0: ・・・・

Fin i1: ・・・・

Fin i2: r1=load r1E i3: r2=load r3

i4: ・・・・

i5: r3 = r4 << r2i6: ・・・・

In order execution Out of order execution

Fin i0: ・・・・

i1: ・・・・

Fin i2: r1=load r1E i3: r2=load r3

i4: ・・・・

Fin i5: r3 = r4 << r2i6: ・・・・

Page 74: Real-time Signal Processing on Embedded Systems

Flow of exception handling

Unfinished instructions that include an instruction causes the exception is invalidated.

Control is moved to OS to save the current state to main memory and to handle the exception.

After the process of the exception, CPU begins to execute the instruction causing the exception again.

Page 75: Real-time Signal Processing on Embedded Systems

Problem of out-of-order execution

It is difficult to update the processor state precisely if exception is occurred.

Fin i0: ・・・・

Fin i1: ・・・・

Fin i2: r1=load r1E i3: r2=load r3

i4: ・・・・

i5: r3 = r4 << r2i6: ・・・・

In order execution

•Save the current state.•OS handles the exception.•CPU restarts from i3.

Page 76: Real-time Signal Processing on Embedded Systems

Problem of out-of-order execution

It is difficult to update the processor state precisely if exception is occurred.

Out of order execution

Fin i0: ・・・・

i1: ・・・・

Fin i2: r1=load r1E i3: r2=load r3

i4: ・・・・

Fin i5: r3 = r4 << r2i6: ・・・・

•Save the current state.• i5 has finished before i3.• i1 has not finished.• the data of r3 has been lost.

•OS handles the exception.CPU cannot restart from i3.

Reorder buffer is used for precise exception handling.

Page 77: Real-time Signal Processing on Embedded Systems

Reorder buffer Updates CPU’s state in the original

program order by reordering results.

Handles exception at the state update.

Reorder Buffer

Registers

Results and information about exception

Store of results in the originalprogram order and detection ofexception.

Commit

Page 78: Real-time Signal Processing on Embedded Systems

Superscalar arch using Tomasulo and reorder buffer

Instruction cache

Instruction decodeTag allocation

Branch prediction

Function unit

Function unit

Registers

・・・・・

・・・・・

Data cache

・・・・・

Frontend

Ex-core

dispatch

issue・・・・・

Reservation Station

Reorder Buffer

Backend

commit

Page 79: Real-time Signal Processing on Embedded Systems

Behaviour of reorder buffer

If there is result without an exception, it is stored to a register and the entry corresponding to it is removed.

There is a result with an exception, pipeline and reorder buffer are cleared.

If a result is not stored, reorder buffer waits until the result is obtained.

Page 80: Real-time Signal Processing on Embedded Systems

Contents of reorder buffer

PC: instruction address R: Ready flag dreg: register number of

destination dtag: operand tag of destination E: Exception flag result: result

resultEdtagdregRPC

Page 81: Real-time Signal Processing on Embedded Systems

Operand bypass and supply of source operand tag

Tomasulo: operand values are obtained from registers that have the latest values.

Reorder buffer: the latest values are stored in reorder buffer. (not in registers)

Procedure of obtaining operands: Check dependency to instructions decoded

concurrently. If there is dependency, stag becomes dtag of the dependent instruction.

Otherwise, reorder buffer is searched by source register number to obtain value (when R=1) or tag. (when R=0) If reorder buffer does not have value and tag corresponding to the register number, values are obtained from registers.

Page 82: Real-time Signal Processing on Embedded Systems

An example of reorder buffer

A superscalar processor used in this example has the following 6 stage pipeline and the number of way is 2. IF: fetches 2 instructions. ID: decodes, allocates tags, and

dispatches. RS: waits operands until an

instruction becomes ready. EX: executes an instruction. WB: writes results to reorder buffer. RT: writes result to registers.

Page 83: Real-time Signal Processing on Embedded Systems

A code used in the example

i1: 0x40: r1 = load A (r0)i2: 0x44: r2 = r1 + r3i3: 0x48: r2 = r2 + 16i4: 0x4C: r5 = load 0 (r1)i5: 0x50: r1 = r1 + 1i6: 0x54: r2 = load 0 (r2)

Address of instruction

Page 84: Real-time Signal Processing on Embedded Systems

Cycle 0op Destination Source 1 Source 2

E dtag

val R stag

val R stag

valInstructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)

Stage

State of instructions

pointer

enrty PC R dreg dtag E result

h/t 202122232425

Reorder buffer

Page 85: Real-time Signal Processing on Embedded Systems

Cycle 1op Destination Source 1 Source 2

E dtag

val R stag

val R stag

valInstructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)

StageIFIF

State of instructions

pointer

enrty PC R dreg dtag E result

H/T 202122232425

Reorder buffer

Page 86: Real-time Signal Processing on Embedded Systems

Cycle 2op Destination Source 1 Source 2

E dtag

val R stag

val R stag

val

load

X 20 X 1 X A 1 X 0

add

X 21 X 0 20 X 1 X 7

Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)

StageIDIDIFIF

State of instructions

pointer

enrty PC R dreg dtag E result

Head 20 40 0 1 20 X X21 44 0 2 21 X X

Tail 22232425

Reorder buffer

Page 87: Real-time Signal Processing on Embedded Systems

Cycle 3op Destination Source 1 Source 2

E dtag

val R stag

val R stag

val

load

0 20 15 1 X A 1 X 0

add

X 21 X 0 20 X 1 X 7

add

X 22 X 0 21 X 1 X 16

load

X 23 X 1 X 0 0 20 X

Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)

Stage

EXRSIDIDIFIF

State of instructions

pointer

enrty PC R dreg dtag E result

Head 20 40 0 1 20 X X21 44 0 2 21 X X22 48 0 2 22 X X23 4C 0 5 23 X X

Tail 2425

Reorder buffer

Page 88: Real-time Signal Processing on Embedded Systems

Cycle 4op Destination Source 1 Source 2

E dtag

val R stag

val R stag

val

load

0 20 15 1 X A 1 X 0

add

0 21 22 1 20 15 1 X 7

add

X 22 X 0 21 X 1 X 16

load

1 23 ? 1 X 0 1 20 15

add

X 24 X 1 X 15 1 X 1

load

X 25 X 1 X 0 0 22 X

Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)

Stage

WBEXRSEXIDID

State of instructions

pointer

enrty PC R dreg dtag E result

Head 20 40 1 1 20 0 1521 44 0 2 21 X X22 48 0 2 22 X X23 4C 0 5 23 X X24 50 0 1 24 X X25 54 0 2 25 X X

Reorder buffer

Tail

Page 89: Real-time Signal Processing on Embedded Systems

Cycle 5op Destination Source 1 Source 2

E dtag

val R stag

val R stag

val

add

0 21 22 1 20 15 1 X 7

add

0 22 38 1 21 22 1 X 16

load

1 23 ? 1 X 0 1 20 15

add

0 24 16 1 X 15 1 X 1

load

X 25 X 1 X 0 0 22 X

Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)

StageRTWBEXWBEXRS

State of instructions

pointer

enrty PC R dreg dtag E result

20 40 1 1 20 0 15Head 21 44 1 2 21 0 22

22 48 0 2 22 X X23 4C 1 5 23 1 ?24 50 0 1 24 X X25 54 0 2 25 X X

Reorder buffer

Tail

Page 90: Real-time Signal Processing on Embedded Systems

Cycle 6op Destination Source 1 Source 2

E dtag

val R stag

val R stag

val

add

0 22 38 1 21 22 1 X 16

add

0 24 16 1 X 15 1 X 1

load

0 25 4 1 X 0 1 22 38

Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)

Stage

RTWBRTWBEX

State of instructions

pointer

enrty PC R dreg dtag E result

20 40 1 1 20 0 1521 44 1 2 21 0 22

Head 22 48 1 2 22 0 3823 4C 1 5 23 1 ?24 50 1 1 24 0 1625 54 0 2 25 X X

Reorder buffer

Tail

Page 91: Real-time Signal Processing on Embedded Systems

Cycle 7op Destination Source 1 Source 2

E dtag

val R stag

val R stag

val

load

0 25 4 1 X 0 1 22 38

Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)

Stage

RTRTRTWB

State of instructions

pointer

enrty PC R dreg dtag E result

20 40 1 1 20 0 1521 44 1 2 21 0 2222 48 1 2 22 0 38

H/T 23 4C 1 5 23 1 ?24 50 1 1 24 0 1625 54 0 2 25 X X

Reorder buffer

Exceptionis detected.

Page 92: Real-time Signal Processing on Embedded Systems

VLIW (Very Long Instruction Word)

In the VLIW processor, compiler extracts parallelism in a code. Therefore, special hardware support used in the superscalar processor becomes unnecesarry. Superscalar: dynamic scheduling by

hardware support VLIW: static scheduling by compiler

Page 93: Real-time Signal Processing on Embedded Systems

Overview of VLIWcompiler

compiler

processor

processor

main(){ ・・・・・・・・・・・・ }

add sub ・・・

code gen

scheduling

execution

execution

main(){ ・・・・・・・・・・・・ }

add sub ・・・

code gen

add sub load add mul load ・・・

scheduling

Superscalar

VLIW

Page 94: Real-time Signal Processing on Embedded Systems

VLIW code

i1: r3=r4+1i2: r1=load(r2)i3: r1=r1<<r3i4: r5=r2+r6i5:beq r5,L

Sequential code

ALU ALU MEM Branchi1: r3=r4+1 i4: r5=r2+r6 i2:

r1=load(r2)nop

i3: r1=r1<<r3

nop nop i5:beq r5,L VLIW code

Page 95: Real-time Signal Processing on Embedded Systems

・・・・・

Hardware organization of VLIW

ALU ALU MEM Branch

Registers

・・・・・

Instruction cache

Data cache

Page 96: Real-time Signal Processing on Embedded Systems

VLIW vs Superscalar

Superscalar VLIWHardware size Large SmallHardware complexity Large SmallScheduling algorithm Poor RichInstruction window Small LargeBinary compatibility Compatible Not

compatible

Page 97: Real-time Signal Processing on Embedded Systems

Dynamic vs Static schedluingi1: r1=load Ai2: r2=load(r1)i3: r3=load Bi4: r4=r3<<r2i5: r5=r4+1i6: r6=r2+r5

i1

i2

i3

i4

i5i6

Data dependencyof the code

Cycle ALU MEM0 i11 i22 i33 i44 i55 i6

Sample code

Dynamic scheduling

Cycle ALU MEM0 i31 i4 i12 i5 i23 i6

Optimal scheduling

Page 98: Real-time Signal Processing on Embedded Systems

Advantage of dynamic scheduling

Scheduling based on information that can only be obtained at run time. For example, cache miss can be

concealed. Scheduling based on accurate

dependency of memories. Data address that can be obtained

only at run time improves scheduling performance.

Page 99: Real-time Signal Processing on Embedded Systems

Taxonomy of scheduling algorithm

Local scheduling Global scheduling

Cyclic scheduling Acyclic scheduling

Trace-based scheduling DAG-based (Directed acyclic graph)

scheduling

Page 100: Real-time Signal Processing on Embedded Systems

VLIW-based commercial processors

Transmeta Crusoe Aiming mobile computing

Texas Instruments TMS320C6x series Embedded applications

Intel Itanium

Page 101: Real-time Signal Processing on Embedded Systems

Parallel operation by SIMD

What is SIMD?: SIMD (Single Instruction Multiple Data) means that the same operation is applied to several operands. Ex: Addition

int i;int a[4]={1,2,3,4};int b[4]={5,6,7,8};int c[4];for (i=0;i<4;i++){ c[i]=a[i]+b[i];}

c[0]=a[0]+b[0]c[1]=a[1]+b[1]c[2]=a[2]+b[2]c[3]=a[3]+b[3]

Sequential operation

SIMD

b[3]b[2]b[1]b[0]

a[3]a[2]a[1]a[0]

c[3]c[2]c[1]c[0]

Page 102: Real-time Signal Processing on Embedded Systems

SIMD data types (Cell/B.E.)

vector unsigned char 16 unsigned 8bit valuesvector signed char 16 signed 8bit valuesvector unsigned short 8 unsigned 16 bit valuesvector signed short 8 signed 16 bit valuesvector unsigned int 4 unsigned 32 bit valuesvector signed int 4 signed 32 bit valuesvector unsigned long long

2 unsigned 64 bit values

vector signed long long

2 signed 64 bit values

vector float 4 32bit floating vlauesvector double 2 64 bit double (floating) values

Page 103: Real-time Signal Processing on Embedded Systems

Allocation of vector values

Vector values are allocated to memory in the big-endian style as shown in the following figure.

*This figure is adapted from cell.fixstars.com

Page 104: Real-time Signal Processing on Embedded Systems

How to access vector type via normal pointer

vector signed int va = (vector signed int) { 1, 2, 3, 4 };int *a = (int *) &va;

*This figure is adapted from cell.fixstars.com

Page 105: Real-time Signal Processing on Embedded Systems

How to access a normal array from vector type

int a[8] __attribute__((aligned(16))) = { 1, 2, 3, 4, 5, 6, 7, 8 };vector signed int *va = (vector signed int *) a;

*This figure is adapted from cell.fixstars.com

__attribute__((aligned(16))) forces scalar data to be 16 byte-aligned

Page 106: Real-time Signal Processing on Embedded Systems

SIMD operation on PPE

int a[4] __attribute__((aligned(16))) = { 1, 2, 3, 4 };int b[4] __attribute__((aligned(16))) = { 5, 6, 7, 8 };int c[4] __attribute__((aligned(16)));vector signed int *va = (vector signed int *) a;vector signed int *vb = (vector signed int *) b;vector signed int *vc = (vector signed int *) c;*vc = vec_add(*va, *vb);

b[3]b[2]b[1]b[0]

a[3]a[2]a[1]a[0]

c[3]c[2]c[1]c[0]

vec_add is a SIMD function provided by VMX (Vector Multimedia Extension) proposed by IBM and Mtorola.

Page 107: Real-time Signal Processing on Embedded Systems

Entire code for vector addition

#include <stdio.h>#include <altivec.h>int a[4] __attribute__((aligned(16))) = { 1, 2, 3, 4 };int b[4] __attribute__((aligned(16))) = { 5, 6, 7, 8 };int c[4] __attribute__((aligned(16)));int main(int argc, char **argv){vector signed int *va = (vector signed int *) a;vector signed int *vb = (vector signed int *) b;vector signed int *vc = (vector signed int *) c;*vc = vec_add(*va, *vb);printf("c[0]=%d, c[1]=%d, c[2]=%d, c[3]=%d\n", c[0], c[1], c[2], c[3]);return 0;}

Page 108: Real-time Signal Processing on Embedded Systems

A part of VMX functionArithmetic operation

vec_add(a,b) a+bvec_sub(a,b) a-bvec_madd(a,b,c)

a*b+c

Logical operation

vec_and(a,b) Logical andvec_or(a,b) Logical or

Bit operation

vec_perm(a,b,c)

creating new vector from a[i] and b[i] based on c[i]

vec_sel(a,b,c) selecting a[i] or b[i] basedon c[i]

branch vec_cmpeq(a, b)

a[i]==b[i]

vec_cmpgt(a, b)

a[i]>b[i]

Type conversion

vec_ctf(a, b) (float)a[i]/(2^b)vec_ctu(a, b) (unsigned int) a[i]/(2^b)

Generating constant

vec_splat(a, b) a[b]vec_splat s32(a)

signed a[i]

Page 109: Real-time Signal Processing on Embedded Systems

How to create dense vector data

In general, vector data is not densely stored. Threfore, dense vector data must be created before vector operation.vc = vec_perm(va, vb, vpat);

*This figure is adapted from cell.fixstars.com

Page 110: Real-time Signal Processing on Embedded Systems

Ex of vec_perm: Transpose

*These figures are adapted from cell.fixstars.com

Page 111: Real-time Signal Processing on Embedded Systems

Branch on SIMD

*These figures are adapted from cell.fixstars.com

Page 112: Real-time Signal Processing on Embedded Systems

Procedure of SIMD Branch

*These figures are adapted from cell.fixstars.com

Page 113: Real-time Signal Processing on Embedded Systems

Detail of SIMD Branch

vec_cmpgt()

vec_sel() *These figures are adapted from cell.fixstars.com

Page 114: Real-time Signal Processing on Embedded Systems

Ex of SIMD Branchint a[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };int b[16] = { 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4 3, 2, 1 };int c[16];int i;for (i = 0; i < 16; i++) { if (a[i] > b[i]) { c[i] = a[i] - b[i]; } else { c[i] = b[i] - a[i]; }}

Page 115: Real-time Signal Processing on Embedded Systems

Ex of SIMD Branchint a[16] __attribute__((aligned(16))) = { 1, 2, 3, 4, 5, 6, 7, 8,9, 10, 11, 12, 13, 14, 15, 16 };int b[16] __attribute__((aligned(16))) = { 16, 15, 14, 13, 12, 11, 10, 9,8, 7, 6, 5, 4, 3, 2, 1 };int c[16] __attribute__((aligned(16)));vector signed int *va = (vector signed int *) a;vector signed int *vb = (vector signed int *) b;vector signed int *vc = (vector signed int *) c;vector signed int vc_true, vc_false;vector unsigned int vpat;int i;for (i = 0; i < 4; i++) { vpat = vec_cmpgt(va[i], vb[i]); vc_true = vec_sub(va[i], vb[i]); vc_false = vec_sub(vb[i], va[i]); vc[i] = vec_sel(vc_false, vc_true, vpat);}