Real-time Signal Processing on Embedded Systems

Post on 23-Feb-2016

50 views 2 download

description

Real-time Signal Processing on Embedded Systems. Advanced Cutting-edge Research Seminar I&III. Advances in Microprocessor Technology. Architectural improvements of microprocessors. Pipelining Paralle processing exploiting ILP Superscalar VLIW SIMD. - PowerPoint PPT Presentation

Transcript of Real-time Signal Processing on Embedded Systems

Real-time Signal Processing on Embedded Systems

Advanced Cutting-edge Research Seminar I&III

Advances in Microprocessor Technology

Architectural improvements   of microprocessors

Pipelining Paralle processing exploiting ILP

Superscalar VLIW

SIMD

Procedure of instruction execution on a processor

Instruction Fetch (IF) fetches an instruction from main

memory. Instruction Decode (ID)

decodes fetched instruction Execution (EX)

executes decoded instruction Memory Access (MA)

accesses to main memory Write Back (WB)

Write back data to registers

Operation cycles on a processor

Single cycle machine This kinds of machines execute all

procedures from IF to WB in a cycle. Operation speed is determined by the

slowest instruction. (Because all instructions must be executed in a cycle)

Multi-cycle machine This kinds of machines execute an

instruction in several cycles.IF ID EX MA

WB

Piepelined operation can improve throughput of

instructions.

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

IF ID EX MA

WB IF ID EX M

AWB

IF ID EX MA

WB

To realize pipelined operation, several techniques are required.

Causes of pipeline hazards

Structural hazard: The hardware cannot cope with the combination of issued instructions.

Data hazard: The latter instruction must wait completion of former instruction because the latter uses the result of the former.

Control hazard: A condition that determines whether an instruction is executed or not depends on the result of the former instruction.

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

MAIFconflict

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX M

AWB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU Resolve 1: to stall the next

instruction

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX M

AWB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU Resolve 1: to stall the next

instruction

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

MAIFconflict

Resolve 2: to add another data bus to access the instruction memory.

Structural hazard

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC

Inst Mem

Instructiondecoder

Instructionregister

ALU

Registers

CPU

Data Mem

Harvard Architecture

Resolve 2: to add another data bus to access the instruction memory.

Data hazard

IF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)

Registers5

0

4 3 2 1

0 0 0 0t0 t1 t2 t3 t4

s0 s1 s2 s3 s4

Data hazard

IF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)

$s0=$t0+$t1

Registers5

0

4 3 2 1

0 0 0 0t0 t1 t2 t3 t4

s0 s1 s2 s3 s4

Data hazard

IF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)

$s0=$t0+$t1

Registers5

0

4 3 2 1

0 0 0 0t0 t1 t2 t3 t4

s0 s1 s2 s3 s4

$t2=$s0-$t3

Data hazard

IF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)

$s0=$t0+$t1

Registers5

0

4 3 2 1

0 0 0 0t0 t1 t2 t3 t4

s0 s1 s2 s3 s4

$t2=$s0-$t3-2=0-2

Data hazard

IF ID EX MA

WBIF ID EX M

AWB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)

$s0=$t0+$t1

Registers5

0

4 3 2 1

0 0 0 0t0 t1 t2 t3 t4

s0 s1 s2 s3 s4

Waiting by stalls: consuming 3 cycles

Data hazard

IF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)

$s0=$t0+$t1

Registers5

0

4 3 2 1

0 0 0 0t0 t1 t2 t3 t4

s0 s1 s2 s3 s4

Resolve: forwarding

Data hazard

IF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)

$s0=$t0+$t1

Registers5

0

4 3 2 1

0 0 0 0t0 t1 t2 t3 t4

s0 s1 s2 s3 s4

Resolve: forwarding

The result is forwarded to ALU

Data hazard

IF ID EX MA

WBIF ID EX MA

WB

PC

Memory

Instructiondecoder

Instructionregister

ALU

Registers

CPU

add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)

$s0=$t0+$t1

Registers5

0

4 3 2 1

0 0 0 0t0 t1 t2 t3 t4

s0 s1 s2 s3 s4

Resolve: forwarding

$t2=9-$t37=9-2

The result is forwarded to ALU

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

An instruction sequenceincluding branch

PC:10Instructiondecoder

Instructionregister

ALU

Registers

CPU ※ ※ In this explanation,PC adopts word addressfor simplification.

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:Instructiondecoder

Instructionregister

ALU

Registers

CPU

An instruction sequenceincluding branch

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:11Instructiondecoder

Instructionregister

ALU

Registers

CPU

An instruction sequenceincluding branch

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:12Instructiondecoder

Instructionregister

ALU

Registers

CPUPC’s value of next

instruction depends on the branch

conditionBranch is

taken:PC=40Not taken:PC=12

An instruction sequenceincluding branch

Control hazard Resolve 1: stall

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX M

AWB2 cycle stall

The number of required stall cycleaetermined by architecture.

Control hazard Resolve 1: stall

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX M

AWB

1 cycle stall

If the processor can calculate the branch targetaddress at the ID stage.

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:10Instructiondecoder

Instructionregister

ALU

Registers

CPU

Resolve 2: Branch prediction

In this example, the nextPC is predicted as if the branch is always untaken.

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:11Instructiondecoder

Instructionregister

ALU

Registers

CPU

Resolve 2: branch prediction

In this example, the nextPC is predicted as if the branch is always untaken.

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:12Instructiondecoder

Instructionregister

ALU

Registers

CPU

Resolve 2: branch prediction

In this example, the nextPC is predicted as if the branch is always untaken.

Control hazard Resolve 2: branch prediction

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX M

AWBstall

PC:40Instructiondecoder

Instructionregister

ALU

Registers

CPUIf the prediction is missed,in other words, if branchis taken.

Control hazard More practical scheme: dynamic

branch prediction n-bit counter-based prediction:

Address of a branch instraction Branch History TableLower i-bit

n-bit saturatingup/down counter

1-bit counter-based prediction

Predict branch will be taken

Predict branch will be untaken

1 0

Branch is taken

Branch is untaken

2-bit counter-based prediction

Predict branch will be taken

Predict branch will be taken

Predict branch will be taken

00

Predict branch will be taken

0110

11

Branch is taken

Branch is untaken

This scheme is adopted in Intel Pentium, Sun Ultra SPARC, MIPS R10000,etc

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})Inserted instructionor $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:11Instructiondecoder

Instructionregister

ALU

Registers

CPU

Resolve 3: delayed prediction

An instruction that has no dependencyis inserted.

IF ID EX MA

WB

Resolve 3: delayed prediction

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})Inserted instructionor $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:12Instructiondecoder

Instructionregister

ALU

Registers

CPUIF ID EX M

AWB

An instruction that has no dependencyis inserted.

Resolve 3: delayed prediction

Control hazard

add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})Inserted instructionor $s3,$s4,$t2($s3=$s4|$t2)

IF ID EX MA

WBIF ID EX MA

WBIF ID EX MA

WB

PC:13or40Instructiondecoder

Instructionregister

ALU

Registers

CPUIF ID EX M

AWB

An instruction at determined addressis executed.

An instruction that has no dependencyis inserted.

Exploiting ILP (Instruction Level Parallelism)

SuperScalar : issuing multiple instructions per cycle with hardware support. Advantage: binary compatibility.

VLIW: issuing multiple instructions per cycle with compiler support. Advantage: simple hardware

Types of data dependence True data dependence (RAW: Read

After Write)

Anti-dependence (WAR: Write After Read)

Output dependence (WAW: Write After Write)

i1: r2=r1+r3i2: r4=r2+1

i1: r1=r2+r3i2: r2=r4+1i3: r1=r4+2

Anti Output

difficult to remove

can be removed by register renaming

They are called as artificial dependence

Basic Architecture of Superscaler Processor

Instruction cache

Instruction decodeRegister renaming

Branch prediction

Function unit

Function unit

Registers

・・・・・

・・・・・

Data cache

Reorder buffer

・・・・・

・・・・・

・・・・・

Frontend

Ex-coreBackend

dispatchInstruction window

commit

issue

Basic function of Frontend

provides enough instructions. predicts next instruction address if

branch instruction appears. resolves artificial dependences by

register renaming. analyzes true data dependence

after register renaming. transfers instructions after the

above operations. This operation is called “dispatch”.

Basic function of Ex-core finds independent instructions

stored in “instruction window” as many as possible. In this operation, dynamic scheduling

is performed to resolve several restrictions: data dependence, resource, prior defined priority, etc.

executes independent instructions in parallel. An operation that transfers an

instruction to a function unit is called “issue”.

Basic function of Backend

updates processor state. Results obtained as out-of-order are

reordered to in-order. Update of the processor state is

performed precisely. Update of the processor state based on

the execution result is called “commit”. Disappear of instruction is called “retire”.

Dynamic instruction scheduling

Instruction scheduling means to determine issuing order of instructions and when the instructions are issued.

In superscalar processors, dynamic instruction scheduling is performed using instructions stored in the instruction buffer.

In the following slides, dynamic scheduling will be explained using several types of processors:1-way in-order processor, i-way in-order processro, and i-way out-of-order processor.

1 way in-order issue The number of issued instructions

at a cycle is at most 1. The size of instruction window is 1

because all subsequent instructions cannot be issued if an instruction cannot be issued.

Only true and output dependences should be checked because anti dependence is always resolved.

Control by R flag R flag is used to check true and

output dependences.op dst src1 src2 R value

R valueR valueR valueR valueR valueR valueR value

Instruction

Registers

Register number

Only when R(dst) == true && R(src1) ==true && R(src2), the instruction is issued. (This condition is called “ready”.)

R==false means the register is reserved but the result has not been stored yet. In this case, the operand is not available.

Update sequence of the R flag

R bit of destination becomes false when an instruction is issued.

R bit of destination becomes true when a result is stored in the destination. by the above update,

• Instructions using unavailable registers as source registers are not issued; true dependence is resolved.•Instructions using unavailable a register as a destination register are not issued; output dependence is resolved.

Practically, resource restrictions must be satisfied to issue instructions in addition to the check of dependency. In this lecture, only restriction about function unit is considered to simplify the discussion.

i-way in-order issue We think about how the following 4

instructions are executed on this processor.

i1: r1 = r5i2: r2 = r1 + 1i3: r3 = r6i4: r4 = r3 +1

Cycle Funciont Unit0

Function Unit1

0 i1: r1=r51 i2: r2 = r1 +

1i3: r3 = r6

2 i4: r4 = r3 + 1

In-order scheduling

IPC becomes 1.3. (4instcuctions/3

cycle)

How to check dependency of instructions?

True and output dependence must be checked.

Instruction 0

Instruction i-1::i

Instruction window

R valueRegisters

R value

:::::

Register number

3 × i

3 × i i

How to allocate resources(funciton unit)?

Allocation of is performed as follows. Check whether any of preceding

ready instructions refers or not. If there is no instructions refering , the function unit is available.

Repeat the above procedure from to , where means the number of function units.

0R0R

1R 1rRr

Complexity of i-way in-order issue

Ready detection ports are required.

comparators are required for check of operand dependency.

Resource allocation     input NOR gate is required.

i3

i

k

iik1

)1(23)1(3

Complexity increases by )()( 2iOiO ~

1i

i-way out-of-order issue Out-of-order scheduling of the

same code used in the previous i-way in-order case.

i1: r1 = r5i2: r2 = r1 + 1i3: r3 = r6i4: r4 = r3 +1

Cycle Funciont Unit0

Function Unit1

0 i1: r1=r5 i3: r3 = r61 i2: r2 = r1 +

1i4: r4 = r3 +

1

Out-of-order scheduling

IPC becomes 2.0. (4instcuctions/2

cycle)

Architectural requirements for out-of-order execution

The depth of instruction window should be increased to .

The number of registers’ ports must be for check of dependence.

Anti-dependence must be checked, in addition to the i-way in-order case.

Resource allocation can be performed in the same way as the i-way in-order case.

n3

n

Complexity of i-way in-order issue

Ready detection ports are required.

comparators are required for check of operand dependency.

Resource allocation     input NOR gate is required.

n3

n

k

nnk1

)1(25)1(5

Complexity increases by

)()( 2nOnO ~

1n

Increase of hardware complexity is more significant than the in-order case because n>>i in general.

Tomasulo’s Algorithm was proposed by R.M. Tomasulo in

1967. was originally adopted in floating

point unit in IBM 360/91. Performance was drastically

improved. Similar algorithms are used in the

latest microprocessors.

Superscalar arch using TomasuloInstruction cache

Instruction decodeTag allocation

Branch prediction

Function unit

Function unit

Registers

・・・・・

・・・・・

Data cache

・・・・・

Frontend

Ex-coredispatch

issue・・・・・

Reservation Station

Contents of reservation station and register

Register Tag is used for register renaming.

Reservation station

op: opecode dtag: destination tag stag: source tag R: ready flag value: operand’s value

valuetagR

valuestagRvaluestagRdtagop

Source 1 Source 2

Operation on the arch Dispatch Issue Execution Finalization

Operation on the arch Dispatch

dtag is assigned to a destination operand from tag pool that holds unassigned tags.

Src operands are obtained by reading registers using each register number. If R is true, then value is read, otherwise tag’s value is read from the register.

Then, an instruction is stored in a reservatoin station corresponding to a function unit used in the instruction.

Operation on the arch Issue

A ready instruction in a reservation is executed on a corresponding function unit, if the function unit is available.

The issued instruction is deleted from the reservation station.

Execution Issued instructions are executed on

corresponding function units.

Operation on the arch Finalize

Based on a result of execution, dtag and a result value is broadcasted to the result bus.

If there is an instruction holds the broadcasted dtag as stag, R flag and value of the instruction is replaced by true and the broadcasted result value, respectively.

Only when there is a register holding a tag corresponding to broadcasted dtag, the broadcasted result is stored in the register.

Finally, the broadcasted tag is stored to tag pool.

An example of Tomasulo A superscalar processor used in

this example has the following 5 stage pipeline and the number of way is 2. IF: fetches 2 instructions. ID: decodes, allocates tags, and

dispatches. RS: waits operands until an

instruction becomes ready. EX: executes an instruction. WB: writes a result.

i1: r1 = load Ai2: r2 = r1 + 3i3: r3 = r2 + 1i4: r4 = load B#A and B are const

Cycle 0op Destination Source 1 Source 2

R dtag

val R stag

val R stag

valInstruction

i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B

Stage

State of instructions

# R tag Val1 1 X 22 1 X 43 1 X 74 1 X 9

Registers

30

・・・・・・ 54

53

52

51

50

Tag pool

Cycle 1op Destination Source 1 Source 2

R dtag

val R stag

val R stag

valInstruction

i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B

StageIFIF

State of instructions

# R tag Val1 1 X 22 1 X 43 1 X 74 1 X 9

Registers

30

・・・・・・ 54

53

52

51

50

Tag pool

Cycle 2op Destination Source 1 Source 2

R dtag

val R stag

val R stag

val

load

0 50 X 1 X A 1 X 0

add

0 51 X 0 50 X 1 X 7

Instruction

i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B

StageIDIDIFIF

State of instructions

# R tag Val1 0 50 X2 0 51 X3 1 X 74 1 X 9

Registers

30

・・・・・・ 54

53

52

Tag pool

Cycle 3op Destination Source 1 Source 2

R dtag

val R stag

val R stag

val

load

1 50 15 1 X A 1 X 0

add

0 51 X 0 50 X 1 X 7

add

0 52 X

load

0 53 X

Instruction

i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B

Stage

EXRSIDID

State of instructions

# R tag Val1 0 50 X2 0 53 X3 1 X 74 0 52 X

Registers

30

・・・・・・ 54

Tag pool

Cycle 4op Destination Source 1 Source 2

R dtag

val R stag

val R stag

val

load

1 50 15 1 X A 1 X 0

add

1 51 22 1 50 15 1 X 7

add

0 52 X 0 51 X 1 X 1

load

1 53 16 1 X B 1 X 0

Instruction

i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B

Stage

WBEXRSEX

State of instructions

# R tag Val1 1 X 152 0 53 X3 1 X 74 0 52 X

Registers

50

30 ・・・・・・ 54

Tag pool

Cycle 5op Destination Source 1 Source 2

R dtag

val R stag

val R stag

val

add

1 51 22 1 50 15 1 X 7

add

1 52 23 1 51 22 1 X 1

load

1 53 16 1 X B 1 X 0

Instruction

i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B

Stage

WBEXWB

State of instructions

# R tag Val1 1 X 152 1 X 163 1 X 74 0 52 X

Registers

53

51 50 30 ・・・・・・ 54

Tag pool

Cycle 6op Destination Source 1 Source 2

R dtag

val R stag

val R stag

val

add

1 52 23 1 51 22 1 X 1

Instruction

i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B

Stage

WB

State of instructions

# R tag Val1 1 X 152 1 X 163 1 X 74 1 X 23

Registers

52

53 51 50 30 ・・・・・・ 54

Tag pool

Problem of out-of-order execution

It is difficult to update the processor state precisely if exception is occurred.

Fin i0: ・・・・

Fin i1: ・・・・

Fin i2: r1=load r1E i3: r2=load r3

i4: ・・・・

i5: r3 = r4 << r2i6: ・・・・

In order execution Out of order execution

Fin i0: ・・・・

i1: ・・・・

Fin i2: r1=load r1E i3: r2=load r3

i4: ・・・・

Fin i5: r3 = r4 << r2i6: ・・・・

Flow of exception handling

Unfinished instructions that include an instruction causes the exception is invalidated.

Control is moved to OS to save the current state to main memory and to handle the exception.

After the process of the exception, CPU begins to execute the instruction causing the exception again.

Problem of out-of-order execution

It is difficult to update the processor state precisely if exception is occurred.

Fin i0: ・・・・

Fin i1: ・・・・

Fin i2: r1=load r1E i3: r2=load r3

i4: ・・・・

i5: r3 = r4 << r2i6: ・・・・

In order execution

•Save the current state.•OS handles the exception.•CPU restarts from i3.

Problem of out-of-order execution

It is difficult to update the processor state precisely if exception is occurred.

Out of order execution

Fin i0: ・・・・

i1: ・・・・

Fin i2: r1=load r1E i3: r2=load r3

i4: ・・・・

Fin i5: r3 = r4 << r2i6: ・・・・

•Save the current state.• i5 has finished before i3.• i1 has not finished.• the data of r3 has been lost.

•OS handles the exception.CPU cannot restart from i3.

Reorder buffer is used for precise exception handling.

Reorder buffer Updates CPU’s state in the original

program order by reordering results.

Handles exception at the state update.

Reorder Buffer

Registers

Results and information about exception

Store of results in the originalprogram order and detection ofexception.

Commit

Superscalar arch using Tomasulo and reorder buffer

Instruction cache

Instruction decodeTag allocation

Branch prediction

Function unit

Function unit

Registers

・・・・・

・・・・・

Data cache

・・・・・

Frontend

Ex-core

dispatch

issue・・・・・

Reservation Station

Reorder Buffer

Backend

commit

Behaviour of reorder buffer

If there is result without an exception, it is stored to a register and the entry corresponding to it is removed.

There is a result with an exception, pipeline and reorder buffer are cleared.

If a result is not stored, reorder buffer waits until the result is obtained.

Contents of reorder buffer

PC: instruction address R: Ready flag dreg: register number of

destination dtag: operand tag of destination E: Exception flag result: result

resultEdtagdregRPC

Operand bypass and supply of source operand tag

Tomasulo: operand values are obtained from registers that have the latest values.

Reorder buffer: the latest values are stored in reorder buffer. (not in registers)

Procedure of obtaining operands: Check dependency to instructions decoded

concurrently. If there is dependency, stag becomes dtag of the dependent instruction.

Otherwise, reorder buffer is searched by source register number to obtain value (when R=1) or tag. (when R=0) If reorder buffer does not have value and tag corresponding to the register number, values are obtained from registers.

An example of reorder buffer

A superscalar processor used in this example has the following 6 stage pipeline and the number of way is 2. IF: fetches 2 instructions. ID: decodes, allocates tags, and

dispatches. RS: waits operands until an

instruction becomes ready. EX: executes an instruction. WB: writes results to reorder buffer. RT: writes result to registers.

A code used in the example

i1: 0x40: r1 = load A (r0)i2: 0x44: r2 = r1 + r3i3: 0x48: r2 = r2 + 16i4: 0x4C: r5 = load 0 (r1)i5: 0x50: r1 = r1 + 1i6: 0x54: r2 = load 0 (r2)

Address of instruction

Cycle 0op Destination Source 1 Source 2

E dtag

val R stag

val R stag

valInstructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)

Stage

State of instructions

pointer

enrty PC R dreg dtag E result

h/t 202122232425

Reorder buffer

Cycle 1op Destination Source 1 Source 2

E dtag

val R stag

val R stag

valInstructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)

StageIFIF

State of instructions

pointer

enrty PC R dreg dtag E result

H/T 202122232425

Reorder buffer

Cycle 2op Destination Source 1 Source 2

E dtag

val R stag

val R stag

val

load

X 20 X 1 X A 1 X 0

add

X 21 X 0 20 X 1 X 7

Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)

StageIDIDIFIF

State of instructions

pointer

enrty PC R dreg dtag E result

Head 20 40 0 1 20 X X21 44 0 2 21 X X

Tail 22232425

Reorder buffer

Cycle 3op Destination Source 1 Source 2

E dtag

val R stag

val R stag

val

load

0 20 15 1 X A 1 X 0

add

X 21 X 0 20 X 1 X 7

add

X 22 X 0 21 X 1 X 16

load

X 23 X 1 X 0 0 20 X

Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)

Stage

EXRSIDIDIFIF

State of instructions

pointer

enrty PC R dreg dtag E result

Head 20 40 0 1 20 X X21 44 0 2 21 X X22 48 0 2 22 X X23 4C 0 5 23 X X

Tail 2425

Reorder buffer

Cycle 4op Destination Source 1 Source 2

E dtag

val R stag

val R stag

val

load

0 20 15 1 X A 1 X 0

add

0 21 22 1 20 15 1 X 7

add

X 22 X 0 21 X 1 X 16

load

1 23 ? 1 X 0 1 20 15

add

X 24 X 1 X 15 1 X 1

load

X 25 X 1 X 0 0 22 X

Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)

Stage

WBEXRSEXIDID

State of instructions

pointer

enrty PC R dreg dtag E result

Head 20 40 1 1 20 0 1521 44 0 2 21 X X22 48 0 2 22 X X23 4C 0 5 23 X X24 50 0 1 24 X X25 54 0 2 25 X X

Reorder buffer

Tail

Cycle 5op Destination Source 1 Source 2

E dtag

val R stag

val R stag

val

add

0 21 22 1 20 15 1 X 7

add

0 22 38 1 21 22 1 X 16

load

1 23 ? 1 X 0 1 20 15

add

0 24 16 1 X 15 1 X 1

load

X 25 X 1 X 0 0 22 X

Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)

StageRTWBEXWBEXRS

State of instructions

pointer

enrty PC R dreg dtag E result

20 40 1 1 20 0 15Head 21 44 1 2 21 0 22

22 48 0 2 22 X X23 4C 1 5 23 1 ?24 50 0 1 24 X X25 54 0 2 25 X X

Reorder buffer

Tail

Cycle 6op Destination Source 1 Source 2

E dtag

val R stag

val R stag

val

add

0 22 38 1 21 22 1 X 16

add

0 24 16 1 X 15 1 X 1

load

0 25 4 1 X 0 1 22 38

Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)

Stage

RTWBRTWBEX

State of instructions

pointer

enrty PC R dreg dtag E result

20 40 1 1 20 0 1521 44 1 2 21 0 22

Head 22 48 1 2 22 0 3823 4C 1 5 23 1 ?24 50 1 1 24 0 1625 54 0 2 25 X X

Reorder buffer

Tail

Cycle 7op Destination Source 1 Source 2

E dtag

val R stag

val R stag

val

load

0 25 4 1 X 0 1 22 38

Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)

Stage

RTRTRTWB

State of instructions

pointer

enrty PC R dreg dtag E result

20 40 1 1 20 0 1521 44 1 2 21 0 2222 48 1 2 22 0 38

H/T 23 4C 1 5 23 1 ?24 50 1 1 24 0 1625 54 0 2 25 X X

Reorder buffer

Exceptionis detected.

VLIW (Very Long Instruction Word)

In the VLIW processor, compiler extracts parallelism in a code. Therefore, special hardware support used in the superscalar processor becomes unnecesarry. Superscalar: dynamic scheduling by

hardware support VLIW: static scheduling by compiler

Overview of VLIWcompiler

compiler

processor

processor

main(){ ・・・・・・・・・・・・ }

add sub ・・・

code gen

scheduling

execution

execution

main(){ ・・・・・・・・・・・・ }

add sub ・・・

code gen

add sub load add mul load ・・・

scheduling

Superscalar

VLIW

VLIW code

i1: r3=r4+1i2: r1=load(r2)i3: r1=r1<<r3i4: r5=r2+r6i5:beq r5,L

Sequential code

ALU ALU MEM Branchi1: r3=r4+1 i4: r5=r2+r6 i2:

r1=load(r2)nop

i3: r1=r1<<r3

nop nop i5:beq r5,L VLIW code

・・・・・

Hardware organization of VLIW

ALU ALU MEM Branch

Registers

・・・・・

Instruction cache

Data cache

VLIW vs Superscalar

Superscalar VLIWHardware size Large SmallHardware complexity Large SmallScheduling algorithm Poor RichInstruction window Small LargeBinary compatibility Compatible Not

compatible

Dynamic vs Static schedluingi1: r1=load Ai2: r2=load(r1)i3: r3=load Bi4: r4=r3<<r2i5: r5=r4+1i6: r6=r2+r5

i1

i2

i3

i4

i5i6

Data dependencyof the code

Cycle ALU MEM0 i11 i22 i33 i44 i55 i6

Sample code

Dynamic scheduling

Cycle ALU MEM0 i31 i4 i12 i5 i23 i6

Optimal scheduling

Advantage of dynamic scheduling

Scheduling based on information that can only be obtained at run time. For example, cache miss can be

concealed. Scheduling based on accurate

dependency of memories. Data address that can be obtained

only at run time improves scheduling performance.

Taxonomy of scheduling algorithm

Local scheduling Global scheduling

Cyclic scheduling Acyclic scheduling

Trace-based scheduling DAG-based (Directed acyclic graph)

scheduling

VLIW-based commercial processors

Transmeta Crusoe Aiming mobile computing

Texas Instruments TMS320C6x series Embedded applications

Intel Itanium

Parallel operation by SIMD

What is SIMD?: SIMD (Single Instruction Multiple Data) means that the same operation is applied to several operands. Ex: Addition

int i;int a[4]={1,2,3,4};int b[4]={5,6,7,8};int c[4];for (i=0;i<4;i++){ c[i]=a[i]+b[i];}

c[0]=a[0]+b[0]c[1]=a[1]+b[1]c[2]=a[2]+b[2]c[3]=a[3]+b[3]

Sequential operation

SIMD

b[3]b[2]b[1]b[0]

a[3]a[2]a[1]a[0]

c[3]c[2]c[1]c[0]

SIMD data types (Cell/B.E.)

vector unsigned char 16 unsigned 8bit valuesvector signed char 16 signed 8bit valuesvector unsigned short 8 unsigned 16 bit valuesvector signed short 8 signed 16 bit valuesvector unsigned int 4 unsigned 32 bit valuesvector signed int 4 signed 32 bit valuesvector unsigned long long

2 unsigned 64 bit values

vector signed long long

2 signed 64 bit values

vector float 4 32bit floating vlauesvector double 2 64 bit double (floating) values

Allocation of vector values

Vector values are allocated to memory in the big-endian style as shown in the following figure.

*This figure is adapted from cell.fixstars.com

How to access vector type via normal pointer

vector signed int va = (vector signed int) { 1, 2, 3, 4 };int *a = (int *) &va;

*This figure is adapted from cell.fixstars.com

How to access a normal array from vector type

int a[8] __attribute__((aligned(16))) = { 1, 2, 3, 4, 5, 6, 7, 8 };vector signed int *va = (vector signed int *) a;

*This figure is adapted from cell.fixstars.com

__attribute__((aligned(16))) forces scalar data to be 16 byte-aligned

SIMD operation on PPE

int a[4] __attribute__((aligned(16))) = { 1, 2, 3, 4 };int b[4] __attribute__((aligned(16))) = { 5, 6, 7, 8 };int c[4] __attribute__((aligned(16)));vector signed int *va = (vector signed int *) a;vector signed int *vb = (vector signed int *) b;vector signed int *vc = (vector signed int *) c;*vc = vec_add(*va, *vb);

b[3]b[2]b[1]b[0]

a[3]a[2]a[1]a[0]

c[3]c[2]c[1]c[0]

vec_add is a SIMD function provided by VMX (Vector Multimedia Extension) proposed by IBM and Mtorola.

Entire code for vector addition

#include <stdio.h>#include <altivec.h>int a[4] __attribute__((aligned(16))) = { 1, 2, 3, 4 };int b[4] __attribute__((aligned(16))) = { 5, 6, 7, 8 };int c[4] __attribute__((aligned(16)));int main(int argc, char **argv){vector signed int *va = (vector signed int *) a;vector signed int *vb = (vector signed int *) b;vector signed int *vc = (vector signed int *) c;*vc = vec_add(*va, *vb);printf("c[0]=%d, c[1]=%d, c[2]=%d, c[3]=%d\n", c[0], c[1], c[2], c[3]);return 0;}

A part of VMX functionArithmetic operation

vec_add(a,b) a+bvec_sub(a,b) a-bvec_madd(a,b,c)

a*b+c

Logical operation

vec_and(a,b) Logical andvec_or(a,b) Logical or

Bit operation

vec_perm(a,b,c)

creating new vector from a[i] and b[i] based on c[i]

vec_sel(a,b,c) selecting a[i] or b[i] basedon c[i]

branch vec_cmpeq(a, b)

a[i]==b[i]

vec_cmpgt(a, b)

a[i]>b[i]

Type conversion

vec_ctf(a, b) (float)a[i]/(2^b)vec_ctu(a, b) (unsigned int) a[i]/(2^b)

Generating constant

vec_splat(a, b) a[b]vec_splat s32(a)

signed a[i]

How to create dense vector data

In general, vector data is not densely stored. Threfore, dense vector data must be created before vector operation.vc = vec_perm(va, vb, vpat);

*This figure is adapted from cell.fixstars.com

Ex of vec_perm: Transpose

*These figures are adapted from cell.fixstars.com

Branch on SIMD

*These figures are adapted from cell.fixstars.com

Procedure of SIMD Branch

*These figures are adapted from cell.fixstars.com

Detail of SIMD Branch

vec_cmpgt()

vec_sel() *These figures are adapted from cell.fixstars.com

Ex of SIMD Branchint a[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };int b[16] = { 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4 3, 2, 1 };int c[16];int i;for (i = 0; i < 16; i++) { if (a[i] > b[i]) { c[i] = a[i] - b[i]; } else { c[i] = b[i] - a[i]; }}

Ex of SIMD Branchint a[16] __attribute__((aligned(16))) = { 1, 2, 3, 4, 5, 6, 7, 8,9, 10, 11, 12, 13, 14, 15, 16 };int b[16] __attribute__((aligned(16))) = { 16, 15, 14, 13, 12, 11, 10, 9,8, 7, 6, 5, 4, 3, 2, 1 };int c[16] __attribute__((aligned(16)));vector signed int *va = (vector signed int *) a;vector signed int *vb = (vector signed int *) b;vector signed int *vc = (vector signed int *) c;vector signed int vc_true, vc_false;vector unsigned int vpat;int i;for (i = 0; i < 4; i++) { vpat = vec_cmpgt(va[i], vb[i]); vc_true = vec_sub(va[i], vb[i]); vc_false = vec_sub(vb[i], va[i]); vc[i] = vec_sel(vc_false, vc_true, vpat);}