Tomasulo Algorithm

ASSIGNMENT # 1

Subject

“COMPUTER ARCHITECTURE”

Teacher

“Ma’am Aden Iqbal”

“Farwa Abdul Hannan”

(12-CS-13)

Monday, 28 March, 2016

NFC – INSITUTDE OF ENGINEERING AND

FERTILIZER RESEARCH, FSD

Tomasulo Algorithm

1) Consider the code sequence shown below.

LD F6, 12(R2)

LD F2, 16(R3)

ADDD F0, F2, F4

DIVD F10, F0, F6

SUBD F8, F6, F2

ADDI R2, R2, 8

ADDI R3, R3, 16

ADDD F6, F8, F2

a) Identify all WAR, WAW, and RAW dependencies in the instruction stream.

WAR WAW RAW

SUBD F8, F6, F2

ADDD F6, F8, F2

LD F6, 12(R2)

ADDD F6, F8, F2

LD F2, 16(R3)

ADDDF0, F2, F4

NIL NIL ADDD F0, F2, F4

DIVD F10, F0, F6

NIL NIL LD F6, 12(R2)

SUBD F8, F6, F2

b) Draw a pipeline diagram of how instructions would issue in a machine using

Tamasulo algorithm as discussed in class:. Assume that the FP Add unit has 4

EX phases, the FP Multiply unit has 7 EX phases, and divide has 24 EX phases.

FP Adds, Subtracts, and Multiplies are fully-pipelined, while divide operations

are NOT pipelined.

Cycle 1, 2, 3

Instruction Status Load/Buffers

Instruction Issue

Execute

Busy Address

LD F6 12+ R2 1 3 Load1 Yes 12+R2

LD F2 16+ R3 2 Load2 Yes 16+R3

ADDD F0 F2 F4 3 Load3 No

DIVD F10 F0 F6

SUBD F8 F6 F2

ADDI R2 R2 8

ADDI R3 R3 16

ADDD F6 F8 F2

Reservation Station

Time Name Busy Op. Vj Vk Qj Qk

ADD1 Yes ADDD R(F4) Load2

ADD2 No

ADD3 No

MULT1 No

MULT2 No

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 F14

3 FU ADD1 Load2 Load1

Cycle 4

Instruction Issue

Execute

Busy Address

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 Load2 Yes 16+R3

DIVD F10 F0 F6 4

SUBD F8 F6 F2

ADDI R2 R2 8

ADDI R3 R3 16

ADDD F6 F8 F2

Reservation Station

ADD1 Yes ADDD R(F4) Load2

ADD2 No

ADD3 No

MULT1 Yes DIVD M(A1) ADD1

MULT2 No

4 FU ADD1 Load2 M(A1) MULT1

Cycle 5

Instruction Issue

Execute

Busy Address

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5

ADDI R2 R2 8

ADDI R3 R3 16

ADDD F6 F8 F2

Reservation Station

4 ADD1 Yes ADDD M(A2) R(F4)

4 ADD2 Yes SUBD M(A1) M(A2)

ADD3 No

MULT2 No

5 FU ADD1 M(A2) M(A1) ADD2 MULT1

Cycle 6

Instruction Issue

Execute

Busy Address

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5

ADDI R2 R2 8 6 6

ADDI R3 R3 16

ADDD F6 F8 F2

Reservation Station

ADD3 No

MULT2 No

Cycle 7

Instruction Issue

Execute

Busy Address

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7

ADDD F6 F8 F2

Reservation Station

ADD3 No

MULT2 No

Cycle 8

Instruction Issue

Execute

Busy Address

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8

Reservation Station

ADD3 No ADDD M(A2) ADD2

MULT2 No

8 FU ADD1 M(A2) ADD3 ADD2 MULT1

Cycle 9

Instruction Issue

Execute

Busy Address

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

ADDD F0 F2 F4 3 9 Load3 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8

Reservation Station

MULT2 No

Cycle 10

Instruction Issue

Execute

Busy Address

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

ADDD F0 F2 F4 3 9 10 Load3 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8

Reservation Station

ADD1 No

ADD2 Yes SUBD M(A1) M(A2)

4 ADD3 Yes ADDD M-M M(A2)

24 MULT1 Yes DIVD M+R4 M(A1)

MULT2 No

10 FU M+R4 M(A2) ADD3 ADD2 MULT1

Cycle 11

Instruction Issue

Execute

Busy Address

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5 9 11

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8

Reservation Station

ADD1 No

ADD2 No

MULT2 No

11 FU M+R4 M(A2) ADD3 M-M MULT1

Cycle 14

Instruction Issue

Execute

Busy Address

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5 8 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8 14

Reservation Station

ADD1 No

ADD2 No

MULT2 No

14 FU M+R4 M(A2) ADD3 M-M MULT1

Cycle 15

Instruction Issue

Execute

Busy Address

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5 8

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8 14 15

Reservation Station

ADD1 No

ADD2 No

ADD3 No

MULT2 No

15 FU M+R4 M(A2) M-M+M M-M MULT1

Cycle 35

Instruction Issue

Execute

Busy Address

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

DIVD F10 F0 F6 4 35

SUBD F8 F6 F2 5 8 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8 14 15

Reservation Station

ADD1 No

ADD2 No

ADD3 No

MULT2 No

35 FU M+R4 M(A2) M-M+M M-M MULT1

Cycle 36

Instruction Issue

Execute

Busy Address

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

DIVD F10 F0 F6 4 35 36

SUBD F8 F6 F2 5 8 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8 14 15

Reservation Station

ADD1 No

ADD2 No

ADD3 No

MULT1 No

MULT2 No

36 FU M+R4 M(A2) M-M+M M-M (M+R4)/M

c) Tomasulo’s algorithm has a disadvantage. Only one result can complete per

clock, per CDB. Using the same latencies as above, find a code sequence of no

more than 12 instructions where Tomasulo’s algorithm must stall due to CDB

contention. Indicate where this occurs in your sequence.

It occurs in the following cycle

Cycle 9 Instruction Status Load/Buffers

Instruction Issue

Execute

Busy Address

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8

Reservation Station

MULT2 No

2) Evaluate the performance of several implementation options for

the following workload:

L.D F3, R4(R6) # F3 = MEM[r4+r6]

MUL.D F4, F3, F2 # F4 = F3*F2

S.D F4, R3(R6) # MEM[R3+R6] = F4

A.D F4, F3, F3 # F4 = F3+F3

Only one instruction can complete per result

per CDB

A.D F10, F10, F4 # F10 = F10 + F4

DSUBUI R6,R6, #4 # R6 = R6 - 4

BNEQ R6, loop # if R6 != 0, jump to LOOP

Assume the processor implements Tomasulo’s algorithm (with reservation stations and no reorder

buffer), as well as the following:

A single instruction is issued per cycle.

All function units are not pipelined.

No forwarding between or within function units; results are communicated via the single

The memory execution unit uses three stages for load and 2 cycles for store. Load and store

have separate reservation stations, but either a load or store can execute at any one time

since they share the memory port.

Issue and write result stages require one cycle each. Address generation is performed

separate from the ALU in the load and store buffers.

Branches execute in the integer unit, and instructions issued after a branch wait until the

branch has been resolved and broadcast on the CDB.

Functional Unit Queues and Latencies:

Functional Unit # of Functional Units Latency (cycles in EX) # of Reservation Stations

Memory – Load 1 3 2

Memory – Store 1 2 2

Integer 1 1 5

FP – Add 1 4 3

FP – Multiply 1 2 2

a) Perform a simulation of the first two iterations for a single issue architecture.

Create the table below

Iteration 1

Instruction Issue

Execute

Cycle j k

L.D F3 R4 R6 1 2 6

MUL.D F4 F3 F2 2 6 17

S.D F4 R3 R6 3 17 21

A.D F4 F3 F3 4 21 26

A.D F10 F10 F4 5 26 31

DSUBUI R6 R6 #4 6 31 36

BNEQ R6 loop 7 37

Iteration 2

Instruction Issue

Execute

Cycle j k

L.D F3 R4 R6 8 38 42

MUL.D F4 F3 F2 9 42 53

S.D F4 R3 R6 10 53 56

A.D F4 F3 F3 11 56 61

A.D F10 F10 F4 12 61 66

DSUBUI R6 R6 #4 13 66 71

BNEQ R6 loop 14 71

b) What is the performance bottleneck?

The delay in transmission of data through the circuits of a computer's microprocessor or

over a TCP/IP network. The delay typically occurs when a system's bandwidth cannot

support the amount of information being relayed at the speed it is being processed

c) What is the “steady state” of this loop – that is how many cycles will an average

loop iteration take if loop startup and shutdown effects are ignored?

The steady state of the loop occurs when the R6 will be equal to zero which means at R6

equal to zero the loop will no longer keep on iterating and will be in a steady state.

d) Where will the first issue stall occur?

The first stall will occur when the second instruction of MULTD F4, F3, F2 will execute

because its execution will be dependent on the F3 of LD. So RAW delay will occur.

Tomasulo Algorithm

Education

Transcript of Tomasulo Algorithm

Genetic Algorithm

Convex Hull Algorithms - users.dimi.uniud.itclaudio.mirolo/... · Divide-et-impera algorithm Randomized algorithm Outline 1 Incremental algorithm degeneracies correctness computational

Algorithm programming

Parallel Algorithm

การเขียนอัลกอริทึม Algorithm) · การเขียนอัลกอริทึม (Algorithm) สาระการเรียนร้.

Algorithm 1 PB-AFN Algorithm 2 PF-AFN

Algorithm Qr

Escalação Dinâmica Algorítmo de Tomasulo

Algorithm Profiles

Djikstra's Algorithm

FlowChart & Algorithm

Algorithm Oi

Meljun Cortes Algorithm Funda of Algorithm

Memetic algorithm

RSA Algorithm

Algorithm e

Shotest Path Algorithm Dijikstra’s Algorithm

Scoreboarding i Tomasulo Algoritam

Dijkstra's Algorithm and FloydWarshall's Algorithm

Scheduling dinámico Algoritmo de Tomasulo. Universidad de SonoraArquitectura de Computadoras2 Introducción Desarrollado por Robert Tomasulo para la super-