Post on 15-Feb-2017
ASSIGNMENT # 1
Subject
“COMPUTER ARCHITECTURE”
Teacher
“Ma’am Aden Iqbal”
By
“Farwa Abdul Hannan”
(12-CS-13)
Monday, 28 March, 2016
NFC – INSITUTDE OF ENGINEERING AND
FERTILIZER RESEARCH, FSD
1
Tomasulo Algorithm
1) Consider the code sequence shown below.
LD F6, 12(R2)
LD F2, 16(R3)
ADDD F0, F2, F4
DIVD F10, F0, F6
SUBD F8, F6, F2
ADDI R2, R2, 8
ADDI R3, R3, 16
ADDD F6, F8, F2
a) Identify all WAR, WAW, and RAW dependencies in the instruction stream.
WAR WAW RAW
SUBD F8, F6, F2
ADDD F6, F8, F2
LD F6, 12(R2)
ADDD F6, F8, F2
LD F2, 16(R3)
ADDDF0, F2, F4
NIL NIL ADDD F0, F2, F4
DIVD F10, F0, F6
NIL NIL LD F6, 12(R2)
SUBD F8, F6, F2
b) Draw a pipeline diagram of how instructions would issue in a machine using
Tamasulo algorithm as discussed in class:. Assume that the FP Add unit has 4
EX phases, the FP Multiply unit has 7 EX phases, and divide has 24 EX phases.
FP Adds, Subtracts, and Multiplies are fully-pipelined, while divide operations
are NOT pipelined.
2
Cycle 1, 2, 3
Instruction Status Load/Buffers
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle
Busy Address
j k
LD F6 12+ R2 1 3 Load1 Yes 12+R2
LD F2 16+ R3 2 Load2 Yes 16+R3
ADDD F0 F2 F4 3 Load3 No
DIVD F10 F0 F6
SUBD F8 F6 F2
ADDI R2 R2 8
ADDI R3 R3 16
ADDD F6 F8 F2
Reservation Station
Time Name Busy Op. Vj Vk Qj Qk
ADD1 Yes ADDD R(F4) Load2
ADD2 No
ADD3 No
MULT1 No
MULT2 No
Register Result Status
Clock F0 F2 F4 F6 F8 F10 F12 F14
3 FU ADD1 Load2 Load1
Cycle 4
Instruction Status Load/Buffers
3
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle
Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 Load2 Yes 16+R3
ADDD F0 F2 F4 3 Load3 No
DIVD F10 F0 F6 4
SUBD F8 F6 F2
ADDI R2 R2 8
ADDI R3 R3 16
ADDD F6 F8 F2
Reservation Station
Time Name Busy Op. Vj Vk Qj Qk
ADD1 Yes ADDD R(F4) Load2
ADD2 No
ADD3 No
MULT1 Yes DIVD M(A1) ADD1
MULT2 No
Register Result Status
Clock F0 F2 F4 F6 F8 F10 F12 F14
4 FU ADD1 Load2 M(A1) MULT1
Cycle 5
Instruction Status Load/Buffers
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle
Busy Address
j k
4
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
ADDD F0 F2 F4 3 Load3 No
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5
ADDI R2 R2 8
ADDI R3 R3 16
ADDD F6 F8 F2
Reservation Station
Time Name Busy Op. Vj Vk Qj Qk
4 ADD1 Yes ADDD M(A2) R(F4)
4 ADD2 Yes SUBD M(A1) M(A2)
ADD3 No
MULT1 Yes DIVD M(A1) ADD1
MULT2 No
Register Result Status
Clock F0 F2 F4 F6 F8 F10 F12 F14
5 FU ADD1 M(A2) M(A1) ADD2 MULT1
Cycle 6
Instruction Status Load/Buffers
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle
Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
ADDD F0 F2 F4 3 Load3 No
5
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5
ADDI R2 R2 8 6 6
ADDI R3 R3 16
ADDD F6 F8 F2
Reservation Station
Time Name Busy Op. Vj Vk Qj Qk
3 ADD1 Yes ADDD M(A2) R(F4)
3 ADD2 Yes SUBD M(A1) M(A2)
ADD3 No
MULT1 Yes DIVD M(A1) ADD1
MULT2 No
Register Result Status
Clock F0 F2 F4 F6 F8 F10 F12 F14
6 FU ADD1 M(A2) M(A1) ADD2 MULT1
Cycle 7
Instruction Status Load/Buffers
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle
Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
ADDD F0 F2 F4 3 Load3 No
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5
ADDI R2 R2 8 6 6 7
6
ADDI R3 R3 16 7 7
ADDD F6 F8 F2
Reservation Station
Time Name Busy Op. Vj Vk Qj Qk
2 ADD1 Yes ADDD M(A2) R(F4)
2 ADD2 Yes SUBD M(A1) M(A2)
ADD3 No
MULT1 Yes DIVD M(A1) ADD1
MULT2 No
Register Result Status
Clock F0 F2 F4 F6 F8 F10 F12 F14
7 FU ADD1 M(A2) M(A1) ADD2 MULT1
Cycle 8
Instruction Status Load/Buffers
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle
Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
ADDD F0 F2 F4 3 Load3 No
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8
Reservation Station
7
Time Name Busy Op. Vj Vk Qj Qk
1 ADD1 Yes ADDD M(A2) R(F4)
1 ADD2 Yes SUBD M(A1) M(A2)
ADD3 No ADDD M(A2) ADD2
MULT1 Yes DIVD M(A1) ADD1
MULT2 No
Register Result Status
Clock F0 F2 F4 F6 F8 F10 F12 F14
8 FU ADD1 M(A2) ADD3 ADD2 MULT1
Cycle 9
Instruction Status Load/Buffers
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle
Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
ADDD F0 F2 F4 3 9 Load3 No
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5 9
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8
Reservation Station
Time Name Busy Op. Vj Vk Qj Qk
0 ADD1 Yes ADDD M(A2) R(F4)
0 ADD2 Yes SUBD M(A1) M(A2)
ADD3 No ADDD M(A2) ADD2
8
MULT1 Yes DIVD M(A1) ADD1
MULT2 No
Register Result Status
Clock F0 F2 F4 F6 F8 F10 F12 F14
9 FU ADD1 M(A2) ADD3 ADD2 MULT1
Cycle 10
Instruction Status Load/Buffers
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle
Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
ADDD F0 F2 F4 3 9 10 Load3 No
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5 9
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8
Reservation Station
Time Name Busy Op. Vj Vk Qj Qk
ADD1 No
ADD2 Yes SUBD M(A1) M(A2)
4 ADD3 Yes ADDD M-M M(A2)
24 MULT1 Yes DIVD M+R4 M(A1)
MULT2 No
Register Result Status
9
Clock F0 F2 F4 F6 F8 F10 F12 F14
10 FU M+R4 M(A2) ADD3 ADD2 MULT1
Cycle 11
Instruction Status Load/Buffers
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle
Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
ADDD F0 F2 F4 3 9 10 Load3 No
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5 9 11
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8
Reservation Station
Time Name Busy Op. Vj Vk Qj Qk
ADD1 No
ADD2 No
3 ADD3 Yes ADDD M-M M(A2)
23 MULT1 Yes DIVD M+R4 M(A1)
MULT2 No
Register Result Status
Clock F0 F2 F4 F6 F8 F10 F12 F14
11 FU M+R4 M(A2) ADD3 M-M MULT1
10
Cycle 14
Instruction Status Load/Buffers
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle
Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
ADDD F0 F2 F4 3 10 11 Load3 No
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5 8 9
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8 14
Reservation Station
Time Name Busy Op. Vj Vk Qj Qk
ADD1 No
ADD2 No
0 ADD3 Yes ADDD M-M M(A2)
20 MULT1 Yes DIVD M+R4 M(A1)
MULT2 No
Register Result Status
Clock F0 F2 F4 F6 F8 F10 F12 F14
14 FU M+R4 M(A2) ADD3 M-M MULT1
Cycle 15
Instruction Status Load/Buffers
11
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle
Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
ADDD F0 F2 F4 3 10 Load3 No
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5 8
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8 14 15
Reservation Station
Time Name Busy Op. Vj Vk Qj Qk
ADD1 No
ADD2 No
ADD3 No
20 MULT1 Yes DIVD M+R4 M(A1)
MULT2 No
Register Result Status
Clock F0 F2 F4 F6 F8 F10 F12 F14
15 FU M+R4 M(A2) M-M+M M-M MULT1
Cycle 35
Instruction Status Load/Buffers
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle
Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
12
LD F2 16+ R3 2 4 5 Load2 No
ADDD F0 F2 F4 3 10 11 Load3 No
DIVD F10 F0 F6 4 35
SUBD F8 F6 F2 5 8 9
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8 14 15
Reservation Station
Time Name Busy Op. Vj Vk Qj Qk
ADD1 No
ADD2 No
ADD3 No
0 MULT1 Yes DIVD M+R4 M(A1)
MULT2 No
Register Result Status
Clock F0 F2 F4 F6 F8 F10 F12 F14
35 FU M+R4 M(A2) M-M+M M-M MULT1
Cycle 36
Instruction Status Load/Buffers
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle
Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
ADDD F0 F2 F4 3 10 11 Load3 No
DIVD F10 F0 F6 4 35 36
13
SUBD F8 F6 F2 5 8 9
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8 14 15
Reservation Station
Time Name Busy Op. Vj Vk Qj Qk
ADD1 No
ADD2 No
ADD3 No
MULT1 No
MULT2 No
Register Result Status
Clock F0 F2 F4 F6 F8 F10 F12 F14
36 FU M+R4 M(A2) M-M+M M-M (M+R4)/M
c) Tomasulo’s algorithm has a disadvantage. Only one result can complete per
clock, per CDB. Using the same latencies as above, find a code sequence of no
more than 12 instructions where Tomasulo’s algorithm must stall due to CDB
contention. Indicate where this occurs in your sequence.
It occurs in the following cycle
Cycle 9 Instruction Status Load/Buffers
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle
Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
ADDD F0 F2 F4 3 9 Load3 No
DIVD F10 F0 F6 4
14
SUBD F8 F6 F2 5 9
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8
Reservation Station
Time Name Busy Op. Vj Vk Qj Qk
0 ADD1 Yes ADDD M(A2) R(F4)
0 ADD2 Yes SUBD M(A1) M(A2)
ADD3 No ADDD M(A2) ADD2
MULT1 Yes DIVD M(A1) ADD1
MULT2 No
Register Result Status
Clock F0 F2 F4 F6 F8 F10 F12 F14
9 FU ADD1 M(A2) ADD3 ADD2 MULT1
2) Evaluate the performance of several implementation options for
the following workload:
LOOP:
L.D F3, R4(R6) # F3 = MEM[r4+r6]
MUL.D F4, F3, F2 # F4 = F3*F2
S.D F4, R3(R6) # MEM[R3+R6] = F4
A.D F4, F3, F3 # F4 = F3+F3
Only one instruction can complete per result
per CDB
15
A.D F10, F10, F4 # F10 = F10 + F4
DSUBUI R6,R6, #4 # R6 = R6 - 4
BNEQ R6, loop # if R6 != 0, jump to LOOP
Assume the processor implements Tomasulo’s algorithm (with reservation stations and no reorder
buffer), as well as the following:
A single instruction is issued per cycle.
All function units are not pipelined.
No forwarding between or within function units; results are communicated via the single
CDB.
The memory execution unit uses three stages for load and 2 cycles for store. Load and store
have separate reservation stations, but either a load or store can execute at any one time
since they share the memory port.
Issue and write result stages require one cycle each. Address generation is performed
separate from the ALU in the load and store buffers.
Branches execute in the integer unit, and instructions issued after a branch wait until the
branch has been resolved and broadcast on the CDB.
Functional Unit Queues and Latencies:
Functional Unit # of Functional Units Latency (cycles in EX) # of Reservation Stations
Memory – Load 1 3 2
Memory – Store 1 2 2
Integer 1 1 5
FP – Add 1 4 3
FP – Multiply 1 2 2
a) Perform a simulation of the first two iterations for a single issue architecture.
Create the table below
Iteration 1
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle j k
L.D F3 R4 R6 1 2 6
MUL.D F4 F3 F2 2 6 17
S.D F4 R3 R6 3 17 21
16
A.D F4 F3 F3 4 21 26
A.D F10 F10 F4 5 26 31
DSUBUI R6 R6 #4 6 31 36
BNEQ R6 loop 7 37
Iteration 2
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle j k
L.D F3 R4 R6 8 38 42
MUL.D F4 F3 F2 9 42 53
S.D F4 R3 R6 10 53 56
A.D F4 F3 F3 11 56 61
A.D F10 F10 F4 12 61 66
DSUBUI R6 R6 #4 13 66 71
BNEQ R6 loop 14 71
b) What is the performance bottleneck?
The delay in transmission of data through the circuits of a computer's microprocessor or
over a TCP/IP network. The delay typically occurs when a system's bandwidth cannot
support the amount of information being relayed at the speed it is being processed
c) What is the “steady state” of this loop – that is how many cycles will an average
loop iteration take if loop startup and shutdown effects are ignored?
The steady state of the loop occurs when the R6 will be equal to zero which means at R6
equal to zero the loop will no longer keep on iterating and will be in a steady state.
d) Where will the first issue stall occur?
The first stall will occur when the second instruction of MULTD F4, F3, F2 will execute
because its execution will be dependent on the F3 of LD. So RAW delay will occur.