Team9 presentation

66
Walk-time Address Adjustment For Improving The Accuracy of Dynamic Branch Prediction Team #9 Sindhuja Nandikonda Atrayee Bhadra Sharmila Kannan Under the guidance of Prof. Gita Alaghband

Transcript of Team9 presentation

Page 1: Team9 presentation

Walk-time Address Adjustment For Improving The Accuracy of Dynamic Branch Prediction

Team #9Sindhuja NandikondaAtrayee BhadraSharmila Kannan

Under the guidance ofProf. Gita Alaghband

Page 2: Team9 presentation

Introduction Motivation Walk Time Techniques Link Time Branch Interference NOP instructions Address adjustment Techniques

• Constrained Address Adjustment• Relaxed Address Adjustment• Branch Classification

Performance Metric and Analysis Conclusion Our Project

Agenda

Page 3: Team9 presentation

Introduction

• Dynamic branch prediction can deliver accurate branch prediction without changes to the instruction set architecture.

• It has been an effective technique for boosting the performance of modern high performance microprocessors.

Limitation of hardware predictors:Limited number of 2-bit counters, which leads to branch interference.

Page 4: Team9 presentation

Motivation

•To reduce the branch interference.• Instead of designing more accurate predictors with large predictor tables- we can use the same branch predictors(with limited 2 bit counters) combined with address adjustment techniques .•This provides accuracy same as that of predictors with large predictor table.

Page 5: Team9 presentation

Role Of A 2-bit counter in Branch Prediction

PC

entryK

CACHE

PC+Inst size

Next fetch address

Cache of target address

HIT??

pattern history Table, whose each entry is a 2-bit counter TAKEN??

[Ref:Prof.Onur Mutlu lectures]

Target address

Page 6: Team9 presentation

Walk time techniques

•Walk time techniques are software translation techniques where the changes to the program are done at the link time•Walk time is nothing but link time•Changes are architecturally seen

Page 7: Team9 presentation

Link Time

•Link Time is after compile time and before run time•It refers to the operations performed by the linker•The operations performed by linker generally includes fixing up the addresses of externally referenced objects, relocating the machine code etc.. •Link time is the time at which several object files obtained from the compiler are then combined with the libraries to form one executable files

Page 8: Team9 presentation

Compiler

Compiler takes source code as the input,and generates an object file. Linker takes the object file and generates an executable file.Loader loads this executable file in memory, then code execution takes place.

I/PO/PLinker Loader Run-time

Outline of source code execution

Page 9: Team9 presentation

Branch Interference

• The branch behavior is maintained in the pattern history table. • Then the predictors use the pattern history table information, and predict

accordingly.• The mapping of conditional branch is done by hashing with low-order bits of the

branch addresses.• Branch interference occurs, when the 2-bit counter that keeps track of the branch

history of a particular branch is altered by the history of another branch (i.e., several conditional branches are mapped to the same 2bit counter).

Page 10: Team9 presentation

Address/4 Profiling Instructions 480005ce 9433/9436 bnez i 480005ef 1/1 bnez j . . . . . . 480005ee 0/6788 bnez k . . . 4800061f 5702/5792 bnez n

ENTRY 0

ENTRY 1

ENTRY 2

ENTRY 3

Entry 4..

ENTRY 14

ENTRY 15

ENTRY 16

Pattern table

2 bit counter

Example:Branches i and k are mapped to the same 2-bit counter of the pattern table

Page 11: Team9 presentation

Effects of branch interference

The accuracy of prediction reduces

Explanation:• 480005ce 9433/9436 bnez i•4800061f 5702/5792 bnez n•480005ef 1/1 bnez j

Page 12: Team9 presentation

How to reduce the interference

•The address adjustment techniques implemented during link time(walk time) can help to reduce the branch interference.

•The technique focuses on how to include NOP instructions logically.

Page 13: Team9 presentation

NOP instruction

•An instruction that does nothing is called an NOP(Null Operation).

Example: lw vo , 4(v1) // leads to a bug jr v0 lw v0, 4(v1) NOP jr v0

Page 14: Team9 presentation

Few Interesting Points

In MIPS, add 0 0 0 is an NOP instruction, since it does not affect any of the statesAn NOP should advance the PC only by one instruction

Page 15: Team9 presentation

Principle Uses Of NOP

• To allow the future modification of the code, like reserving space in code memory• To add a known delay•To deal with hazards and sequencing problems in pipelining• To synchronize events

Page 16: Team9 presentation

•Two important issues while inserting NOP instructions are:

Right location to insert NOP instructions- To avoid wasting CPU cycles.Proper number of NOP instructions - To avoid code expansion.

Page 17: Team9 presentation

Address Adjustment

• Logically adjust the static address of branch in the program, which causes the branch interference.

• Inserting NOP’s in such a way that the branch is mapped to another entry in the table, avoiding interference.

• This task is done by Linker at link time.• Any address adjustment scheme must tradeoff among code expansion, CPU

overhead, and branch prediction accuracy.

Page 18: Team9 presentation

Address Adjustment Techniques

•Constrained Address Adjustment•Relaxed Address Adjustment•Branch classification

Page 19: Team9 presentation

Approach

•All the three algorithms follows greedy approach.Greedy Approach:At each step they try to get optimal result unlike optimal approach which considers only worst case scenario.

Page 20: Team9 presentation

• Scans the instructions only once• Insert NOP instructions appropriately• Once NOP instructions are inserted, they are not altered by the adjustment of the

following instructions

How does the algorithms work using Greedy

Page 21: Team9 presentation

Constrained Address AdjustmentConcept:The method adjusts the addresses of instructions by inserting NOP instructions only after unconditional branches. NOP instructions inserted right after unconditional instructions are never executed. Dynamic instruction count remains the same and CPU cycles will not be increased.

Page 22: Team9 presentation

Terms to concentrate

a)Unconditional branchb)Conditional Branchc)Reference count of Branchd)Reference count of 2-bit counterse)Profilingf)Maximum Motion distance(MMD)

Page 23: Team9 presentation

Unconditional Branch• A branch instruction which always leads to branching.• Branch instructions are used to follow control loops of program.

Examples:J - jump to an address.Jr- jump to an address stored in a register.Jal- jump to an address and store the return address in registerJalr-jump to an address stored in a register and store return

Page 24: Team9 presentation

Conditional Branch• It is a branch which may or may not branch depending on the condition.

ExamplesBeq-Branch if two registers are equal.Bnz- Branch if two registers are not equal.Bgtz-Branch if quantity in a register is greater than zero.

Page 25: Team9 presentation

Reference count of a branch• It Explains that how many times the branch is visited and then we decide whether its taken or not taken.

Reference count of 2-bit counter• This defines how many times the branch is being mapped to the same counter in

pattern table. Profiling• It tells the underlying interaction between software and underlying machine

architecture.• Indicates areas of improvement=>performance tuning.

Page 26: Team9 presentation

Maximum Motion distance(MMD)• Based on the reference count of the selected branches, the algorithm determines

how many NOP’S(maximum) should be inserted after the first unconditional branch. This is motion distance.

• MMD is chosen in such a way that the references to the pattern table entries are spread as evenly as possible ,this lead to reduction in branch interference.

• MMD is used to set an upper bound on the motion distance for an unconditional branch in order to avoid excessive code expansion.

Page 27: Team9 presentation

Algorithm Assumptions MadeU - number of unconditional branches.C - number of conditional branches between two unconditional branches and there is a pseudo unconditional branch at the top and at the end of the program.E - number of entries in the pattern table.RCE[j] - reference count of entry in the pattern table, 1<=j<=E.RCE[j] is set to zero for each j.RCC[q] - reference count of the selected Conditional branch q.ADDRESS[q] - address of the conditional branch q.

Page 28: Team9 presentation

StepsStep 1: i=1Step 2:Conditional Branches is selected between the unconditional branches i and i+1. /*i.e., C- number of conditional branches */Step 3:Read the profiling information of the selected conditional branches /* RCC[q]*/.Step 4:For k=0 to MMD do For j=1 to E do P_RCE[j]=RCE[j]; End for M[k]=0; /*m[k] refers to the highest reference count among the indexed entries*/ For q=1 to C do Use (Address[q] + k) to index an entry ,say n in the pattern table; P_RCE[n]=P_RCE[n]+RCC[q];

if(M[k] < P_RCE[n] )M[k] = P_RCE[n]; Endfor Endfor

Page 29: Team9 presentation

Step5: Find P such that M[P] = min{M[k],0<=k<=MMD};/*If there is more than one minimum, then select the one with the

smallest subscript.*/

Step6: Insert P NOPs right after unconditional branch I;

Step7: Update the address of the instructions after unconditional branch i and RCE[j], 1<= j <=E;/*RCE[J] contains the reference count before unconditional branch i+1 considered*/

Step8: i++,Repeat steps until i=U

Page 30: Team9 presentation

Example 1

Entry 0Entry 1

Entry 2

Entry 3entry4

……

……

Entry14

Entry 15

…..Entry 31

Address/4 Profiling Instructions 480005ce 9433/9436 jmp i 480005ef 1/1 bnez j . . . . . . 480005ee 0/6788 jmp k . . . 480005ff 0/5792 jmp m . . . 4800061f 5702/5792 jmp n

Pattern table

Page 31: Team9 presentation

Entry 0Entry 1

Entry 2

Entry 3

entry4

……

……

Entry14

Entry 15

…..Entry 31

Address/4 Profiling Instructions 480005ce 9433/9436 jmp i 480005ef . NOP 480005d0 1/1 bnez j 480005d1 . . . . . 480005ef 0/6788 jmp k . . . 480005ff 0/5792 jmp m . . . 4800061f 5702/5792 jmp n

Pattern table

Now,Branches j and k share the same entry.Branch j is executed only once,the interference between j and k is lower than i and k branch interference.

After applying Constrained address adjustment algorithm..

Page 32: Team9 presentation

Before applying the algorithm Jump

… … Jump

Example 2

Entry 0Entry 1Entry 2Entry 3entry4…………Entry14 Entry 15

…..Entry 31

Page 33: Team9 presentation

After applying the algorithm

Entry 0Entry 1

Entry 2

Entry 3

entry4

……

……

Entry14

Entry 15

…..Entry 31

JUMP

NOP

NOP

JUMP

Page 34: Team9 presentation

(a) Constraint address adjustment(b) Relaxed address adjustment

Page 35: Team9 presentation

Pros:• Accuracy increases.• The interference is lowered because it considers the conditional branches

between unconditional branch.• CPU will not execute NOP instructions.

Cons:Do not reduce the branch interferences between the conditional branches.

Pros and Cons of Constrained address adjustment:

Page 36: Team9 presentation

Relaxed Address Adjustment

CONCEPT:•It extends constrained address adjustment technique by inserting NOP instructions both after conditional and unconditional branches.

Page 37: Team9 presentation

Need to shift from constrained address adjustment technique to relaxed address adjustment technique

•Reduce Interference between conditional branches

Example:•Let i , m be the unconditional branches and let j, k be the conditional branches.•Constrained focuses only on eliminating interference between i and m but it wont eliminate interference between j and k which are encompassed by i and m.

Page 38: Team9 presentation

Pattern History Table //Each entry is a 2-bit jump i counter bnez j bgtz k jump m

ENTRY 1

ENTRY 2

ENTRY 3

ENTRY 4

ENTRY 5

ENTRY 6

ENTRY 7

ENTRY 8

Considering interference in conditional branches…

Page 39: Team9 presentation

Assumptions made:C - number of conditional branchesE - the number of entries in the pattern tableRCE[e] - the reference count of entry eRCE[j] - the reference count of entry j, before a conditional branch is considered, 1<= j <= ERCC[i ] -the reference count of the conditional branch i, 1<= i <= CADDRESS[i ] - address of the conditional branch i Also assume that first conditional branch is mapped to entry e.that is, RCE[e] = RCC[1].

Algorithm

Page 40: Team9 presentation

STEP 1: Read the profiling information of the branch i.STEP 2: For k=0 to MMD do Use (ADDRESS[i]+k) to index an entry, say n ,in the

pattern table; M[k] = RCE[n] + RCC[i];

Endfor STEP 3: Select P such that M[P]= min{M[k],0≤ k ≤ MMD }; /* Take the minimum value of M[k] */STEP 4: If there is no unconditional branches between conditional branches i-1 and i, then insert P NOP’S right after the branch i-1;

Otherwise, insert P NOP’s after an unconditional branch between i-1 and i ;

STEP 5: Update the addresses of the instructions after branch i-1 and RCE[j] , 1≤j≤E

STEP 6: i++; Repeat steps 2 ~ 6 until i= C

Steps

Page 41: Team9 presentation

Step 3 ExplanationAssume branch i, to be a conditional branch. for(int k=0;k< 8;k++)Profiling information: { 3434/4456 ADDRESS[i]+k-> e[n]; M[k]= RCE[n] + RCC [i]; }

It collects M[K] for all the values of k.

Page 42: Team9 presentation

Example 1

Entry 0Entry 1

Entry 2

Entry 3entry4

……

……

Entry14

Entry 15

…..Entry 31

Address/4 Profiling Instructions 480005ce 9433/9436 bnez i 480005ef 1/1 bnez j 480005d0 . jmp . . . 480005ee 0/6788 bnez k . . . 480005ff 0/5792 bnez m . . . 4800061f 5702/5792 bnez n

Pattern table

Page 43: Team9 presentation

Entry 0

Entry 1

Entry 2

Entry 3

entry4

……

……

Entry14

Entry 15

…..Entry 31

Address/4 Profiling Instructions 480005ce 9433/9436 bnez i 480005ef 1/1 bnez j 480005d0 . jmp 480005d1 . NOP . . . 480005ef 0/6788 bnez k . . . 48000600 0/5792 bnez m 48000601 . NOP . . . 4800061f 5702/5792 bnez n

Pattern table

After applying Relaxed address adjustment algorithm..

Page 44: Team9 presentation

Effects of MMD using 2bC predictor

Constrained address adjustment Relaxed address adjustment

• Larger MMD will result in , lower miss-prediction ratios, but more NOP insertion(Increases code size)

• The graph shows that there is a balance between miss-prediction ratio and code expansion, when there are 8 2-bit counters.

Page 45: Team9 presentation

Pros and Cons of relaxed address adjustment:Pros:•Considers interference between both conditional and unconditional branches.•More accurate than constrained based algorithm

Cons:•NOP’s will be now executed by the CPU which increases CPU overhead (but considerable ).

Page 46: Team9 presentation

Branch Classification

CONCEPT

• Examine the interference behavior of branches.• Map branches of similar behavior to same 2 bit counter of the pattern table.

Page 47: Team9 presentation

Interference Problem

Constructive interference

Neutral interference

Destructive interference

Branch Interference Neutral interference

Page 48: Team9 presentation

Constructive Interference-Miss-prediction of a conditional branch to be correctly predicted.

Neutral Interference-This has no effect on miss-prediction.

Destructive Interference-Miss-predicted to be predicted wrong.

Page 49: Team9 presentation

Destructive interference

• Branches with different branching behaviour are mapped to the same 2-bit counter.

• Causes decrease in branch prediction accuracy.• Branch classification technique aims at reducing destructive interferences

among branches.

Page 50: Team9 presentation

Part 1 – Examine the branch behavior and Classify them• Branch is classified based on the branch direction.• Profiling information is used to find the Taken Probability of the branch.• Lets say, For a branch i, the Taken Probability is TP(i).• After we get the branch direction, it is classified to an appropriate class.• All conditional branches are classified into N classes, where N is the number of entries in

Pattern table.• Therefore the classification values are from 0 to N-1. For eg., the branch i will be mapped to the nth 2bit counter of the pattern table.

l (n modulo N) = Class (I) { 0 , if N-1/ N <= TP(i) < 1

{ 1 , if N-2/ N <= TP(i) < N-1/N { .

Class(i) = { . { . { N-2, if 1/N <= TP(i) < N { N-1, if 0 <= TP(i) < 1/N

Page 51: Team9 presentation

How Branch ClassificationWorks

The branch classification works in two parts:

Part 1 – Examine the branch behavior and Classify them

Part 2 – Address adjustment of the branch by NOP insertion

Page 52: Team9 presentation

Part 2 – Address adjustment of the branch by NOP insertion• After mapping is decided, the address of the branch is adjusted by inserting

proper number of Null Operation instructions at the right locations.• Finally, updates the address of the branches and the following instructions

after alteration.

Page 53: Team9 presentation

Algorithm:Assumptions made:C – number of conditional branchesN – number of Branch classesADDRESS[i] be the address of the branch i , where 1<= i <=C

Step 1: i=1 /* Start from the very first conditional branch */Step 2: Read the profiling information of conditional branch i and compute TP(i).Step 3: Compute Class(i) and find the smallest positive integer P such that,

(ADDRESS[i] + P) modulo N = Class(i)/*P is to find the least number of NOPs to be inserted, to avoid code expansion */Step 4: If (i=1)

Then insert P NOPs at the beginning of the program;If (i>1 && No unconditional branch between branches i-1 and i)

Then insert P NOPS after conditional branch i-1; If(i>1 && an Unconditional branch between branch i-1 and i)

Then insert P NOPs after the unconditional branch;Step 5: Update the address of instructions after conditional branch i-1;Step 6: i++; Repeat steps 2~6 until i= C;

Page 54: Team9 presentation

ExampleEntry 0

Entry 1

Entry 2

Entry 3

entry4

……

……

Entry14

Entry 15

…..Entry 31

Address/4 Profiling Instructions 480005ce 9433/9436 bnez i 480005ef 1/1 bnez j . . . . . . 480005ee 0/6788 bnez k . . . 480005ff 0/5792 bnez m

4800061f 5702/5792 bnez n

Assume, branch i,j, n belongs to Class 0- map to even entries in the pattern table. Branch k,m belongs to Class 1- map to odd entries in the pattern table.

Page 55: Team9 presentation

Entry 0Entry 1Entry 2Entry 3entry4…………Entry14 Entry 15Entry 16...Entry 31

Address/4 Profiling Instructions 480005ce 9433/9436 bnez i 480005cf NOP 480005d0 1/1 bnez j . . . . . . 480005ef 0/6788 bnez k 480005f0 NOP . . . 48000601 0/5792 bnez m 48000602 NOP

48000622 5702/5792 bnez n

After applying Branch Classification algorithm..

Branch k, is now mapped to entry 15- which predict for branches that belongs to Class 1.NO INTERFERENCE, after branch classification technique.

Page 56: Team9 presentation

Relationship between the branch miss-prediction ratio and the number of classes

Large number of classes will result in,Code expansion,and sometimes misprediction ratios remains the same after certain value

The graph shows that the appropriate value is the case when 4 classes is used, because it has lower misprediction ratios than when 2 classes are used.

Page 57: Team9 presentation

Performance Metric

Performance metric used is Misprediction ratio, because– reasonable metric for dynamic branch predictors.– Independent of other part of processor organisation.

Misprediction ratios= 1- number of correct predictions number of conditional branches executed

When Misprediction Ratio is reduced, the prediction accuracy increased.

Page 58: Team9 presentation

Tools used

MachineDEC 3000 workstations running OSF/1 version 3.2.

Benchmarks– Specint92 suite– Specint95 suite

ATOMTo collect statistics about branches, insert analysis procedures implementing the

predictors.

Page 59: Team9 presentation

Analysis

Constrained

Relaxed

Branch classification

Address adjustment techniques

Predictors

2bC

GAs

PAs

GShare

Analysis of:

• Reduction in Miss-prediction ratio after applying the three techniques.

• For the analysis, MMD is set to 8 and the number of classes is set to 4, since they are found to be the optimal values.

Page 60: Team9 presentation

Misprediction ratios for 4 predictors

2bC predictor

Gshare predictorPA predictor

GA predictor

Page 61: Team9 presentation

Limitation of Address Adjustment- Code Expansion

Page 62: Team9 presentation

Best Among Three…

• Among all the three address adjustment algorithms Branch Classification is best, because it removes all of the interference.

• Any address adjustment scheme must tradeoff among code expansion, CPU overhead, and branch prediction accuracy.

• Branch classification gives the best tradeoff among code expansion, CPU overhead and Branch prediction accuracy.

Page 63: Team9 presentation

Conclusion• Predictors with large tables might not be useful.• Address adjustment with smaller predictors can deliver comparable

performance.• Address adjustment is very effective in reducing the miss prediction ratios for

2bC, GAs, and PAs Predictors• Limitation: Address adjustment technique cannot help predictors such as Gshare, which map the branches to the prediction table at runtime.

Page 64: Team9 presentation

•We are going to work on the delay occurred while the execution of conditional branches.•Find the reason behind the delays.•Implementation of Re-Timing Techniques

• Introducing parallelism into the conditional branches.• Analysis of the reduction of number of CPU cycles taken, miss-prediction ratios etc..

Our Project

Page 65: Team9 presentation

References[1]A. Srivastava and A. Eustace, a ATOM: A System for Building Customized Program Analysis Tools ,o Proc. SIGPLAN '94 Conf. Programming Languages Design and Implementation, June 1994.[2]J.A. Fisher and S.M. Freudenberger , ªPredicting Conditional Branch Directions from Previous Runs of a Program,º Proc. Fifth Int'l Conf. Architectural Support for Programming Languages andOperating Systems, 1992.[3] J.A. Fisher, ªWalk-Time Techniques: Catalyst for Architectural Change,º Computer, vol. 30, no. 9, pp. 40-42, Sept. 1997.[4] N. Gloy, M. Smith, and C. Young, ªPerformance Issues in Correlated Branch Prediction Schemes, Proc. 28th Ann. Int'l Symp. Microarchitecture, 1995.[5] E. Jacobsen, E. Rotenberg, and J. Smith, ªAssigning Confidence to Conditional Branch Predictions,º Proc. 29th Int'l Symp. Microarchitecture,1996.[6] J.K.F. Lee and A. Smith, ªBranch Prediction Strategies and Branch Target Buffer Design,º Computer, vol. 17, no. 1, pp. 6-22, Jan. 1984.[7] S. Mcfarling, ªCombining Branch Predictors,º WRL Technical Note TN-36, Digital Equipment Corp., June 1993.[8] S. McFarling and J. Hennessy, ªReducing the Cost of Branches ,Proc. 13th Ann. Int'l Symp. Computer Architecture, June 1986.[9] S. Pan, K. So, and J. Rahmeh, ªImproving the Accuracy of Dynamic Branch Prediction Using Branch Correlation,º Proc. Fifth Ann. Int'l Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 1992.

Page 66: Team9 presentation

QUESTIONS??