Coa Lecture Unit 3 Pipelining

download Coa Lecture Unit 3 Pipelining

of 95

Transcript of Coa Lecture Unit 3 Pipelining

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    1/95

    Topics covered:Pipelining

    1

    Unit 3

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    2/95

    22

    Basic concepts

    Speed of execution of programs can be improved in twoways: Faster circuit technology to build the processor and the

    memory. Arrange the hardware so that a number of operations can be

    performed simultaneously. The number of operations performedper second is increased although the elapsed time needed toperform any one operation is not changed.

    Pipelining is an effective way of organizing concurrent

    activity in a computer system to improve the speed ofexecution of programs.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    3/95

    33

    Basic concepts (contd..)

    Processor executes a program by fetching and executing instructions one after theother.

    This is known as sequential execution.

    If Fi refers to the fetch step, and Ei refers to the execution step of instruction Ii,

    then sequential execution looks like:

    F1

    E1

    F2

    E2

    F3

    E3

    1 2 3

    Time

    What if the execution of one instruction is overlapped with the fetching of the

    next one?

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    4/95

    44

    Basic concepts (contd..)

    Instructionfetchunit

    Executionunit

    Interstage buffer

    B1

    Computer has two separate hardware units, one for fetching instructions and one

    for executing instructions.Instruction is fetched by instruction fetch unit and deposited in an intermediate

    buffer B1.

    Buffer enables the instruction execution unit to execute the instruction while the

    fetch unit is fetching the next instruction.

    Results of the execution are deposited in the destination location specified by the

    instruction.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    5/95

    55

    Basic concepts (contd..)

    F1

    E1

    F2

    E2

    F3

    E3

    I1

    I2

    I3

    Instruction

    Clock cycle 1 2 3 4Time

    Computer is controlled by a clock whose period is such that the fetch and execute

    steps of any instruction can be completed in one clock cycle.

    First clock cycle:

    - Fetch unit fetches an instruction I1 (F1) and stores it in B1.

    Second clock cycle:

    - Fetch unit fetches an instruction I2 (F2) , and execution unit executes instruction I1 (E1).

    Third clock cycle:

    - Fetch unit fetches an instruction I3 (F3), and execution unit executes instruction I2 (E2).

    Fourth clock cycle:

    - Execution unit executes instruction I3 (E3).

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    6/95

    66

    Basic concepts (contd..)

    In each clock cycle, the fetch unit fetches the nextinstruction, while the execution unit executes the currentinstruction stored in the interstage buffer. Fetch and the execute units can be kept busy all the time.

    If this pattern of fetch and execute can be sustained for a

    long time, the completion rate of instruction execution willbe twice that achievable by the sequential operation.

    Fetch and execute units constitute a two-stage pipeline. Each stage performs one step in processing of an instruction. Interstage storage buffer holds the information that needs to

    be passed from the fetch stage to execute stage. New information gets loaded into the buffer every clock cycle.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    7/95

    77

    Basic concepts (contd..)

    Suppose the processing of an instruction is divided into four steps:

    F Fetch: Read the instruction from the memory.D Decode: Decode the instruction and fetch the source operands.

    E Execute: Perform the operation specified by the instruction.

    W Write: Store the result in the destination location.

    There are four distinct hardware units, for each one of the steps.Information is passed from one unit to the next through an interstage buffer.

    Three interstage buffers connecting four units.As an instruction progresses through the pipeline, the information needed by the

    downstream units must be passed along.

    F : Fetchinstruction

    D : Decodeinstructionand fetchoperands

    E: Executeoperation

    W : Writeresults

    Interstage buffers

    B1 B2 B3

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    8/95

    88

    Basic concepts (contd..)

    F4I4

    F1

    F2

    F3

    I1

    I2

    I3

    D1

    D2

    D3

    D4

    E1

    E2

    E3

    E4

    W1

    W2

    W3

    W4

    Instruction

    Clock cycle 1 2 3 4 5 6 7

    Time

    Clock cycle 1: F1

    Clock cycle 2: D1, F2Clock cycle 3: E1, D2, F3

    Clock cycle 4: W1, E2, D3, F4

    Clock cycle 5: W2, E3, D4

    Clock cycle 6: W3, E3, D4

    Clock cycle 7: W4

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    9/95

    99

    Basic concepts (contd..)

    F4I4

    F1

    F2

    F3

    I1

    I2

    I3

    D1

    D2

    D3

    D4

    E1

    E2

    E3

    E4

    W1

    W2

    W3

    W4

    Instruction

    Clock cycle 1 2 3 4 5 6 7

    Time

    Buffer B1 holds instruction I3, which is being decoded by the instruction-decoding

    unit. Instruction I3 was fetched in cycle 3.Buffer B2 holds the source and destination operands for instruction I2. It also holds

    the information needed for the Write step (W2) of instruction I2. This information

    will be passed to the stage W in the following clock cycle.

    Buffer B3 holds the results produced by the execution unit and the destination

    information for instruction I1.

    During clock cycle #4:

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    10/95

    1010

    Role of cache memory

    Each stage in the pipeline is expected to complete itsoperation in one clock cycle: Clock period should be sufficient to complete the longest task. Units which complete the tasks early remain idle for the

    remaining clock period.

    Tasks being performed in different stages should require aboutthe same amount of time for pipelining to be effective.

    If instructions are to be fetched from the main memory,the instruction fetch stage would take as much as ten timesgreater than the other stage operations inside the

    processor. However, if instructions are to be fetched from the cachememory which is on the processor chip, the time required tofetch the instruction would be more or less similar to thetime required for other basic operations.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    11/95

    Completes an instruction each clock cycle

    Therefore, four times as fast as without pipeline

    as long as nothing takes more than one cycle

    But sometimes things take longer -- for example, most executes such as ADD

    take one clock, but suppose DIVIDE takes three

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    12/95

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    13/95

    and other stages idle

    Write has nothing to write

    Decode cant use its out buffer

    Fetch cant use its out buffer

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    14/95

    1414

    Pipeline performance

    Potential increase in performance achieved by usingpipelining is proportional to the number of pipeline stages. For example, if the number of pipeline stages is 4, then the

    rate of instruction processing is 4 times that of sequentialexecution of instructions.

    Pipelining does not cause a single instruction to be executedfaster, it is the throughput that increases.

    This rate can be achieved only if the pipelined operation canbe sustained without interruption through programinstruction.

    If a pipelined operation cannot be sustained withoutinterruption, the pipeline is said to stall.

    A condition that causes the pipeline to stall is called ahazard.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    15/95

    1515

    Data hazard

    A data hazard is a condition in which either the source or the destination

    operand is not available at the time expected in the pipeline.Execution of the instruction occurs in the E stage of the pipeline.Execution of most arithmetic and logic operations would take only one clock cycle.However, some operations such as division would take more time to complete.

    For example, the operation specified in instruction I2 takes three cycles to complete

    from cycle 4 to cycle 6.

    F 4I 4

    F 1

    F 2

    F 3

    I 1

    I 2

    I 3

    D 1

    D 2

    D 3

    D 4

    E 1

    E 2

    E 3

    E 4

    W 1

    W 2

    W 3

    W 4

    Instruction

    Clock cycle 1 2 3 4 5 6 7

    T ime

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    16/95

    1616

    Data hazard (contd..)

    F4I4

    F1

    F2

    F3

    I1

    I2

    I3

    D1

    D2

    D3

    D4

    E1

    E2

    E3

    E4

    W1

    W2

    W3

    W4

    Instruction

    Clock cycle 1 2 3 4 5 6 7

    Time

    Cycles 5 and 6, the Write stage is idle, because it has no data to work with.

    Information in buffer B2 must be retained till the execution of the instruction I2 is

    complete.Stage 2, and by extension stage 1 cannot accept new instructions because the

    information in B1 cannot be overwritten.

    Steps D6 and F5 must be postponed.

    A data hazard is a condition in which either the source or the destination operand is

    not available at the time expected in the pipeline.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    17/95

    Data Hazards

    Situations that cause the pipeline to stallbecause data to be operated on is delayed execute takes extra cycle, for example

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    18/95

    1818

    Control or instruction hazard

    Pipeline may be stalled because an instruction is not available at the expected time.

    For example, while fetching an instruction a cache miss may occur, and hence the

    instruction may have to be fetched from the main memory.Fetching the instruction from the main memory takes much longer than fetching the

    instruction from the cache.

    Thus, the fetch cycle of the instruction cannot be completed in one cycle.

    For example, the fetching of instruction I2 results in a cache miss.

    Thus, F2 takes 4 clock cycles instead of 1.

    F1

    F2

    F3

    I1

    I2

    I3

    D1

    D2

    D3

    E1

    E2

    E3

    W1

    W2

    W3

    Instruction

    1 2 3 4 5 6 7 8 9Clock cycle

    Time

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    19/95

    An instruction hazard (or control hazard) has caused the pipeline to

    stall

    Instruction I2 not in the cache, required a main memory access

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    20/95

    2020

    Control or instruction hazard (contd..)

    Fetch operation for instruction I2 results in a cache miss, and the instruction fetch

    unit must fetch this instruction from the main memory.Suppose fetching instruction I2 from the main memory takes 4 clock cycles.

    Instruction I2 will be available in buffer B1 at the end of clock cycle 5.The pipeline resumes its normal operation at this point.Decode unit is idle in cycles 3 through 5.

    Execute unit is idle in cycles 4 through 6.

    Write unit is idle in cycles 5 through 7.Such idle periods are called as stalls or bubbles.

    Once created in one of the pipeline stages, a bubble moves downstream unit it

    reaches the last unit.

    1 2 3 4 5 6 7 8Clock c ycle

    Stage

    F: Fetch

    D: Decode

    E: Execute

    W: Write

    F1

    F2

    F3

    D1

    D2

    D3

    idle idle idle

    E1

    E2

    E3

    idle idle idle

    W 1 W 2idle idle idle

    9

    W 3

    F2

    F2

    F2

    T ime

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    21/95

    2121

    Structural hazard

    Two instructions require the use of a hardware resource atthe same time.

    Most common case is in access to the memory: One instruction needs to access the memory as part of the

    Execute or Write stage.

    Other instruction is being fetched. If instructions and data reside in the same cache unit, only one

    instruction can proceed and the other is delayed. Many processors have separate data and instruction caches

    to avoid this delay.

    In general, structural hazards can be avoided by providingsufficient resources on the processor chip.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    22/95

    2222

    Structural hazard (contd..)

    F1

    F2

    F3

    I1

    I2(Load X(R1),R2

    I3

    E1

    M2

    D1

    D2

    D3

    W1

    W2

    Instruction

    F4

    I4

    F5

    I5 D5

    Clock cycle 1 2 3 4 5 6 7

    E2

    E3 W3

    E4D4

    Memory address X+R1 is computed in step E2 in cycle 4, memory access takes place

    in cycle 5, operand read from the memory is written into register R2 in cycle 6.

    Execution of instruction I2 takes two clock cycles 4 and 5.

    In cycle 6, both instructions I2 and I3 require access to register file.

    Pipeline is stalled because the register file cannot handle two operations at once.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    23/95

    I2 writing to

    register file

    I3 must wait for

    register file

    I2 takes extra cycle for cache access as

    part of execution

    calculate

    the address

    I5 fetch delayed

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    24/95

    2424

    Pipelining and performance

    Pipelining does not cause an individual instruction to beexecuted faster, rather, it increases the throughput. Throughput is defined as the rate at which instruction

    execution is completed.

    When a hazard occurs, one of the stages in the pipeline

    cannot complete its operation in one clock cycle. The pipeline stalls causing a degradation in performance. Performance level of one instruction completion in each clock

    cycle is the upper limit for the throughput that can beachieved in a pipelined processor.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    25/95

    2525

    Data hazards

    Data hazard is a situation in which the pipeline is stalled because the data to be

    operated on are delayed.Consider two instructions:

    I1: A = 3 + A

    I2: B = 4 x A

    If A = 5, and I1 and I2 are executed sequentially, B=32.

    In a pipelined processor, the execution of I2 can begin before the execution of I1.

    The value of A used in the execution of I2 will be the original value of 5 leading to

    an incorrect result.

    Thus, instructions I1 and I2 depend on each other, because the data used by I2

    depends on the results generated by I1

    .

    Results obtained using sequential execution of instructions should be the same as

    the results obtained from pipelined execution.When two instructions depend on each other, they must be performed in the correct

    order.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    26/95

    Concurrency

    A 3 + AB 4 x A

    A 5 x C

    B 20 + C

    Cant be performedconcurrently--result incorrect

    if new value of A is not used

    Can be performed

    concurrently (or in either

    order) without affecting result

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    27/95

    Concurrency

    A 3 + A

    B 4 x A

    A 5 x C

    B 20 + C

    Second operation depends oncompletion of first operation

    The two operations are

    independent

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    28/95

    MUL R2, R3, R4

    ADD R5, R4, R6 (dependent on result in R4 from previous instruction)

    will write result in R4

    cant finish decoding until result is in R4

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    29/95

    2929

    Data hazards (contd..)

    F1

    F2

    F3

    I1

    I2

    I3

    D1

    D3

    E1

    E3

    E2

    W3

    Instruction1 2 3 4 5 6 7 8 9Clock cycle

    W1

    D2A

    W2

    F4

    D4

    E4

    W4

    I4

    D2

    Mul R2, R3, R4

    Add R5,R4,R6

    Mul instruction places the results of the multiply operation in register R4 at the end

    of clock cycle 4.Register R4 is used as a source operand in the Add instruction. Hence the Decode

    Unit decoding the Add instruction cannot proceed until the Write step of the first

    instruction is complete.

    Data dependency arises because the destination of one instruction is used as a source

    in the next instruction.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    30/95

    3030

    Operand forwarding

    Data hazard occurs because the destination of one instruction is used as the source

    in the next instruction.

    Hence, instruction I2 has to wait for the data to be written in the register file by the

    Write stage at the end of step W1.

    However, these data are available at the output of the ALU once the Execute stage

    completes step E1.

    Delay can be reduced or even eliminated if the result of instruction I1 can be

    forwarded directly for use in step E2.

    This is called operand forwarding.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    31/95

    pipeline

    stall

    data

    forwarding

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    32/95

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    33/95

    ADD R5, R4, R6

    MUL R2, R3, R4

    R2, R3

    R2 x R3

    R2 x R3

    R2 x R3

    R5 R4 + R5

    R4

    R6

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    34/95

    If solved by software:

    MUL R2, R3, R4

    NOOPNOOP

    ADD R5, R4, R6

    2 cycle stall introduced by

    hardware

    (if no data forwarding)

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    35/95

    MUL R2, R3, R4

    from

    R2

    R2 x R3

    toR4

    to

    I2

    from

    R3

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    36/95

    3636

    Operand forwarding (contd..)

    Register

    file

    SRC1 SRC2

    RSLT

    Destination

    Source 1

    Source 2

    ALU

    I1: Mul R2, R3, R4

    I2: Add R5, R4, R6

    Clock cycle 3:

    - Instruction I2 is decoded, and a data

    dependency is detected.

    - Operand not involved in the dependency,

    register R5 is loaded in register SRC1.Clock cycle 4:

    - Product produced by I1 is available in

    register RSLT.

    - The forwarding connection allows the

    result to be used in step E2.

    Instruction I2 proceeds without interruption.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    37/95

    3737

    Handling data dependency in software

    Data dependency may be detected by the hardware whiledecoding the instruction: Control hardware may delay by an appropriate number of clock

    cycles reading of a register till its contents become available.The pipeline stalls for that many number of clock cycles.

    Detecting data dependencies and handling them can also be

    accomplished in software. Compiler can introduce the necessary delay by introducing an

    appropriate number of NOP instructions. For example, if a two-cycle delay is needed between two instructions then two NOPinstructions can be introduced between the two instructions.

    I1: Mul R2, R3, R4NOP

    NOP

    I2: Add R5, R4, R6

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    38/95

    3838

    Side effects

    Data dependencies are explicit easy to detect if a registerspecified as the destination in one instruction is used as asource in the subsequent instruction.

    However, some instructions also modify registers that arenot specified as the destination.

    For example, in the autoincrement and autodecrementaddressing mode, the source register is modified as well.

    When a location other than the one explicitly specified inthe instruction as a destination location is affected, theinstruction is said to have a side effect.

    Another example of a side effect is condition code flagswhich implicitly record the results of the previousinstruction, and these results may be used in the subsequentinstruction.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    39/95

    3939

    Side effects (contd..)

    I1: Add R3, R4

    I2: AddWithCarry R2, R4

    Instruction I1 sets the carry flag and instruction I2 uses the carry flag leading to

    an

    implicit dependency between the two instructions.

    Instructions with side effects can lead to multiple data dependencies.Results in a significant increase in the complexity of hardware or software

    needed

    to handle the dependencies.Side effects should be kept to a minimum in instruction sets designed for

    executionon pipelined hardware.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    40/95

    4040

    Instruction hazards

    Instruction fetch units fetch instructions and supply theexecution units with a steady stream of instructions.

    If the stream is interrupted then the pipeline stalls.

    Stream of instructions may be interrupted because of acache miss or a branch instruction.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    41/95

    4141

    Instruction hazards (contd..)

    Consider a two-stage pipeline, first stage is the instruction fetch stage and the second

    stage is the instruction execute stage.Instructions I1, I2 and I3 are stored at successive memory locations.

    I2 is a branch instruction with branch target as instruction Ik.

    I2 is an unconditional branch instruction.

    Clock cycle 3:- Fetch unit is fetching instruction I3.

    - Execute unit is decoding I2 and computing the branch target address.

    Clock cycle 4:

    - Processor must discard I3 which has been incorrectly fetched and fetch Ik.

    - Execution unit is idle, and the pipeline stalls for one clock cycle.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    42/95

    4242

    Instruction hazards (contd..)

    F2I2 (Branch)

    I3

    Ik

    E2

    F3

    Fk

    Ek

    Fk+1 Ek+1Ik+1

    Instruction

    Execution unit idle

    1 2 3 4 5Clock cycle

    Time

    F1

    I1

    E1

    6

    X

    Pipeline stalls for one clock cycle.

    Time lost as a result of a branch instruction is called as branch penalty.

    Branch penalty is one clock cycle.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    43/95

    the lost cycle is the branch penalty

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    44/95

    4444

    Instruction hazards (contd..)

    Branch penalty depends on the length of the pipeline, may be higher for a longer

    pipeline.For a four-stage pipeline:

    - Branch target address is computed in stage E2.

    - Instructions I3 and I4 have to be discarded.

    - Execution unit is idle for 2 clock cycles.

    - Branch penalty is 2 clock cycles.

    X

    F1 D1 E1 W1

    I2 (Branch)

    I1

    1 2 3 4 5 6 7Clock cycle

    F2 D2

    F3

    Fk

    Dk

    Ek

    Fk+1 Dk+1

    I3

    Ik

    Ik+1

    Wk

    Ek+1

    E2

    D3

    F4 XI4

    8

    Time

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    45/95

    In a four stage pipeline, the penalty is two clock cycles

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    46/95

    4646

    Instruction hazards (contd..)

    Branch penalty can be reduced by computing the branch target address earlier in the

    pipeline.Instruction fetch unit has special hardware to identify a branch instruction after the

    instruction is fetched.

    Branch target address can be computed in the Decode stage (D2), rather than in the

    Execute stage (E2).

    Branch penalty is only one clock cycle.

    F1 D1 E1 W1

    I2 (Branch)

    I1

    1 2 3 4 5 6 7Clock cycle

    F2 D2

    F3 X

    Fk

    Dk

    Ek

    Fk+1 Dk+1

    I3

    Ik

    Ik+1

    Wk

    Ek+1

    Time

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    47/95

    4747

    Instruction hazards (contd..)

    F : Fetchinstruction

    E : Executeinstruction

    W : Writeresults

    D : Dispatch/Decode

    Instruction queue

    Instruction fetch unit

    unit

    Fetch unit fetchesinstructions beforethey are needed &

    stores them in a queue

    Queue can hold severalinstructions

    Dispatch unit takes instructions from the front of the queue anddispatches them to the Execution unit. Dispatch unit also decodesthe instruction.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    48/95

    4848

    Instruction hazards (contd..)

    Fetch unit must have sufficient decoding and processingcapability to recognize and execute branch instructions.

    Pipeline stalls because of a data hazard: Dispatch unit cannot issue instructions from the queue. Fetch unit continues to fetch instructions and add them to the

    queue. Delay in fetching because of a cache miss or a branch:

    Dispatch unit continues to dispatch instructions from theinstruction queue.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    49/95

    4949

    Instruction hazards (contd..)

    X

    F1 D1 E1 E1 E1 W1

    F4

    W3E3

    I5 (Branch)

    I1

    F2 D2

    1 2 3 4 5 6 7 8 9Clock cycle

    E2 W2

    F3 D3

    E4D4 W4

    F5 D5

    F6

    Fk

    Dk

    Ek

    Fk+1 Dk+1

    I2

    I3

    I4

    I6

    Ik

    Ik+1

    Wk

    Ek+1

    10

    1 1 1 1 2 3 2 1 1Queue length 1

    Initial length of the queue is 1.Fetch adds 1 to the queue, dispatch

    reduces the length by 1.Queue length remains the same

    for first 4 clock cycles.

    I1 stalls the pipeline for 2 cycles.

    Queue has space, so the fetch unit

    continues and queue length rises

    to 3 in clock cycle 6.

    I5 is a branch instruction with

    target instruction Ik.

    Ikis fetched in cycle 7, and I6is discarded.However, this does not stall the

    pipeline, since I4 is dispatched.I2, I3, I4 and Ik are executed in successive clock cycles.

    Fetch unit computes the branch address concurrently with the execution of other

    instructions. This is called as branch folding.

    previous F1 out, F2 out, F4 in F5 in

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    50/95

    p

    out, F1 in

    ,

    F2 in

    ,

    F3 in

    instructions 3

    d 4

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    51/95

    and 4

    instructions 3,4,and 5

    keeps fetching despite

    stall

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    52/95

    calculates branch target

    concurrently

    branch folding

    discards F6

    and fetches K

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    53/95

    completes an

    instruction

    each clock

    cycle

    no branch

    penalty

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    54/95

    5454

    Instruction hazards (contd..)

    Branch folding can occur if there is at least one instructionavailable in the queue other than the branch instruction. Queue should ideally be full most of the time. Increasing the rate at which the fetch unit reads instructions

    from the cache. Most processors allow more than one instruction to be fetched

    from the cache in one clock cycle. Fetch unit must replenish the queue quickly after a branch hasoccurred.

    Instruction queue also mitigates the impact of cache misses: In the event of a cache miss, the dispatch unit continues to

    send instructions to the execution unit as long as the queue is

    full. In the meantime, the desired cache block is read. If the queue does not become empty, cache miss has no effect

    on the rate of instruction execution.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    55/95

    5555

    Conditional branches and branch prediction

    Conditional branch instructions depend on the result of apreceding instruction. Decision on whether to branch cannot be made until the

    execution of the preceding instruction is complete.

    Branch instructions represent 20% of the dynamic

    instruction count of most programs. Dynamic instruction count takes into consideration that someinstructions are executed repeatedly.

    Branch instructions may incur branch penalty reducing theperformance gains expected from pipelining.

    Several techniques to mitigate the negative impact ofbranch penalty on performance.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    56/95

    5656

    Delayed branch

    Branch target address is computed in stage E2.

    Instructions I3

    and I4

    have to be discarded.

    Location following a branch instruction is called a branch delay slot.There may be more than one branch delay slot depending on the time it takes to

    determine whether the instruction is a branch instruction.In this case, there are two branch delay slots.The instructions in the delay slot are always fetched and at least partially executed

    before the branch decision is made and the branch address is computed.

    X

    F1 D1 E1 W1

    I2 (Branch)

    I1

    1 2 3 4 5 6 7Clock cycle

    F2 D2

    F3

    Fk

    Dk

    Ek

    Fk+1 Dk+1

    I3

    Ik

    Ik+1

    Wk

    Ek+1

    E2

    D3

    F4 XI4

    8

    Time

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    57/95

    the penalty is two clock cycles in a four stage pipeline

    two branch

    delay slots

    if branch

    address

    calculated here

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    58/95

    fetched and decoded

    discarded

    penalty reduced to one cycle

    one branch delay slot

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    59/95

    5959

    Delayed branch (contd..)

    Delayed branching can minimize the penalty incurred as aresult of conditional branch instructions.

    Since the instructions in the delay slots are always fetchedand partially executed, it is better to arrange for them tobe fully executed whether or not branch is taken.

    If we are able to place useful instructions in these slots, thenthey will always be executed whether or not the branch istaken.

    If we cannot place useful instructions in the branch delayslots, then we can fill these slots with NOP instructions.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    60/95

    6060

    Delayed branch (contd..)

    Add

    LOOP Shift_left R1

    Decrement

    Branch=0

    R2

    LOOP

    NEXT

    (a) Original program loop

    LOOP Decrement R2

    Branch=0

    Shift_left

    LOOP

    R1NEXT Add

    R1,R3

    R1,R3

    Register R2 is used as a counter to

    determine how many times R1 is to

    be shifted.

    Processor has a two stage pipeline or

    one delay slot.

    Instructions can be reordered so thatthe shift left instruction appears in the

    delay slot.

    Shift left instruction is always executed

    whether the branch condition is true or

    false.

    (b) Reordered instructions

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    61/95

    6161

    Delayed branch (contd..)

    F E

    F E

    F E

    F E

    F E

    F E

    F E

    Instruction

    Decrement

    Branch

    Shift (delay slot)

    Decrement (Branch taken)

    Branch

    Shift (delay slot)

    Add (Branch not taken)

    1 2 3 4 5 6 7 8Clock cycleTime

    Shift instruction is executed when the branch

    is taken.

    Shift instruction is executedwhen the branch is not taken.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    62/95

    6262

    Delayed branch (contd..)

    Logically, the program is executed as if the branchinstruction were placed after the shift instruction.

    Branching takes place one instruction later than where thebranch instruction appears in the instruction sequence (withreference to reordered instructions).

    Hence, this technique is termed as delayed branch. Delayed branch requires reordering as many instructions as

    the number of delay slots. Usually possible to reorganize one instruction to fill one delay

    slot.

    Difficult to reorganize two or more instructions to fill two ormore delay slots.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    63/95

    6363

    Branch prediction

    To reduce the branch penalty associated with conditional

    branches, we can predict whether the branch will be taken. Simplest form of branch prediction:

    Assume that the branch will not take place. Continue to fetch instructions in sequential execution order. Until the branch condition is evaluated, instruction execution

    along the predicted path must be done on a speculative basis. Speculative execution implies that the processor is

    executing instructions before it is certain that they are inthe correct sequence. Processor registers and memory locations should not be

    updated unless the sequence is confirmed.

    If the branch prediction turns out to be wrong, theninstructions that were executed on a speculative basis andtheir data must be purged.

    Correct sequence of instructions must be fetched andexecuted.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    64/95

    6464

    Branch prediction (contd..)

    F1

    F2

    I1 (Compare)

    I2 (Branch>0)

    I3

    D1 E1 W1

    F3

    F4

    Fk

    Dk

    D3 X

    XI4

    Ik

    Instruction

    E2

    Clock cycle 1 2 3 4 5 6

    D2/P2

    Time

    I1 is a compare instruction and I2is a branch instruction.Branch prediction takes place in cycle

    3 when I2 is being decoded.

    I3 is being fetched at that time.

    Fetch unit predicts that the branch willnot be taken and continues to fetch I4in cycle 4 when I3 is being decoded.

    Results of I1 are available in cycle 3.

    Fetch unit evaluates branch condition in cycle 4.

    If the branch prediction is incorrect, the fetch unit realizes at this point.

    I3 and I4 are discarded and Ik is fetched from the branch target address.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    65/95

    6565

    Branch prediction (contd..)

    Which way to predict the result of the branch instruction(taken or not taken) may be made in the hardware,depending on whether the target address of the branchinstruction is lower or higher than the address of thebranch instruction. If the target address is lower, then the branch is predicted as

    taken. If the target address is higher, then the branch is predicted

    as not taken.

    Branch prediction can also be handled by the compiler. Complier can set the branch prediction bit to 0 or 1 to indicate

    the desired behavior. Instruction fetch unit checks the branch prediction bit to

    predict whether the branch will be taken.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    66/95

    6666

    Branch prediction (contd..)

    Branch prediction decision is the same every time aninstruction is executed. This is static branch prediction.

    Branch prediction decision may change depending on theexecution history.

    This is dynamic branch prediction.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    67/95

    6767

    Branch prediction (contd..)

    Branch prediction algorithm may be described as a two-state machine with 2 states:

    LT : Branch is likely to be takenLNT: Branch is likely not to be taken

    Initial state of the machine be LNT

    When the branch instruction is executed, and if the branch is taken, the machine

    moves to state LT.

    If the branch is not taken, it remains in state LNT.

    When the same branch instruction is executed the next time, the branch is predictedas taken if the state of the machine is LT, else it is predicted as not taken.

    Branch taken (BT)

    Branch not taken (BNT)

    BTBNT LNT LT

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    68/95

    6868

    Branch prediction (contd..)

    Requires only one bit of history information for each branchinstruction.

    Works well inside loops: Once a loop is entered, the branch instruction that controls the

    looping will always yield the same result until the last pass. In the last pass, the branch prediction will turn out to be

    incorrect. The branch history state machine will be changed to the

    opposite state. However, if the same loop is entered the next time, and there

    is more than one pass, the branch prediction machine will lead

    to wrong branch prediction. Better performance may be achieved by keeping more

    execution history.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    69/95

    6969

    Branch prediction (contd..)

    BTBNT

    BNT

    BT

    BNT

    BT

    BNT LNT

    LT ST

    SNT

    BT

    ST : Strong likely to be takenLT : Likely to be taken

    LNT : Likely not to be taken

    SNT : Strong likely not to be taken

    Initial state of the algorithm is LNT.

    After the branch instruction is executed, ifthe branch is taken, the state is changed to STFor a branch instruction, the fetch unit predicts

    that the branch will be taken if the state is ST

    or LT, else it predicts that the branch will not

    be taken.

    In state SNT:- The prediction is that the branch is not taken.

    - If the branch is actually taken, the state

    changes to LNT.

    - Next time the branch is encountered, the

    prediction again is that it is not taken.

    - If the prediction is wrong the second time,the state changes to ST.

    - After that, the branch is predicted as taken.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    70/95

    7070

    Branch prediction (contd..)

    Consider a loop with a branch instruction at the end.

    Initial state of the branch prediction algorithm is LNT. In the first pass, the algorithm will predict that the branch

    is not taken. This prediction will be incorrect. The state of the algorithm will change to ST.

    In the subsequent passes, the algorithm will predict that thebranch is taken: Prediction will be incorrect, except for the last pass.

    In the last pass, the branch is not taken:

    The state will change to LT from ST. When the loop is entered the second time, the algorithm will

    predict that the branch is taken: This prediction will be correct.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    71/95

    7171

    Branch prediction (contd..)

    Branch prediction algorithm mispredicts the outcome of thebranch instruction in the first pass. The prediction in the first pass depends on the initial state of

    the branch prediction algorithm. If the initial state can be set correctly, the misprediction in

    the first pass can be avoided.

    Information necessary to set the initial state of the branchprediction algorithm can be provided by static predictionschemes. Comparing the branch target address with the address of the

    branch instruction, Checking the branch prediction bit set by the compiler.

    Branch instruction at the end of the loop, initial state is LT. Branch instruction at the start of the loop, initial state is LNT.

    With this, the only misprediction that occurs is on the finalpass through the loop. This misprediction is unavoidable.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    72/95

    Topics covered:Pipelining

    Influence on Instruction Sets

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    73/95

    Overview

    Some instructions are much better suited to pipelineexecution than others.

    Addressing modes

    Conditional code flags

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    74/95

    Addressing Modes

    Addressing modes include simple ones and complex ones. In choosing the addressing modes to be implemented in a

    pipelined processor, we must consider the effect of eachaddressing mode on instruction flow in the pipeline:

    Side effects The extent to which complex addressing modes cause the

    pipeline to stall Whether a given mode is likely to be used by compilers

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    75/95

    Recall

    F1

    F2

    F3

    I1

    I2 (Load)

    I3

    E1

    M2

    D1

    D2

    D3

    W1

    W2

    Instruction

    F4I4

    Clock cycle 1 2 3 4 5 6 7

    Figure 8.5. Effect of a Load instruction on pipeline timing.

    F5I5 D5

    Time

    E2

    E3 W3

    E4D4

    Load X(R1), R2

    Load (R1), R2

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    76/95

    Complex Addressing Mode

    F

    F D

    D E

    X + [R1] [X +[R1]] [[X +[R1]]]Load

    Next instruction

    (a) Complex addressing mode

    W

    1 2 3 4 5 6 7Clock cycleTime

    W

    Forward

    Load (X(R1)), R2

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    77/95

    Simple Addressing Mode

    X + [R1]F D

    F

    F

    F D

    D

    D

    E

    [X +[R1]]

    [[X +[R1]]]

    Add

    Load

    Load

    Next instruction

    (b) Simple addressing mode

    W

    W

    W

    W

    Add #X, R1, R2Load (R2), R2

    Load (R2), R2

    dd d

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    78/95

    Addressing Modes

    In a pipelined processor, complex addressingmodes do not necessarily lead to fasterexecution.

    Advantage: reducing the number of

    instructions / program spaceDisadvantage: cause pipeline to stall / more

    hardware to decode / not convenient forcompiler to work with

    Conclusion: complex addressing modes are notsuitable for pipelined execution.

    dd i M d

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    79/95

    Addressing Modes

    Good addressing modes should have:

    Access to an operand does not require more thanone access to the memory

    Only load and store instruction access memoryoperands

    The addressing modes used do not have sideeffects

    Register, register indirect, index

    C di i l C d

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    80/95

    Conditional Codes

    If an optimizing compiler attempts to reorder instruction toavoid stalling the pipeline when branches or datadependencies between successive instructions occur, it mustensure that reordering does not cause a change in theoutcome of a computation.

    The dependency introduced by the condition-code flagsreduces the flexibility available for the compiler to reorderinstructions.

    C diti l C d

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    81/95

    Conditional Codes

    AddCompare

    Branch=0

    R1,R2R3,R4

    . . .

    Compare

    AddBranch=0

    R3,R4

    R1,R2. . .

    (a) A program fragment

    (b) Instructions reordered

    Figure 8.17. Instruction reordering.

    C diti l C d

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    82/95

    Conditional Codes

    Two conclusion:

    To provide flexibility in reordering instructions,the condition-code flags should be affected by asfew instruction as possible.

    The compiler should be able to specify in whichinstructions of a program the condition codes areaffected and in which they are not.

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    83/95

    Topics covered:Pipelining

    Datapath and Control Considerations

    Bus A Bus B Bus C

    Incrementer

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    84/95

    Memory b usdata lines

    Figure 7.8. Three-b us organization of the datapath.

    Instructiondecoder

    PC

    Register

    file

    Constant 4

    AL U

    MDR

    A

    B

    R

    MUX

    Addresslines

    MAR

    IR

    Register

    file

    Pipelined Desi n

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    85/95

    Instruction cache

    Figure 8.18. Datapath modified for pipelined e xecution, with

    BusA

    BusB

    Control s ignal pipeline

    IMAR

    PC

    ALU

    Instruction

    A

    B

    R

    decoder

    Incrementer

    MDR/Write

    Instructionqueue

    BusC

    Data cache

    Memory address

    MDR/ReadDMAR

    Memory address

    (Instruction fetches)

    (Data access)

    interstage b uffers at the input and output of the ALU.

    Pipelined Design

    - Separate instruction and data caches

    - PC is connected to IMAR

    - DMAR

    - Separate MDR

    - Buffers for ALU

    - Instruction queue

    - Instruction decoder output

    - Reading an instruction from the instruction cache

    - Incrementing the PC

    - Decoding an instruction

    - Reading from or writing into the data cache

    - Reading the contents of up to two regs

    - Writing into one register in the reg file

    - Performing an ALU operation

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    86/95

    Topics covered:Pipelining

    Superscalar Operation

    Overview

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    87/95

    Overview

    The maximum throughput of a pipelinedprocessor is one instruction per clock cycle.

    If we equip the processor with multipleprocessing units to handle several instructionsin parallel in each processing stage, several

    instructions start execution in the same clockcycle multiple-issue.

    Processors are capable of achieving aninstruction execution throughput of more than

    one instruction per cycle superscalarprocessors.Multiple-issue requires a wider path to the

    cache and multiple execution units.

    Superscalar

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    88/95

    Superscalar

    W : Writeresults

    Dispatchunit

    Instruction queue

    Floating-pointunit

    Integerunit

    Figure 8.19. A processor with two execution units.

    F : Instructionfetch unit

    Timing

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    89/95

    Timing

    I1 (Fadd) D1

    D2

    D3

    D4

    E1A E1B E1C

    E2

    E3 E3 E3

    E4

    W1

    W2

    W3

    W4

    I2 (Add)

    I3 (Fsub)

    I4 (Sub)

    Figure 8.20. An example of instruction execution flow in the processor of Figure 8.19,

    assuming no hazards are encountered.

    1 2 3 4 5 6Clock cycleTime

    F1

    F2

    F3

    F4

    7

    Out of Order Execution

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    90/95

    Out-of-Order Execution

    HazardsExceptionsImprecise exceptions

    Precise exceptions

    I1

    (Fadd) D1

    D2

    D3

    D4

    E1A E1B E1C

    E2

    E3A E3B E3C

    E4

    W1

    W2

    W3

    W4

    I2 (Add)

    I3

    (Fsub)

    I4

    (Sub)

    1 2 3 4 5 6Clock cycleTime

    (a) Delayed write

    F1

    F2

    F3

    F4

    7

    Execution Completion

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    91/95

    Execution Completion

    It is desirable to used out-of-order execution, so that anexecution unit is freed to execute other instructions as soonas possible.

    At the same time, instructions must be completed in programorder to allow precise exceptions.

    The use of temporary registers Commitment unit

    I1

    (Fadd) D1

    D2

    D3

    D4

    E1A E1B E1C

    E2

    E3A

    E3B

    E3C

    E4

    W1

    W2

    W3

    W4

    I2 (Add)

    I3

    (Fsub)

    I4

    (Sub)

    1 2 3 4 5 6Clock cycleTime

    (b) Using temporary registers

    TW2

    TW4

    7

    F1

    F2

    F3

    F4

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    92/95

    Topics covered:Pipelining

    Performance Considerations

    Overview

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    93/95

    Overview

    The execution time T of a program thathas a dynamic instruction count N is givenby:

    where S is the average number of clockcycles it takes to fetch and execute oneinstruction, and R is the clock rate.

    Instruction throughput is defined as thenumber of instructions executed persecond.

    R

    SNT

    =

    S

    RP

    s=

    Overview

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    94/95

    Overview

    Ann

    -stage pipeline has the potential toincrease the throughput by n times.However, the only real measure of

    performance is the total execution time of a

    program.Higher instruction throughput will not

    necessarily lead to higher performance.Two questions regarding pipelining

    How much of this potential increase in instructionthroughput can be realized in practice?

    What is good value of n?

    Number of Pipeline Stages

  • 7/28/2019 Coa Lecture Unit 3 Pipelining

    95/95

    Number of Pipeline Stages

    Since ann

    -stage pipeline has the potential toincrease the throughput by n times, how aboutwe use a 10,000-stage pipeline?

    As the number of stages increase, the

    probability of the pipeline being stalledincreases.The inherent delay in the basic operations

    increases.

    Hardware considerations (area, power,complexity,)