Lec5a Supp

download Lec5a Supp

of 77

Transcript of Lec5a Supp

  • 7/29/2019 Lec5a Supp

    1/77

  • 7/29/2019 Lec5a Supp

    2/77

    Review: Major Components of a Computer

    rocessor

    Control

    Devices

    Input

    Datapath Output

    Cache

    Main

    Memory

    Second

    Memo

    (Disk

    CSE431 Chapter 5A.2 Irwin, PSU, 2008

    ry

    y

  • 7/29/2019 Lec5a Supp

    3/77

    Processor-Memory Performance Gap

    1000055%/year(2X/1.5yr)

    1000nce Moores Law

    100

    rfo

    rma Processor-Memory

    Performance Gaprows 50%/ ear

    10P DRAM7%/year

    1

    1980 1984 1988 1992 1996 2000 2004

    (2X/10yrs)

    CSE431 Chapter 5A.3 Irwin, PSU, 2008

    Year

  • 7/29/2019 Lec5a Supp

    4/77

    The Memory Wall

    Processor vs DRAM speed disparity continues to grow

    1000ss

    100

    tructio

    n

    Ma

    cce

    1

    Core

    Memory

    sperin

    perDR

    0.1Cloc

    Clock

    .VAX/1980 PPro/1996 2010+

    CSE431 Chapter 5A.4 Irwin, PSU, 2008

    important to overall performance

  • 7/29/2019 Lec5a Supp

    5/77

    The Memory Hierarchy Goal

    Fact: Large memories are slow and fast memories aresmall

    How do we create a memory that gives the illusion of

    being large, cheap and fast (most of the time)?z erarc y

    z With parallelism

    CSE431 Chapter 5A.5 Irwin, PSU, 2008

  • 7/29/2019 Lec5a Supp

    6/77

  • 7/29/2019 Lec5a Supp

    7/77

    Memory Hierarchy Technologies

    compatibilityz Fast (typical access times of 0.5 to 2.5 nsec)

    z Low density (6 transistor cells), higher power, expensive ($2000to $5000 per GB in 2008)

    z Static: content will last forever (as long as power is left on)

    Main memory uses DRAM for size (density)

    z Slower (typical access times of 50 to 70 nsec)z High density (1 transistor cells), lower power, cheaper ($20 to $75

    per GB in 2008)

    z Dynamic: needs to be refreshed regularly (~ every 8 ms)

    - consumes1% to 2% of the active cycles of the DRAM

    z Addresses divided into 2 halves (row and column)

    -

    CSE431 Chapter 5A.7 Irwin, PSU, 2008

    - CAS orColumn Access Strobe triggering the column selector

  • 7/29/2019 Lec5a Supp

    8/77

    The Memory Hierarchy: Why Does it Work?

    Temporal Locality (locality in time)

    z If a memory location is referenced then it will tend to bereferenced again soon

    Keep most recently accessed data items closer to the processor

    Spatial Locality (locality in space)

    z If a memory location is referenced, the locations with nearbya resses w en o e re erence soon

    Move blocks consisting ofcontiguous words closer to theprocessor

    CSE431 Chapter 5A.8 Irwin, PSU, 2008

  • 7/29/2019 Lec5a Supp

    9/77

    The Memory Hierarchy: Terminology

    present (or not) in a cache

    Hit Rate: the fraction of memory accesses found in a level

    of the memory hierarchyz Hit Time: Time to access that level which consists of

    Miss Rate: the fraction of memory accesses not found in a

    level of the memory hierarchy 1 - (Hit Rate)z Miss Penalty: Time to replace a block in that level with the

    corresponding block from a lower level which consists of

    Time to access the block in the lower level + Time to transmit that block

    to the level that experienced the miss + Time to insert the block in thatlevel + Time to pass the block to the requestor

    CSE431 Chapter 5A.9 Irwin, PSU, 2008

    Hit Time

  • 7/29/2019 Lec5a Supp

    10/77

    Characteristics of the Memory Hierarchy

    ProcessorInclusive

    4-8 b tes word

    Increasingdistance

    L1$

    is a subset ofwhat is in L2$

    8-32 b tes block

    from theprocessor

    in access

    L2$

    what is in MMthat is asubset of is in1 to 4 blocks

    time SM1,024+ bytes (disk sector = page)

    (Relative) size of the memory at each level

    CSE431 Chapter 5A.10 Irwin, PSU, 2008

  • 7/29/2019 Lec5a Supp

    11/77

    How is the Hierarchy Managed?

    registers memory

    z by compiler (programmer?)

    cache main memoryz by the cache controller hardware

    main memory disks

    z by the operating system (virtual memory)z virtual to physical address mapping assisted by the hardware

    (TLB)

    z by the programmer (files)

    CSE431 Chapter 5A.11 Irwin, PSU, 2008

  • 7/29/2019 Lec5a Supp

    12/77

    Cache Basics

    Two questions to answer (in hardware):

    z Q1: How do we know if a data item is in the cache?

    ,

    Direct mapped

    z Each memory block is mapped to exactly one block in thecache

    - lots of lower level blocks must share blocks in the cache

    z Address mapping (to answer Q2):

    z Have a tag associated with each cache block that contains

    CSE431 Chapter 5A.12 Irwin, PSU, 2008

    e a ress n orma on e upper por on o e a ressrequired to identify the block (to answer Q1)

  • 7/29/2019 Lec5a Supp

    13/77

    Caching: A Simple First ExampleMain Memory

    Cache

    xx

    0001xx

    0010xxTag DataValid

    One word blocksTwo low order bitsdefine the byte in theIndex

    0001

    10

    0100xx0101xx

    0110xx

    word (32b words)

    11 0111xx

    1000xx

    1001xx

    Q2: How do we find it?

    1010xx

    1011xx

    1100xx

    Q1: Is it there?

    memory address bits the index to

    xx1110xx

    1111xx

    tag to the high order 2memory address bits totell if the memor block

    cache block (i.e.,modulo the number ofblocks in the cache

    CSE431 Chapter 5A.13 Irwin, PSU, 2008

    is in the cache

    (block address) modulo (# of blocks in the cache)

  • 7/29/2019 Lec5a Supp

    14/77

    Caching: A Simple First ExampleMain Memory

    Cache

    Tag DataValid

    xx0001xx

    0010xx

    One word blocksTwo low order bitsdefine the byte in theIndex

    0001

    10

    0100xx0101xx

    0110xx

    word (32b words)

    11Q2: How do we find it?

    0111xx

    1000xx

    1001xx

    memory address bits the index to

    Q1: Is it there?1010xx

    1011xx

    1100xx

    cache block (i.e.,modulo the number ofblocks in the cache

    tag to the high order 2memory address bits totell if the memor block

    xx1110xx

    1111xx

    CSE431 Chapter 5A.14 Irwin, PSU, 2008

    is in the cache

    (block address) modulo (# of blocks in the cache)

  • 7/29/2019 Lec5a Supp

    15/77

    Direct Mapped Cache

    0 1 2 3 4 3 4 15Start with an empty cache - all

    blocks initially marked as not valid

    3 4 15

    CSE431 Chapter 5A.15 Irwin, PSU, 2008

  • 7/29/2019 Lec5a Supp

    16/77

    Direct Mapped Cache

    0 1 2 3 4 3 4 15Start with an empty cache - all

    blocks initially marked as not valid

    00 Mem(0) 00 Mem(0)00 Mem(1)

    00 Mem(0) 00 Mem(0)

    00 Mem(1)

    m ss m ss m ss m ss

    00 Mem(1)

    3 4 15

    00 Mem(2)

    miss misshit hit

    00 Mem(2)

    00 Mem(3)

    00 Mem(0)

    00 Mem(1)

    01 Mem(4)

    00 Mem(1)

    01 Mem(4)

    00 Mem(1)

    01 Mem(4)

    00 Mem(1)

    01 4

    00 Mem(3)

    00 Mem(3)

    00 Mem(3)

    00 Mem(3)11 15

    z 8 re uests, 6 misses

    CSE431 Chapter 5A.16 Irwin, PSU, 2008

  • 7/29/2019 Lec5a Supp

    17/77

  • 7/29/2019 Lec5a Supp

    18/77

    Multiword Block Direct Mapped Cache

    =31 30 . . . 13 12 11 . . . 4 3 2 1 0

    ByteoffsetHit Data

    ,

    8

    IndexDataIndex TagValid

    Tag oc o se

    1

    2

    .

    .

    .

    253

    254

    255

    20

    CSE431 Chapter 5A.18 Irwin, PSU, 2008

    32

    What kind of locality are we taking advantage of?

  • 7/29/2019 Lec5a Supp

    19/77

    Taking Advantage of Spatial Locality

    0 1 2 3 4 3 4 15Start with an empty cache - allblocks initially marked as not valid

    2

    3 4 3

    4 15

    CSE431 Chapter 5A.19 Irwin, PSU, 2008

  • 7/29/2019 Lec5a Supp

    20/77

    Taking Advantage of Spatial Locality

    0 1 2 3 4 3 4 15Start with an empty cache - allblocks initially marked as not valid

    2

    00 Mem(1) Mem(0)

    m ss

    00 Mem(1) Mem(0)00 Mem(3) Mem(2)00 Mem(1) Mem(0)

    m ss

    3 4 3hit

    00 Mem(1) Mem(0)

    miss

    00 Mem(1) Mem(0)01

    5 4

    hit

    01 Mem(5) Mem(4)

    4 15

    00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2)

    hit miss

    00 Mem(3) Mem(2)01 Mem(5) Mem(4)

    00 Mem(3) Mem(2)01 Mem(5) Mem(4)11 15 14

    CSE431 Chapter 5A.20 Irwin, PSU, 2008

    z 8 requests, 4 misses

  • 7/29/2019 Lec5a Supp

    21/77

    Miss Rate vs Block Size vs Cache Size

    10

    ) 8 KB

    5

    ssrate( 16 KB

    64 KB

    256 KB

    Mi

    16 32 64 128 256

    Block size (bytes)

    Miss rate goes up if the block size becomes a significantfraction of the cache size because the number of blocks

    CSE431 Chapter 5A.21 Irwin, PSU, 2008

    (increasing capacity misses)

  • 7/29/2019 Lec5a Supp

    22/77

    Cache Field Sizes

    The number of bits in a cache includes both the storagefor data and for the tags

    -

    z For a direct mapped cache with 2n

    blocks, n bits are used for theindex

    z or a oc s ze o wor s y es , m s are use oaddress the word within the block and 2 bits are used to address

    the byte within the word a s e s ze o e ag e

    The total number of bits in a direct-mapped cache is thenn + +

    How many total bits are required for a direct mappedcache with 16KB of data and 4-word blocks assuming a

    CSE431 Chapter 5A.22 Irwin, PSU, 2008

    32-bit address?

  • 7/29/2019 Lec5a Supp

    23/77

    Handling Cache Hits

    z this is what we want!

    Write hits (D$ only)

    z require the cache and memory to be consistent

    - the memory hierarchy (write-through)

    - writes run at the speed of the next level in the memory hierarchy soslow! or can use a write bufferand stall onl if the write buffer is full

    z allow cache and memory to be inconsistent

    - write the data only into the cache block (write-back the cache block to

    evicted)

    - need a dirty bit for each data cache block to tell if it needs to bewritten back to memory when it is evicted can use a write bufferto

    CSE431 Chapter 5A.23 Irwin, PSU, 2008

    help buffer write-backs of dirty blocks

  • 7/29/2019 Lec5a Supp

    24/77

    Sources of Cache Misses

    ompu sory co s ar or process m gra on, rsreference):

    z First access to a block, cold fact of life, not a whole lot youcan o a ou . you are go ng o run m ons o ns ruc on,

    compulsory misses are insignificantz Solution: increase block size (increases miss penalty; very

    Capacity:

    z Solution: increase cache size (may increase access time)

    Conflict collision :

    z Multiple memory locations mapped to the same cache location

    z Solution 1: increase cache size

    CSE431 Chapter 5A.24 Irwin, PSU, 2008

    z Solution 2: increase associativity (stay tuned) (may increaseaccess time)

  • 7/29/2019 Lec5a Supp

    25/77

    Handling Cache Misses (Single Word Blocks)

    z stall the pipeline, fetch the block from the next level in the memory

    hierarchy, install it in the cache and send the requested word toe processor, en e e p pe ne resume

    Write misses (D$ only)1. stall the i eline, fetch the block from next level in the memor

    hierarchy, install it in the cache (which may involve having to evicta dirty block if using a write-back cache), write the word from the

    processor to the cache, then let the pipeline resumeor

    2. Write allocate just write the word into the cache updating boththe ta and data no need to check for cache hit no need to stall

    or

    3. No-write allocate skip the cache write (but must invalidate that

    CSE431 Chapter 5A.25 Irwin, PSU, 2008

    word to the write buffer (and eventually to the next memory level),

    no need to stall if the write buffer isnt full

  • 7/29/2019 Lec5a Supp

    26/77

    Multiword Block Considerations

    Read misses (I$ and D$)

    z Processed the same as for single word blocks a miss returnsthe entire block from memor

    z

    Miss penalty grows as block size grows- Early restart processor resumes execution as soon as the

    re uested word of the block is returned

    - Requested word first requested word is transferred from thememory to the cache (and processor) first

    z the cache while the cache is handling an earlier miss

    Write misses (D$)

    z

    us ng wr te a ocate must rst etc t e oc rom memory anthen write the word to the block (or could end up with a garbledblock in the cache (e.g., for 4 word blocks, a new tag, one word

    CSE431 Chapter 5A.26 Irwin, PSU, 2008

    ,block)

  • 7/29/2019 Lec5a Supp

    27/77

    Memory Systems that Support Caches

    affect overall system performance in dramatic ways

    One word wide organization (one word wide buson-chip

    CPUan one wor w e memory

    Assume

    Cache

    bus

    .

    2. 15 memory bus clock cycles to get the 1st

    word in the block from DRAM (row cycle

    DRAM

    , ,3rd, 4th words (column access time)

    3. 1 memory bus clock cycle to return a word

    -&

    32-bit addrper cycle

    emory

    Memory-Bus to Cache bandwidth

    z number of bytes accessed from memory

    CSE431 Chapter 5A.27 Irwin, PSU, 2008

    and transferred to cache/CPU per memory

    bus clock cycle

  • 7/29/2019 Lec5a Supp

    28/77

    Review: (DDR) SDRAM OperationColumn +1

    N colsAddress er a row sread into the SRAM register

    z In ut CAS as the startin burst

    N

    row

    s Row

    Address

    address along with a burst length

    z Transfers a burst of data (ideallya cache block from a series of

    N x M SRAM

    sequential addrs within that row

    - The memory bus clock controls

    transfer of successive words in

    M-bit Output

    the burst

    1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit

    Cycle Time

    CAS

    RAS

    CSE431 Chapter 5A.28 Irwin, PSU, 2008

    Row Address Col Address Row Add

  • 7/29/2019 Lec5a Supp

    29/77

    DRAM Size Increase

    Add a table like figure 5.12 to show DRAM growth since1980

    CSE431 Chapter 5A.29 Irwin, PSU, 2008

  • 7/29/2019 Lec5a Supp

    30/77

    One Word Wide Bus, One Word Blocks

    CPU

    on-chip

    If the block size is one word, then for amemory access due to a cache miss,the pipeline will have to stall for the

    Cache

    number of cycles required to return one

    data word from memoryc cle to send address

    buscycles to read DRAM

    cycle to return data

    DRAMMemory

    o a c oc cyc es m ss pena y

    cycle (bandwidth) for a single miss is

    bytes per memory bus clock

    CSE431 Chapter 5A.30 Irwin, PSU, 2008

  • 7/29/2019 Lec5a Supp

    31/77

    One Word Wide Bus, One Word Blocks

    CPU

    on-chip

    If the block size is one word, then for amemory access due to a cache miss,the pipeline will have to stall for the

    Cache

    number of cycles required to return one

    data word from memorymemor bus clock c cle to send address1

    busmemory bus clock cycles to read DRAM

    memory bus clock cycle to return data

    15

    1

    DRAMMemory

    o a c oc cyc es m ss pena y

    cycle (bandwidth) for a single miss is

    bytes per memory bus clock4/17 = 0.235

    CSE431 Chapter 5A.31 Irwin, PSU, 2008

  • 7/29/2019 Lec5a Supp

    32/77

    One Word Wide Bus, Four Word Blocks

    CPU

    on-chip

    each word is in a different DRAM row?

    cycle to send 1st address

    Cache

    cycles to read DRAM

    cycles to return last data word

    bus

    DRAMMemory

    Number of bytes transferred per clockcycle (bandwidth) for a single miss is

    CSE431 Chapter 5A.32 Irwin, PSU, 2008

  • 7/29/2019 Lec5a Supp

    33/77

    One Word Wide Bus, Four Word Blocks

    CPU

    on-chip

    each word is in a different DRAM row?

    cycle to send 1st address1

    Cache

    cycles to read DRAM

    cycles to return last data word

    4 x 15 = 60

    1

    bus

    15 cycles

    15 cycles

    DRAMMemory

    15 cycles

    15 cycles

    Number of bytes transferred per clockcycle (bandwidth) for a single miss is

    =

    CSE431 Chapter 5A.33 Irwin, PSU, 2008

    .

  • 7/29/2019 Lec5a Supp

    34/77

    One Word Wide Bus, Four Word Blocks

    CPU

    on-chip

    words are in the same DRAM row?

    cycle to send 1st address

    Cache

    cycles to read DRAM

    cycles to return last data word

    bus

    DRAMMemory

    Number of bytes transferred per clock

    CSE431 Chapter 5A.34 Irwin, PSU, 2008

    bytes per clock

  • 7/29/2019 Lec5a Supp

    35/77

    One Word Wide Bus, Four Word Blocks

    CPU

    on-chip

    words are in the same DRAM row?

    cycle to send 1st address1

    Cache

    cycles to read DRAM

    cycles to return last data word15 + 3*5 = 30

    1

    bus

    15 cycles

    DRAMMemory

    cyc es

    5 cycles

    5 cycles

    Number of bytes transferred per clock

    CSE431 Chapter 5A.35 Irwin, PSU, 2008

    bytes per clock(4 x 4)/32 = 0.5

  • 7/29/2019 Lec5a Supp

    36/77

    Interleaved Memory, One Word Wide Bus

    For a block size of four wordscycle to send 1st address

    CPU

    on-chip

    cycles to return last data wordtotal clock cycles miss penaltyCache

    bus

    Memorybank 1

    Memorybank 0

    Memorybank 2

    Memorybank 3

    Number of bytes transferredper clock cycle (bandwidth) for asin le miss is

    CSE431 Chapter 5A.36 Irwin, PSU, 2008

    bytes per clock

  • 7/29/2019 Lec5a Supp

    37/77

    Interleaved Memory, One Word Wide Bus

    For a block size of four wordscycle to send 1st address

    CPU

    on-chip 1

    cycles to return last data wordtotal clock cycles miss penaltyCache

    4*1 = 420

    bus 15 cycles

    15 cycles15 cycles

    15 cyclesMemorybank 1

    Memorybank 0

    Memorybank 2

    Memorybank 3

    Number of bytes transferredper clock cycle (bandwidth) for asin le miss is

    CSE431 Chapter 5A.37 Irwin, PSU, 2008

    bytes per clock(4 x 4)/20 = 0.8

  • 7/29/2019 Lec5a Supp

    38/77

    DRAM Memory System Summary

    Its important to match the cache characteristicsz caches access one block at a time (usually more than one

    word)

    with the DRAM characteristics

    z use DRAMs that support fast multiple word accesses,preferably ones that match the block size of the cache

    with the memory-bus characteristics

    z make sure the memory-bus can support the DRAM accessra es an pa erns

    z with the goal of increasing the Memory-Bus to Cachebandwidth

    CSE431 Chapter 5A.38 Irwin, PSU, 2008

  • 7/29/2019 Lec5a Supp

    39/77

    Measuring Cache Performance

    normal CPU execution cycle, then

    CPU time = IC CPI CC

    = IC (CPIideal + Memory-stall cycles) CC

    CPIstall

    emory-s a cyc es come rom cac e m sses a sum oread-stalls and write-stalls)

    Read-stall c cles = reads/ ro ram read miss rate read miss penalty

    Write-stall cycles = (writes/program write miss rate write miss enalt

    + write buffer stalls

    For write-through caches, we can simplify this to

    CSE431 Chapter 5A.39 Irwin, PSU, 2008

    Memory-stall cycles = accesses/program miss rate miss penalty

  • 7/29/2019 Lec5a Supp

    40/77

    Impacts of Cache Performance

    performance improves (faster clock rate and/or lower CPI)

    z The memory speed is unlikely to improve as fast as processorcycle time. When calculating CPIstall, the cache miss penalty is

    measured in processorclock cycles needed to handle a missz The lower the CPIideal, the more pronounced the impact of stalls

    A processor with a CPIideal of 2, a 100 cycle miss penalty,36% load/store instrs, and 2% I$ and 4% D$ miss rates

    - .

    So CPIstalls = 2 + 3.44 = 5.44

    more than twice the CPIideal !

    What if the CPIideal is reduced to 1? 0.5? 0.25? What if the D$ miss rate went up 1%? 2%?

    CSE431 Chapter 5A.40 Irwin, PSU, 2008

    What if the processor clock rate is doubled (doubling the

    miss penalty)?

  • 7/29/2019 Lec5a Supp

    41/77

    Average Memory Access Time (AMAT)

    A larger cache will have a longer access time. Anincrease in hit time will likely add another stage to thei eline. At some oint the increase in hit time for a

    larger cache will overcome the improvement in hit rateleading to a decrease in performance.

    verage emory ccess me s t e average toaccess memory considering both hits and misses

    AMAT = Time for a hit + Miss rate x Miss penalty

    What is the AMAT for a processor with a 20 psec clock, amiss penalty of 50 clock cycles, a miss rate of 0.02misses er instruction and a cache access time of 1

    CSE431 Chapter 5A.41 Irwin, PSU, 2008

    clock cycle?

  • 7/29/2019 Lec5a Supp

    42/77

    Reducing Cache Miss Rates #1

    1. Allow more flexible block placement

    In a direct mapped cache a memory block maps toexactly one cache block

    At the other extreme, could allow a memory block to bemapped to any cache block fully associative cache

    A compromise is to divide the cache into sets each of - .

    memory block maps to a unique set (specified by theindex field) and can be placed in any way of that set (so

    CSE431 Chapter 5A.42 Irwin, PSU, 2008

    (block address) modulo (# sets in the cache)

  • 7/29/2019 Lec5a Supp

    43/77

    Another Reference String Mapping

    0 4 0 4 0 4 0 4Start with an empty cache - all

    blocks initially marked as not valid

    0 4 0 4

    CSE431 Chapter 5A.43 Irwin, PSU, 2008

    A th R f St i M i

  • 7/29/2019 Lec5a Supp

    44/77

    Another Reference String Mapping

    0 4 0 4 0 4 0 4Start with an empty cache - all

    blocks initially marked as not valid

    00 Mem(0) 00 Mem(0)

    01 401 Mem(4)

    000

    00 Mem(0)

    01

    4

    0 4 0 4miss miss miss miss

    00 Mem(0) 00 Mem(0)4

    01 Mem(4)0

    01 Mem(4)

    z 8 requests, 8 misses

    CSE431 Chapter 5A.44 Irwin, PSU, 2008

    Ping pong effect due to conflict misses - two memory

    locations that map into the same cache block

    S t A i ti C h E l

  • 7/29/2019 Lec5a Supp

    45/77

    Set Associative Cache ExampleMain Memory

    Cache

    Tag DataV

    xx0001xx

    0010xxSetWay

    One word blocksTwo low order bitsdefine the byte in the

    0 0100xx

    0101xx

    0110xx

    1

    0

    0

    wor wor s

    Q2: How do we find it?0111xx

    1000xx

    1001xx

    1

    memory address bit todetermine which

    Q1: Is it there?

    Compare all the cache

    1010xx

    1011xx

    1100xx . .,

    the number of sets inthe cache)

    tags in the set to thehigh order 3 memoryaddress bits to tell if

    xx

    1110xx

    1111xx

    CSE431 Chapter 5A.45 Irwin, PSU, 2008

    e memory oc s nthe cache

    A th R f St i M i

  • 7/29/2019 Lec5a Supp

    46/77

    Another Reference String Mapping

    0 4 0 4 0 4 0 4Start with an empty cache - all

    blocks initially marked as not valid

    0 4 0 4

    CSE431 Chapter 5A.46 Irwin, PSU, 2008

    A th R f St i M i

  • 7/29/2019 Lec5a Supp

    47/77

    Another Reference String Mapping

    0 4 0 4 0 4 0 4Start with an empty cache - all

    blocks initially marked as not valid

    0 4 0 4m ss m ss

    000 Mem(0) 000 Mem(0) 000 Mem(0) 000 Mem(0)

    010 Mem(4) 010 Mem(4) 010 Mem(4)

    z 8 requests, 2 misses

    Solves the ping pong effect in a direct mapped cachedue to conflict misses since now two memory locationsthat ma into the same cache set can co-exist!

    CSE431 Chapter 5A.47 Irwin, PSU, 2008

    Four Way Set Associative Cache

  • 7/29/2019 Lec5a Supp

    48/77

    Four-Way Set Associative Cache 28 = 256 sets each with four wa s each with one block

    31 30 . . . 13 12 11 . . . 2 1 0 Byte offset

    822Tag

    DataTagV01

    DataTagV01

    DataTagV01

    Index DataTagV01

    .

    .

    .

    253

    .

    .

    .

    253

    .

    .

    .

    253

    .

    .

    .

    253

    254

    255

    254

    255

    254

    255

    254

    255

    32

    CSE431 Chapter 5A.48 Irwin, PSU, 2008Hit Data

    4x1 select

    Range of Set Associative Caches

  • 7/29/2019 Lec5a Supp

    49/77

    Range of Set Associative Caches

    ,in associativity doubles the number of blocks per set (i.e.,the number or ways) and halves the number of sets

    the size of the tag by 1 bit

    Block offset B te offsetIndexTa

    CSE431 Chapter 5A.49 Irwin, PSU, 2008

    Range of Set Associative Caches

  • 7/29/2019 Lec5a Supp

    50/77

    Range of Set Associative Caches

    ,in associativity doubles the number of blocks per set (i.e.,the number or ways) and halves the number of sets

    the size of the tag by 1 bit

    Block offset B te offsetIndexTa

    Selects the setUsed for tag compare Selects the word in the block

    Decreasing associativityIncreasing associativity

    (only one set)Tag is all the bits exceptblock and byte offset

    Direct mapped(only one way)Smaller tags, only a

    CSE431 Chapter 5A.50 Irwin, PSU, 2008

    single comparator

    Costs of Set Associative Caches

  • 7/29/2019 Lec5a Supp

    51/77

    Costs of Set Associative Caches

    ,replacement?

    z Least Recently Used (LRU): the block replaced is the one thathas been unused for the longest time

    - Must have hardware to keep track of when each ways block wasused relative to the other blocks in the set

    - For 2-way set associative, takes one bit per set set the bit when ablock is referenced (and reset the other ways bit)

    N-way set associative cache costsz N comparators (delay and area)

    z MUX delay (set selection) before data is available

    .

    direct mapped cache, the cache block is available before theHit/Miss decision

    CSE431 Chapter 5A.51 Irwin, PSU, 2008

    - if it was a miss

    Benefits of Set Associative Caches

  • 7/29/2019 Lec5a Supp

    52/77

    Benefits of Set Associative Caches

    on the cost of a miss versus the cost of implementation

    12

    8

    10

    e

    4KB

    8KB16KB

    4

    6

    M

    issRat

    64KB

    128KB

    256KB

    0

    2512KB

    Data from Hennessy &

    1-w ay 2-w ay 4-w ay 8-w ayAssoci ati vity

    ,

    Architecture, 2003

    CSE431 Chapter 5A.52 Irwin, PSU, 2008

    arges ga ns are n go ng rom rec mappe o -way(20%+ reduction in miss rate)

    Reducing Cache Miss Rates #2

  • 7/29/2019 Lec5a Supp

    53/77

    Reducing Cache Miss Rates #2

    2. se mu p e eve s o cac es

    room on the die for bigger L1 caches orfor a secondlevel of caches normally a unified L2 cache (i.e., it

    even a unified L3 cache

    For our exam le CPI of 2 100 c cle miss enalt(to main memory) and a 25 cycle miss penalty (toUL2$), 36% load/stores, a 2% (4%) L1 I$ (D$) miss

    , .

    CPIstalls = 2 + .0225 + .36.0425 + .005100 +

    CSE431 Chapter 5A.53 Irwin, PSU, 2008

    . . .(as compared to 5.44 with no L2$)

    Multilevel Cache Design Considerations

  • 7/29/2019 Lec5a Supp

    54/77

    Multilevel Cache Design Considerations

    different

    z Primary cache should focus on minimizing hit time in support of

    - Smaller with smaller block sizes

    z Secondary cache(s) should focus on reducing miss rate to

    - Larger with larger block sizes

    - Higher levels of associativity

    The miss penalty of the L1 cache is significantly reducedby the presence of an L2 cache so it can be smalleri.e. faster but have a hi her miss rate

    For the L2 cache, hit time is less important than miss rate

    z The L2 hit time determines L1 s miss enalt

    CSE431 Chapter 5A.54 Irwin, PSU, 2008

    z L2$ local miss rate >> than the global miss rate

    Using the Memory Hierarchy Well

  • 7/29/2019 Lec5a Supp

    55/77

    Using the Memory Hierarchy Well

    Include plots from Figure 5.18

    CSE431 Chapter 5A.55 Irwin, PSU, 2008

    Two Machines Cache Parameters

  • 7/29/2019 Lec5a Supp

    56/77

    Two Machines Cache Parameters

    Intel Nehalem AMD Barcelona

    L1 cacheorganization & size

    Split I$ and D$; 32KB foreach per core; 64B blocks

    Split I$ and D$; 64KB for eachper core; 64B blocks

    L1 associativity 4-way (I), 8-way (D) set

    assoc.; ~LRU replacement

    2-way set assoc.; LRU

    replacementL1 write olic write-back, write-allocate write-back, write-allocate

    L2 cacheorganization & size

    Unified; 256MB (0.25MB)per core; 64B blocks

    Unified; 512KB (0.5MB) percore; 64B blocks

    L2 associativit 8-wa set assoc. ~LRU 16-wa set assoc. ~LRUL2 write policy write-back write-back

    L2 write policy write-back, write-allocate write-back, write-allocate

    organization & size

    shared by cores; 64B blocks

    shared by cores; 64B blocksL3 associativity 16-way set assoc. 32-way set assoc.; evict block

    CSE431 Chapter 5A.56 Irwin, PSU, 2008

    L3 write policy write-back, write-allocate write-back; write-allocate

    Two Machines Cache Parameters

  • 7/29/2019 Lec5a Supp

    57/77

    Intel P4 AMD Opteron

    L1 organization Split I$ and D$ Split I$ and D$

    L1 cache size 8KB for D$, 96KB for

    trace cache (~I$)

    64KB for each of I$ and D$

    L1 block size 64 bytes 64 bytes

    L1 associativity 4-way set assoc. 2-way set assoc.

    L1 replacement ~ LRU LRU

    L1 write policy write-through write-backL2 organization Unified Unified

    L2 cache size 512KB 1024KB (1MB)

    L2 associativity 8-way set assoc. 16-way set assoc.

    L2 replacement ~LRU ~LRU

    CSE431 Chapter 5A.57 Irwin, PSU, 2008

    L2 write policy write-back write-back

    FSM Cache Controller

  • 7/29/2019 Lec5a Supp

    58/77

    Key characteristics for a simple L1 cachez Direct mapped

    z - -

    z Block size of 4 32-bit words (so 16B); Cache size of 16KB (so1024 blocks)

    - - - -, , , ,bit, valid bit, LRU bits (if set associative)

    Cache&

    Cache

    - t ea r te

    essor

    DRAM

    1-bit Valid

    32-bit address

    - t ea r te

    1-bit Valid

    32-bit address

    ControllerPro

    DDR32-bit data

    32-bit data128-bit data128-bit data

    CSE431 Chapter 5A.58 Irwin, PSU, 2008

    - -

    Four State Cache Controller

  • 7/29/2019 Lec5a Supp

    59/77

    Compare TagIf Valid && Hit

    Cache HitMark Cache Ready

    e Set Valid, Set Tag,

    If Write set DirtyValid CPU request

    Old block isDirty

    Cache MissOld block is

    AllocateRead new block

    Write BackWrite old block

    Memory Ready

    MemoryMemory

    Not Read

    CSE431 Chapter 5A.59 Irwin, PSU, 2008

    o ea y

    Cache Coherence in Multicores

  • 7/29/2019 Lec5a Supp

    60/77

    share a common physical address space, causing acache coherence problem

    Core 1 Core 2

    Unified (shared) L2

    CSE431 Chapter 5A.60 Irwin, PSU, 2008

    Cache Coherence in Multicores

  • 7/29/2019 Lec5a Supp

    61/77

    share a common physical address space, causing acache coherence problem

    Core 1 Core 2

    Write 1 to X

    X = 0 X = 0X = 1

    Unified (shared) L2

    X = 0X = 1

    CSE431 Chapter 5A.61 Irwin, PSU, 2008

    A Coherent Memory System

  • 7/29/2019 Lec5a Supp

    62/77

    Any read of a data item should return the most recentlywritten value of the data item

    - Writes to the same location are serialized (two writes to the same

    location must be seen in the same order by all cores)

    by a read

    ,z Replication of shared data items in multiple cores caches

    z Replication reduces both latency and contention for a read shareddata item

    z Migration of shared data items to a cores local cache

    z Migration reduced the latency of the access the data and the

    CSE431 Chapter 5A.62 Irwin, PSU, 2008

    bandwidth demand on the shared memory (L2 in our example)

    Cache Coherence Protocols

  • 7/29/2019 Lec5a Supp

    63/77

    the most popular of which is snooping

    z The cache controllers monitor (snoop) on the broadcast medium(e.g., bus) with duplicate address tag hardware (so they dont

    interfere with cores access to the cache) to determine if theircache has a copy of a block that is requested

    Write invalidate protocol writes require exclusiveaccess and invalidate all other copies

    z copies of an item exists

    If two processors attempt to write the same data at the,

    cores copy to be invalidated. For the other core tocomplete, it must obtain a new copy of the data which

    CSE431 Chapter 5A.63 Irwin, PSU, 2008

    must now contain the updated value thus enforcingwrite serialization

    Handling Writes

  • 7/29/2019 Lec5a Supp

    64/77

    informed of writes can be handled two ways:

    - - . broadcasts new data over the bus, all copies areupdatedz wr es go o e us g er us ra c

    z Since new values appear in caches sooner, can reduce latency

    . - signal on bus, cache snoops check to see if they have acopy of the data, if so they invalidate their cache block

    only one writer)z Uses the bus only on the first write lower bus traffic, so better

    CSE431 Chapter 5A.64 Irwin, PSU, 2008

    Example of Snooping Invalidation

  • 7/29/2019 Lec5a Supp

    65/77

    Core 1 Core 2

    L1 I$ L1 D$L1 I$ L1 D$

    n e s are

    CSE431 Chapter 5A.65 Irwin, PSU, 2008

    Example of Snooping Invalidation

  • 7/29/2019 Lec5a Supp

    66/77

    Core 1 Core 2

    Read X Read X

    L1 I$ L1 D$L1 I$ L1 D$

    = =

    Write 1 to X Read X

    I

    X = 0

    I

    X = IX = 1

    n e s are

    When the second miss by Core 2 occurs, Core 1responds with the value canceling the response fromthe L2 cache (and also updating the L2 copy)

    CSE431 Chapter 5A.66 Irwin, PSU, 2008

    A Write-Invalidate CC Protocol

  • 7/29/2019 Lec5a Supp

    67/77

    SharedInvalid

    read (miss)

    miss)

    s)

    w

    rite(mis

    write-back caching

    (dirty)

    CSE431 Chapter 5A.67 Irwin, PSU, 2008

    rea or write (hit or miss)

    A Write-Invalidate CC Protocol

  • 7/29/2019 Lec5a Supp

    68/77

    SharedInvalid

    read (miss)read (hit ormiss)

    receives invalidate

    s)

    (write by another coreto this block)

    eto

    by this

    ate

    w

    rite(mis

    -backdu

    d(miss)

    ercoret

    block

    n

    dinvalid

    write-back caching

    writ

    re

    anot

    s

    (dirty)

    signals from the core inredsignals from the bus in

    CSE431 Chapter 5A.68 Irwin, PSU, 2008

    rea or wr e ue

    Write-Invalidate CC ExamplesI = invalid man S = shared man M = modified onl one

  • 7/29/2019 Lec5a Supp

    69/77

    z I = invalid man S = shared man M = modified onl one

    Core 1 Core 2 Core 1 Core 2

    A S A I A S A I

    Main MemA

    Main MemA

    Core 1 Core 2 Core 1 Core 2

    CSE431 Chapter 5A.69 Irwin, PSU, 2008

    a n emA

    a n emA

    Write-Invalidate CC Examplesz I = invalid man S = shared man M = modified onl one

  • 7/29/2019 Lec5a Supp

    70/77

    z I = invalid man S = shared man M = modified onl one

    1. read miss for A3. snoop seesread request for

    A & lets MM 4. gets A from MM

    1. write miss for A

    2. writes A &Core 1 Core 2 Core 1 Core 2

    2. read re uest for A

    supply A & changes its stateto S

    changes its stateto M

    4. change Astate to IA S A I A S A I

    .

    Main MemA

    Main MemA

    . rea m ss or. snoop sees rearequest for A, writes-

    back A to MM 4. gets A from MM& changes its state5. chan e A

    .

    2. writes A &changes its state4. change A

    Core 1 Core 2 Core 1 Core 2

    2. read request for A

    to M

    5. C2 sends invalidate for A

    state to I to M

    3. C2 sends invalidate for A

    state to I

    CSE431 Chapter 5A.70 Irwin, PSU, 2008

    a n emA

    a n emA

    Data Miss Rates

  • 7/29/2019 Lec5a Supp

    71/77

    z Share data misses often dominate cache behavior even thoughthey may only be 10% to 40% of the data accesses

    18

    Capacity miss rate

    Coherence miss rate64KB 2-way set associativedata cache with 32B blocks

    12

    14

    16

    Hennessy & Patterson, ComputerArchitecture: A Quantitative Approach

    8

    Capacity miss rate

    Coherence miss rate

    8

    10

    2

    4

    2

    4

    CSE431 Chapter 5A.71 Irwin, PSU, 2008

    FFT

    0

    1 2 4 8 16

    Ocean

    1 2 4 8 16

    Block Size Effects

  • 7/29/2019 Lec5a Supp

    72/77

    Writes to one word in a multi-word block mean that thefull block is invalidated

    Multi-word blocks can also result in false sharing: whentwo cores are writing to two different variables thatappen o a n e same cac e oc

    z With write-invalidate false sharing increases cache miss rates

    A B

    Core1 Core2

    4 word cache block

    Compilers can help reduce false sharing by allocating

    CSE431 Chapter 5A.72 Irwin, PSU, 2008

    Other Coherence Protocols

  • 7/29/2019 Lec5a Supp

    73/77

    There are many variations on cache coherence protocols

    Another write-invalidate protocol used in the Pentium 4

    (and many other processors) is MESI with four states:

    z Exclusive only one copy of the shared data is allowed to becached; memory has an up-to-date copy

    - nce ere s on y one copy o e oc , wr e s on nee osend invalidate signal

    z Shared multiple copies of the shared data may be cached (i.e.,data ermitted to be cached with more than one rocessormemory has an up-to-date copy

    z Invalid same

    CSE431 Chapter 5A.73 Irwin, PSU, 2008

    MESI Cache Coherency ProtocolProcessor

  • 7/29/2019 Lec5a Supp

    74/77

    Invalid

    shared read

    Invalidate for this block

    no vablock) (clean)

    Processorshared readmiss

    no er processor

    has read/writemiss for this

    Processorwrite

    oc

    Processorexclusive

    rocessorsharedread

    Modified

    Processorwrite Exclusive

    readexclusiveread miss

    Processor write

    (dirty)

    Processor exclusiveread miss

    Processor

    [Write back block]

    CSE431 Chapter 5A.74 Irwin, PSU, 2008

    or read hi tread

  • 7/29/2019 Lec5a Supp

    75/77

    Summary: Improving Cache Performance

  • 7/29/2019 Lec5a Supp

    76/77

    . e uce e m ss pena yz smaller blocks

    z use a write buffer to hold dirt blocks bein re laced so donthave to wait for the write to complete before reading

    z check write buffer (and/or victim cache) on read miss may getluck

    z for large blocks fetch critical word first

    z use multiple cache levels L2 cache not tied to CPU clock rate

    z as er ac ng s ore mprove memory an w

    - wider buses

    - memory interleaving, DDR SDRAMs

    CSE431 Chapter 5A.76 Irwin, PSU, 2008

    Summary: The Cache Design Space

  • 7/29/2019 Lec5a Supp

    77/77

    z cache size

    z block size

    Cache Size

    z associativity

    z replacement policy

    ssoc a v y

    - -

    z write allocation Block Size

    z depends on access characteristics

    - workload Bad

    - use I-cache, D-cache, TLB

    z depends on technology / cost Good Factor A Factor B

    CSE431 Chapter 5A.77 Irwin, PSU, 2008