Lec5a Supp

7/29/2019 Lec5a Supp

1/77


2/77

Review: Major Components of a Computer

rocessor

Control

Devices

Input

Datapath Output

Cache

Main

Memory

Second

Memo

(Disk

CSE431 Chapter 5A.2 Irwin, PSU, 2008

ry

y


3/77

Processor-Memory Performance Gap

1000055%/year(2X/1.5yr)

1000nce Moores Law

100

rfo

rma Processor-Memory

Performance Gaprows 50%/ ear

10P DRAM7%/year

1

1980 1984 1988 1992 1996 2000 2004

(2X/10yrs)


Year


4/77

The Memory Wall

Processor vs DRAM speed disparity continues to grow

1000ss

100

tructio

n

Ma

cce

1

Core

Memory

sperin

perDR

0.1Cloc

Clock

.VAX/1980 PPro/1996 2010+


important to overall performance


5/77

The Memory Hierarchy Goal

Fact: Large memories are slow and fast memories aresmall

How do we create a memory that gives the illusion of

being large, cheap and fast (most of the time)?z erarc y

z With parallelism



6/77


7/77

Memory Hierarchy Technologies

compatibilityz Fast (typical access times of 0.5 to 2.5 nsec)

z Low density (6 transistor cells), higher power, expensive ($2000to $5000 per GB in 2008)

z Static: content will last forever (as long as power is left on)

Main memory uses DRAM for size (density)

z Slower (typical access times of 50 to 70 nsec)z High density (1 transistor cells), lower power, cheaper ($20 to $75

per GB in 2008)

z Dynamic: needs to be refreshed regularly (~ every 8 ms)

- consumes1% to 2% of the active cycles of the DRAM

z Addresses divided into 2 halves (row and column)

-


- CAS orColumn Access Strobe triggering the column selector


8/77

The Memory Hierarchy: Why Does it Work?

Temporal Locality (locality in time)

z If a memory location is referenced then it will tend to bereferenced again soon

Keep most recently accessed data items closer to the processor

Spatial Locality (locality in space)

z If a memory location is referenced, the locations with nearbya resses w en o e re erence soon

Move blocks consisting ofcontiguous words closer to theprocessor



9/77

The Memory Hierarchy: Terminology

present (or not) in a cache

Hit Rate: the fraction of memory accesses found in a level

of the memory hierarchyz Hit Time: Time to access that level which consists of

Miss Rate: the fraction of memory accesses not found in a

level of the memory hierarchy 1 - (Hit Rate)z Miss Penalty: Time to replace a block in that level with the

corresponding block from a lower level which consists of

Time to access the block in the lower level + Time to transmit that block

to the level that experienced the miss + Time to insert the block in thatlevel + Time to pass the block to the requestor


Hit Time


10/77

Characteristics of the Memory Hierarchy

ProcessorInclusive

4-8 b tes word

Increasingdistance

L1$

is a subset ofwhat is in L2$

8-32 b tes block

from theprocessor

in access

L2$

what is in MMthat is asubset of is in1 to 4 blocks

time SM1,024+ bytes (disk sector = page)

(Relative) size of the memory at each level



11/77

How is the Hierarchy Managed?

registers memory

z by compiler (programmer?)

cache main memoryz by the cache controller hardware

main memory disks

z by the operating system (virtual memory)z virtual to physical address mapping assisted by the hardware

(TLB)

z by the programmer (files)



12/77

Cache Basics

Two questions to answer (in hardware):

z Q1: How do we know if a data item is in the cache?

,

Direct mapped

z Each memory block is mapped to exactly one block in thecache

- lots of lower level blocks must share blocks in the cache

z Address mapping (to answer Q2):

z Have a tag associated with each cache block that contains


e a ress n orma on e upper por on o e a ressrequired to identify the block (to answer Q1)


13/77

Caching: A Simple First ExampleMain Memory

Cache

xx

0001xx

0010xxTag DataValid

One word blocksTwo low order bitsdefine the byte in theIndex

0001

10

0100xx0101xx

0110xx

word (32b words)

11 0111xx

1000xx

1001xx

Q2: How do we find it?

1010xx

1011xx

1100xx

Q1: Is it there?

memory address bits the index to

xx1110xx

1111xx

tag to the high order 2memory address bits totell if the memor block

cache block (i.e.,modulo the number ofblocks in the cache


is in the cache

(block address) modulo (# of blocks in the cache)


14/77

Caching: A Simple First ExampleMain Memory

Cache

Tag DataValid

xx0001xx

0010xx

One word blocksTwo low order bitsdefine the byte in theIndex

0001

10

0100xx0101xx

0110xx

word (32b words)

11Q2: How do we find it?

0111xx

1000xx

1001xx

memory address bits the index to

Q1: Is it there?1010xx

1011xx

1100xx

cache block (i.e.,modulo the number ofblocks in the cache

tag to the high order 2memory address bits totell if the memor block

xx1110xx

1111xx


is in the cache

(block address) modulo (# of blocks in the cache)


15/77

Direct Mapped Cache

0 1 2 3 4 3 4 15Start with an empty cache - all

blocks initially marked as not valid

3 4 15



16/77

Direct Mapped Cache



00 Mem(0) 00 Mem(0)00 Mem(1)

00 Mem(0) 00 Mem(0)

00 Mem(1)

m ss m ss m ss m ss

00 Mem(1)

3 4 15

00 Mem(2)

miss misshit hit

00 Mem(2)

00 Mem(3)

00 Mem(0)

00 Mem(1)

01 Mem(4)

00 Mem(1)

01 Mem(4)

00 Mem(1)

01 Mem(4)

00 Mem(1)

01 4

00 Mem(3)

00 Mem(3)

00 Mem(3)

00 Mem(3)11 15

z 8 re uests, 6 misses



17/77


18/77

Multiword Block Direct Mapped Cache

=31 30 . . . 13 12 11 . . . 4 3 2 1 0

ByteoffsetHit Data

,

8

IndexDataIndex TagValid

Tag oc o se

1

2

.

.

.

253

254

255

20


32

What kind of locality are we taking advantage of?


19/77

Taking Advantage of Spatial Locality

0 1 2 3 4 3 4 15Start with an empty cache - allblocks initially marked as not valid

2

3 4 3

4 15



20/77

Taking Advantage of Spatial Locality

0 1 2 3 4 3 4 15Start with an empty cache - allblocks initially marked as not valid

2

00 Mem(1) Mem(0)

m ss

00 Mem(1) Mem(0)00 Mem(3) Mem(2)00 Mem(1) Mem(0)

m ss

3 4 3hit

00 Mem(1) Mem(0)

miss

00 Mem(1) Mem(0)01

5 4

hit

01 Mem(5) Mem(4)

4 15

00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2)

hit miss

00 Mem(3) Mem(2)01 Mem(5) Mem(4)

00 Mem(3) Mem(2)01 Mem(5) Mem(4)11 15 14


z 8 requests, 4 misses


21/77

Miss Rate vs Block Size vs Cache Size

10

) 8 KB

5

ssrate( 16 KB

64 KB

256 KB

Mi

16 32 64 128 256

Block size (bytes)

Miss rate goes up if the block size becomes a significantfraction of the cache size because the number of blocks


(increasing capacity misses)


22/77

Cache Field Sizes

The number of bits in a cache includes both the storagefor data and for the tags

-

z For a direct mapped cache with 2n

blocks, n bits are used for theindex

z or a oc s ze o wor s y es , m s are use oaddress the word within the block and 2 bits are used to address

the byte within the word a s e s ze o e ag e

The total number of bits in a direct-mapped cache is thenn + +

How many total bits are required for a direct mappedcache with 16KB of data and 4-word blocks assuming a


32-bit address?


23/77

Handling Cache Hits

z this is what we want!

Write hits (D$ only)

z require the cache and memory to be consistent

- the memory hierarchy (write-through)

- writes run at the speed of the next level in the memory hierarchy soslow! or can use a write bufferand stall onl if the write buffer is full

z allow cache and memory to be inconsistent

- write the data only into the cache block (write-back the cache block to

evicted)

- need a dirty bit for each data cache block to tell if it needs to bewritten back to memory when it is evicted can use a write bufferto


help buffer write-backs of dirty blocks


24/77

Sources of Cache Misses

ompu sory co s ar or process m gra on, rsreference):

z First access to a block, cold fact of life, not a whole lot youcan o a ou . you are go ng o run m ons o ns ruc on,

compulsory misses are insignificantz Solution: increase block size (increases miss penalty; very

Capacity:

z Solution: increase cache size (may increase access time)

Conflict collision :

z Multiple memory locations mapped to the same cache location

z Solution 1: increase cache size


z Solution 2: increase associativity (stay tuned) (may increaseaccess time)


25/77

Handling Cache Misses (Single Word Blocks)

z stall the pipeline, fetch the block from the next level in the memory

hierarchy, install it in the cache and send the requested word toe processor, en e e p pe ne resume

Write misses (D$ only)1. stall the i eline, fetch the block from next level in the memor

hierarchy, install it in the cache (which may involve having to evicta dirty block if using a write-back cache), write the word from the

processor to the cache, then let the pipeline resumeor

2. Write allocate just write the word into the cache updating boththe ta and data no need to check for cache hit no need to stall

or

3. No-write allocate skip the cache write (but must invalidate that


word to the write buffer (and eventually to the next memory level),

no need to stall if the write buffer isnt full


26/77

Multiword Block Considerations

Read misses (I$ and D$)

z Processed the same as for single word blocks a miss returnsthe entire block from memor

z

Miss penalty grows as block size grows- Early restart processor resumes execution as soon as the

re uested word of the block is returned

- Requested word first requested word is transferred from thememory to the cache (and processor) first

z the cache while the cache is handling an earlier miss

Write misses (D$)

z

us ng wr te a ocate must rst etc t e oc rom memory anthen write the word to the block (or could end up with a garbledblock in the cache (e.g., for 4 word blocks, a new tag, one word


,block)


27/77

Memory Systems that Support Caches

affect overall system performance in dramatic ways

One word wide organization (one word wide buson-chip

CPUan one wor w e memory

Assume

Cache

bus

.

2. 15 memory bus clock cycles to get the 1st

word in the block from DRAM (row cycle

DRAM

, ,3rd, 4th words (column access time)

3. 1 memory bus clock cycle to return a word

-&

32-bit addrper cycle

emory

Memory-Bus to Cache bandwidth

z number of bytes accessed from memory


and transferred to cache/CPU per memory

bus clock cycle


28/77

Review: (DDR) SDRAM OperationColumn +1

N colsAddress er a row sread into the SRAM register

z In ut CAS as the startin burst

N

row

s Row

Address

address along with a burst length

z Transfers a burst of data (ideallya cache block from a series of

N x M SRAM

sequential addrs within that row

- The memory bus clock controls

transfer of successive words in

M-bit Output

the burst

1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit

Cycle Time

CAS

RAS


Row Address Col Address Row Add


29/77

DRAM Size Increase

Add a table like figure 5.12 to show DRAM growth since1980



30/77

One Word Wide Bus, One Word Blocks

CPU

on-chip

If the block size is one word, then for amemory access due to a cache miss,the pipeline will have to stall for the

Cache

number of cycles required to return one

data word from memoryc cle to send address

buscycles to read DRAM

cycle to return data

DRAMMemory

o a c oc cyc es m ss pena y

cycle (bandwidth) for a single miss is

bytes per memory bus clock



31/77

One Word Wide Bus, One Word Blocks

CPU

on-chip

If the block size is one word, then for amemory access due to a cache miss,the pipeline will have to stall for the

Cache

number of cycles required to return one

data word from memorymemor bus clock c cle to send address1

busmemory bus clock cycles to read DRAM

memory bus clock cycle to return data

15

1

DRAMMemory

o a c oc cyc es m ss pena y

cycle (bandwidth) for a single miss is

bytes per memory bus clock4/17 = 0.235



32/77

One Word Wide Bus, Four Word Blocks

CPU

on-chip

each word is in a different DRAM row?

cycle to send 1st address

Cache

cycles to read DRAM

cycles to return last data word

bus

DRAMMemory

Number of bytes transferred per clockcycle (bandwidth) for a single miss is



33/77


CPU

on-chip

each word is in a different DRAM row?

cycle to send 1st address1

Cache

cycles to read DRAM


4 x 15 = 60

1

bus

15 cycles

15 cycles

DRAMMemory

15 cycles

15 cycles

Number of bytes transferred per clockcycle (bandwidth) for a single miss is

=


.


34/77


CPU

on-chip

words are in the same DRAM row?

cycle to send 1st address

Cache

cycles to read DRAM


bus

DRAMMemory

Number of bytes transferred per clock


bytes per clock


35/77


CPU

on-chip

words are in the same DRAM row?

cycle to send 1st address1

Cache

cycles to read DRAM

cycles to return last data word15 + 3*5 = 30

1

bus

15 cycles

DRAMMemory

cyc es

5 cycles

5 cycles

Number of bytes transferred per clock


bytes per clock(4 x 4)/32 = 0.5


36/77

Interleaved Memory, One Word Wide Bus

For a block size of four wordscycle to send 1st address

CPU

on-chip

cycles to return last data wordtotal clock cycles miss penaltyCache

bus

Memorybank 1

Memorybank 0

Memorybank 2

Memorybank 3

Number of bytes transferredper clock cycle (bandwidth) for asin le miss is


bytes per clock


37/77

Interleaved Memory, One Word Wide Bus

For a block size of four wordscycle to send 1st address

CPU

on-chip 1

cycles to return last data wordtotal clock cycles miss penaltyCache

4*1 = 420

bus 15 cycles

15 cycles15 cycles

15 cyclesMemorybank 1

Memorybank 0

Memorybank 2

Memorybank 3

Number of bytes transferredper clock cycle (bandwidth) for asin le miss is


bytes per clock(4 x 4)/20 = 0.8


38/77

DRAM Memory System Summary

Its important to match the cache characteristicsz caches access one block at a time (usually more than one

word)

with the DRAM characteristics

z use DRAMs that support fast multiple word accesses,preferably ones that match the block size of the cache

with the memory-bus characteristics

z make sure the memory-bus can support the DRAM accessra es an pa erns

z with the goal of increasing the Memory-Bus to Cachebandwidth



39/77

Measuring Cache Performance

normal CPU execution cycle, then

CPU time = IC CPI CC

= IC (CPIideal + Memory-stall cycles) CC

CPIstall

emory-s a cyc es come rom cac e m sses a sum oread-stalls and write-stalls)

Read-stall c cles = reads/ ro ram read miss rate read miss penalty

Write-stall cycles = (writes/program write miss rate write miss enalt

+ write buffer stalls

For write-through caches, we can simplify this to


Memory-stall cycles = accesses/program miss rate miss penalty


40/77

Impacts of Cache Performance

performance improves (faster clock rate and/or lower CPI)

z The memory speed is unlikely to improve as fast as processorcycle time. When calculating CPIstall, the cache miss penalty is

measured in processorclock cycles needed to handle a missz The lower the CPIideal, the more pronounced the impact of stalls

A processor with a CPIideal of 2, a 100 cycle miss penalty,36% load/store instrs, and 2% I$ and 4% D$ miss rates

- .

So CPIstalls = 2 + 3.44 = 5.44

more than twice the CPIideal !

What if the CPIideal is reduced to 1? 0.5? 0.25? What if the D$ miss rate went up 1%? 2%?


What if the processor clock rate is doubled (doubling the

miss penalty)?


41/77

Average Memory Access Time (AMAT)

A larger cache will have a longer access time. Anincrease in hit time will likely add another stage to thei eline. At some oint the increase in hit time for a

larger cache will overcome the improvement in hit rateleading to a decrease in performance.

verage emory ccess me s t e average toaccess memory considering both hits and misses

AMAT = Time for a hit + Miss rate x Miss penalty

What is the AMAT for a processor with a 20 psec clock, amiss penalty of 50 clock cycles, a miss rate of 0.02misses er instruction and a cache access time of 1


clock cycle?


42/77

Reducing Cache Miss Rates #1

1. Allow more flexible block placement

In a direct mapped cache a memory block maps toexactly one cache block

At the other extreme, could allow a memory block to bemapped to any cache block fully associative cache

A compromise is to divide the cache into sets each of - .

memory block maps to a unique set (specified by theindex field) and can be placed in any way of that set (so


(block address) modulo (# sets in the cache)


43/77

Another Reference String Mapping



0 4 0 4


A th R f St i M i


44/77




00 Mem(0) 00 Mem(0)

01 401 Mem(4)

000

00 Mem(0)

01

4

0 4 0 4miss miss miss miss

00 Mem(0) 00 Mem(0)4

01 Mem(4)0

01 Mem(4)



Ping pong effect due to conflict misses - two memory

locations that map into the same cache block

S t A i ti C h E l


45/77

Set Associative Cache ExampleMain Memory

Cache

Tag DataV

xx0001xx

0010xxSetWay

One word blocksTwo low order bitsdefine the byte in the

0 0100xx

0101xx

0110xx

1

0

0

wor wor s

Q2: How do we find it?0111xx

1000xx

1001xx

1

memory address bit todetermine which

Q1: Is it there?

Compare all the cache

1010xx

1011xx

1100xx . .,

the number of sets inthe cache)

tags in the set to thehigh order 3 memoryaddress bits to tell if

xx

1110xx

1111xx


e memory oc s nthe cache

A th R f St i M i


46/77




0 4 0 4


A th R f St i M i


47/77




0 4 0 4m ss m ss

000 Mem(0) 000 Mem(0) 000 Mem(0) 000 Mem(0)

010 Mem(4) 010 Mem(4) 010 Mem(4)


Solves the ping pong effect in a direct mapped cachedue to conflict misses since now two memory locationsthat ma into the same cache set can co-exist!


Four Way Set Associative Cache


48/77

Four-Way Set Associative Cache 28 = 256 sets each with four wa s each with one block

31 30 . . . 13 12 11 . . . 2 1 0 Byte offset

822Tag

DataTagV01

DataTagV01

DataTagV01

Index DataTagV01

.

.

.

253

.

.

.

253

.

.

.

253

.

.

.

253

254

255

254

255

254

255

254

255

32

CSE431 Chapter 5A.48 Irwin, PSU, 2008Hit Data

4x1 select

Range of Set Associative Caches


49/77


,in associativity doubles the number of blocks per set (i.e.,the number or ways) and halves the number of sets

the size of the tag by 1 bit

Block offset B te offsetIndexTa




50/77


,in associativity doubles the number of blocks per set (i.e.,the number or ways) and halves the number of sets

the size of the tag by 1 bit

Block offset B te offsetIndexTa

Selects the setUsed for tag compare Selects the word in the block

Decreasing associativityIncreasing associativity

(only one set)Tag is all the bits exceptblock and byte offset

Direct mapped(only one way)Smaller tags, only a


single comparator

Costs of Set Associative Caches


51/77

Costs of Set Associative Caches

,replacement?

z Least Recently Used (LRU): the block replaced is the one thathas been unused for the longest time

- Must have hardware to keep track of when each ways block wasused relative to the other blocks in the set

- For 2-way set associative, takes one bit per set set the bit when ablock is referenced (and reset the other ways bit)

N-way set associative cache costsz N comparators (delay and area)

z MUX delay (set selection) before data is available

.

direct mapped cache, the cache block is available before theHit/Miss decision


- if it was a miss

Benefits of Set Associative Caches


52/77

Benefits of Set Associative Caches

on the cost of a miss versus the cost of implementation

12

8

10

e

4KB

8KB16KB

4

6

M

issRat

64KB

128KB

256KB

0

2512KB

Data from Hennessy &

1-w ay 2-w ay 4-w ay 8-w ayAssoci ati vity

,

Architecture, 2003


arges ga ns are n go ng rom rec mappe o -way(20%+ reduction in miss rate)



53/77


2. se mu p e eve s o cac es

room on the die for bigger L1 caches orfor a secondlevel of caches normally a unified L2 cache (i.e., it

even a unified L3 cache

For our exam le CPI of 2 100 c cle miss enalt(to main memory) and a 25 cycle miss penalty (toUL2$), 36% load/stores, a 2% (4%) L1 I$ (D$) miss

, .

CPIstalls = 2 + .0225 + .36.0425 + .005100 +


. . .(as compared to 5.44 with no L2$)

Multilevel Cache Design Considerations


54/77

Multilevel Cache Design Considerations

different

z Primary cache should focus on minimizing hit time in support of

- Smaller with smaller block sizes

z Secondary cache(s) should focus on reducing miss rate to

- Larger with larger block sizes

- Higher levels of associativity

The miss penalty of the L1 cache is significantly reducedby the presence of an L2 cache so it can be smalleri.e. faster but have a hi her miss rate

For the L2 cache, hit time is less important than miss rate

z The L2 hit time determines L1 s miss enalt


z L2$ local miss rate >> than the global miss rate

Using the Memory Hierarchy Well


55/77

Using the Memory Hierarchy Well

Include plots from Figure 5.18


Two Machines Cache Parameters


56/77


Intel Nehalem AMD Barcelona

L1 cacheorganization & size

Split I$ and D$; 32KB foreach per core; 64B blocks

Split I$ and D$; 64KB for eachper core; 64B blocks

L1 associativity 4-way (I), 8-way (D) set

assoc.; ~LRU replacement

2-way set assoc.; LRU

replacementL1 write olic write-back, write-allocate write-back, write-allocate

L2 cacheorganization & size

Unified; 256MB (0.25MB)per core; 64B blocks

Unified; 512KB (0.5MB) percore; 64B blocks

L2 associativit 8-wa set assoc. ~LRU 16-wa set assoc. ~LRUL2 write policy write-back write-back

L2 write policy write-back, write-allocate write-back, write-allocate

organization & size

shared by cores; 64B blocks

shared by cores; 64B blocksL3 associativity 16-way set assoc. 32-way set assoc.; evict block


L3 write policy write-back, write-allocate write-back; write-allocate



57/77

Intel P4 AMD Opteron

L1 organization Split I$ and D$ Split I$ and D$

L1 cache size 8KB for D$, 96KB for

trace cache (~I$)

64KB for each of I$ and D$

L1 block size 64 bytes 64 bytes

L1 associativity 4-way set assoc. 2-way set assoc.

L1 replacement ~ LRU LRU

L1 write policy write-through write-backL2 organization Unified Unified

L2 cache size 512KB 1024KB (1MB)

L2 associativity 8-way set assoc. 16-way set assoc.

L2 replacement ~LRU ~LRU


L2 write policy write-back write-back

FSM Cache Controller


58/77

Key characteristics for a simple L1 cachez Direct mapped

z - -

z Block size of 4 32-bit words (so 16B); Cache size of 16KB (so1024 blocks)

- - - -, , , ,bit, valid bit, LRU bits (if set associative)

Cache&

Cache

- t ea r te

essor

DRAM

1-bit Valid

32-bit address

- t ea r te

1-bit Valid

32-bit address

ControllerPro

DDR32-bit data

32-bit data128-bit data128-bit data


- -

Four State Cache Controller


59/77

Compare TagIf Valid && Hit

Cache HitMark Cache Ready

e Set Valid, Set Tag,

If Write set DirtyValid CPU request

Old block isDirty

Cache MissOld block is

AllocateRead new block

Write BackWrite old block

Memory Ready

MemoryMemory

Not Read


o ea y

Cache Coherence in Multicores


60/77

share a common physical address space, causing acache coherence problem

Core 1 Core 2

Unified (shared) L2


Cache Coherence in Multicores


61/77

share a common physical address space, causing acache coherence problem

Core 1 Core 2

Write 1 to X

X = 0 X = 0X = 1

Unified (shared) L2

X = 0X = 1


A Coherent Memory System


62/77

Any read of a data item should return the most recentlywritten value of the data item

- Writes to the same location are serialized (two writes to the same

location must be seen in the same order by all cores)

by a read

,z Replication of shared data items in multiple cores caches

z Replication reduces both latency and contention for a read shareddata item

z Migration of shared data items to a cores local cache

z Migration reduced the latency of the access the data and the


bandwidth demand on the shared memory (L2 in our example)

Cache Coherence Protocols


63/77

the most popular of which is snooping

z The cache controllers monitor (snoop) on the broadcast medium(e.g., bus) with duplicate address tag hardware (so they dont

interfere with cores access to the cache) to determine if theircache has a copy of a block that is requested

Write invalidate protocol writes require exclusiveaccess and invalidate all other copies

z copies of an item exists

If two processors attempt to write the same data at the,

cores copy to be invalidated. For the other core tocomplete, it must obtain a new copy of the data which


must now contain the updated value thus enforcingwrite serialization

Handling Writes


64/77

informed of writes can be handled two ways:

- - . broadcasts new data over the bus, all copies areupdatedz wr es go o e us g er us ra c

z Since new values appear in caches sooner, can reduce latency

. - signal on bus, cache snoops check to see if they have acopy of the data, if so they invalidate their cache block

only one writer)z Uses the bus only on the first write lower bus traffic, so better


Example of Snooping Invalidation


65/77

Core 1 Core 2

L1 I$ L1 D$L1 I$ L1 D$

n e s are


Example of Snooping Invalidation


66/77

Core 1 Core 2

Read X Read X

L1 I$ L1 D$L1 I$ L1 D$

= =

Write 1 to X Read X

I

X = 0

I

X = IX = 1

n e s are

When the second miss by Core 2 occurs, Core 1responds with the value canceling the response fromthe L2 cache (and also updating the L2 copy)


A Write-Invalidate CC Protocol


67/77

SharedInvalid

read (miss)

miss)

s)

w

rite(mis

write-back caching

(dirty)


rea or write (hit or miss)

A Write-Invalidate CC Protocol


68/77

SharedInvalid

read (miss)read (hit ormiss)

receives invalidate

s)

(write by another coreto this block)

eto

by this

ate

w

rite(mis

-backdu

d(miss)

ercoret

block

n

dinvalid

write-back caching

writ

re

anot

s

(dirty)

signals from the core inredsignals from the bus in


rea or wr e ue

Write-Invalidate CC ExamplesI = invalid man S = shared man M = modified onl one


69/77

z I = invalid man S = shared man M = modified onl one

Core 1 Core 2 Core 1 Core 2

A S A I A S A I

Main MemA

Main MemA



a n emA

a n emA

Write-Invalidate CC Examplesz I = invalid man S = shared man M = modified onl one


70/77

z I = invalid man S = shared man M = modified onl one

1. read miss for A3. snoop seesread request for

A & lets MM 4. gets A from MM

1. write miss for A

2. writes A &Core 1 Core 2 Core 1 Core 2

2. read re uest for A

supply A & changes its stateto S

changes its stateto M

4. change Astate to IA S A I A S A I

.

Main MemA

Main MemA

. rea m ss or. snoop sees rearequest for A, writes-

back A to MM 4. gets A from MM& changes its state5. chan e A

.

2. writes A &changes its state4. change A


2. read request for A

to M

5. C2 sends invalidate for A

state to I to M

3. C2 sends invalidate for A

state to I


a n emA

a n emA

Data Miss Rates


71/77

z Share data misses often dominate cache behavior even thoughthey may only be 10% to 40% of the data accesses

18

Capacity miss rate

Coherence miss rate64KB 2-way set associativedata cache with 32B blocks

12

14

16

Hennessy & Patterson, ComputerArchitecture: A Quantitative Approach

8

Capacity miss rate

Coherence miss rate

8

10

2

4

2

4


FFT

0

1 2 4 8 16

Ocean

1 2 4 8 16

Block Size Effects


72/77

Writes to one word in a multi-word block mean that thefull block is invalidated

Multi-word blocks can also result in false sharing: whentwo cores are writing to two different variables thatappen o a n e same cac e oc

z With write-invalidate false sharing increases cache miss rates

A B

Core1 Core2

4 word cache block

Compilers can help reduce false sharing by allocating


Other Coherence Protocols


73/77

There are many variations on cache coherence protocols

Another write-invalidate protocol used in the Pentium 4

(and many other processors) is MESI with four states:

z Exclusive only one copy of the shared data is allowed to becached; memory has an up-to-date copy

- nce ere s on y one copy o e oc , wr e s on nee osend invalidate signal

z Shared multiple copies of the shared data may be cached (i.e.,data ermitted to be cached with more than one rocessormemory has an up-to-date copy

z Invalid same


MESI Cache Coherency ProtocolProcessor


74/77

Invalid

shared read

Invalidate for this block

no vablock) (clean)

Processorshared readmiss

no er processor

has read/writemiss for this

Processorwrite

oc

Processorexclusive

rocessorsharedread

Modified

Processorwrite Exclusive

readexclusiveread miss

Processor write

(dirty)

Processor exclusiveread miss

Processor

[Write back block]


or read hi tread


75/77

Summary: Improving Cache Performance


76/77

. e uce e m ss pena yz smaller blocks

z use a write buffer to hold dirt blocks bein re laced so donthave to wait for the write to complete before reading

z check write buffer (and/or victim cache) on read miss may getluck

z for large blocks fetch critical word first

z use multiple cache levels L2 cache not tied to CPU clock rate

z as er ac ng s ore mprove memory an w

- wider buses

- memory interleaving, DDR SDRAMs


Summary: The Cache Design Space


77/77

z cache size

z block size

Cache Size

z associativity

z replacement policy

ssoc a v y

- -

z write allocation Block Size

z depends on access characteristics

- workload Bad

- use I-cache, D-cache, TLB

z depends on technology / cost Good Factor A Factor B


Lec5a Supp

Documents

Transcript of Lec5a Supp