Chapter 5 Memory Hierarchy Design EEF011 Computer Architecture 計算機結構 December 2004...

Chapter 5Memory Hierarchy Design

EEF011 Computer Architecture計算機結構

December 2004

吳俊興高雄大學資訊工程學系

2

Chapter 5 Memory Hierarchy Design5.1 Introduction

5.2 Review of the ABCs of Caches

5.3 Cache Performance

5.4 Reducing Cache Miss Penalty

5.5 Reducing Cache Miss Rate

5.6 Reducing Cache Miss Penalty or Miss Rate via Parallelism

5.7 Reducing Hit Time

5.8 Main Memory and Organizations for Improving Performance

5.9 Memory Technology

5.10 Virtual Memory

5.11 Protection and Examples of Virtual Memory

3

5.1 IntroductionThe five classic components of a computer:

Control

Datapath

Memory

Processor

Input

Output

Where do we fetch instructions to execute? Build a memory hierarchy which includes main memory & caches (internal

memory) and hard disk (external memory) Instructions are first fetched from external storage such as hard disk and

are kept in the main memory. Before they go to the CPU, they are probably extracted to stay in the caches

4

Technology Trends

DRAM

Year Size Cycle Time

1980 64 Kb 250 ns

1983 256 Kb 220 ns

1986 1 Mb 190 ns

1989 4 Mb 165 ns

1992 16 Mb 145 ns

1995 64 Mb 120 ns

2000 256 Mb 100 ns

Capacity Speed (latency)

CPU: 2x in 1.5 years 2x in 1.5 years

DRAM: 4x in 3 years 2x in 10 years

Disk: 4x in 3 years 2x in 10 years

4000:1! 2.5:1!

Memory Performance Index

5

The gap (latency) grows about 50% per year!

CPU1.35X/yr1.55X/yr

Memory7%/yr

Performance Gap between CPUs and Memory

(improvement ratio)

6

Levels of the Memory Hierarchy

CPU Registers500 bytes0.25 ns

Cache64 KB1 ns

Main Memory512 MB100ns

Disk100 GB5 ms

CapacityAccess Time

Upper Level

Lower Level

Faster

Larger

Memory Hierarchy

Spe

ed

Cap

acit

y

Registers

Cache

Memory

I/O Devices

Blocks

Pages

Files

???

7

• Cache:– In this textbook it mainly means the first level of the memory

hierarchy encountered once the address leaves the CPU– applied whenever buffering is employed to reuse commonly

occurring items, i.e. file caches, name caches, and so on• Principle of Locality:

– Program access a relatively small portion of the address space at any instant of time.

• Two Different Types of Locality:– Temporal Locality (Locality in Time): If an item is referenced, it will

tend to be referenced again soon (e.g., loops, reuse)– Spatial Locality (Locality in Space): If an item is referenced, items

whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)

5.2 ABCs of Caches

8

Memory Hierarchy: Terminology

• Hit: data appears in some block in the cache (example: Block X) – Hit Rate: the fraction of cache access found in the cache– Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss• Miss: data needs to be retrieved from a block in the main

memory (Block Y)– Miss Rate = 1 - (Hit Rate)– Miss Penalty: Time to replace a block in cache

+ Time to deliver the block to the processor• Hit Time << Miss Penalty (e.g. 1 clock cycle .vs. 40 clock cycles)

mainMemory

cacheTo Processor

From ProcessorBlk X

Blk Y

9

Cache Measures

CPU execution time incorporated with cache performance:

CPU execution time = (CPU clock cycles + Memory stall cycles)* Clock cycle time

Memory stall cycles: number of cycles during which the CPU is stalled waiting for a memory access

Memory stall clock cycles = Number of misses * miss penalty

= IC*(Misses/Instruction)*Miss penalty

= IC*(Memory accesses/Instruction)*Miss rate*Miss penalty

= IC * Reads per instruction * Read miss rate * Read miss penalty

+IC * Writes per instruction * Write miss rate * Write miss penalty

Memory access consists of fetching instructions and reading/writing data

10

Example Assume we have a computer where the CPI is 1.0 when all memory accesses hit the cache. The only data access are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions are in the cache?

Answer:

(A) If instructions always hit in the cache, CPI=1.0, no memory stalls, thenCPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time

(B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls.

memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty

= IC*(1+50%)*2%*25 = IC*0.75

then CPU(B) = (IC + IC*0.75)* Clock cycle time

= 1.75*IC*clock cycle time

The performance ration is easy to get to be the inverse of the CPU execution time :

CPU(B)/CPU(A) = 1.75

The computer with no cache miss is 1.75 times faster.

P.395 Example

11

Four Memory Hierarchy Questions

Q1 (block placement):

Where can a block be placed in the upper level?

Q2 (block identification):

How is a block found if it is in the upper level?

Q3 (block replacement):

Which bock should be replaced on a miss?

Q4 (write strategy):

What happens on a write?

12

Q1(block placement): Where can a block be placed?Direct mapped: (Block number) mod (Number of blocks in cache)Set associative: (Block number) mod (Number of sets in cache)

– # of set # of blocks– n-way: n blocks in a set– 1-way = direct mapped

Fully associative: # of set = 1

Example: block 12 placed in a 8-block cache

13

Simplest Cache: Direct Mapped (1-way)

Memory

4 Block Direct Mapped Cache

Block number0

1

2

3

4

5

6

7

8

9

A

B

C

D

E

F

Block Index in Cache

0

1

2

3

The block have only one place it can appear in the cache. The mapping is usually

(Block address) MOD ( Number of blocks in cache)

14

Example: 1 KB Direct Mapped Cache, 32B Blocks

For a 2N byte cache:– The uppermost (32 - N) bits are always the Cache Tag– The lowest M bits are the Byte Select (Block Size = 2M)

0

1

2

3

:

Cache Data

Byte 0

:

0x50

Stored as partof the cache “state”

Valid Bit

:31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Byte 992Byte 1023 :

Cache Tag

Cache Index

0431

Cache Tag Example: 0x50

Ex: 0x01

Byte Select

Ex: 0x00

9

15

Block Offset selects the desired data from the block, the index filed selects the set, and the tag field compared against the CPU address for a hit

• Use the Cache Index to select the cache set• Check the Tag on each block in that set

– No need to check index or block offset– A valid bit is added to the Tag to indicate whether or not this entry

contains a valid address• Select the desiredbytes using Block Offset

Increasing associativity ↑ => shrinks index↓ expands tag ↑

Block Address Block Offset

(Block Size)Tag Cache/Set Index

Three portions of an address in a set-associative or direct-mapped cache

Q2 (block identification): How is a block found?

16

Example: Two-way set associative cache

• Cache Index selects a “set” from the cache• The two tags in the set are compared in parallel• Data is selected based on the tag result

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

Cache Index

0431

Cache Tag Example: 0x50

Ex: 0x01

Byte Select

Ex: 0x00

9

0x50

17

Disadvantage of Set Associative Cache

• N-way Set Associative Cache v.s. Direct Mapped Cache:– N comparators vs. 1– Extra MUX delay for the data– Data comes AFTER Hit/Miss

• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:– Possible to assume a hit and continue. Recover later if miss.

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache DataCache Block 0

Cache TagValid

:: :

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

18

Easy for Direct Mapped – hardware decisions are simplified Only one block frame is checked and only that block can be replaced

Set Associative or Fully AssociativeThere are many blocks to choose from on a miss to replace

Three primary strategies for selecting a block to be replaced Random: randomly selected LRU: Least Recently Used block is removed FIFO(First in, First out)

Data cache misses per 1000 instructions for various replacement strategiesAssociativity: 2-way 4-way 8-waySize LRU Random FIFO LRU Random FIFO LRU Random FIFO16 KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4 64 KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3256 KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5

There are little difference between LRU and random for the largest size cache, with LRU outperforming the others for smaller caches. FIFO generally outperforms random in the smaller cache sizes

Q3 (block replacement): Which block should be replaced on a cache miss?

19

Reads dominate processor cache accesses. E.g. 7% of overall memory traffic are writes while 21% of data cache

access are writes Two option we can adopt when writing to the cache:Write through —The information is written to both the block in the

cache and to the block in the lower-level memory.Write back —The information is written only to the block in the

cache. The modified cache block is written to main memory only when it is replaced.

To reduce the frequency of writing back blocks on replacement, a dirty bit is used to indicate whether the block was modified in the cache (dirty) or not (clean). If clean, no write back since identical information to the cache is found

Pros and ConsWT: simply to be implemented. The cache is always clean, so read

misses cannot result in writesWB: writes occur at the speed of the cache. And multiple writes within

a block require only one write to the lower-level memory

Q4(write strategy): What happens on a write?

20

• A Write Buffer is needed between the Cache and Memory– Processor: writes data into the cache and the write buffer– Memory controller: write contents of the buffer to memory

• Write buffer is just a FIFO:– Typical number of entries: 4

ProcessorCache

Write Buffer

DRAM

Write Stall and Write Buffer

When the CPU must wait for writes to complete during WT, the CPU is said to write stall

A common optimization to reduce write stall is a write buffer, which allows the processor to continue as soon as the data are written to the buffer, thereby overlapping processor execution with memory updating

21

Two options on a write miss

Write allocate – the block is allocated on a write miss, followed by the write hit actions

Write misses act like read misses

No-write allocate – write misses do not affect the cache. The block is modified only in the lower-level memory

Block stay out of the cache in no-write allocate until the program tries to read the blocks, but with write allocate even blocks that are only written will still be in the cache

Write-Miss Policy: Write Allocate vs. Not Allocate

22

Example: Assume a fully associative write-back cache with many cache entries that starts empty. Below is sequence of five memory operations.

Write Mem[100];Write Mem[100];Read Mem[200];Write Mem[200];Write Mem[100].

What are the number of hits and misses (inclusive reads and writes) when using no-write allocate versus write allocate?

Answer:No-write Allocate: Write allocate: Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss Write Mem[100]; 1 write hit Read Mem[200]; 1 read miss Read Mem[200]; 1 read miss Write Mem[200]; 1 write hit Write Mem[200]; 1 write hit Write Mem[100]. 1 write miss Write Mem[100]; 1 write hit

4 misses; 1 hit 2 misses; 3 hits

Write-Miss Policy Example

23

Example: Split Cache vs. Unified CacheWhich has the better avg. memory access time?

A 16-KB instruction cache with a 16-KB data cache (split cache), orA 32-KB unified cache?

Miss rates Size Instruction Cache Data Cache Unified Cache

16KB 0.4% 11.4%

32 KB 3.18%

5.3 Cache Performance

Assume• A hit takes 1 clock cycle and the miss penalty is 100 cycles• A load or store takes 1 extra clock cycle on a unified cache since

there is only one cache port• 36% of the instructions are data transfer instructions.

• About 74% of the memory accesses are instruction references

Answer:Average memory access time (split) = % instructions x (Hit time + Instruction miss rate x Miss penalty) + % data x (Hit time + Instruction miss rate x Miss penalty) = 74% X ( 1 + 0.4% X 100) + 26% X ( 1 + 11.4% X 100) = 4.24Average memory access time(unified) = 74% X ( 1 + 3.18%x100) + 26% X ( 1 + 1 + 3.18% X 100) = 4.44

24

Example: Suppose a processor:– Ideal CPI = 1.0 (ignoring memory stalls)– Avg. miss rate is 2%– Avg. memory references per instruction is 1.5– Miss penalty is 100 cycles

What are the impact on performance when behavior of the cache is included?

Answer:CPI = CPU execution cycles per instr. + Memory stall cycles per instr. = CPI execution + Miss rate x Memory accesses per instr. x Miss penaltyCPI with cache = 1.0 + 2% x 1.5 x 100 = 4CPI without cache = 1.0 + 1.5 x 100 = 151

CPU time with cache = IC x CPI x Clock cycle time = IC x 4.0 x Clock cycle timeCPU time without cache = IC x 151 x Clock cycle time

•Without cache, the CPI of the processor increases from 1 to 151!•75 % of the time the processor is stalled waiting for memory! (CPI: 1→4)

Impact of Memory Access on CPU Performance

25

Example: What is the impact of two different cache organizations (direct mapped vs. 2-way set associative) on the performance of a CPU?

– Ideal CPI = 2.0 (ignoring memory stalls)– Clock cycle time is 1.0 ns– Avg. memory references per instruction is 1.5– Cache size: 64 KB, block size: 64 bytes– For set-associative, assume the clock cycle time is stretched 1.25 times to

accommodate the selection multiplexer– Cache miss penalty is 75 ns– Hit time is 1 clock cycle– Miss rate: direct mapped 1.4%; 2-way set-associative 1.0%.

Answer:• Avg. memory access time1-way= 1.0+(0.014 x 75) = 2.05 ns Avg. memory access time2-way= 1.0 x 1.25 + (0.01 x 75) = 2.00 ns

• CPU time1-way = IC x (CPIexecution + Miss rate x Memory accesses per instruction x Miss penalty) x Clock cycle time

= IC x (2.0 x 1.0 + (1.5 x 0.014 x 75)) = 3.58 IC CPU time2-way = IC x (2.0 x 1.0 x 1.25 + (1.5 x 0.01 x 75)) = 3.63 IC

Impact of Cache Organizations on CPU Performance

26

Summary of Performance Equations

27

The next few sections in the text book look at ways to improve cache and memory access times.

TimeCycleClockPenaltyMissRateMissnInstructio

AccessesMemoryCPIICTimeCPU Execution )

(*

Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty

Section 5.5 Section 5.4Section 5.7

Improving Cache Performance

28

5.4 Reducing Cache Miss Penalty

Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty

Time to handle a miss is becoming more and more the controlling factor. This is because of the great improvement in speed of processors as compared to the speed of memory.

Five optimizations1. Multilevel caches2. Critical word first and early restart3. Giving priority to read misses over writes4. Merging write buffer5. Victim caches

29

• Approaches– Make the cache faster to keep pace with the speed of CPUs– Make the cache larger to overcome the widening gapL1: fast hits, L2: fewer misses

• L2 Equations

Average Memory Access Time = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

Average Memory Access Time = Hit TimeL1

+ Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)

Hit TimeL1 << Hit TimeL2 << … << Hit TimeMem

Miss RateL1 < Miss RateL2 < …

Definitions:– Local miss rate— misses in this cache divided by the total number of memory

accesses to this cache (Miss rateL1 , Miss rateL2)• L1 cache skims the cream of the memory accesses

– Global miss rate—misses in this cache divided by the total number of memory

accesses generated by the CPU (Miss rateL1, Miss RateL1 x Miss RateL2) • Indicate what fraction of the memory accesses that leave the CPU go all the

way to memory

O1: Multilevel Caches

30

Design of L2 Cache

•Size– Since everything in L1 cache is likely to be in L2 cache, L2 cache

should be much bigger than L1

•Whether data in L1 is in L2– novice approach: design L1 and L2 independently– multilevel inclusion: L1 data are always present in L2

• Advantage: easy for consistency between I/O and cache (checking L2 only)

• Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-level block to be replaced => slightly higher 1st-level miss rate

• i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2

– multilevel exclusion: L1 data is never found in L2• A cache miss in L1 results in a swap of blocks between L1 and L2

• Advantage: prevent wasting space in L2

• i.e. AMD Athlon: 64 KB L1 and 256 KB L2

31

Don’t wait for full block to be loaded before restarting CPU• Critical Word First—Request missed word first from memory

and send it to CPU as soon as it arrives; let CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first

• Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution

– Given spatial locality, CPU tends to want next sequential word, so it’s not clear if benefit by early restart

Generally useful only in large blocks,

block

O2: Critical Word First and Early Restart

32

• Serve reads before writes have been completed• Write through with write buffers

SW R3, 512(R0) ; M[512] <- R3 (cache index 0)LW R1, 1024(R0) ; R1 <- M[1024] (cache index 0)LW R2, 512(R0) ; R2 <- M[512] (cache index 0)

Problem: write through with write buffers offer RAW conflicts with main memory reads on cache misses

– If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% )

– Check write buffer contents before read; if no conflicts, let the memory access continue

• Write BackSuppose a read miss will replace a dirty block– Normal: Write dirty block to memory, and then do the read– Instead: Copy the dirty block to a write buffer, do the read, and then

do the write– CPU stall less since restarts as soon as do read

O3: Giving Priority to Read Misses over Writes

33

O4: Merging Write Buffer• If a write buffer is empty, the data and the full address are

written in the buffer, and the write is finished from the CPU’s perspective

– Usually a write buffer supports multi-words• Write merging: addresses of write buffers are checked to see if

the address of the new data matches the address of a valid write buffer entry. If so, the new data are combined

Write buffer with 4 entries, each can hold four 64-bit words(left) without merging (right) Four writes are merged into a single entry

•writing multiple words at the same time is faster than writing multiple times

34

O5: Victim Caches

Idea of recycling: remember what was discarded latest due to cache miss in case it is needed again

–rather simply discarded or swapped into L2

victim cache: a small, fully associative cache between a cache and its refill path

–contain only blocks that are discarded from a cache because of a miss, “victims”

–checked on a miss before going to the next lower-level memory–Victim caches of 1 to 5 entries are effective at reducing misses,

especially for small, direct-mapped data caches–AMD Athlon: 8 entries

35

5.5 Reducing Miss Rate

3 C’s of Cache Miss• Compulsory—The first access to a block is not in the cache, so the block

must be brought into the cache. Also called cold start misses or first

reference misses.

(Misses in even an Infinite Cache)• Capacity—If the cache cannot contain all the blocks needed during

execution of a program, capacity misses will occur due to blocks being

discarded and later retrieved.

(Misses in Fully Associative Size X Cache)• Conflict—If block-placement strategy is set associative or direct mapped,

conflict misses (in addition to compulsory & capacity misses) will occur

because a block can be discarded and later retrieved if too many blocks

map to its set. Also called collision misses or interference misses.

(Misses in N-way Associative but hits in Fully Associative Size X Cache)

36

3 C’s of Cache Miss

miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2

Compulsory vanishinglysmall

3Cs Absolute Miss Rate (SPEC92) 2:1 Cache Rule

Conflict

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16 32 64

128

1-way

2-way

4-way

8-way

Capacity

Compulsory

37

3Cs Relative Miss Rate

Conflict

Flaws: for fixed block sizeGood: insight => invention

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0%

20%

40%

60%

80%

100%

1 2 4 8

16 32 64

128

1-way

2-way4-way

8-way

Capacity

Compulsory

38

Five Techniques to Reduce Miss Rate

1. Larger block size

2. Larger caches

3. Higher associativity

4. Way prediction and pseudoassociative caches

5. Compiler optimizations

39

Block Size (bytes)

Miss

Rate

0%

5%

10%

15%

20%

25%

16 32 64

128

256

1K

4K

16K

64K

256K

Size of Cache

Using the principle of locality: The larger the block, the greater thechance parts of it will be used again.

O1: Larger Block Size

• Take advantage of spatial locality-The larger the block, the greater the chance parts of it is used again

• # of blocks is reduced for the cache of same size => Increase miss penalty• It may increase conflict misses and even capacity misses if the cache is

small• Usually high latency and high bandwidth encourage large block size

40

O2: Larger Caches

• Increasing capacity of cache reduces capacity misses (Figure 5.14 and 5.15)

• May be longer hit time and higher cost• Trends: Larger L2 or L3 off-chip caches

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16 32 64

128

1-way

2-way

4-way

8-way

Capacity

Compulsory

41

• Figure 5-14 and 5-15 show how improve miss rates improve with higher associativity

– 8-way set asociative is as effective as fully associative for practical purposes– 2:1 Cache Rule:

Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2

• Tradeoff: higher associative cache complicates the circuit– May have longer clock cycle

• Beware: Execution time is the only final measure!–Will Clock Cycle time increase as a result of having a more

complicated cache?–Hill [1988] suggested hit time for 2-way vs. 1-way is:

external cache +10%, internal + 2%

O3: Higher Associativity

42

O4: Way Prediction & Pseudoassociative Caches

way prediction: extra bits are kept in cache to predict the way, or block within the set of the next cache access

Example: 2-way I-cache of Alpha 21264– If the predictor is correct, I-cache latency is 1 clock cycle– If incorrect, tries the other block, changes the way predictor, and has a

latency of 3 clock cycles– excess of 85% accuracyreduce conflict miss and maintain the hit speed of direct-mapped cache

pseudoassociative or column associative– On a miss, a 2nd cache entry is checked before going to the next lower

level• one fast hit and one slow hit

– Invert the most significant bit to the find other block in the “pseudoset”– Miss penalty may become slightly longer

43

O5: Compiler Optimizations

Improve hit rate by compile-time optimization• Reordering instructions with profiling information (McFarling[1989])

– Reduce misses by 50% for a 2KB direct-mapped 4-byte-block I-cache, and 75% in an 8KB cache

– Get best performance when it was possible to prevent some instruction from entering the cache

• Aligning basic block: the entry point is at the beginning of a cache block

– Decrease the chance of a cache miss for sequential code

• Loop Interchange: exchanging the nesting of loops– Improve spatial locality => reduce misses– Make data be accessed in order

=> maximize use of data in a cache block before discarded

/* Before: row first */for(j=0;j<100;j=j+1)

for(i=0;i<5000;i=i+1)x[i][j]=2*x[i][j];

skip through memory in strides of 100 words

/* Before: row first */for(i=0;i<5000;i=i+1)

for(j=0;j<100;j=j+1)x[i][j]=2*x[i][j];

access all words in a cache block

44

Blocking: operating on submatrices or blocks–Maximize accesses to the data loaded into the cache before replaced–Improve temporal localityX=Y*Z

/* Before */for(i=0;i<N;i=i+1)

for(j=0;j<N;j=j+1){r=0;for(k=0;k<N;k=k+1)

r=r+y[i][k]*z[k][j];x[i][j]=r;

}

/* After: B=blocking factor */for(jj=0;jj<N;jj=jj+B)for(kk=0;kk<N;kk=kk+B)for(i=0;i<N;i=i+1)

for(j=jj;j<min(jj+B,N;j=j+1){r=0;for(k=kk;k<min(kk+B,N);k=k+1)

r=r+y[i][k]*z[k][j];x[i][j]=x[i][j]+r;

}

# of capacity misses depends on N and cache size

•total # of memory words accessed = 2N3/B+N2

•y benefits from spatial locality•z benefits from temporal locality

45

5.6 Reducing Cache Penalty or Miss Rate

via Parallelism

Three techniques that overlap the execution of instructions

1.Nonblocking caches to reduce stalls on cache misses•to match the out-of-order processors

2.Hardware prefetching of insructions and data

3.Compiler-controlled prefetching

46

O1: Nonblocking cache to reduce stalls on cache missFor pipelined computers that allow out-of-order completion, the CPU need not stall on a cache miss•separate I-cache and D-cache

– Continue fetching instructions from I-cache while waiting for D-cache to return missing data

•“Nonblocking cache (lookup-free cache)– “hit under miss”: D-cache continues to supply cache hits during a miss– “hit under multiple miss” or “miss under miss”: overlap multiple misses

Ratio of average memory stall time for a blocking cache to hit-under-miss schemes

•first 14 are FP programsaverage: 76% for 1-miss, 51% for 2-miss, 39% for 64-miss

•final 4 are INT programsaverage: 81%, 78% and 78%

47

O2: Hardware Prefetching of Instructions and Data

Prefetch instructions or data before requested by the CPU– either directly into the caches or into an external buffer (faster than

accessing main memory)•Instruction prefetch: frequently done in hardware outside cache

– Fetch two blocks on a miss• the requested block is placed in I-cache when it returns• the prefetched block is placed in instruction stream buffer (ISB)• 1 single ISB would catch 15% to 25% of misses from a 4KB 16-byte-block

direct-mapped I-cache. 4 ISBs increased the data hit rate to 43% (Jouppi1990)

•UltraSPARC III: data prefetch– If a load hits in the prefetch cache

• the block is read from the prefetch cache• the next prefetch request is issued: calculating the “stride” of the next

prefetched block using the difference between the current address and the previous address

– Up to 8 simultaneous prefetches

It may interfere with demand misses resulting in lowering performance

48

O3: Compiler-Controlled Prefetching

•Compiler-controlled prefetching– Register prefetch: load the value into a register– Cache prefetch: load data only into the cache (not register)

•Faulting vs. nonfaulting: the address does or does not cause an exception for virtual address faults and protection violations

– normal load instruction = faulting register prefetch instruction•Most effective prefetch: “semantically invisible” to a program

– doesn’t change the contents of registers and memory, and– cannot cause virtual memory faults

•nonbinding prefetch: nonfaulting cache prefetch– Overlapping execution: CPU proceeds while the prefetched data are being

fetched– Advantage: The compiler may avoid unnecessary prefetches in hardware– Drawback: Prefetch instructions incurs instruction overhead

49

5.7 Reducing Hit Time

•Importance of cache hit time–Average Memory Access Time

= Hit Time + Miss Rate * Miss Penalty–More importantly, cache access time limits the clock cycle rate in

many processors today!

•Fast hit time:–Quickly and efficiently find out if data is in the cache, and–if it is, get that data out of the cache

•Four techniques:1.Small and simple caches

2.Avoiding address translation during indexing of the cache

3.Pipelined cache access

4.Trace caches

50

O1: Small and Simple Caches

•A time-consuming portion of a cache hit is using the index portion of the address to read the tag memory and then compare it to the address

Guideline: smaller hardware is faster–Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB

second level cache?• Small data cache and thus fast clock rate

Guideline: simpler hardware is faster–Direct Mapped, on chip

•General design:–small and simple cache for 1st-level cache–Keeping the tags on chip and the data off chip for 2nd-level caches

The emphasis recently is on fast clock time while hiding L1 misses with dynamic execution and using L2 caches to avoid going to memory

51

O2: Avoiding address translation during cache indexing•Two tasks: indexing the cache and comparing addresses•virtually vs. physically addressed cache

–virtual cache: use virtual address (VA) for the cache–physical cache: use physical address (PA) after translating virtual address

•Challenges to virtual cache1.Protection: page-level protection (RW/RO/Invalid) must be checked

–It’s checked as part of the virtual to physical address translation–solution: an addition field to copy the protection information from TLB and check it on every access to the cache

2.context switching: same VA of different processes refer to different PA, requiring the cache to be flushed

–solution: increase width of cache address tag with process-identifier tag (PID)3.Synonyms or aliases: two different VA for the same PA

–inconsistency problem: two copies of the same data in a virtual cache–hardware antialiasing solution: guarantee every cache block a unique PA

–Alpha 21264: check all possible locations. If one is found, it is invalidated–software page-coloring solution: forcing aliases to share some address bits

–Sun’s Solaris: all aliases must be identical in last 18 bits => no duplicate PA4.I/O: typically use PA, so need to interact with cache (see Section 5.12)

52

Virtually indexed, physically tagged cache

CPU

TB

$

MEM

VA

PA

PA

ConventionalOrganization

CPU

$

TB

MEM

VA

VA

PA

Virtually Addressed CacheTranslate only on miss

Synonym Problem

CPU

$ TB

MEM

VA

PATags

PA

Overlap cache accesswith VA translation:

requires $ index toremain invariant

across translation

VATags

L2 $

53

O3: Pipelined Cache Access

Simply to pipeline cache access– Multiple clock cycle for 1st-level cache hit

•Advantage: fast cycle time and slow hitExample: accessing instructions from I-cache– Pentium: 1 clock cycle– Pentium Pro ~ Pentium III: 2 clocks– Pentium 4: 4 clocks

•Drawback: Increasing the number of pipeline stages leads to– greater penalty on mispredicted branches and– more clock cycles between the issue of the load and the use of the data

Note that it increases the bandwidth of instructions rather than decreasing the actual latency of a cache hit

54

O4: Trace Caches

Trace cache for instructions: find a dynamic sequence of instructions including taken branches to load into a cache block

– The cache blocks contain• dynamic traces of executed instructions determined by CPU• rather than static sequences of instructions determined by memory

– branch prediction is folded into the cache: validated along with the addresses to have a valid fetch

– i.e. Intel NetBurst microarchitecture

•advantage: better utilization– Trace caches store instructions only from the branch entry point to the

exit of the trace– Unused part of a long block entered or exited from a taken branch in

conventional I-cache may not be fetched

•Downside: store the same instructions multiple times

55

Cache Optimization

Summary

5.4 miss penalty

5.5 miss rate

5.6 parallelism

5.7 hit time

56

Summary

Chapter 5 Memory Hierarchy Design5.1 Introduction5.2 Review of the ABCs of Caches5.3 Cache Performance5.4 Reducing Cache Miss Penalty5.5 Reducing Cache Miss Rate5.6 Reducing Cache Miss Penalty/Miss Rate via Parallelism5.7 Reducing Hit Time5.8 Main Memory and Organizations for Improving

Performance5.9 Memory Technology5.10 Virtual Memory5.11 Protection and Examples of Virtual Memory

Chapter 5 Memory Hierarchy Design EEF011 Computer Architecture 計算機結構 December 2004...

Documents

Transcript of Chapter 5 Memory Hierarchy Design EEF011 Computer Architecture 計算機結構 December 2004...