Chapter 5 Memory Hierarchy Design EEF011 Computer Architecture 計算機結構 December 2004...
-
Upload
ferdinand-elliott -
Category
Documents
-
view
239 -
download
2
Transcript of Chapter 5 Memory Hierarchy Design EEF011 Computer Architecture 計算機結構 December 2004...
Chapter 5Memory Hierarchy Design
EEF011 Computer Architecture計算機結構
December 2004
吳俊興高雄大學資訊工程學系
2
Chapter 5 Memory Hierarchy Design5.1 Introduction
5.2 Review of the ABCs of Caches
5.3 Cache Performance
5.4 Reducing Cache Miss Penalty
5.5 Reducing Cache Miss Rate
5.6 Reducing Cache Miss Penalty or Miss Rate via Parallelism
5.7 Reducing Hit Time
5.8 Main Memory and Organizations for Improving Performance
5.9 Memory Technology
5.10 Virtual Memory
5.11 Protection and Examples of Virtual Memory
3
5.1 IntroductionThe five classic components of a computer:
Control
Datapath
Memory
Processor
Input
Output
Where do we fetch instructions to execute? Build a memory hierarchy which includes main memory & caches (internal
memory) and hard disk (external memory) Instructions are first fetched from external storage such as hard disk and
are kept in the main memory. Before they go to the CPU, they are probably extracted to stay in the caches
4
Technology Trends
DRAM
Year Size Cycle Time
1980 64 Kb 250 ns
1983 256 Kb 220 ns
1986 1 Mb 190 ns
1989 4 Mb 165 ns
1992 16 Mb 145 ns
1995 64 Mb 120 ns
2000 256 Mb 100 ns
Capacity Speed (latency)
CPU: 2x in 1.5 years 2x in 1.5 years
DRAM: 4x in 3 years 2x in 10 years
Disk: 4x in 3 years 2x in 10 years
4000:1! 2.5:1!
Memory Performance Index
5
The gap (latency) grows about 50% per year!
CPU1.35X/yr1.55X/yr
Memory7%/yr
Performance Gap between CPUs and Memory
(improvement ratio)
6
Levels of the Memory Hierarchy
CPU Registers500 bytes0.25 ns
Cache64 KB1 ns
Main Memory512 MB100ns
Disk100 GB5 ms
CapacityAccess Time
Upper Level
Lower Level
Faster
Larger
Memory Hierarchy
Spe
ed
Cap
acit
y
Registers
Cache
Memory
I/O Devices
Blocks
Pages
Files
???
7
• Cache:– In this textbook it mainly means the first level of the memory
hierarchy encountered once the address leaves the CPU– applied whenever buffering is employed to reuse commonly
occurring items, i.e. file caches, name caches, and so on• Principle of Locality:
– Program access a relatively small portion of the address space at any instant of time.
• Two Different Types of Locality:– Temporal Locality (Locality in Time): If an item is referenced, it will
tend to be referenced again soon (e.g., loops, reuse)– Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
5.2 ABCs of Caches
8
Memory Hierarchy: Terminology
• Hit: data appears in some block in the cache (example: Block X) – Hit Rate: the fraction of cache access found in the cache– Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss• Miss: data needs to be retrieved from a block in the main
memory (Block Y)– Miss Rate = 1 - (Hit Rate)– Miss Penalty: Time to replace a block in cache
+ Time to deliver the block to the processor• Hit Time << Miss Penalty (e.g. 1 clock cycle .vs. 40 clock cycles)
mainMemory
cacheTo Processor
From ProcessorBlk X
Blk Y
9
Cache Measures
CPU execution time incorporated with cache performance:
CPU execution time = (CPU clock cycles + Memory stall cycles)* Clock cycle time
Memory stall cycles: number of cycles during which the CPU is stalled waiting for a memory access
Memory stall clock cycles = Number of misses * miss penalty
= IC*(Misses/Instruction)*Miss penalty
= IC*(Memory accesses/Instruction)*Miss rate*Miss penalty
= IC * Reads per instruction * Read miss rate * Read miss penalty
+IC * Writes per instruction * Write miss rate * Write miss penalty
Memory access consists of fetching instructions and reading/writing data
10
Example Assume we have a computer where the CPI is 1.0 when all memory accesses hit the cache. The only data access are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions are in the cache?
Answer:
(A) If instructions always hit in the cache, CPI=1.0, no memory stalls, thenCPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time
(B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls.
memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty
= IC*(1+50%)*2%*25 = IC*0.75
then CPU(B) = (IC + IC*0.75)* Clock cycle time
= 1.75*IC*clock cycle time
The performance ration is easy to get to be the inverse of the CPU execution time :
CPU(B)/CPU(A) = 1.75
The computer with no cache miss is 1.75 times faster.
P.395 Example
11
Four Memory Hierarchy Questions
Q1 (block placement):
Where can a block be placed in the upper level?
Q2 (block identification):
How is a block found if it is in the upper level?
Q3 (block replacement):
Which bock should be replaced on a miss?
Q4 (write strategy):
What happens on a write?
12
Q1(block placement): Where can a block be placed?Direct mapped: (Block number) mod (Number of blocks in cache)Set associative: (Block number) mod (Number of sets in cache)
– # of set # of blocks– n-way: n blocks in a set– 1-way = direct mapped
Fully associative: # of set = 1
Example: block 12 placed in a 8-block cache
13
Simplest Cache: Direct Mapped (1-way)
Memory
4 Block Direct Mapped Cache
Block number0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
Block Index in Cache
0
1
2
3
The block have only one place it can appear in the cache. The mapping is usually
(Block address) MOD ( Number of blocks in cache)
14
Example: 1 KB Direct Mapped Cache, 32B Blocks
For a 2N byte cache:– The uppermost (32 - N) bits are always the Cache Tag– The lowest M bits are the Byte Select (Block Size = 2M)
0
1
2
3
:
Cache Data
Byte 0
:
0x50
Stored as partof the cache “state”
Valid Bit
:31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Byte 992Byte 1023 :
Cache Tag
Cache Index
0431
Cache Tag Example: 0x50
Ex: 0x01
Byte Select
Ex: 0x00
9
15
Block Offset selects the desired data from the block, the index filed selects the set, and the tag field compared against the CPU address for a hit
• Use the Cache Index to select the cache set• Check the Tag on each block in that set
– No need to check index or block offset– A valid bit is added to the Tag to indicate whether or not this entry
contains a valid address• Select the desiredbytes using Block Offset
Increasing associativity ↑ => shrinks index↓ expands tag ↑
Block Address Block Offset
(Block Size)Tag Cache/Set Index
Three portions of an address in a set-associative or direct-mapped cache
Q2 (block identification): How is a block found?
16
Example: Two-way set associative cache
• Cache Index selects a “set” from the cache• The two tags in the set are compared in parallel• Data is selected based on the tag result
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
Cache Index
0431
Cache Tag Example: 0x50
Ex: 0x01
Byte Select
Ex: 0x00
9
0x50
17
Disadvantage of Set Associative Cache
• N-way Set Associative Cache v.s. Direct Mapped Cache:– N comparators vs. 1– Extra MUX delay for the data– Data comes AFTER Hit/Miss
• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:– Possible to assume a hit and continue. Recover later if miss.
Cache DataCache Block 0
Cache Tag Valid
: ::
Cache DataCache Block 0
Cache TagValid
:: :
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
18
Easy for Direct Mapped – hardware decisions are simplified Only one block frame is checked and only that block can be replaced
Set Associative or Fully AssociativeThere are many blocks to choose from on a miss to replace
Three primary strategies for selecting a block to be replaced Random: randomly selected LRU: Least Recently Used block is removed FIFO(First in, First out)
Data cache misses per 1000 instructions for various replacement strategiesAssociativity: 2-way 4-way 8-waySize LRU Random FIFO LRU Random FIFO LRU Random FIFO16 KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4 64 KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3256 KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5
There are little difference between LRU and random for the largest size cache, with LRU outperforming the others for smaller caches. FIFO generally outperforms random in the smaller cache sizes
Q3 (block replacement): Which block should be replaced on a cache miss?
19
Reads dominate processor cache accesses. E.g. 7% of overall memory traffic are writes while 21% of data cache
access are writes Two option we can adopt when writing to the cache:Write through —The information is written to both the block in the
cache and to the block in the lower-level memory.Write back —The information is written only to the block in the
cache. The modified cache block is written to main memory only when it is replaced.
To reduce the frequency of writing back blocks on replacement, a dirty bit is used to indicate whether the block was modified in the cache (dirty) or not (clean). If clean, no write back since identical information to the cache is found
Pros and ConsWT: simply to be implemented. The cache is always clean, so read
misses cannot result in writesWB: writes occur at the speed of the cache. And multiple writes within
a block require only one write to the lower-level memory
Q4(write strategy): What happens on a write?
20
• A Write Buffer is needed between the Cache and Memory– Processor: writes data into the cache and the write buffer– Memory controller: write contents of the buffer to memory
• Write buffer is just a FIFO:– Typical number of entries: 4
ProcessorCache
Write Buffer
DRAM
Write Stall and Write Buffer
When the CPU must wait for writes to complete during WT, the CPU is said to write stall
A common optimization to reduce write stall is a write buffer, which allows the processor to continue as soon as the data are written to the buffer, thereby overlapping processor execution with memory updating
21
Two options on a write miss
Write allocate – the block is allocated on a write miss, followed by the write hit actions
Write misses act like read misses
No-write allocate – write misses do not affect the cache. The block is modified only in the lower-level memory
Block stay out of the cache in no-write allocate until the program tries to read the blocks, but with write allocate even blocks that are only written will still be in the cache
Write-Miss Policy: Write Allocate vs. Not Allocate
22
Example: Assume a fully associative write-back cache with many cache entries that starts empty. Below is sequence of five memory operations.
Write Mem[100];Write Mem[100];Read Mem[200];Write Mem[200];Write Mem[100].
What are the number of hits and misses (inclusive reads and writes) when using no-write allocate versus write allocate?
Answer:No-write Allocate: Write allocate: Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss Write Mem[100]; 1 write hit Read Mem[200]; 1 read miss Read Mem[200]; 1 read miss Write Mem[200]; 1 write hit Write Mem[200]; 1 write hit Write Mem[100]. 1 write miss Write Mem[100]; 1 write hit
4 misses; 1 hit 2 misses; 3 hits
Write-Miss Policy Example
23
Example: Split Cache vs. Unified CacheWhich has the better avg. memory access time?
A 16-KB instruction cache with a 16-KB data cache (split cache), orA 32-KB unified cache?
Miss rates Size Instruction Cache Data Cache Unified Cache
16KB 0.4% 11.4%
32 KB 3.18%
5.3 Cache Performance
Assume• A hit takes 1 clock cycle and the miss penalty is 100 cycles• A load or store takes 1 extra clock cycle on a unified cache since
there is only one cache port• 36% of the instructions are data transfer instructions.
• About 74% of the memory accesses are instruction references
Answer:Average memory access time (split) = % instructions x (Hit time + Instruction miss rate x Miss penalty) + % data x (Hit time + Instruction miss rate x Miss penalty) = 74% X ( 1 + 0.4% X 100) + 26% X ( 1 + 11.4% X 100) = 4.24Average memory access time(unified) = 74% X ( 1 + 3.18%x100) + 26% X ( 1 + 1 + 3.18% X 100) = 4.44
24
Example: Suppose a processor:– Ideal CPI = 1.0 (ignoring memory stalls)– Avg. miss rate is 2%– Avg. memory references per instruction is 1.5– Miss penalty is 100 cycles
What are the impact on performance when behavior of the cache is included?
Answer:CPI = CPU execution cycles per instr. + Memory stall cycles per instr. = CPI execution + Miss rate x Memory accesses per instr. x Miss penaltyCPI with cache = 1.0 + 2% x 1.5 x 100 = 4CPI without cache = 1.0 + 1.5 x 100 = 151
CPU time with cache = IC x CPI x Clock cycle time = IC x 4.0 x Clock cycle timeCPU time without cache = IC x 151 x Clock cycle time
•Without cache, the CPI of the processor increases from 1 to 151!•75 % of the time the processor is stalled waiting for memory! (CPI: 1→4)
Impact of Memory Access on CPU Performance
25
Example: What is the impact of two different cache organizations (direct mapped vs. 2-way set associative) on the performance of a CPU?
– Ideal CPI = 2.0 (ignoring memory stalls)– Clock cycle time is 1.0 ns– Avg. memory references per instruction is 1.5– Cache size: 64 KB, block size: 64 bytes– For set-associative, assume the clock cycle time is stretched 1.25 times to
accommodate the selection multiplexer– Cache miss penalty is 75 ns– Hit time is 1 clock cycle– Miss rate: direct mapped 1.4%; 2-way set-associative 1.0%.
Answer:• Avg. memory access time1-way= 1.0+(0.014 x 75) = 2.05 ns Avg. memory access time2-way= 1.0 x 1.25 + (0.01 x 75) = 2.00 ns
• CPU time1-way = IC x (CPIexecution + Miss rate x Memory accesses per instruction x Miss penalty) x Clock cycle time
= IC x (2.0 x 1.0 + (1.5 x 0.014 x 75)) = 3.58 IC CPU time2-way = IC x (2.0 x 1.0 x 1.25 + (1.5 x 0.01 x 75)) = 3.63 IC
Impact of Cache Organizations on CPU Performance
26
Summary of Performance Equations
27
The next few sections in the text book look at ways to improve cache and memory access times.
TimeCycleClockPenaltyMissRateMissnInstructio
AccessesMemoryCPIICTimeCPU Execution )
(*
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
Section 5.5 Section 5.4Section 5.7
Improving Cache Performance
28
5.4 Reducing Cache Miss Penalty
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
Time to handle a miss is becoming more and more the controlling factor. This is because of the great improvement in speed of processors as compared to the speed of memory.
Five optimizations1. Multilevel caches2. Critical word first and early restart3. Giving priority to read misses over writes4. Merging write buffer5. Victim caches
29
• Approaches– Make the cache faster to keep pace with the speed of CPUs– Make the cache larger to overcome the widening gapL1: fast hits, L2: fewer misses
• L2 Equations
Average Memory Access Time = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
Average Memory Access Time = Hit TimeL1
+ Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)
Hit TimeL1 << Hit TimeL2 << … << Hit TimeMem
Miss RateL1 < Miss RateL2 < …
Definitions:– Local miss rate— misses in this cache divided by the total number of memory
accesses to this cache (Miss rateL1 , Miss rateL2)• L1 cache skims the cream of the memory accesses
– Global miss rate—misses in this cache divided by the total number of memory
accesses generated by the CPU (Miss rateL1, Miss RateL1 x Miss RateL2) • Indicate what fraction of the memory accesses that leave the CPU go all the
way to memory
O1: Multilevel Caches
30
Design of L2 Cache
•Size– Since everything in L1 cache is likely to be in L2 cache, L2 cache
should be much bigger than L1
•Whether data in L1 is in L2– novice approach: design L1 and L2 independently– multilevel inclusion: L1 data are always present in L2
• Advantage: easy for consistency between I/O and cache (checking L2 only)
• Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-level block to be replaced => slightly higher 1st-level miss rate
• i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2
– multilevel exclusion: L1 data is never found in L2• A cache miss in L1 results in a swap of blocks between L1 and L2
• Advantage: prevent wasting space in L2
• i.e. AMD Athlon: 64 KB L1 and 256 KB L2
31
Don’t wait for full block to be loaded before restarting CPU• Critical Word First—Request missed word first from memory
and send it to CPU as soon as it arrives; let CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first
• Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution
– Given spatial locality, CPU tends to want next sequential word, so it’s not clear if benefit by early restart
Generally useful only in large blocks,
block
O2: Critical Word First and Early Restart
32
• Serve reads before writes have been completed• Write through with write buffers
SW R3, 512(R0) ; M[512] <- R3 (cache index 0)LW R1, 1024(R0) ; R1 <- M[1024] (cache index 0)LW R2, 512(R0) ; R2 <- M[512] (cache index 0)
Problem: write through with write buffers offer RAW conflicts with main memory reads on cache misses
– If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% )
– Check write buffer contents before read; if no conflicts, let the memory access continue
• Write BackSuppose a read miss will replace a dirty block– Normal: Write dirty block to memory, and then do the read– Instead: Copy the dirty block to a write buffer, do the read, and then
do the write– CPU stall less since restarts as soon as do read
O3: Giving Priority to Read Misses over Writes
33
O4: Merging Write Buffer• If a write buffer is empty, the data and the full address are
written in the buffer, and the write is finished from the CPU’s perspective
– Usually a write buffer supports multi-words• Write merging: addresses of write buffers are checked to see if
the address of the new data matches the address of a valid write buffer entry. If so, the new data are combined
Write buffer with 4 entries, each can hold four 64-bit words(left) without merging (right) Four writes are merged into a single entry
•writing multiple words at the same time is faster than writing multiple times
34
O5: Victim Caches
Idea of recycling: remember what was discarded latest due to cache miss in case it is needed again
–rather simply discarded or swapped into L2
victim cache: a small, fully associative cache between a cache and its refill path
–contain only blocks that are discarded from a cache because of a miss, “victims”
–checked on a miss before going to the next lower-level memory–Victim caches of 1 to 5 entries are effective at reducing misses,
especially for small, direct-mapped data caches–AMD Athlon: 8 entries
35
5.5 Reducing Miss Rate
3 C’s of Cache Miss• Compulsory—The first access to a block is not in the cache, so the block
must be brought into the cache. Also called cold start misses or first
reference misses.
(Misses in even an Infinite Cache)• Capacity—If the cache cannot contain all the blocks needed during
execution of a program, capacity misses will occur due to blocks being
discarded and later retrieved.
(Misses in Fully Associative Size X Cache)• Conflict—If block-placement strategy is set associative or direct mapped,
conflict misses (in addition to compulsory & capacity misses) will occur
because a block can be discarded and later retrieved if too many blocks
map to its set. Also called collision misses or interference misses.
(Misses in N-way Associative but hits in Fully Associative Size X Cache)
36
3 C’s of Cache Miss
miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2
Compulsory vanishinglysmall
3Cs Absolute Miss Rate (SPEC92) 2:1 Cache Rule
Conflict
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
16 32 64
128
1-way
2-way
4-way
8-way
Capacity
Compulsory
37
3Cs Relative Miss Rate
Conflict
Flaws: for fixed block sizeGood: insight => invention
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0%
20%
40%
60%
80%
100%
1 2 4 8
16 32 64
128
1-way
2-way4-way
8-way
Capacity
Compulsory
38
Five Techniques to Reduce Miss Rate
1. Larger block size
2. Larger caches
3. Higher associativity
4. Way prediction and pseudoassociative caches
5. Compiler optimizations
39
Block Size (bytes)
Miss
Rate
0%
5%
10%
15%
20%
25%
16 32 64
128
256
1K
4K
16K
64K
256K
Size of Cache
Using the principle of locality: The larger the block, the greater thechance parts of it will be used again.
O1: Larger Block Size
• Take advantage of spatial locality-The larger the block, the greater the chance parts of it is used again
• # of blocks is reduced for the cache of same size => Increase miss penalty• It may increase conflict misses and even capacity misses if the cache is
small• Usually high latency and high bandwidth encourage large block size
40
O2: Larger Caches
• Increasing capacity of cache reduces capacity misses (Figure 5.14 and 5.15)
• May be longer hit time and higher cost• Trends: Larger L2 or L3 off-chip caches
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
16 32 64
128
1-way
2-way
4-way
8-way
Capacity
Compulsory
41
• Figure 5-14 and 5-15 show how improve miss rates improve with higher associativity
– 8-way set asociative is as effective as fully associative for practical purposes– 2:1 Cache Rule:
Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2
• Tradeoff: higher associative cache complicates the circuit– May have longer clock cycle
• Beware: Execution time is the only final measure!–Will Clock Cycle time increase as a result of having a more
complicated cache?–Hill [1988] suggested hit time for 2-way vs. 1-way is:
external cache +10%, internal + 2%
O3: Higher Associativity
42
O4: Way Prediction & Pseudoassociative Caches
way prediction: extra bits are kept in cache to predict the way, or block within the set of the next cache access
Example: 2-way I-cache of Alpha 21264– If the predictor is correct, I-cache latency is 1 clock cycle– If incorrect, tries the other block, changes the way predictor, and has a
latency of 3 clock cycles– excess of 85% accuracyreduce conflict miss and maintain the hit speed of direct-mapped cache
pseudoassociative or column associative– On a miss, a 2nd cache entry is checked before going to the next lower
level• one fast hit and one slow hit
– Invert the most significant bit to the find other block in the “pseudoset”– Miss penalty may become slightly longer
43
O5: Compiler Optimizations
Improve hit rate by compile-time optimization• Reordering instructions with profiling information (McFarling[1989])
– Reduce misses by 50% for a 2KB direct-mapped 4-byte-block I-cache, and 75% in an 8KB cache
– Get best performance when it was possible to prevent some instruction from entering the cache
• Aligning basic block: the entry point is at the beginning of a cache block
– Decrease the chance of a cache miss for sequential code
• Loop Interchange: exchanging the nesting of loops– Improve spatial locality => reduce misses– Make data be accessed in order
=> maximize use of data in a cache block before discarded
/* Before: row first */for(j=0;j<100;j=j+1)
for(i=0;i<5000;i=i+1)x[i][j]=2*x[i][j];
skip through memory in strides of 100 words
/* Before: row first */for(i=0;i<5000;i=i+1)
for(j=0;j<100;j=j+1)x[i][j]=2*x[i][j];
access all words in a cache block
44
Blocking: operating on submatrices or blocks–Maximize accesses to the data loaded into the cache before replaced–Improve temporal localityX=Y*Z
/* Before */for(i=0;i<N;i=i+1)
for(j=0;j<N;j=j+1){r=0;for(k=0;k<N;k=k+1)
r=r+y[i][k]*z[k][j];x[i][j]=r;
}
/* After: B=blocking factor */for(jj=0;jj<N;jj=jj+B)for(kk=0;kk<N;kk=kk+B)for(i=0;i<N;i=i+1)
for(j=jj;j<min(jj+B,N;j=j+1){r=0;for(k=kk;k<min(kk+B,N);k=k+1)
r=r+y[i][k]*z[k][j];x[i][j]=x[i][j]+r;
}
# of capacity misses depends on N and cache size
•total # of memory words accessed = 2N3/B+N2
•y benefits from spatial locality•z benefits from temporal locality
45
5.6 Reducing Cache Penalty or Miss Rate
via Parallelism
Three techniques that overlap the execution of instructions
1.Nonblocking caches to reduce stalls on cache misses•to match the out-of-order processors
2.Hardware prefetching of insructions and data
3.Compiler-controlled prefetching
46
O1: Nonblocking cache to reduce stalls on cache missFor pipelined computers that allow out-of-order completion, the CPU need not stall on a cache miss•separate I-cache and D-cache
– Continue fetching instructions from I-cache while waiting for D-cache to return missing data
•“Nonblocking cache (lookup-free cache)– “hit under miss”: D-cache continues to supply cache hits during a miss– “hit under multiple miss” or “miss under miss”: overlap multiple misses
Ratio of average memory stall time for a blocking cache to hit-under-miss schemes
•first 14 are FP programsaverage: 76% for 1-miss, 51% for 2-miss, 39% for 64-miss
•final 4 are INT programsaverage: 81%, 78% and 78%
47
O2: Hardware Prefetching of Instructions and Data
Prefetch instructions or data before requested by the CPU– either directly into the caches or into an external buffer (faster than
accessing main memory)•Instruction prefetch: frequently done in hardware outside cache
– Fetch two blocks on a miss• the requested block is placed in I-cache when it returns• the prefetched block is placed in instruction stream buffer (ISB)• 1 single ISB would catch 15% to 25% of misses from a 4KB 16-byte-block
direct-mapped I-cache. 4 ISBs increased the data hit rate to 43% (Jouppi1990)
•UltraSPARC III: data prefetch– If a load hits in the prefetch cache
• the block is read from the prefetch cache• the next prefetch request is issued: calculating the “stride” of the next
prefetched block using the difference between the current address and the previous address
– Up to 8 simultaneous prefetches
It may interfere with demand misses resulting in lowering performance
48
O3: Compiler-Controlled Prefetching
•Compiler-controlled prefetching– Register prefetch: load the value into a register– Cache prefetch: load data only into the cache (not register)
•Faulting vs. nonfaulting: the address does or does not cause an exception for virtual address faults and protection violations
– normal load instruction = faulting register prefetch instruction•Most effective prefetch: “semantically invisible” to a program
– doesn’t change the contents of registers and memory, and– cannot cause virtual memory faults
•nonbinding prefetch: nonfaulting cache prefetch– Overlapping execution: CPU proceeds while the prefetched data are being
fetched– Advantage: The compiler may avoid unnecessary prefetches in hardware– Drawback: Prefetch instructions incurs instruction overhead
49
5.7 Reducing Hit Time
•Importance of cache hit time–Average Memory Access Time
= Hit Time + Miss Rate * Miss Penalty–More importantly, cache access time limits the clock cycle rate in
many processors today!
•Fast hit time:–Quickly and efficiently find out if data is in the cache, and–if it is, get that data out of the cache
•Four techniques:1.Small and simple caches
2.Avoiding address translation during indexing of the cache
3.Pipelined cache access
4.Trace caches
50
O1: Small and Simple Caches
•A time-consuming portion of a cache hit is using the index portion of the address to read the tag memory and then compare it to the address
Guideline: smaller hardware is faster–Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB
second level cache?• Small data cache and thus fast clock rate
Guideline: simpler hardware is faster–Direct Mapped, on chip
•General design:–small and simple cache for 1st-level cache–Keeping the tags on chip and the data off chip for 2nd-level caches
The emphasis recently is on fast clock time while hiding L1 misses with dynamic execution and using L2 caches to avoid going to memory
51
O2: Avoiding address translation during cache indexing•Two tasks: indexing the cache and comparing addresses•virtually vs. physically addressed cache
–virtual cache: use virtual address (VA) for the cache–physical cache: use physical address (PA) after translating virtual address
•Challenges to virtual cache1.Protection: page-level protection (RW/RO/Invalid) must be checked
–It’s checked as part of the virtual to physical address translation–solution: an addition field to copy the protection information from TLB and check it on every access to the cache
2.context switching: same VA of different processes refer to different PA, requiring the cache to be flushed
–solution: increase width of cache address tag with process-identifier tag (PID)3.Synonyms or aliases: two different VA for the same PA
–inconsistency problem: two copies of the same data in a virtual cache–hardware antialiasing solution: guarantee every cache block a unique PA
–Alpha 21264: check all possible locations. If one is found, it is invalidated–software page-coloring solution: forcing aliases to share some address bits
–Sun’s Solaris: all aliases must be identical in last 18 bits => no duplicate PA4.I/O: typically use PA, so need to interact with cache (see Section 5.12)
52
Virtually indexed, physically tagged cache
CPU
TB
$
MEM
VA
PA
PA
ConventionalOrganization
CPU
$
TB
MEM
VA
VA
PA
Virtually Addressed CacheTranslate only on miss
Synonym Problem
CPU
$ TB
MEM
VA
PATags
PA
Overlap cache accesswith VA translation:
requires $ index toremain invariant
across translation
VATags
L2 $
53
O3: Pipelined Cache Access
Simply to pipeline cache access– Multiple clock cycle for 1st-level cache hit
•Advantage: fast cycle time and slow hitExample: accessing instructions from I-cache– Pentium: 1 clock cycle– Pentium Pro ~ Pentium III: 2 clocks– Pentium 4: 4 clocks
•Drawback: Increasing the number of pipeline stages leads to– greater penalty on mispredicted branches and– more clock cycles between the issue of the load and the use of the data
Note that it increases the bandwidth of instructions rather than decreasing the actual latency of a cache hit
54
O4: Trace Caches
Trace cache for instructions: find a dynamic sequence of instructions including taken branches to load into a cache block
– The cache blocks contain• dynamic traces of executed instructions determined by CPU• rather than static sequences of instructions determined by memory
– branch prediction is folded into the cache: validated along with the addresses to have a valid fetch
– i.e. Intel NetBurst microarchitecture
•advantage: better utilization– Trace caches store instructions only from the branch entry point to the
exit of the trace– Unused part of a long block entered or exited from a taken branch in
conventional I-cache may not be fetched
•Downside: store the same instructions multiple times
55
Cache Optimization
Summary
5.4 miss penalty
5.5 miss rate
5.6 parallelism
5.7 hit time
56
Summary
Chapter 5 Memory Hierarchy Design5.1 Introduction5.2 Review of the ABCs of Caches5.3 Cache Performance5.4 Reducing Cache Miss Penalty5.5 Reducing Cache Miss Rate5.6 Reducing Cache Miss Penalty/Miss Rate via Parallelism5.7 Reducing Hit Time5.8 Main Memory and Organizations for Improving
Performance5.9 Memory Technology5.10 Virtual Memory5.11 Protection and Examples of Virtual Memory