Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use...

64
1 Lecture 11 •Virtual Memory •Review: Memory Hierarchy

Transcript of Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use...

Page 1: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

1

Lecture 11

•Virtual Memory•Review: Memory Hierarchy

Page 2: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

2

Administration

• Homework 4 -Due 12/21• HW 4Use your favorite language to write a cache simulator.Input: address trace, cache size, block size, associativityOutput: miss rate

Page 3: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

3

Review:Main Memory• Simple, Interleaved and Wider Memory• Interleaved Memory: for sequential or independent

accesses• Avoiding bank conflicts: SW & HW• DRAM specific optimizations: page mode & Specialty

DRAM

Page 4: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

4

Main Memory Background• Performance of Main Memory:

– Latency: time to finish a request• Access Time: time between request and word arrives• Cycle Time: time between requests• Cycle time > Access Time

– Bandwidth: Bytes/per second• Main Memory is DRAM: Dynamic Random Access Memory

– Dynamic since needs to be refreshed periodically (8 ms)– Addresses divided into 2 halves (Memory as a 2D matrix):

• RAS or Row Access Strobe• CAS or Column Access Strobe

• Cache uses SRAM: Static Random Access Memory– No refresh (6 transistors/bit vs. 1 transistor /bit)– Address not divided

• Size: DRAM/SRAM - 4-8, Cost/Cycle time: SRAM/DRAM - 8-16

Page 5: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

5

Virtual Memory: Motivation

• Permit applications to grow larger than main memory size

– 32, 64 bits v.s. 28 bits (256MB)• Automatic management• Multiple process management

– Sharing– Protection

• Relocation

Disk

ABCD

virtual memory physical memory

C

B

A

D

Page 6: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

6

Virtual Memory Terminology

• Page or segment– A block transferred from the disk to main memory

• Page: fixed size• Segment: variable size

• Page fault– The requested page is not in main memory (miss)

• Address translation or memory mapping– Virtual to physical address

Page 7: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

7

Cache vs. VM Difference

• What controls replacement?– Cache miss: HW– Page fault: often handled by the OS

• Size– VM space is determined by the address size of the CPU– Cache size is independent of the CPU address size

• Lower level use– Cache: main memory is not shared by anything els– VM: most of the disk contains the file system

Page 8: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

8

Virtual Memory

• 4Qs for VM?– Q1: Where can a block be placed in the upper level?

• Fully Associative

– Q2: How is a block found if it is in the upper level?• Pages: use a page table• Segments: segment table

– Q3: Which block should be replaced on a miss?• LRU

– Q4: What happens on a write? • Write Back

Page 9: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

9

Page Table

• Virtual-to-physical address mapping via page table

virtual page number page offset

page table

Main Memory

What is the size of the page table given a 28-bit virtual address, 4 KB pages, and 4 bytes per page table entry?

2 ^ (28-12) x 2^2 = 256 KB

PTE

Page 10: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

10

Inverted Page Table (HP, IBM)

• One PTE per page frame• Pros & Cons:

– The size of table is the number of physical pages

– Must search for virtual address (using hash)

virtual page number page offset

Hash Another Table (HAT)

hash

Inverted page table

VA PA

Page 11: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

11

Fast Translation: Translation Look-aside Buffer (TLB)

• Cache of translated addresses• Alpha 21064 TLB: 32 entry fully associative

page frame address <30> page offset <13>

::::

V R W Tag <30> Physical address <21>1 2

32:1 Mux

<13>

<21>3 4 34-bit

physicaladdress

Problem: combine caches with virtual memory

Page 12: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

12

TLB & Caches

CPU

TB

$

MEM

VA

PA

PA

ConventionalOrganization

CPU

$

TB

MEM

VA

VA

PA

Virtually Addressed CacheTranslate only on miss

CPU

$ TB

MEM

VA

PATags

PA

Overlap $ accesswith VA translation:requires $ index toremain invariant

across translation

VATags

L2 $

Page 13: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

13

Virtual Cache

• Avoid address translation before accessing cache– Faster hit time

• Context switch– Flush: time to flush + “compulsory misses” from empty cache– Add processor id (PID) to TLB

• I/O (physical address) must interact with cache– Physical -> virtual address translation

• Aliases (Synonyms)– Two virtual addresses map to the same physical addresses

• Two identical copies in the cache

Page 14: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

14

Solutions for Aliases

• HW: anti-aliasing– Guarantee every cache block a unique physical address

• OS : page coloring– Guarantee that the virtual and physical addresses match in the last

n bits– Avoid duplicate physical addresses for block if using a direct-

mapped cache, size < 2^n

virtual address

Physical address

< 18 >

index bit block offsetvirtual address x

virtual address y(x, y) aliasing

index bit block offset

(x,y) map to the same set

Page 15: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

15

Virtually indexed & physically tagged cache

• Use the part of addresses that is not affected by address translation to index a cache– Page offset– Overlap the time to read the tags with address translation

• Limit cache size to cache size for direct-mapped cache –how to get a bigger cache– Higher associativity– Page coloring

Page address Page offset

Address tag index block offset

Page addressPage offset

Address tag index block offset

Page 16: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

16

Selecting a Page Size

• Reasons for larger page size– Page table size is inversely proportional to the page size;

therefore memory saved– Fast cache hit time easy when cache <= page size (VA caches);

bigger page makes it feasible as cache size grows– Transferring larger pages to or from secondary storage, possibly over a network,

is more efficient– Number of TLB entries are restricted by clock cycle time, so a larger page size

maps more memory, thereby reducing TLB misses• Reasons for a smaller page size

– Fragmentation: don’t waste storage; data must be contiguous within page– Quicker process start for small processes

• Hybrid solution: multiple page sizes– Alpha: 8KB, 16KB, 32 KB, 64 KB pages (43, 47, 51, 55 virt addr bits)

Page 17: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache
Page 18: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

18

Virtual Memory Summary

• Why virtual memory?• Fast address translation

– Page table, TLB

• TLB & Cache– Virtual cache vs. physical cache– Virtually indexed and physically tagged

Page 19: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

19

Cross Cutting Issues

• Superscalar CPU & Number Cache Ports• Speculative Execution and non-faulting option on

memory• Parallel Execution vs. Cache locality

– Want far separation to find independent operations vs.want reuse of data accesses to avoid misses

For ( i=0; i<512; i= i+1)for (j= 0; j< 512; j = j+1)

x[i][j] = 2 * x[i][j-1]

For (i = 0; i<512 ; i=i+1)for (j=1; j< 512; j= j+4) {

x[i][j] = 2 * x[i][j-1];x[i][j+1] = 2 * x[i][j];x[i][j+2] = 2 * x[i][j+1];x[i][j+3] = 2 * x[i][j+2];

};

For ( j=0; j<512; j= j+1)for (i = 0; i< 512; i = i+1)

x[i][j] = 2 * x[i][j-1]

For (j= 0; j<512 ; j=j+1)for (i=1; i< 512; i= i+4) {

x[i][j] = 2 * x[i][j-1];x[i+1][j] = 2 * x[i+1][j-1];x[i+2][j] = 2 * x[i+2][j-1];x[i+3][j] = 2 * x[i+3][j-1];

};

Page 20: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

20

Cross Cutting Issues

• I/O and consistency of data between cache and memory

Cache

MainMemory

I/O Bridge

CPU

Cache

MainMemory

I/O Bridge

CPU

DMA

• I/O always see the latest data• interfere with CPU

• not interfering with the CPU• might see the stale data

•Output: write-through•Input:

• noncacheable• SW: flush the cache• HW : check I/O address on input

Page 21: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

21

Pitfall: Predicting Cache Performance from DifferentPrgrm (ISA, compiler, ...)

• 4KB Data cache miss rate 8%,12%,or 28%?

• 1KB Instr cache miss rate 0%,3%,or 10%?

• Alpha vs. MIPS for 8KB Data:17% vs. 10%

Cache Size (KB)

Miss Rate

0%

5%

10%

15%

20%

25%

30%

35%

1 2 4 8 16 32 64 128

D: tomcatv

D: gcc

D: espresso

I: gcc

I: espresso

I: tomcatv

Page 22: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

22Instructions Executed (billions)

Cummlative

AverageMemoryAccess

Time

1

1.5

2

2.5

3

3.5

4

4.5

0 1 2 3 4 5 6 7 8 9 10 11 12

Pitfall: Simulating Too Small an Address Trace

Page 23: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

23

Review: Memory Hierarchy

• Definition & principle of memory hierarchy• Cache

– Organization :Cache ABC– Improve cache performance

• Main memory– Organization– Improve memory bandwidth

• Virtual memory– Fast address translation– TLB & Cache

Page 24: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

24

Levels of the Memory Hierarchy

CPU Registers100s Bytes<1s ns

Cache10s-100s K Bytes1-10 ns$10/ MByte

Main MemoryM Bytes100ns- 300ns$1/ MByte

Disk10s G Bytes, 10 ms (10,000,000 ns)$0.0031/ MByte

CapacityAccess TimeCost

Tapeinfinitesec-min$0.0014/ MByte

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

Page 25: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

25

General Principle

• The Principle of Locality:– Program access a relatively small portion of the address space at any instant of

time.• Two Different Types of Locality:

– Temporal Locality (Locality in Time): • If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)

– Spatial Locality (Locality in Space): • If an item is referenced, items whose addresses are close by tend to be referenced

soon (e.g., straightline code, array access)• ABC of Cache:

– Associativity– Block size– Capacity

• Cache organization– Direct-mapped cache : A = 1, S = C/B– N-way set-associative: A = N, S =C /(BA)– Fully assicaitivity “ S =1

Page 26: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

26

4 Questions for Memory Hierarchy

• Q1: Where can a block be placed in the upper level? (Block placement)

• Q2: How is a block found if it is in the upper level?(Block identification)

• Q3: Which block should be replaced on a miss? (Block replacement)

• Q4: What happens on a write? (Write strategy)

Page 27: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

27

Q1: Where can a block be placed in the upper level?

• Block 12 placed in 8 block cache:– Fully associative, direct mapped, 2-way set associative– S.A. Mapping = Block Number Modulo Number Sets

Cache

01234567 0123456701234567

Memory

111111111122222222223301234567890123456789012345678901

Full Mapped Direct Mapped(12 mod 8) = 4

2-Way Assoc(12 mod 4) = 0

set 0

set 1

set 2

set 3

Page 28: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

28

1 KB Direct Mapped Cache, 32B blocks

• For a 2 ** N byte cache:– The uppermost (32 - N) bits are always the Cache Tag– The lowest M bits are the Byte Select (Block Size = 2 ** M)

Cache Index

0123

:

Cache DataByte 0

0431

:

Cache Tag Example: 0x50Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte SelectEx: 0x00

9block address

Page 29: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

29

Q2: How is a block found if it is in the upper level?

• Tag on each block– No need to check index or block offset

BlockOffset

Block Address

IndexTag

Page 30: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

30

Q3: Which block should be replaced on a miss?

• Easy for Direct Mapped• Set Associative or Fully Associative:

– Random– LRU (Least Recently Used)

• Hardware keeps track of the access history• Replace the entry that has not been used for the longest time

Page 31: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

31

Q4: What happens on a write?

• Write through—The information is written to both the block in the cache and to the block in the lower-level memory.

• Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.

– is block clean or dirty?• Pros and Cons of each?

– WT:• Good: read misses cannot result in writes• Bad: write stall

– WB: • no repeated writes to same location• Read misses could result in writes

• WT always combined with write buffers so that don’t wait for lower level memory

Page 32: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

32

Write Miss Policy

• Write allocate (fetch on write)– The block is loaded on a write miss

• No-write allocate (write-around)– The block is modified in the lower level and not loaded into the

cache

hit: write to cache, set dirty bit. miss: write to memory

hit: write to cache/memorymiss: write to memory

Write around

hit: write to cache, set dirty bit.miss: load block into cache; write to cache

hit: write to cache/memorymiss: load block into cache; write to cache/memory

Write allocate

Write backWrite through

Page 33: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

33

Improving Cache Performance

1. Reduce the miss rate, 2. Reduce the miss penalty, or3. Reduce the time to hit in the cache.

Average memory access time = hit time + miss-rate X miss-penalty

Page 34: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

34

Reducing Misses

• Classifying Misses: 3 Cs– Compulsory—The first access to a block is not in the cache, so the block

must be brought into the cache. These are also called cold start missesor first reference misses.(Misses in Infinite Cache)

– Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Misses in Size X Cache)

– Conflict—If the block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses or interference misses.(Misses in N-way Associative, Size X Cache)

Page 35: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

35

1. Reduce Misses via Larger Block Size

• Block size Compulsory misses – Spatial locality – Example: access patter 0x0000,0x0004,0x0008,0x0012,..

Block size = 2 Word 0x0012 (hit)0x0008 (miss)0x0004 (hit)0x0000 (miss)

Block size = 4 Word 0x0012 (hit)0x0008 (hit)0x0004 (hit)0x0000 (miss)

Page 36: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

36

2. Reduce Misses via Higher Associativity

• 2:1 Cache Rule: – Miss Rate DM cache size N - Miss Rate 2-way cache size N/2

• Beware: Execution time is only final measure!– Will Clock Cycle time increase?– Hill [1988] suggested hit time external cache +10%, internal + 2%

for 2-way vs. 1-way

Page 37: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

37

3. Reducing Misses via Victim Cache

• How to combine fast hit time of Direct Mapped yet still avoid conflict misses?

• Add buffer to place data discarded from cache

• Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache

01234567

4

victim cache

12

0123456712

4

4

012345674

12

Page 38: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

38

4. Reducing Misses via Pseudo-Associativity

• How to combine fast hit time of Direct Mapped and have the lowerconflict misses of 2-way SA cache?

• Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit)

• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles– Better for caches not tied directly to processor

Hit Time

Pseudo Hit Time Miss Penalty

Time

Page 39: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

39

5. Reducing Misses by HW Prefetching of Instruction & Data

• Bring a cache block up the memory hierarchy before it is requested by the processor

• Example: Stream Buffer for instruction prefetching (alpha 21064)– Alpha 21064 fetches 2 blocks on a miss– Extra block placed in stream buffer– On miss check stream buffer

01234567

4 5

stream buffer

CPUIssue prefetch request 6

memory

Page 40: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

40

6. Reducing Misses by SW Prefetching Data

• Data Prefetch– Compiler insert prefetch instructions to the request the data before

they are needed– Load data into register (HP PA-RISC loads)– Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)– Special prefetching instructions cannot cause faults;

a form of speculative execution– Need a non-blocking cache

• Issuing Prefetch Instructions takes time– Is cost of prefetch issues < savings in reduced misses?

Page 41: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

41

7. Reducing Misses by Compiler Optimizations

• Instructions– Reorder procedures in memory so as to reduce misses– Profiling to look at conflicts– McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache

with 4 byte blocks• Data

– Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays

– Loop Interchange: change nesting of loops to access data in order stored in memory

– Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap

– Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

Page 42: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

42

1. Reducing Miss Penalty: Read Priority over Write on Miss

• Write-through: check write buffer contents before read; if no conflicts, let the memory access continue

• How to reduce read-miss penalty for a write-back cache?– Read miss replacing dirty block– Normal: Write dirty block to memory, and then do the read– Instead copy the dirty block to a write buffer, then do the read, and

then do the write– CPU stall less since restarts as soon as do read

cache

memory

Write buffer

Read

write

Page 43: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

43

100

300

200

204

1

1

0

0

1

1

1

0

1

0

0

0

Sub-blocks

1

0

1

0

2. Subblock Placement to Reduce Miss Penalty

• Don’t have to load full block on a miss• Have bits per subblock to indicate valid• (Originally invented to reduce tag storage)

Valid Bits

204

200

300

100

0000

1010

0011

1111

Page 44: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

44

3. Early Restart and Critical Word First

• Don’t wait for full block to be loaded before restarting CPU– Early restart—As soon as the requested word of the block arrives,

send it to the CPU and let the CPU continue execution

– Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first

• Generally useful only in large blocks, • Spatial locality a problem; tend to want next sequential word, so

not clear if benefit by early restart

Page 45: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

45

4. Non-blocking Caches to reduce stalls on misses

• Non-blocking cache or lockup-free cache allowing the data cache to continue to supply cache hits during a miss

• “hit under miss” reduces the effective miss penalty by being helpful during a miss instead of ignoring the requests of the CPU

• “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses– Significantly increases the complexity of the cache controller as there can

be multiple outstanding memory accesses

Page 46: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

46

5th Miss Penalty Reduction: Second Level Cache

• L2 EquationsAMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)

• Definitions:– Local miss rate— misses in this cache divided by the total number of

memory accesses to this cache (Miss rateL2)– Global miss rate—misses in this cache divided by the total number of

memory accesses generated by the CPU(Miss RateL1 x Miss RateL2)

L1

L2

Main Memory

Page 47: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

47

1. Fast Hit times via Small and Simple Caches

• Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level cache

• Direct Mapped, on chip

Page 48: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

48

2. Fast hits by Avoiding Address Translation

• Address translation – from virtual to physical addresses • Physical cache:

– 1. Virtual -> physical– 2. Use physical address to index cache => longer hit time

• Virtual cache:– Use virtual cache to index cache => shorter hit time– problem: aliasing

More on this after covering virtual memory issues !

Page 49: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

49

CPU

TB

$

MEM

VA

PA

PA

ConventionalOrganization

CPU

$

TB

MEM

VA

VA

PA

Virtually Addressed CacheTranslate only on miss

CPU

$ TB

MEM

VA

PATags

PA

Overlap $ accesswith VA translation:requires $ index toremain invariant

across translation

VATags

L2 $

2. Fast hits by Avoiding Address Translation

Page 50: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

50

Virtual Cache

• Avoid address translation before accessing cache– Faster hit time

• Context switch– Flush: time to flush + “compulsory misses” from empty cache– Add processor id (PID) to TLB

• I/O (physical address) must interact with cache– Physical -> virtual address translation

• Aliases (Synonyms)– Two virtual addresses map to the same physical addresses

• Two identical copies in the cache– Solution:

• HW: anti-aliasing• SW: page coloring

Page 51: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

51

Virtually indexed & physically tagged cache

• Use the part of addresses that is not affected by address translation to index a cache– Page offset– Overlap the time to read the tags with address translation

• Limit cache size to cache size for direct-mapped cache –how to get a bigger cache– Higher associativity– Page coloring

Page address Page offset

Address tag index block offset

Page addressPage offset

Address tag index block offset

Page 52: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

52

• Pipeline Tag Check and Update Cache as separate stages; current write tag check & previous write cache update

• Only Write in the pipeline; empty during a miss

write x1write x2

3. Fast Hit Times Via Pipelined Writes

tag check x1 write data

tag check x2 write data

Page 53: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

53

4. Fast Writes on Misses Via Small Subblocks

• If most writes are 1 word, subblock size is 1 word, & write through then always write subblock & valid bit immediately

– Tag match and valid bit already set: Writing the block was proper, & nothing lost by setting valid bit on again.

– Tag match and valid bit not set: The tag match means that this is the proper block; writing the data into the subblock makes it appropriate to turn the valid bit on.

– Tag mismatch: This is a miss and will modify the data portion of the block. As this is a write-through cache, however, no harm was done; memory still has an up-to-date copy of the old value. Only the tag to the address of the write and the valid bits of the othersubblock need be changed because the valid bit for this subblock has already been set

• Doesn’t work with write back due to last case

Page 54: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

54

Main Memory Background• Performance of Main Memory:

– Latency: time to finish a request• Access Time: time between request and word arrives• Cycle Time: time between requests• Cycle time > Access Time

– Bandwidth: Bytes/per second• Main Memory is DRAM: Dynamic Random Access Memory

– Dynamic since needs to be refreshed periodically (8 ms)– Addresses divided into 2 halves (Memory as a 2D matrix):

• RAS or Row Access Strobe• CAS or Column Access Strobe

• Cache uses SRAM: Static Random Access Memory– No refresh (6 transistors/bit vs. 1 transistor /bit)– Address not divided

• Size: DRAM/SRAM - 4-8, Cost/Cycle time: SRAM/DRAM - 8-16

Page 55: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

55

Main Memory Performance

• Simple: – CPU, Cache, Bus, Memory same

width (32 bits)• Wide:

– CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits)

• Interleaved: – CPU, Cache, Bus 1 word:

Memory N Modules(4 Modules); example is word interleaved

CPU

Cache

Memory

One word

One word

Simple

CPU

Cache

Bank0

One word

One word

Bank1

Bank2

Bank3

InterleavedCPU

Cache

Memory

`multiplexor

Wide

Page 56: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

56

Bank offsetSuperbank offset

Bank numberSuperbank number

Independent Memory Banks

• Memory banks for independent accesses vs. faster sequential accesses– Multiprocessor– I/O– Miss under Miss, Non-blocking Cache

Super bank 0 1 2 :::

0 1 2 3bank

Independent banks

0 1 2 3bank

0 1 2 3bank

Page 57: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

57

Avoiding Bank Conflicts

• Bank Conflicts– Memory references map to the same bank– Problem: cannot take advantage of multiple banks (supporting multiple

independent request)• Example: all elements of a column are in the same memory bank with 128

memory banks, interleaved on a word basis

int x[256][512];for (j = 0; j < 512; j = j+1)

for (i = 0; i < 256; i = i+1)x[i][j] = 2 * x[i][j];

• SW: loop interchange or declaring array not power of 2• HW: Prime number of banks

– Problem: more complex calculation per memory access

Page 58: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

58

Fast Memory Systems: DRAM specific• Multiple RAS accesses: several names (page mode)

– 64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns• New DRAMs to address gap;

what will they cost, will they survive?– Synchronous DRAM: Provide a clock signal to DRAM, transfer synchronous

to system clock– RAMBUS: reinvent DRAM interface

• Each Chip a module vs. slice of memory• Short bus between CPU and chips• Does own refresh• Variable amount of data returned• 1 byte / 2 ns (500 MB/s per chip)

• Niche memory – e.g., Video RAM for frame buffers, DRAM + fast serial output

Page 59: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

59

Virtual Memory: Motivation

• Permit applications to grow larger than main memory size

– 32, 64 bits v.s. 28 bits (256MB)• Automatic management• Multiple process management

– Sharing– Protection

• Relocation

Disk

ABCD

virtual memory physical memory

C

B

A

D

Page 60: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

60

Virtual Memory

• 4Qs for VM?– Q1: Where can a block be placed in the upper level?

• Fully Associative

– Q2: How is a block found if it is in the upper level?• Pages: use a page table• Segments: segment table

– Q3: Which block should be replaced on a miss?• LRU

– Q4: What happens on a write? • Write Back

Page 61: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

61

Page Table

• Virtual-to-physical address mapping via page table

virtual page number page offset

page table

Main Memory

What is the size of the page table given a 28-bit virtual address, 4 KB pages, and 4 bytes per page table entry?

2 ^ (28-12) x 2^2 = 256 KB

PTE

Page 62: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

62

Inverted Page Table (HP, IBM)

• One PTE per page frame• Pros & Cons:

– The size of table is the number of physical pages

– Must search for virtual address (using hash)

virtual page number page offset

Hash Another Table (HAT)

hash

Inverted page table

VA PA

Page 63: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache
Page 64: Lecture 11 - 國立臺灣大學yangc/temp3.pdf2 Administration • Homework 4 -Due 12/21 •HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache

64

Fast Translation: Translation Look-aside Buffer (TLB)

• Cache of translated addresses• Alpha 21064 TLB: 32 entry fully associative

page frame address <30> page offset <13>

::::

V R W Tag <30> Physical address <21>1 2

32:1 Mux

<13>

<21>3 4 34-bit

physicaladdress

Problem: combine caches with virtual memory