ΧΑΡΗΣ
ΘΕΟΧΑΡΙΔΗΣ
ΗΜΥ
656
ΠΡΟΧΩΡΗΜΕΝΗ
ΑΡΧΙΤΕΚΤΟΝΙΚΗ
ΗΛΕΚΤΡΟΝΙΚΩΝ
ΥΠΟΛΟΓΙΣΤΩΝ
Εαρινό
Εξάμηνο
2007
ΔΙΑΛΕΞΗ
9: Advanced Memory
Hierarchy Design Isssues
Ack: David Patterson, Berkeley
Joseph Manzano, Univ. of Delaware
Since 1980, CPU has outpaced DRAM ...
CPU60% per yr2X in 1.5 yrs
DRAM9% per yr2X in 10 yrs
10
DRAM
CPU
Performance(1/latency)
100
1000
1980
2000
1990
Year
Gap grew 50% per year
Q. How do architects address this gap? A. Put smaller, faster “cache” memories
between CPU and DRAM. Create a “memory hierarchy”.
1977: DRAM faster than microprocessors
Apple ][ (1977)
Steve WozniakSteve
Jobs
CPU: 1000 nsDRAM: 400 ns
Levels of the Memory Hierarchy
CPU Registers100s Bytes<10s ns
CacheK Bytes10-100 ns1-0.1 cents/bit
Main MemoryM Bytes200ns-
500ns$.0001-.00001 cents /bitDiskG Bytes, 10 ms (10,000,000 ns)
10 -
10 cents/bit-5 -6
CapacityAccess TimeCost
Tapeinfinitesec-min10 -8
Registers
Cache
Memory
Disk
Other Devices (Tapes, etc.)
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
Memory Hierarchy: Apple iMac G5
iMac G51.6 GHz
07 Reg L1 Inst L1 Data L2 DRAM Disk
Size 1K 64K 32K 512K 256M 80G
LatencyCycles, Time
1,0.6 ns
3,1.9 ns
3,1.9 ns
11,6.9 ns
88,55 ns
107,12 ms
Let programs address a memory space that scales to the disk size, at a speed that is
usually as fast as register access
Managed by compiler
Managed by hardware
Managed by OS,hardware,application
Goal: Illusion of large, fast, cheap memory
iMac’s PowerPC 970: All caches on-chip
(1K)
R eg ist er s 512K
L2
L1 (64K Instruction)
L1 (32K Data)
The Principle of Locality
• The Principle of Locality:– Program access a relatively small portion of the address space at
any instant of time.
• Two Different Types of Locality:– Temporal Locality (Locality in Time): If an item is referenced, it will
tend to be referenced again soon (e.g., loops, reuse)– Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon (e.g., straightline
code, array access)
• Last 15 years, HW relied on locality for speed
It is a property of programs which is exploited in machine design.
Programming Shared Memory (review)
• Program is a collection of threads of control.• Each thread has a set of private variables
– e.g. local variables on the stack.
• Collectively with a set of shared variables– e.g., static variables, shared common blocks, global heap.
• Communication and synchronization through shared variables
iressPPP
iress . . .
x = ...y = ..x ...
Address :
Shared
Private
Outline• Historical perspective• Bus-based machines
– Pentium SMP– IBM SP node
• Directory-based (CC-NUMA) machine– Origin 2000
• Global address space machines– Cray t3d and (sort of) t3e
60s Mainframe Multiprocessors• Enhance memory capacity or I/O capabilities by
adding memory modules or I/O devices
• How do you enhance processing capacity?– Add processors
• Already need an interconnect between slow memory banks and processor + I/O channels
– cross-bar or multistage interconnection network
Proc
I/ODevices
Interconnect
Proc
Mem IOCMem Mem Mem IOC
P IO IOP
M
M
M
M
70s Breakthrough: Caches• Memory system scaled by adding memory modules
– Both bandwidth and capacity
• Memory was still a bottleneck– Enter…
Caches!
• Cache does two things:– Reduces average access time (latency)– Reduces bandwidth requirements to memory
P
memory (slow)
interconnect
I/O Deviceor
Processor
A: 17
processor (fast)
Technology Perspective
DRAMYear
Size
Cycle Time1980
64 Kb
250 ns1983
256 Kb
220 ns1986
1 Mb
190 ns1989
4 Mb
165 ns1992
16 Mb
145 ns1995
64 Mb
120 ns
1000:1! 2:1!
Capacity
Speed
Logic:
2x in 3 years
2x in 3 years
DRAM:
4x in 3 years
1.4x in 10 years
Disk:
2x in 3 years
1.4x in 10 years
0
50
100
150
200
250
300
350
1986 1988 1990 1992 1994 1996
Year
SpecIntSpecFP
Approaches to Building Parallel Machines
P1
Switch
Main memory
Pn
(Interleaved)
(Interleaved)
First-level $
P1
$
Interconnection network
$
Pn
Mem Mem
P1
$
Interconnection network
$
Pn
Mem MemShared Cache
Centralized MemoryDance Hall, UMA
Distributed Memory (NUMA)
Scale
80s Shared Memory: Shared Cache
i80286
i80486
Pentium
i80386
i8086
i4004
R10000
R4400
R3010
SU MIPS
1000
10000
100000
1000000
10000000
100000000
1965 1970 1975 1980 1985 1990 1995 2000 2005
Year
Tran
sist
ors
i80x86M68KMIPS
• Alliant
FX-8– early 80’s– eight 68020s with x-bar to 512 KB interleaved cache
• Encore & Sequent– first 32-bit micros (N32032)– two to a board with a shared cache
P1
Switch
Main memory
Pn
(Interleaved)
(Interleaved)
First-level $
Shared Cache: Advantages and DisadvantagesAdvantages• Cache placement identical to single cache
– only one copy of any cached block
• Fine-grain sharing is possible• Interference
– One processor may prefetch
data for another– Can share data within a line without moving line
Disadvantages• Bandwidth limitation• Interference
– One processor may flush another processors data
Limits of Shared Cache Approach
I/O MEM MEM° ° °
PROC
cache
PROC
cache
° ° °
Assume:1 GHz processor w/o cache
=> 4 GB/s inst BW per processor (32-bit)
=> 1.2 GB/s data BW at 30% load-store
Need 5.2 GB/s of bus bandwidth per processor!
• Typical bus bandwidth is closer to 1 GB/s
5.2 GB/s
140 MB/s
Approaches to Building Parallel MachinesP1
Switch
Main memory
Pn
(Interleaved)
(Interleaved)
First-level $
P1
$
Interconnection network
$
Pn
Mem Mem
P1
$
Interconnection network
$
Pn
Mem MemShared Cache
Centralized MemoryDance Hall, UMA
Distributed Memory (NUMA)
Scale
Modern Multiprocessors Common Extended Memory Hierarchies
PnP1
$ $
Main Mem
PnP1
$ $
Mem
IC
Mem
PnP1
$ $
Mem
IC
Mem
P1 Pn
L1 $
Main Memory
Switch
I/O Devices
Bus
Shared Cache Bus Based Shared Memory
Dance Hall Distributed Memory
Review• Multiprocessors
– Multiple CPU computer with shared memory– Centralized Multiprocessor
• A group of processors sharing a bus and the same physical memory• Uniform Memory Access (UMA)• Symmetric Multi Processors (SMP)
– Distributed Multiprocessors• Memory is distributed across several processors• Memory forms a single logical memory space • Non-uniform memory access multiprocessor (NUMA)
• Multicomputers– Disjointed local address spaces for each processor– Asymmetrical Multi computers
• Consists of a front end (user interaction and I/O devices) and a back end (parallel tasks)
– Symmetrical Multi Computers• All components (computers) has identical functionality• Clusters and Networks of workstations
Programming Execution Models
• A set of rules to create programs• Message Passing Model
– De Facto Multicomputer Programming Model– Multiple Address Space– Explicit Communication / Implicit Synchronization
• Shared Memory Models– De Facto Multiprocessor Programming Model– Single Address Space– Implicit Communication / Explicit Synchronization
Distributed Memory MIMD
• Advantages– Less Contention– Highly Scalable– Simplified Synch– Message Passing
Synch + Comm.
• Disadvantages– Load Balancing– Deadlock / Livelock prone– Waste of Bandwidth– Overhead of small
messages
Shared Memory MIMD
• Advantages– No Partitioning– No data movement
(explicitly)– Minor modifications (or not
all) of toolchains and compilers
• Disadvantages– Synchronization– Scalability
• High-Throughput-Low- Latency network
• Memory Hierarchies• DSM
Thread Model
Memory Model
Synchronization Model
A set of rules for thread creation, scheduling and destruction
Rules that deal with access to shared data
Shared Memory Execution Model
A group of rules that deals with data replication, coherency, and memory ordering
Private Data Shared Data
Data that is not visible to other threads Data that can be access by other threads
Thread Virtual Machine
User Level Shared Memory Support
• Shared Address Space Support and Management
• Access Control and Management– Memory Consistency Model– Cache Management Mechanism
Grand Challenge Problems
• Shared Memory Multiprocessor Effective at a number of thousand units
• Optimize and Compile parallel applications• Main Areas: Assumptions about
– Memory Coherency– Memory Consistency
Review
• Memory Coherency– Ensure that a memory op
(a write) will become visible to all actors.
– Doesn’t impose restrictions on when it becomes visible
– Per location consistency
• Memory Consistency– Ensure that two or more
memory ops has a certain order among them. Even when those operations are from different actors
Memory [Cache] Coherency The Problem
P1 P2 P3
U:5 U:5
U:51
4
U:? U:? U:7
2
3
5
What value P1 and P2 will read?
1 3
Memory Consistency Problem
B = 0…A = 1
L1: print B
A = 0…B = 1
L2: print A
Assume that L1 and L2 are issue only after the other 4 instructions have been completed.What are the possible values that are printed on the screen? Is 0, 0 a possible combination?
The MCM: A software and hardware contract
MCM Attributes
• Memory Operations• Location of Access
– Near memory (cache, near memory modules, etc) V.S. far memory• Direction of Access
– Write or Read• Value Transmitted in Access
– Size• Causality of Access
– Check if two access are “casually” related and if they are in which order are they completed
• Category of Access– Static Property of Accesses
MCM Category of Access
As Presented in Mosberger 93
Memory Access
Private Shared
CompetingNon-Competing
SynchronizationNon synchronization
AcquireRelease
ExclusiveNon-exclusiveUniform V.S. Hybrid
Conventional MCMs
As Presented in Mosberger 93
Atomic Consistency
Sequential Consistency Causal Consistency
Processor Consistency
Cache Consistency PRAM
Slow Memory
Weak Consistency
Release Consistency
Entry Consistency
Uniform
Hybrid
Conventional MCM
• Atomic Consistency– Operation interval Memory Ops happens only inside this
interval– Many operations are allowed in the same interval
• Static: Reads happens at the beginning and writes happens at the end
• Dynamic: Happens at any point as along as the result as if it was run on a serial execution
– “Any read to a memory location X returns the value stored by the most recent write operation to X”
Conventional MCM
• Sequential Consistency– “… the result of any execution is the same as if the operations
of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.”
[Lamport
79]– Weaker than Atomic Consistency
• Allows all combinations of inter instructions (from different processors)
• Some of these combinations are not allowed under AC– P1 W(x) = 1– P2 R(x) = 0 R(x) = 1– Which is legal under SC but not AC
Conventional MCM
• Causal Consistency– Events (writes) that are causally related must be seen
in the same order by all processors.• Example: W1 (x), R2 (x), W2 (y) are casually related because
the value of y might depend on the value written by x
– Unrelated causal events can be seen in any order
Conventional MCM
• Cache Consistency– Synonym with Cache coherence– Sequential ordering per location basis
• SC ensures sequential ordering for all memory locations– Cache consistency is included in SC but not the other way
around.• Pipeline RAM
– Single Processor Writes can be pipelined w/o stalling– All writes from other processors are considered concurrent
• They can be seen in different order
Conventional MCM
• Processor Coherence (Goodman’s 89)– PRAM and Coherence united– Processors agrees in the order of writes from a single
processor but might disagree in the order of writes of different processors as long as they are to different locations
– Stronger than coherence but weaker than sequential
Conventional MCM
• Weak Consistency– Following restrictions
• Any Access to synchronized variables are SC• No Access to a synch variable is issued until all previous data
access are performed• No access is issued by a processor until a previous synch access is
performed– Synch Access == Fence– A program behave as SC under WC if
• No data race• Synchronization is visible to the memory system
Conventional MCM
• Release Consistency– Refinement of WC
• Synch access becomes Acquire, Release and non-synch access• Acquire: One side memory barrier, delay all future memory access• Release: One side memory barrier, it does not completes until all previous
memory accesses have completed• Non-Synch access: Competing accesses with no synch purpose.
• Entry Consistency– Similar to RC but it associates every shared variables with a synch
variable (This being a lock or barrier)– Concurrent access to different Critical section– Refined Acquire to Exclusive and Non-exclusive access
Memory Consistency Problem
B = 0 (1)…A = 1 (2)
L1: print B
A = 0 (3)…B = 1 (4)
L2: print A
Assume that L1 and L2 are issue only after the other 4 instructions have been completed.What are the possible values that are printed on the screen? Is 0, 0 a possible combination?
The Answer: NO!!!! Under SC but under weaker models like PRAM it is possible
(1, 2, 3, 4)(1, 3, 2, 4)(1, 3, 4, 2)(3, 4, 1, 2)(3, 1, 2, 4)(3, 1, 4, 2)
Sufficient Condition for SC
• Every Processor issues Memory operation in program order.
• After a write is issued, the processor waits for it to complete.
• After a read is issued, the processor waits for it to complete, plus it waits for the write that writes the value returned by the read.– i.e. Reads have to wait for the writes which they depend on to
have propagated to all processors
One more Thing …
• Scratch Pad– A private section of memory to each processing unit– Non coherent and therefore non consistent– Own Rd and Wr ports– High Speed Access– Separate Address space as seen by the processor– Good or Bad?
Snoopy Cache-Coherence Protocols
• Bus is a broadcast medium & caches know what they have• Cache Controller “snoops”
all transactions on the shared bus
– A transaction is a relevant transaction
if it involves a cache block currently contained in this cache
– take action to ensure coherence» invalidate, update, or supply value
– depends on state of the block and the protocol
StateAddressData
I/O devicesMem
P1
$
Bus snoop
$
Pn
Cache-memorytransaction
Basic Choices in Cache Coherence• Cache may keep information such as:
– Valid/invalid– Dirty (inconsistent with memory)– Shared (in another caches)
• When a processor executes a write operation to shared data, basic design choices are:
– Write thru: do the write in memory as well as cache– Write back: wait and do the write later, when the item is flushed
– Update: give all other processors the new value– Invalidate: all other processors remove from cache
Example: Write-thru Invalidate
• Update and write-thru both use more memory bandwidth if there are writes to the same address– Update to the other caches– Write-thru to memory
I/O devices
Memory
P1
$ $ $
P2 P3
5
u = ?4
u = ?
u:51
u :5
2
u :5
3
u= 7
Write-Back/Ownership Schemes
• When a single cache has ownership
of a block, processor writes do not result in bus writes, thus conserving bandwidth.
– reads by others cause it to return to “shared”
state
• Most bus-based multiprocessors today use such schemes.
• Many variants of ownership-based protocols
Sharing: A Performance Problem• True sharing
– Frequent writes to a variable can create a bottleneck– OK for read-only or infrequently written data– Technique: make copies of the value, one per processor, if this is
possible in the algorithm– Example problem: the data structure that stores the freelist/heap
for malloc/free
• False sharing– Cache block may also introduce artifacts– Two distinct variables in the same cache block– Technique: allocate data used by each processor contiguously,
or at least avoid interleaving– Example problem: an array of ints, one written frequently by each
processor
Limits of Bus-Based Shared Memory
I/O MEM MEM° ° °
PROC
cache
PROC
cache
° ° °
Assume:1 GHz processor w/o cache
=> 4 GB/s inst BW per processor (32-
bit)
=> 1.2 GB/s data BW at 30% load-store
Suppose 98% inst hit rate and 95% data hit rate
=> 80 MB/s inst BW per processor=> 60 MB/s data BW per processor⇒140 MB/s combined BW
Assuming 1 GB/s bus bandwidth∴
8 processors will saturate bus
5.2 GB/s
140 MB/s
Outline
• Review of Cache Architecture / Organization• The cache protocol• Bus SNOOPY based cache protocol: MESI• Directory Based cache protocol• DASH Architecture
Cache Coherency
• The Coherency Problem– A processor should have exclusive access to a shared variable
when writing and should get the “most recent” value when reading.
• Solution– When writing
• (1) Invalidate all copies• (2) Broadcast to everyone the new copy
– When reading• Finding the most recent copy
– Can be tricky
Write Update V.S Write Invalidate
X’
X’
P1
X’
P2
X’
P3
Shared Memory
BusCache
Processors
I
I
P1
I
P2
X’
P3
Shared Memory
BusCache
Processors
X
X
P1
X
P2
X
P3
Shared Memory
BusCache
Processors
X is a shared variable that has a copy in all caches. Then a write occurred
For Write Invalidate
all the cache copies are marked as “invalid” except the most recent one
For Write Update
all the cache copies are updated with the most recent value
Assume a write through cache protocol
Cache Coherence Protocol
• Directory Based:• The info about one block of physical memory is kept in a single location
• The directory itself can be distributed
• Advantages• Scalable
• Proportional to Main Memory Size
• Bus Based Snooping• Use the shared memory bus
• Every cache that has a copy of the data is responsible to maintain coherence about it
• Advantages• Easily add on to existent busses
• Proportional to cache size
Snooping Protocol
• Write Invalidate– Writing Processor sends an invalidation signal– Caches that are listening to the bus will invalidate their copy of
such variable– The writing processor writes to the variable
• Memory is updated according to which cache protocol policy you have
– Write through, Write Back
• Write Update– Writing processor will broadcast the new value to all caches
Important Design Issues
• Block Size / Line Size• False Sharing
– Different words in the same line• Compiler Aspects
– Aggressive Optimizations and reordering
CPU
Cache
CPU signalsBus signals
Finite Automata for Cache Protocols
• State Transactions– Read Misses– Write Hits – Write Misses
• Type of Outputs– Bus Signals and CPU actions
Issues of Snoop-Based Cache Coherence
• Subtle issues:
• Correctness: atomicity issues, deadlock/ livelock / starvation issues
• Performance: pipelining of memory ops
• Minimum hardware cost
Directory Based Protocols
• Reason– Bus based protocols may generate too much traffic– Multi level IC may not have efficient broadcasting
capabilities as a system bus• Directory Based system
– A directory with an entry per memory location– Central V.S. Distributed
Requirement for Implementation
• Stores to each memory location occurs in program order
• All processing elements see such order if they access the same memory location
• Only one processor has the write privilege at a time.
Implementation
• If Read(X)– Hit: Just copy– Miss: Find the location residence of X
• Memory, other cache, others• Receive a legal copy
• If Write(X)– Acquire Ownership and / or Exclusivity
• Note: Assume a Write Back – Write Invalidate Cache Protocol
Cache Protocols as FA
• Finite Automata– A graph in which vertex are states and edges are transitions– Usually, a transition (an edge) will be labeled with a 2-tuple a / b where
a is the input action that produced the state change and b is an action that will result from this state change.
• The input action may be, in fact, many actions, the same goes to the output action
• FA as cache protocols – A vertex is the state of a given cache line– The transitions may be produced by
• Processor signals• Bus actions
The MSI Protocol
• Similar to the protocol used by Sillicon Graphics 4D series of multiprocessors machines
• Three states to differentiate between clean or dirty– Modified, Shared and Invalid
• Two types of Processor actions and Three types of bus’s signals– Processor Writes and Reads– Bus Read, Bus Read Exclusive and Bus Write Back
MSI States
• Modified– The cached copy is the only valid copy in the system.– Memory is stale.
• Shared– The cached copy is valid and it may or may not be shared by
other caches.• Initial state after first loaded.
– Memory is up to date.
• Invalid– The cached copy is not existence.
MSI Protocol State MachinePromotion
M S I
Demotion
PrRd / --PrWr / --
PrRd / --BusRd / --
BusRdX / Flush
BusRdX / --
BusRd / Flush
PrWr / BusRdX
PrRd / BusRd
PrWr / BusRdXPrWr
PrRd
BusRd
BusRdX
Flush
--
Processor Write
Processor Read
Bus Read
Read to own
Flush to memory
No Action
Input / Output
MSI Example
Processor Action
State P1 State P2 State P3 Bus Action
Data Supplied by
P1 loads u S _ _ BusRd Mem
P3 loads u S _ S BusRd Mem
P3 stores u I _ M BusRdX Mem
P1 loads u S _ S BusRd P3 c
P2 loads u S S S BusRd Mem
The MESI Protocol
• States:– Modified, Exclusive, Shared and Invalid
• Due to Goodman [ISCA’93]• State transitions are due to:
– Processor actions: This being Write or Reads– Bus operations caused by the former
• Implemented in Intel Pentium Pro (in some modes)
MESI States
• Modified– Main Memory’s value is stale– No other cache possesses a copy
• Exclusive– Main Memory’s value is up to date– No other cache possesses a copy
• Shared– Main Memory’s value is up to date– Other caches have a copy of the variable
• Invalid– This cache have a stale copy of the variable
MESI States
Valid DataM XI
Invalid Data
Valid DataE XI
Valid Data
Valid DataS Valid DataS
Valid Data
XI ??
?
A B
Mem
A B
Mem
A B
Mem
A B
Mem
M E
S I
Example: Two Processor SystemCache 1 Cache 2
Bus State Bus StateI I
Memory Transfer
Load into Cache 1I E I
Cache 1 Cache 2Bus State Bus State
E I
Memory Transfer
Load into Cache 2Rd Hit E S I S
Cache 1 Cache 2Bus State Bus State
S S
Memory Transfer
S M Inv I
Cache 1 Cache 2Bus State Bus State
M I
Memory Transfer
Store from Cache 1Load into Cache 2M I
Rd Hit M S IS I S
P1 Load
P2 Load
P1 Store
P2 Load (first abort and then try again)
Processor Activities V.S. Bus Signals
M
I E
S
M
I E
S
Store Store
Store
Load(Shared)
Load(Exclusive)
Write
Write
Write
Read
Read
Processor activities. Load (shared) means there are other caches that have copies of the loaded data. Load (exclusive) means this is the only copy.
Snooping activities. Read and Write are operations seen on the bus by the snooping logic.
MESI Protocol State Machine
M E S IPrRd, PrWr / --
PrRd / --PrRd / --
BusRd / Flush’
BusRdX / Flush
BusRdX / Flush BusRdX / Flush
BusRdX / Flush BusRdX / Flush’
PrWr / BusRdX
PrWr / BusRdX PrRd / BusRd(S’)
PrWr / --PrRd / BusRd(S)
Demotion
Promotion
X Signal from the BusY Changes to Memory
X Read or Write from ProcessorY A Signal to the Bus
X / Y X is the inputY is the Output
Inputs and OutputsPrRd
PrWrBusRdX
BusRd(S)
BusRd(S’)
Flush
--
A Processor Read
A Processor Write
A Bus Read Exclusive. Request the data to be exclusive to this cache or demote it to a shared state
A Bus Read when the element is shared by another processor
A Bus Read when the element is not shared by another processor
Flush to either memory or a requesting processor (according to what sharing scheme is used)
No action or signal produced
Extracted from “Parallel Computer Architecture: A Hardware & Software Approach” by Culler & Singh. Page 301
The MOESI Protocol
• The Five state protocol based on MESI
• Implemented in AMD64 line of multi core personal computers and servers.
Picture Courtesy of “AMD64 Architecture Programmer’s Manual Volume 2: System Programming.”AMD64 Technology. September 2006
MOESI States
• Modify– This line has the only valid copy– Memory is stale
• Owned– This line has the valid copy. Other caches may have a “shared” copy of the data– Memory is stale
• Exclusive– This line has a valid copy and no other cache have one– Memory is up to date
• Shared– Data is replicated across many caches and memory– Note: They may be many shared copies but only one (or zero) owned
• Invalid
The Extra States Rationale
• Exclusive (MSI to MESI)– Reduce the number of busses transactions when a
value is read exclusively and it may be modified in the future
• Owned (MESI to MOESI)– Reduce the number of busses transactions by
delaying the update to the memory
The DASH Architecture Intro
• 1988 The Beginning of DASH• Objectives
– Enhance Scalability of cache coherence and shared memory– Build a machine as fast and as simple as possible– A great exploratory tool
• System Organization– Basic Block (Clusters)
• Four processors connected by a bus• Added logic and controllers for cache coherence
– Basic Blocks Connected in a mesh type interconnect (dual)
The DASH Architecture
• Inside the Cluster– 4 x 33 MHZ R3000 or R3010– 64 KB of I cache and 64 KB of D cache– 256 KB L2 Cache– 256 MB total memory
• Up to 16 MB per cluster• 16 Clusters 64 PEs
– MESI inside the Cluster
The DASH Architecture
• Cache Coherence– Directory Based across clusters– A bit per cluster Directory– Across clusters
• Flat Directory Structure in special memory• Implemented per cluster
– Write Exclusive Invalidate in directory entries– After all invalidates have been acknowledged the
state change in the requesting node
DASH Cache Coherence
• Directory Based– Memory Block States
• Un-cached Remote• Shared Remote• Modified Remote
• Cache States– Invalid– Shared – Exclusive– Dirty (modified)
RC in DASH Requirements
• Explicit Hardware Primitives for ordering– Full Fence and Write Fence
• All coherence operations (including memory ops) must be ACKed.– All invalidation requests must be ACKed.
• Write-fence: blocking a processor’s access to its second-level cache and the bus until the write fence is completed.
• Using one counter per cache controller to keep track of outstanding writes.
DASH Logical Memory HierarchyProcessor LevelProcessor Cache
Local Cluster LevelOther Processor’ caches
in the Local Cluster
Directory Home Level
Directory and main memory associated with memory address
Remote Cluster
Processor’s Caches in remote clusters
When:
Data is in local processor cache
How:
Data fetch from local cache
When:
Data is in a processor that can be accessed through the bus
How:
Fetch the data through the bus
When:
Data is either in local directory in a valid state or it doesn’t exist in any other cache
How:
Get the data from local directory or main memory
When:
Data is in local directory but its state is invalid
How:
Ask the owner of the value to provide it (remotely)
DASH Release Consistency
Acquire
Release
Critical Region
All Loads and Stores in the critical region cannot start until the Acquire
is completed
Subsequent Loads and stores might start before the Release
is completed
Normal Loads and stores that precedes the Acquire
do not need to wait
The Release cannot complete until all Loads and stores in the critical region are finished
One Sided Memory Barriers
DASH Other Features
• Software Controlled non-binding pre-fetching operations
• Atomic Read Modify Write Instructions• Un-cached locks, invalidating locks, granting
locks– Spinning in cached copy
• Fetch and Op– Increment and decrement in memory – Values are not cached
Global Address Space: Structured Memory
• Processor performs load• Pseudo-memory controller turns it into a message
transaction with a remote controller, which performs the memory operation and replies with the data.
• Examples: BBN butterfly, Cray T3D
src
° ° °
Scalable Network
M
PseudoMem
P$
mmuM
PseudoProc
readaddr desttag
src rrsp tag data
Ld R<-
Addr
P$
mmu
What to Take Away?• Programming shared memory machines
– May allocate data in large shared region without too many worries about where
– Memory hierarchy is critical to performance » Even more so than on uniprocs, due to coherence traffic
– For performance tuning, watch sharing (both true and false)
• Semantics– Need to lock access to shared variable for read-modify-write– Sequential consistency is the natural semantics– Architects worked hard to make this work
» Caches are coherent with buses or directories» No caching of remote data on shared address space
machines– But compiler and processor may still get in the way
» Non-blocking writes, read prefetching, code motion…
Where are things going• High-end
– collections of almost complete workstations/SMP on high-speed network (Millennium, IBM SP machines)
– with specialized communication assist integrated with memory system to provide global access to shared data (??)
• Mid-end– almost all servers are bus-based CC SMPs– high-end servers are replacing the bus with a network
» Sun Enterprise 10000, Cray SV1, HP/Convex SPP » SGI Origin 2000
– volume approach is Pentium pro quadpack
+ SCI ring» Sequent, Data General
• Low-end– SMP desktop is here
• Major change ahead– SMP on a chip as a building block
Caches and Scientific Computing• Caches tend to perform worst on demanding
applications that operate on large data sets– transaction processing– operating systems– sparse matrices
• Modern scientific codes use tiling/blocking to become cache friendly
– easier for dense codes than for sparse– tiling and parallelism are similar transformations
Approaches to Building Parallel Machines
P1
Switch
Main memory
Pn
(Interleaved)
(Interleaved)
First-level $
P1
$
Interconnection network
$
Pn
Mem Mem
P1
$
Interconnection network
$
Pn
Mem MemShared Cache
Centralized MemoryDance Hall, UMA
Distributed Memory (NUMA)
Scale
90’s Scalable, Cache Coherent Multiprocessors
P
Cache
P
Cache
Interconnection Network
Memory presence bits
dirty-bit
Directory
memory block
1 n
SGI Origin 2000
L2 cache
P
(1-4 MB)L2 cache
P
(1-4 MB)
HubXbow
Main Memory(1-4 GB)
Direc-tory
L2 cache
P
(1-4 MB)L2 cache
P
(1-4 MB)
Hub Xbow
Main Memory(1-4 GB)
Direc-tory
Interconnection Network
• Single 16”-by-11” PCB• Directory state in same or separate DRAMs, accessed in parallel• Up to 512 nodes ( 2 processors per node)• With 195MHz R10K processor, peak 390MFLOPS or 780 MIPS per proc• Peak SysAD bus bw is 780MB/s, so also Hub-Mem• Hub to router chip and to Xbow is 1.56 GB/s (both are off-board)
Directory Based Cache Coherence
• Why snooping is a bad idea?– Broadcasting is expensive
• Directory– Maintain the cache state explicitly– List of caches that have a copy: many read-only copies but
one writable copy
P1
$
CA
Scalable interconnection network
P2
$
CA
MemoryDirectory
Terminology
• Home node– The node in whose main memory the block is allocated
• Dirty node– The node that has a copy of the block in its cache in modified state
• Owner node– The node that currently hosts the valid copy of a block
• Exclusive node– The node that has a copy in exclusive state
• Local node, or requesting node– The node containing the processor that issues a request for the
block• Local block
– Blocks whose home is local to the issuing processor
Basic Operations
• Read miss to a block in modified state
P
$
CA
Mem
/Dir
Requestor Home
Owner
P
$
CA
Mem
/Dir
P
$
CA
Mem
/Dir
1. Read request
2. Response with owner identifier
3. Read request
4a. Data reply4b. Revision message
Basic Operations (Cont’d)
• Write miss to a block with two sharers
P
$
CA
Mem
/Dir
P
$
CA
Mem
/Dir
P
$
CAM
em/D
ir
P
$
CA
Mem
/Dir
Requestor Home
Sharer Sharer
1. RdEx request
2. Response with sharers’ identifiers
3. Invalidation requests
4. Invalidation acknowledgement
Alternatives for Organizing Directories
Directory storage schemes
Flat Centralized Hierarchical
Memory-based Cache-based
How to find source of directory information?
How to locate copies?
Directory information co-located with memory modules that is home
Stanford DASH/FLASH, SGI Origin, etc
Caches holding a copy of the memory block form a linked list
IEEE SCI, Sequent NUMA-Q
Directory information is in a fixed place: home
Hierarchy of cache that guarantee the inclusion property
Memory-Based Directory Schemes
• Full bit vector (full-map) directory– Most straightforward – Low latency: parallel invalidation– Main disadvantage: storage overhead P*B
• Increase the cache block– Access time and network traffic increased due to false sharing
• Use hierarchical protocol: Stanford DASH: – Node has bus-based 4-processor
Storage Reducing Optimization: Directory Width
• Directory width– Bits per directory entry
• Motivation– Mostly only a few caches have a copy of a block
• Limited (pointer) directory– Storage overhead: log P * k (number of copies)– Overflow methods are needed– Diri
X• i indicates number of pointers (i < P)• X indicates invalidation methods: broadcast or non-broadcast
Overflow Methods for Limited Protocol
• Diri B (Broadcast)– Set the broadcast bit in case of overflow– Broadcast invalidation messages to all nodes– Simple– Increase write latency– Wasting communication bandwidth
• Diri NB (Not Broadcast)– Invalidate the copy of one sharer– Bad for widely shared read-mostly data– Degradation for intensive sharing of read-only and
read-mostly data due to increased miss ratio
Overflow Methods for Limited Protocol (Cont’d)
• Diri CVr (Coarse Vector)– Representation changes to a coarse bit vector
• If P <= i, i pointers• If P > i, each bit stands for a region of r processors (coarse
vector)
– Invalidations to the regions of caches– SGI Origin– Robust to different sharing patterns– 70% less memory message traffic than broadcast and
at least 8% less than other schemes
Coarse Bit Vector Scheme
P0 P1 P2 P3
P4 P5 P6 P7
P8 P9 P10 P11
P12 P13 P14 P15
0 4bits 4bits Overflow bit
overflow
1 1 1 1
P0 P1 P2 P3
P4 P5 P6 P7
P8 P9 P10 P11
P12 P13 P14 P15
2 pointers 8 pointers
Overflow Methods for Limited Protocol (Cont’d)
• Diri SW (Software)– The current i pointers and a pointer to the new sharer are saved into a
special portion of the local main memory by software – MIT Alewife: LimitLESS– Cost of interrupts and software handling is high.
• Diri DP (Dynamic Pointers)– Directory entry contains a hardware pointer into the local memory.– Similar to software mechanism without software overhead.
• Difference is list manipulation is done in a special-purpose protocol processor rather than by the general-purpose processor
– Stanford FLASH– Directory overhead: 7-9% of main memory
Stanford DASH Architecture
• DASH => Directory Architecture for SHared memory– Nodes connected by scalable interconnect– Partitioned shared memory– Processing nodes are themselves multiprocessors– Distributed directory-based cache coherence
Interconnection network
P1
$
P1
$
Memory Directory
Dirty bitPresence bit
P1
$
P1
$
Conclusions
• Full map is most appropriate up to a modest number of processors
• Diri CVr , Diri DP are most likely candidates– Coarse vector: lack of accuracy on overflow– Dynamic pointer: processing cost due hardware list
manipulation
Storage Reducing Optimization: Directory Height
• Directory height– Total number of directory entries
• Motivation– The total amount of cache memory is much less than the total main
memory• Sparse directory
– Organize the directory as a cache– This cache has no need for a backing store
• When an entry is replaced, send invalidations to the nodes with copies– Spatial locality is not an issue: one entry per block– References stream is heavily filtered, consisting of only those
references that were not satisfied in the processor caches.– With directory size factor = 8, associativity of 4, and LRU
replacement: very close to that of full-map directory
Protocol Optimization
• Two major goals + one – Reduce the number of network transactions per
memory operation• Reduce the bandwidth demand
– Reduce the number of actions on the critical path• Reduce the uncontended latency
– Reduce the endpoint assist occupancy per transaction
• Reduce the uncontended latency as well as endpoint contention
Snoop-Snoop System
• Simplest way to build large scale cache-coherent MPs• Coherence monitor
– Remote (access) cache– Local state monitor - keep state information on data locally
allocated, remotely cached
• Remote cache– Should be larger than the sum of processor caches and quite
associative– Should be lockup-free– Issue invalidation request to the local bus when a block is
replaced
Snoop-Snoop with Global Memory
• First level cache– Highest performance SDRAM
caches– B1 follows a standard snooping
protocol• Second level cache
– Much larger than L1 caches (set associative)
– Must maintain inclusion– L2 cache acts as a filter for B1-
bus and L1-caches– L2 cache can be DRAM based
since fewer references get to it
B2
M
P1
$
P1
$
Coherence monitor
P1
$
P1
$
Coherence monitor
B1
Snoop-Snoop with Global Memory (Cont’d)
• Advantages– Misses to main memory just require single traversal to
the root of the hierarchy.– Placement of shared data is not an issue.
• Disadvantages– Misses to local data structures (e.g., stack) also have
to traverse the hierarchy, resulting in higher traffic and latency.
– Memory at the global bus must be highly interleaved. Otherwise bandwidth to it will not scale.
Cluster Based Hierarchies• Key idea
– Main memory is distributed among clusters.– reduces global bus traffic (local data & suitably placed shared
data)– reduces latency (less contention and local accesses are faster)– example machine: Encore Gigamax
• L2 cache can be replaced by a tag-only router- coherence switch.
B2
M
P1
$
P1
$
Coherence monitor
P1
$
P1
$
Coherence monitor
B1
M
Summary
• Advantages:– Conceptually simple to build (apply snooping
recursively)– Can get merging and combining of requests in
hardware• Disadvantages:
– Physical hierarchies do not provide enough bisection bandwidth (the root becomes a bottleneck, e.g., 2-d, 3-d grid problems)
– Latencies often larger than in direct networks
Hierarchical Directory Scheme
• The internal nodes contain only directory information– L1 directory tracks which of its children
processing nodes have a copy of the memory block.
– L2 directory tracks which of its children L1 directories have a copy the memory block.
– It also tracks which local memory blocks are cached outside
– Inclusion is maintained between processor caches and L1 directory
• Logical trees may be embedded in any physical hierarchy
L1 directory
L2 directory
Processing nodes
A Multi-rooted Hierarchical Directory
p0 p7
Directory tree for p0’s memory
Internal nodes
Processing nodes at leaves
All three circles in this rectangle represent the same processing node
Organization and Overhead
• Organization– Separate directory structure for every block
• Storage overhead– Each level has about the same amount of memory– C: cache size, b: branch factor, M: memory size, B:
block size • Performance overhead
– Reduce the number of network hops. – But, increase the end-to-end transactions, increase
latency– Root becomes the bottleneck
MBPC blog
∝
Performance Implication of Hierarchical Coherence
• Advantages– Combining of requests for a block
• Reduce traffic and contention
– Locality effect• Reduce transit latency and contention
• Disadvantages– Long uncontended latency– Bandwidth requirements near the root of the hierarchy
IBM PowerPC
• A weak consistency memory model machine.• A family of machines that supports visible cache
instructions and coherence / consistency checking.
• Found on the IBM server Family as well as the Apple Family of computers (with some modifications)
• Harvard Style Caches
PowerPC Consistency Instructions
• Data Cache Related Instructions– The dcb(f|st|i|t|tst|z)[e] instructions
• Instruction Cache Related instructions– The icbi[e] and isync instructions
• Memory Related Instructions– The mbar, msync, lwarx[e], and stwcrx[e] instructions
• TLB Related Instructions– The tlbsync instruction.
Data Cache Related Instructions
• dcbf[e] RA,RB Data Cache Block Flush– Invalidate all blocks
• dcbi[e] RA,RB Data Cache Block Invalidate– Invalidate all blocks and write back to main storage if modified.
• dcbst[e] RA,RB Data Cache Block Storage– Modified cache lines are written back to memory before the store
Data Cache Related Instructions
• dcbt[e] CT,RA,RB &
dcbtst[e] CT,RA,RBData Cache Block Touch– Provide a hint to the machine that loading this
address in the cache might improve performance• dcbz[e] CT,RA,RB Data Cache Block Set to
Zero– Set a block that resides in cache to zeroes.
Memory Related Instructions
• msync Memory Synchronize– No new instructions can be issued until the msync completes and this instruction
will not complete until all instructions before have done so• mbar MO Memory Bar
– Impose a certain ordering on all storage operation of the calling processor • MO == 0 then all are ordered• Otherwise, a subset can be defined
• lwarx[e] RT,RA,RB Load Word And Reserve Index– Important for atomic operations
• stwcx[e] RS,RA,RB Store Word Conditionally Indexed– Important for atomic operations– Must be used together with the lwarx[e] for correctness.
Instruction Cache & TLB related Instructions
• icbi[e] RA,RB Instruction Cache Block Invalidate– The same as the Data cache counterpart
• isync Instruction Synchronization– Similar to mbar and msync but for the Instruction Cache but with
the following differences:• Prefecthed instructions are discarded• The isync may complete before the preceding storage operations
are performed but they must complete
before the isync may continue
• tlbsync TLB Synchronization– Impose an ordering on the invalidation on the TLB table
Side Note : The TLB
• The translation look-aside buffer– Used to translate from virtual pages to real pages
• A page is a memory division which is size is OS determinate• In PowerPC pages can be as small as 4K or as big as 64K
– The granularity in which consistency and coherence is controlled
– CAM
Side Note : The TLB
• An TLB Entry Describe a page that is a candidate for translation
• Four Categories of Fields on the TLB Entry– Page Identification Fields– Address Translation Fields– Access Control Fields– Storage Attributes Fields
The TLB Entry: Page Identification Fields
Field Description
V Valid (1 bit). Entry is valid for translation
EPN Effective Page Number (54 bits) (compare with the EA of the storage in a range of 0 to 64 – log2 (Page Size) )
TS Translation Address Space (1 bit). Address space that this entry is associated with
SIZE (4 bits) The size of the page as 4size KB
TID Translation ID (Implementation dependent size) Is this page shared or owned?
The TLB Entry: Address Translation Fields
Field Description
RPN Real Page Number (Up to 53 bits). Similar in form to EPN. It is used to replace the 0-n-1 bits of the Effective Address to get the correct translation. The n is defined as before
The TLB Entry: Access Control Fields
Bit DescriptionUX User State Execute Enable
0 Fetch and execution of instructions from this page are NOT permitted while the Machine Stage Registerpr is set to 11 Fetch and execution of instructions from this page are permitted while the Machine Stage Registerpr is set to 1
SX Supervisor State Execute Enable
0 Fetch and execution of instructions from this page are NOT permitted while the Machine Stage Registerpr is set to 01 Fetch and execution of instructions from this page are permitted while the Machine Stage Registerpr is set to 0
The TLB Entry: Access Control Fields
Bit DescriptionUW User State Write Enable
0 Store instructions in this page are NOT permitted while the Machine Stage Registerpr is set to 11 Store instructions in this page are permitted while the Machine Stage Registerpr is set to 1
SW Supervisor State Write Enable
0 Store instructions in this page are NOT permitted while the Machine Stage Registerpr is set to 01 Store instructions in this page are permitted while the Machine Stage Registerpr is set to 0
The TLB Entry: Access Control Fields
Bit DescriptionUR User State Read Enable
0 Load instructions in this page are NOT permitted while the Machine Stage Registerpr is set to 11 Load instructions in this page are permitted while the Machine Stage Registerpr is set to 1
SR Supervisor State Read Enable
0 Load instructions in this page are NOT permitted while the Machine Stage Registerpr is set to 11 Load instructions in this page are permitted while the Machine Stage Registerpr is set to 1
The TLB Entry: Storage Attributes Fields
Field DescriptionW Write Through Required
0 Not Write Through Required1 Write Through Required
I Cache Inhibited
0 Not cache inhibited1 cache inhibited
M Memory Coherency Required
0 Not Memory Coherence Required1 Memory Coherence Required
The TLB Entry: Storage Attributes Fields
Field Description
G Guarded
0 Not Guarded1 Guarded
E Endianness
0 Big Endian1 Little Endian
U0 – U3 Implementation enhancements
The TLB Entry: Storage Attributes Fields
Write Through A store will trail the value saving it through all the memory hierarchy that contains a copy.
Cache Inhibited A Load may cause other accesses in the main storage (if it is not defined as guarded) and no memory operations will be cached. Mutually exclusive with Write Through
Memory Coherency Impose an order on the Coherency blocks. Coherency blocks is defined as a group of stores to memory. If enable, hardware will take care of the ordering
Guarded Protecting pages from speculative execution: Data and code. Protect not well-behaved storage.
Bibliography
• Mosberger, David. “Memory Consistency Models.” Department of Computer Science. University of Arizona. November 1993.
• Adve, Sarita; Gharachorloo, Kourosh. “Shared Memory Consistency Models: A Tutorial.” Rice University and DEC. IEEE Transactiona on Computers. 1996
• Lamport, Leslie. “How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs.” IEEE Transactions on Computers, September 1979, pp.690-691
• “AMD64 Architecture Programmer’s Manual Volume 2: System Programming.”AMD64 Technology. September 2006
• Lenoski, et. Al. “The Stanford DASH Multiprocessor.” IEEE Computrer, 25(3): 63 – 69, March 1992
Top Related