Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学 .
-
Upload
patience-rebecca-fox -
Category
Documents
-
view
274 -
download
2
Transcript of Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学 .
Programming Multi-Core Systems
周枫网易公司
2007-12-28 清华大学http://zhoufeng.net
Last Week
• A trend in software: “The Movement to Safe Languages”
• Improving dependability of system software without using safe languages
2
A Trend in Hardware
“The Movement to Multi-core Processors”
• Originates from inability to increase processor clock rate
• Profound impact on architecture, OS and applications
3
Topic today: How to program multi-core systems effectively?
Why Multi-Core?
• Areas of improving CPU performance in last 30 years
1. Clock speed
2. Execution optimization (Cycles-Per-Instruction)
3. Cache
• All 3 are concurrency-agnostic
• Now: 1 disappears, 2 slows down, 3 still good
– 1: No 4Ghz Intel CPUs
– 2: Improves with new micro-architecture (e.g. Core vs. NetBurst)
4
CPU Speed History
5From “Computer Architecture: A Quantitative Approach”, 4th edition, 2007
Reality Today
• Near-term performance drivers
1. Multicore
2. Hardware threads (a.k.a. Simultaneous Multi-Threading)
3. Cache
• 1 and 2 useful only when software is concurrent
6
“Concurrency is the next revolution in how we write software” --- “The Free Lunch is Over”
Costs/Problems of Concurrency
• Overhead of locks, message passing…
• Not all programs are parallelizable
• Programming concurrently is HARD
– Complex concepts: mutex, read-write lock, queue…
– Correct synchronization: race, deadlocks…
– Getting speed-up
7
Our focus today
Potential Multi-Core Apps
8
Application Category Examples
Server apps w/o shared state Apache web server
Server apps with shared state MMORPG game server
Stream-Sort data processing MapReduce, Yahoo Pig
Scientific computing (many different models)
BLAS, Monte Carlo, N-Body …
Machine learning HMM, EM algorithm
Graphics and games NVIDIA Cg, GPU computing
…… ……
Roadmap
• Overview of multi-core computing
• Overview of transactional programming
• Case: Transactions for safe/managed languages
• Case: Transactions for languages like C
9
Status Quo in Synchronization
• Current mechanism: manual locking
• Organization: lock for each shared structure
• Usage: (block) acquire access release
• Correctness issues
– Under-locking data races
– Acquires in different orders deadlock
• Performance issues
– Difficult to find right granularity
– Overhead of acquiring vs. allowed concurrency10
Transactions / Atomic Sections
• Databases has provided automatic concurrency control for 30 years: ACID transactions
• Vision:
– Atomicity
– Isolation
– Serialization only on conflicts
– (optional) Rollback/abort support
11
Question: Is it possible to provide database transactionsemantics to general programming?
Transactions vs. Manual Locks
• Manual locking issues:
– Under-locking
– Acquires in different orders
– Blocking
– Conservative serialization
12
• How transactions help:
– No explicit locks
– No ordering
– Can cancel transactions
– Serialization only on conflicts
Transactions: Potentially simpler and more efficient
Design Space
• Hardware Transactional Memory vs. software TM
• Granularity: object, word, block
• Update method
– Deferred: discard private copy on aborts
– Direct: control access to data, erase update on aborts
• Concurrency control
– Pessimistic: prevent conflicts by locking
– Optimistic: assumes no conflict and retry if there is
• …13
Hard Issues
• Communications, or side-effects
– File I/O
– Database accesses
• Interaction with other abstractions
– Garbage collection
– Virtual memory
• Work with existing languages, synchronization primitives, …
14
Why software TM?
• More flexible
• Easier to modify and evolve
• Integrate better with existing systems and languages
• Not limited to fixed-size hardware structures, e.g. caches
15
Proposed Language Support
Guard condition
16
// Insert into a doubly-linked list atomicallyatomic { newNode->prev = node; newNode->next = node->next; node->next->prev = newNode; node->next = newNode; }
atomic (queueSize > 0) { // remove item from queue and use it }
Transactions forManaged/Safe Languages
17
Intro
• Optimizing Memory Transactions,Harris et al. PLDI’06
• STM system from MSR Cambridge, for MSIL (.Net)
• Design features
– Object-level granularity
– Direct update
– Version numbers to track reads
– 2-phase locking to track write
– Compiler optimizations
18
First-gen STM
• Very expensive. E.g. Harris+Fraser, OOPSLA ’03):
– Every load and store instruction logs information into a thread-local log
– A store instruction writes the log only
– A load instruction consults the log first
– At the end of the block: validate the log; and atomically commit it to shared memory
19
Direct Update STM
• Augment objects with (i) a lock, (ii) a version number
• Transactional write:
– Lock objects before they are written to (abort if another thread has that lock)
– Log the overwritten data – we need it to restore the heap case of retry, transaction abort, or a conflict with a concurrent thread
– Make the update in place to the object
20
• Transactional read:
– Log the object’s version number
– Read from the object itself
• Commit:
– Check the version numbers of objects we’ve read
– Increment the version numbers of object we’ve written, unlocking them
21
22
Example: contention between transactions
atomic { t = c1.val; t ++; c1.val = t;}
Thread T2
int t = 0;atomic { t += c1.val; t += c2.val;}
Thread T1
T1’s log:
ver = 200
val = 40
c2
ver = 100
val = 10
c1
T2’s log:
23
Example: contention between transactions
atomic { t = c1.val; t ++; c1.val = t;}
Thread T2
int t = 0;atomic { t += c1.val; t += c2.val;}
Thread T1
c1.ver=100
T1’s log:
ver = 200
val = 40
c2
ver = 100
val = 10
c1
T2’s log:
T1 reads from c1: logs that it saw
version 100
24
Example: contention between transactions
atomic { t = c1.val; t ++; c1.val = t;}
Thread T2
int t = 0;atomic { t += c1.val; t += c2.val;}
Thread T1
c1.ver=100
T1’s log:
ver = 200
val = 40
c2
ver = 100
val = 10
c1
c1.ver=100
T2’s log:
T2 also reads from c1: logs that it saw
version 100
25
Example: contention between transactions
atomic { t = c1.val; t ++; c1.val = t;}
Thread T2
int t = 0;atomic { t += c1.val; t += c2.val;}
Thread T1
c1.ver=100c2.ver=200
T1’s log:
ver = 200
val = 40
c2
ver = 100
val = 10
c1
c1.ver=100
T2’s log:
Suppose T1 now reads from c2, sees it
at version 200
26
Example: contention between transactions
atomic { t = c1.val; t ++; c1.val = t;}
Thread T2
int t = 0;atomic { t += c1.val; t += c2.val;}
Thread T1
c1.ver=100
T1’s log:
ver = 200
val = 40
c2
val = 10
c1
c1.ver=100lock: c1, 100
T2’s log:
Before updating c1, thread T2 must lock it: record old
version number
locked:T2
27
Example: contention between transactions
atomic { t = c1.val; t ++; c1.val = t;}
Thread T2
int t = 0;atomic { t += c1.val; t += c2.val;}
Thread T1
c1.ver=100
T1’s log:
ver = 200
val = 40
c2
val = 11
c1
c1.ver=100lock: c1, 100c1.val=10
T2’s log:(2) After logging the old
value, T2 makes its update in place to c1
locked:T2
(1) Before updating c1.val, thread T2 must log the data
it’s going to overwrite
28
Example: contention between transactions
atomic { t = c1.val; t ++; c1.val = t;}
Thread T2
int t = 0;atomic { t += c1.val; t += c2.val;}
Thread T1
c1.ver=100
T1’s log:
ver = 200
val = 40
c2
val = 11
c1
c1.ver=100lock: c1, 100c1.val=10
T2’s log:
ver=101
(2) T2’s transaction commits successfully. Unlock the object,
installing the new version number
(1) Check the version we locked matches the version
we previously read
29
Example: contention between transactions
atomic { t = c1.val; t ++; c1.val = t;}
Thread T2
int t = 0;atomic { t += c1.val; t += c2.val;}
Thread T1
c1.ver=100
T1’s log:
ver = 200
val = 40
c2
val = 11
c1
c1.ver=100lock: c1, 100c1.val=10
T2’s log:
ver=101
(1) T1 attempts to commit. Check the versions it read are still up-to-date.
(2) Object c1 was updated from version 100 to 101, so T1’s transaction is
aborted and re-run.
30
Compiler integration
• We expose decomposed log-writing operations in the compiler’s internal intermediate code (no change to MSIL)
– OpenForRead – before the first time we read from an object (e.g. c1 or c2 in the examples)
– OpenForUpdate – before the first time we update an object
– LogOldValue – before the first time we write to a given field
atomic { … t += n.value; n = n.next; …}
OpenForRead(n);t = n.value;OpenForRead(n);n = n.next;
OpenForRead(n);t = n.value;n = n.next;
Source code Basic intermediate code Optimized intermediate code
31
Runtime integration – garbage collection
atomic { …
}
1. GC runs while some threads are in atomic blocks
obj1.field = oldobj2.ver = 100obj3.locked @ 100
32
Runtime integration – garbage collection
atomic { …
}
1. GC runs while some threads are in atomic blocks
obj1.field = oldobj2.ver = 100obj3.locked @ 100
2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed
33
Runtime integration – garbage collection
atomic { …
}
1. GC runs while some threads are in atomic blocks
obj1.field = oldobj2.ver = 100obj3.locked @ 100
2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed
3. GC visits objects reachable from refs overwritten in LogForUndo entries – retaining objects needed if any block rolls back
34
Runtime integration – garbage collection
atomic { …
}
1. GC runs while some threads are in atomic blocks
obj1.field = oldobj2.ver = 100obj3.locked @ 100
2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed
3. GC visits objects reachable from refs overwritten in LogForUndo entries – retaining objects needed if any block rolls back
4. Discard log entries for unreachable objects: they’re dead whether or not the block succeeds
35
Results: Against Previous WorkN
orm
alis
ed e
xecu
tion
time
Sequential baseline (1.00x)
Coarse-grained locking (1.13x)
Fine-grained locking (2.57x)
Direct-update STM (2.04x)
Direct-update STM + compiler integration
(1.46x)
Harris+Fraser WSTM (5.69x)
Workload: operations on a red-black tree, 1
thread, 6:1:1 lookup:insert:delete mix
with keys 0..65535
Scalable to multicore
Scalability (µ-benchmark)
36#threads
Fine-grained locking
Direct-update STM + compiler integration
WSTM (atomic blocks)
Coarse-grained locking
Mic
rose
cond
s pe
r op
erat
ion
DSTM (API)
OSTM (API)
37
Results: long running tests2
.04
2.1
1
6.0
5
5.8
4
3.9
1
2.0
8
2.0
5
4.3
2
3.1
6
2.6
1
1.7
2
1.6
8
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
0
1
2
3
4
5
6
7
1 2 3 4 5tree goskip merge-sort xlisp
Norm
alis
ed e
xecu
tion t
ime
10.8 73 16
2
Direct-update STM
Run-time filteringCompile-time optimizations
Original application (no tx)
Summary
• A pure software implementation can perform well and scale to vast transactions
– Direct update
– Pessimistic & optimistic CC
– Compiler support & optimizations
• Still need a better understanding of realistic workload distributions
38
Atomic Sections for C-like Languages
39
40
Intro
• Autolocker: Synchronization Inference for Atomic SectionsBill McCloskey, Feng Zhou, David Gay, Eric Brewer, POPL 2006
• Question: Can we have language support for atomic sections for languages like C?
– No object meta-data
– No type safety
– No garbage collection
• Answer: Let programmer help a bit (w/ annotations)
– And do a simpler one: NO aborts!
41
Autolocker: C + Atomic Sections
• Shared data is protected by annotated locks
• Threads access shared data in atomic sections:
• Threads never deadlock (due to Autolocker)
• Threads never race for protected data
• How can we implement this semantics?
mutex m;int shared_var protected_by(m);atomic { ... x = shared_var; ... }
Code runs as if a single lock protects all atomic sections
42
Autolocker Transformation
• Autolocker is a source-to-source transformation
int m1, m2;int x;int y;
begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2;end_atomic();
mutex m1, m2;int x protected_by(m1);int y protected_by(m2);
atomic { x = 3; y = 2;}
Autolocker code C code
43
Autolocker Transformation
• Autolocker is a source-to-source transformation
int m1, m2;int x;int y;
begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2;end_atomic();
mutex m1, m2;int x protected_by(m1);int y protected_by(m2);
atomic { x = 3; y = 2;}
Atomic sections can be nested arbitrarilyThe nesting level is tracked at runtime
Autolocker code C code
44
Autolocker Transformation
• Autolocker is a source-to-source transformation
mutex m1, m2;int x protected_by(m1);int y protected_by(m2);
atomic { x = 3; y = 2;}
Locks are acquired as neededLock acquisitions are reentrant
Autolocker code C code
int m1, m2;int x;int y;
begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2;end_atomic();
45
Autolocker Transformation
• Autolocker is a source-to-source transformation
mutex m1, m2;int x protected_by(m1);int y protected_by(m2);
atomic { x = 3; y = 2;}
Locks are released when outermost section endsStrict two-phase locking: guarantees atomicity
Autolocker code C code
int m1, m2;int x;int y;
begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2;end_atomic();
46
Autolocker Transformation
• Autolocker is a source-to-source transformation
mutex m1, m2;int x protected_by(m1);int y protected_by(m2);
atomic { x = 3; y = 2;}
Locks are acquired in a global orderAcquiring locks in order will never lead to deadlock
Autolocker code C code
int m1, m2;int x;int y;
begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2;end_atomic();
47
Outline
• Introduction
– semantics
– usage model and benefits
• Autolocker algorithm
– match locks to data
– order lock acquisitions
– insert lock acquisitions
• Related work
• Experimental evaluation
• Conclusion
48
finish heremany fine-grained locks
low contentionhigh parallelism
start hereone coarse-grained lock
lots of contentionlittle parallelism
• Typical process for writing threaded software:
• Linux kernel evolved to SMP this way
• Autolocker helps you program in this style
Autolocker Usage Model
shared data
threads
lock
49
Granularity and Autolocker
• In Autolocker, annotations control performance:
• Simpler than fixing all uses of shared_var
• Changing annotations won’t introduce bugs
– no deadlocks
– no new race conditions
int shared_var protected_by(kernel_lock);
int shared_var protected_by(fs->lock);
50
Outline
• Introduction
– semantics
– usage model and benefits
• Autolocker algorithm
– match locks to data
– order lock acquisitions
– insert lock acquisitions
• Related work
• Experimental evaluation
• Conclusion
51
generate globallock order
Algorithm Summary
atomic { …}
C programwith atomic
sections
begin acq L1, L2;end
lock order
cyclic?
yesno
fail and reportpotential deadlock
C programwith acquirestatements
insert orderedlock acquisitions
lockrequirements
remove redundantacquisitions
begin acq L2;end
match locks to data
52
Algorithm Summary
atomic { …}
C programwith atomic
sections
begin acq L1, L2;end
lock order
cyclic?
yesno
fail and reportpotential deadlock
C programwith acquirestatements
match locks to data
generate globallock order
insert orderedlock acquisitions
lockrequirements
remove redundantacquisitions
begin acq L2;end
53
Matching Locks to Data
foo bar
quxbaz
functions
C program withatomic sections
void foo() { atomic { use m2; y = 3; use m1; x = 2; }}
mutex m1, m2;int x protected_by(m1);int y protected_by(m2);
Symbol table
use L ≡
the lock L must already be held when this statement is reached
54
Algorithm Summary
atomic { …}
C programwith atomic
sections
begin acq L1, L2;end
lock order
cyclic?
yesno
fail and reportpotential deadlock
C programwith acquirestatements
generate globallock order
insert orderedlock acquisitions
lockrequirements
remove redundantacquisitions
begin acq L2;end
match locks to data
55
Acquisition Placement
• Assume there’s an acyclic order “<” on locks
• We need to turn uses into acquires:
void foo() { atomic { use m2; y = 3; use m1; x = 2; }}
global lock order
We must acquire m2…
m1 < m2
56
Acquisition Placement
• Assume there’s an acyclic order “<” on locks
• We need to turn uses into acquires:
void foo() { atomic { use m2; y = 3; use m1; x = 2; }}
global lock order
We must acquire m2…
... but since m1 is needed later
… and it’s ordered first
m1 < m2
57
Acquisition Placement
• Assume there’s an acyclic order “<” on locks
• We need to turn uses into acquires:
void foo() { atomic { acquire m1; acquire m2; y = 3; use m1; x = 2; }}
m1 < m2
global lock order
We must acquire m2…
... but since m1 is needed later
… and it’s ordered first
… we acquire m1 first
58
Acquisition Placement
• Assume there’s an acyclic order “<” on locks
• We need to turn uses into acquires:
void foo() { atomic { acquire m1; acquire m2; y = 3; use m1; x = 2; }}
global lock order
Preemptive acquiresPreemptively take locks that:• are used later• but ordered before
m1 < m2
59
Acquisition Placement
• Assume there’s an acyclic order “<” on locks
• We need to turn uses into acquires:
void foo() { atomic { acquire m1; acquire m2; y = 3; acquire m1; x = 2; }}
global lock order
m1 < m2
Eventually we’ll optimizeaway this acquisition
60
match locks to data
Algorithm Summary
atomic { …}
C programwith atomic
sections
begin acq L1, L2;end
lock order
cyclic?
yesno
fail and reportpotential deadlock
C programwith acquirestatements
generate globallock order
insert orderedlock acquisitions
lockrequirements
remove redundantacquisitions
begin acq L2;end
61
• Some lock orders break the insertion algorithm!
• We say that “p->m < m” is not a feasible order
• So how do we find a feasible lock order?
What Is a Good Lock Order?
use m;p = p->next;use p->m;
Assume p->m < macquire p->m;acquire m;p = p->next;acquire p->m;
If “p->m < m” then m cannot be acquired before any lock called p->m
… this lock has never been acquired before.It is acquired after m!
Because of the assignment…via insertion algo
62
Finding a Feasible Lock Order
• We want only feasible orders
• We search for any code matching this pattern:
• Any feasible order must ensure e1 < e2
• Scan through all code to find constraints
use e1;…maykill e2;…use e2;
e2 cannot be acquired before the update.It’s value is different above there.
Signifies anything that may affect e2’s value(like “p = p->next” when e2 = “p->m”)
e1 cannot be acquired after it is used.
63
Finding a Feasible Lock Order
foo bar
quxbaz
functions
Constraints
m1 < p->m’
p->m < p->m’
m1 < r->mm3 < p->m
q->m < p->m
m1r->mq->mp->mp->m’
Global Lock Order
search forinfeasiblepatterns
topological sort
64
Algorithm Summary
atomic { …}
C programwith atomic
sections
begin acq L1, L2;end
lock order
cyclic?
yesno
fail and reportpotential deadlock
C programwith acquirestatements
generate globallock order
insert orderedlock acquisitions
lockrequirements
remove redundantacquisitions
begin acq L2;end
match locks to data
65
Outline
• Introduction
– semantics
– usage model and benefits
• Autolocker algorithm
– computing protections
– lock ordering
– acquisition placement
• Related work
• Experimental evaluation
• Conclusion
66
Comparison
• Transactional memory with optimistic CC
– Threads work locally and commit when done
– If two threads conflict, one rolls back & restarts
– Benefit: no complex static analysis
• Drawbacks:
– software versions: lots of copying, can be slow
– hardware versions: need new hardware
– both: some operations cannot be rolled back (e.g., fire missile)
• How does this compare with Autolocker?
67
Experimental Questions
• Question 1:
– What is the performance cost of using Autolocker?
– How does it compare to other concurrency models?
68
Concurrent Hash Table
• Simple microbenchmark
• Goal: Same performance as hand-optimized
– low overhead
– high scalability
• Compared Autolocker to:
– manual locking
– Fraser’s object-based transactional memory mgr.
– Ennal’s revised transactional memory mgr.
69
Hash Table Benchmark
Machine: four processors, 1.9 GHz, HyperThreading, 1 GB RAMEach datapoint is the average of 4 runs after 1 warmup run
70
Experimental Questions
• Question 1:
– What is the performance cost of using Autolocker?
– How does it compare to other concurrency models?
• Question 2:
– Autolocker may reject code for “potential deadlocks”
– Does it accept enough programs to be useful?
– Will it only accept easy, coarse-grained policies?
71
AOLServer
• Threaded web server using manual locking
• Goal: Preserve the original locking policy with AL
Size52,589 lines
82 modules
Changes143 atomic sections added
126 types annotated with protections
Problems
175 “trusted” casts (23 worrisome)
78/82 modules used original locking policies
Performance negligible impact (~3%)
72
Conclusion
• Contributions:
– a new model for programming parallel systems
– a transformation tool to implement it
• Benefits:
– performance close to well-written manual locking
– freedom from deadlocks
– freedom from races on protected data
– very good performance when “ops/sync” is high
– increase performance without increasing bugs!
References
• Optimizing memory transactions, Tim Harris, Mark Plesko, Avraham Shinnar, David Tarditi, PLDI’06
• Autolocker: Synchronization Inference for Atomic Sections, Bill McCloskey, Feng Zhou, David Gay, Eric Brewer, POPL’06
• The free lunch is over: A fundamental turn toward concurrency in software, Herb Sutter, Dr. Dobb’s Journal, 2005
• Unlocking Concurrency, Ali-Reza Adl-Tabatabai et al, ACM Queue, 2006
• Transactional Memory, James Larus, Ravi Rajwar, Morgan Kaufmann, 2006 (book)
Contact me: [email protected]
http://zhoufeng.net73
Backup slides
74
A Lot of Questions to Answer
75From “The Landscape of Parallel Computing Research: A View from Berkeley”