Paul E. McKenney , IBM Linux Technology Center Maged M. Michael, IBM TJ Watson Research

46
CS-510 Paul E. McKenney, IBM Linux Technology Center Maged M. Michael, IBM TJ Watson Research Jonathan Walpole, Portland State University Presented by Vidhya Priyadharshnee Palaniswamy Gnanam Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory

description

Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory. Paul E. McKenney , IBM Linux Technology Center Maged M. Michael, IBM TJ Watson Research Jonathan Walpole, Portland State University Presented by Vidhya Priyadharshnee Palaniswamy Gnanam. - PowerPoint PPT Presentation

Transcript of Paul E. McKenney , IBM Linux Technology Center Maged M. Michael, IBM TJ Watson Research

Page 1: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

CS-510

Paul E. McKenney, IBM Linux Technology CenterMaged M. Michael, IBM TJ Watson ResearchJonathan Walpole, Portland State University

Presented by Vidhya Priyadharshnee Palaniswamy Gnanam

Why The Grass May Not Be Greener On The Other Side:

A Comparison of Locking vs. Transactional Memory

Page 2: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

2CS-510

Outline Concurrency Control Techniques Review Objective Locking Critique TM Critique Where do Locking and TM fit in? Conclusion Recent Work Future Work

Page 3: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

3CS-510

CONCURRENCY CONTROL TECHNIQUES REVIEW

Page 4: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

4CS-510

Multicore Computing With the speed of individual cores no longer

increasing at the rate it used to, we started using increased number of CPU cores to increase the speed of our ever-more complicated applications.

To use these extra cores, programs must be parallelized.

Synchronization of shared data access is critical for correctness of these programs.

Page 5: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

5CS-510

Lock Based Synchronization “Traditional” pessimistic synchronization approach

Simple. Partition the shared data and protect each partition with separate a lock

Locks prevent concurrent access and enable sequential reasoning about critical section code.

Reader Writer Locking: Allows multiple readers to gain access concurrently. Improves scalability if used correctly.

Page 6: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

6CS-510

Lock Based Synchronization: Downsides

Lock based Synchronization open a whole new can of worms, though.

High Contention on non-partitionable data structures. – Coarse Grained locking limits concurrency. Lock Contention.

Poorly Scales.

Fine Grained locking is hard. Lock acquisition overhead affects performance.

Introduces dependencies among threads.– Propagation of thread failure– Affects fault tolerance of the system

Page 7: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

7CS-510

Non Blocking Synchronization Lock-free, “optimistic” synchronization.

Execute the critical section unconstrained, and check at the end to see if you were the only one

If so, continue. If not roll back and retry

Optimistic synchronization keep threads independent giving different levels of fault tolerant properties like Block Freedom, Wait Freedom and Obstruction Freedom based on implementation.

Page 8: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

8CS-510

Non Blocking Synchronization: Downsides

Difficult programming logic

Heavy use of atomic operations like CAS to do combination of verification and finalization (if passes).

Impact of contention can be quite severe. Increased number of retries causes heavy bus contention, cache contention and thus slows down progressive threads.

May not perform as well as a lock-based approach in non preemptible kernel.

Page 9: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

9CS-510

Objective Each technique has both green and dry areas. The goal of paper is to

– Spot green and dry areas of Lock Based Synchronization and Transactional memory (NBS)

– Constructively criticize to them to understand where each technique fit

Page 10: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

10CS-510

LOCKING CRITIQUE

Page 11: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

11CS-510

Locking Strengths Simple and elegant idea. Allow only one CPU to access a given

data at a time.

Provides Disjoint access parallelism but with more effort.

Does not require any specialized HW support. Can be used on existing commodity hardware.

Supported in multiple platforms as it is largely used and well-defined standardized locking APIs like POSIX pthread API exists. – Much of the legacy code use locking. – More experienced programmers

Contention effects are concentrated within locking primitives, allowing critical sections to run at full speed.

Page 12: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

12CS-510

Locking Strengths Degradation on performance can be minimized by reducing

the power consumption during waiting on lock.

Good for protecting non-idempotent operations such as I/O, thread creation, memory remapping and system rebooting.

Interacts naturally with other synchronization mechanisms, including reference counting, atomic operations, non-blocking synchronization, RCU

Interacts in a natural manner with debuggers

Page 13: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

13CS-510

Locking: Problems & ImprovementsProblem: Lock Contention Some data structures such as unstructured graphs and trees

are difficult to partition. May have to settle for coarse grained locking which leading to

high contention and reduced scalability

Solution Redesign algorithms to use partition-able data structures

– Replace trees and graphs with hash tables and radix trees.

Problem remains with non-partitionable data structures!

Page 14: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

14CS-510

Locking: Problems & ImprovementsProblem: Lock Overhead Lock granularity determines scalability.

– Can we partition the shared data as much as possible and protect each partition with separate lock?

Locking uses expensive instructions and creates high synchronization overhead.

Locking introduces communication related cache misses into read mostly workloads which would otherwise run entirely within the cpu cache.

Solution While lock overhead cannot be completely overcome, it can be

avoided. In read mostly situations, locked updates may be paired with read-

copy-update (RCU) or hazard pointers thus reducing lock overhead in common cases, increasing read side performance and scalability. Problem Remains in Update heavy workloads!

Page 15: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

15CS-510

Locking: Problems & Improvements

Performance Vs Scalability

Need right granularity of locks!

Page 16: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

16CS-510

Locking: Problems & ImprovementsProblem: Deadlock Multiple threads acquire the same set of locks in different

order. Self-deadlock: if interrupt occurs while a lock is held by a

thread and the interrupt handler also needs that lockSolution Require a clear locking hierarchy; multiple locks are acquired

in a pre-specified order If lock not free, thread surrenders conflicting locks and retries Detect deadlock; break cycle by terminating selected threads

based upon priority/ work done. Track lock acquisition, dynamically detect potential deadlock

and prevent before it occurs To avoid self deadlocks disable interrupts on entering CS/

avoid lock acquisition in handlers

Page 17: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

17CS-510

Locking: Problems & ImprovementsProblem: Priority Inversion Priority inversion can cause a high-priority thread to miss its

real-time scheduling deadline, which is unacceptable in safety-critical systems

Solution Low priority thread holding the lock temporarily inherits

priority of high priority blocked thread so that no medium priority thread can preempt it

Lock holder is assigned priority of the highest priority task that might acquire that lock

Preemption is disabled entirely while locks are held

Page 18: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

18CS-510

Locking: Problems & ImprovementsProblem: Convoying Preemption or blocking (due to I/O, page fault etc.) of the lock

holder can block other threads. Unrealistically increased critical section length. Non-deterministic lock acquisition latency May lead to starvation of large critical sections. Problem for real-time workloads.

Solution Use scheduler-conscious synchronization to avoid scheduler to

preempt the thread holding a lock. Use RCU for read side critical sections to avoid Non-

deterministic lock acquisition latency in read side. To avoid starvation use FCFS lock acquisition primitives with

limit on number of threads- e.g. Semaphores

Page 19: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

19CS-510

Locking: Problems & ImprovementsProblem: Lack of composability and Modularity Enabling atomic operations to be composed into larger

atomic operations is difficult. Leads to self deadlock if the inner critical section tries to

acquire same lock out critical section is holding

Solution Need to know what locks other modules use before

calling/composing them.

Abstraction is lost!

Page 20: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

20CS-510

Locking: Problems & ImprovementsProblems: Indefinite blocking Due to termination of the lock holder. Creates problems for fault tolerant software.

Solution Abort and restart entire application- Simple, reliable Identify the terminated lock holder and clean up its state-

extremely complex

Fault tolerance of the software is still affected!

Page 21: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

21CS-510

TRANSACTIONAL MEMORY CRITIQUE

Page 22: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

22CS-510

Composability In locking, operations may be thread safe individually, but not

composed together. Consider, pop from one stack and push into another.

T2

struct foo *push (struct foo_stack *dst){ struct foo *q; lock (dst);

get(q); q->next = dst; dst = q; unlock (dst);}

T1

struct foo *pop (struct foo_stack *src){ struct foo *q; lock (src);

q = src;src = q->next;

unlock (src);}

Intermediate state (item is in neither stacks) is visible!

Page 23: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

23CS-510

TM Approach

struct foo *pop_push(struct foo_stack *src, struct foo_stack *dst){ struct foo *q; begin_txn; q = src;

src = q->next;q->next = dst;

dst = q; end_txn;}

Let the TM system take care of the rest!

Page 24: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

24CS-510

Transactional Memory

Solution to the problem of consistency in the face of concurrency adopted from the database world - Transactions.

Simple, Composable, Scalable

Atomic Blocks == TransactionsAtomicity: All-or-nothing execution of a tx. Isolation: Partial results are invisible to other txs/

threads

Page 25: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

25CS-510

Transactional Memory TM is a non-blocking synchronization mechanism: at least one

thread will succeed

Can be constructed to be either as – Optimistic

– Speculate concurrency without waiting for permission (acquire no locks on reads/writes)

– Performs well when critical regions do not interfere with each other more often.

– Pessimistic – "Always ask for permission"- Acquire locks on read/ writes

(blocking) used in databases. – Good when conflicts are more

Page 26: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

26CS-510

HW Transactional Memory New instructions (LT, LTX, ST, Abort, Commit, Validate) Fully-associative transactional cache for buffering updates Piggy Backing on multi-processor cache coherence protocol to

detect transaction conflicts

Page 27: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

27CS-510

SW Transactional Memory Obstruction free

– Introduce level of indirection

– Log the modifications to memory locations in descriptors.

– Based on tx outcome, commit by writing the new values to memory locations atomically or abort by reverting to old values.

Non Obstruction free

– Revocable Two Phase Locking for Writes: A transaction locks all objects that it writes and does not release these locks until the transaction terminates. If deadlock occurs then one transaction aborts, releasing its locks and reverting its writes.

– Optimistic Concurrency Control for Reads: Whenever a transaction reads from an object, it logs the version it read. When the transaction commits, it verifies that these are still the current versions of the objects.

Page 28: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

28CS-510

TM Strengths Non-blocking: system as a whole makes progress

Familiar to large users in the context of database systems and trivial hardware implementation LL/SC

Scalable Allows multiple, non-interfering threads to concurrently

execute in a critical section.

Automatic Disjoint access parallelism Achieved automatically without having to design complex fine

grain locking solution.

Modular & Composable Transactions may be nested or composed

Page 29: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

29CS-510

TM StrengthsDeadlock Free Avoids common pitfalls of lock composition such as deadlock.

Fault tolerance Failure of one transaction will not affect others

Non Partitionable datastructures Can be used with difficult to partition data structures such as

unstructured graphs

Page 30: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

30CS-510

TM Problems & ImprovementsProblem: Portability in Hardware TM

Portability: need special hardware

Size of transaction limited by transaction cache.

Overflow of transaction cache addressed by virtualization in newer implementations.

Solution

Use HTM in case of small txs, but fall back to STM otherwise with language support.

Transparency to application requires semantics of HTM and STM to be identical.

Page 31: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

31CS-510

TM Problems & ImprovementsProblem: Performance in Software TM

Poor performance compared to locking even at low levels of contention

Atomic operations for acquiring shared object handles Cost of consistency validation Effect on cache of shared object metadata Dynamic allocation, data copying and memory reclamation

Solution:

STM performance can be improved by eliminating overheads of indirection, dynamic allocation, data copying, and memory reclamation by relaxing the non-blocking propertyReintroduce many of the problems of locking!

Page 32: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

32CS-510

Problem: Non Idempotent Operations: I/O Cannot perform any operation that cannot be undone like I/O,

memory remapping, thread creation and destruction It cannot be performed multiple times on tx retry as it will lead to

multiple send requests Common Solution Postpone I/O until outcome of tx is known to avoid I/O retries.

Problematic scenario I/O waits until commit And commit waits for I/O completion.

Self deadlock!

TM Problems & Improvements

Page 33: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

33CS-510

TM Problems & ImprovementsSolutions: Non Idempotent Operations: I/O

Buffered I/O might be addressed by including the buffering mechanism within the scope of the transactions doing I/O This cannot handle the scenario shown.

Can expand both sender and receiver in one tx. But Tx limited to single system currently.

Txs performing non idempotent operations can be executed in “inevitable” mode, where it is guaranteed to commit avoiding the irreversibility problem of I/O etc. But it does not scale, as at most only one transaction can be inevitable.

Page 34: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

34CS-510

TM Problems & ImprovementsProblem: Contention Management

When transactions collide, only one can proceed, others must be rolled back.– Starvation of large transactions by smaller ones– delay of a high-priority thread via rollback of its transactions due to

conflicts with those of a lower-priority thread

Solution

Communication b/w scheduler and tx contention manager. Carefully select the transactions to roll back based on priority,

amount of work done etc. Convert read only transactions to non-transactional form, in a

manner similar to the pairing of locking with RCU.– Writer should have necessary primitives to support non transactional

readers.– Eg, A Relativistic Enhancement to Software Transactional Memory," by

Philip Howard and Jonathan Walpole

Page 35: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

35CS-510

TM Problems- PrivatizationPrivatization Optimization technique that allows access to some data non -

transactionally.

Need To improve performance by temporarily exempting objects

from the overhead of transactional access. Trade Strong Isolation for performance

Problems Can break isolation guarantees causing inconsistent

concurrent access. Performance vs Correctness

Page 36: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

36CS-510

TM Problems- Privatization

T1 T2T1 intends to insert

A1 T2 intends to

Privatize listT1 read A

T1 locks A T2 locks Head

T2 Commit by local = head,

head=null. Unlock

T2 privatized,

perform operations

T1 Commit by A1->next=B, A->next =

A. Unlock

TIM

E

Certain STM optimizations can result in allowing concurrent access to privatized data!

Page 37: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

37CS-510

TM Problems & ImprovementsProblem: Ratio of data and control operation overheads• DBMS: Data operation usually includes reads/writes to mass storage

device. Tx overhead becomes negligible comparatively.• TM: Data operations almost always includes only reads/writes to

memory. Tx overhead seems large.

Solution• Use TM for heavy weight operations like grouping system calls.

Problem: Debugability Difficult debugability of Transactions- break points causes

unconditional aborting

Solution Debugging issue can be addressed by using STM- High degree of

compatibility between STM and HTM needed.

Page 38: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

38CS-510

TM Problems & ImprovementsOthers Problems:

Interaction with other systems is important. In practice it is complicated and expensive.

Conflict Prone Variables- inevitable data structures appearing in every CS causes excessive conflicts.

Performance overhead due to Conflict Resolution and excessive restarts in the face of High conflict rates.

Page 39: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

39CS-510

Where do Locking and TM fit in?Scenario Best Technique Why?Partitionable data structures Locking Disjoint Access Parallelism

Large Non Partitionable data structures TM Automatic Disjoint Access Parallelism

Read Mostly Situations Locking/TM with Hazard Pointers/ RCU Readers Scalable

Update Heavy Situations TM Writers ScalableComplex fine grain locking design, No clear lock hierarchy exists TM Deadlock Avoidance

Atomic operations spanning multiple independent data structures, eg pop from one stack and push to another

TM Composability

Single threaded software having embarrassingly parallel core containing only idempotent operations

TM Performance benefits without much programming effort

Non Idempotent Operations Locking Supportability of non idempotent operations. Large Critical Sections Locking Lock acquisition cost small compared to retry

Commodity Hardware LockingCommodity HW suffices. HTM requires specialized H/W and depends on cache geometry details. Else performance limited by STM

Page 40: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

40CS-510

Conclusion: Use the Right Tool For The Job!

There is no silver bullet: successful adoption of multithreaded/multi-core CPUs will require combination of techniques

Analogy with engineering: How many types of fasteners are there? How many subtypes? Nail, screw, clip, bolt, glue, joint, magnet...

Neither locking nor TM solve the fundamental performance and scalability problems

Combine strengths of various synchronization mechanisms according to the need

Integrate with other techniques: “use the right tool for the job”

TM's applicability may increase if STM performance improves Formalize and generalize existing techniques such as RCU

Page 41: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

41CS-510

Recent Work cx_spinlocks

– new hybrid TM and locking primitive

TxLinux: Using and Managing Hardware Transactional Memory in an Operating System by Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E. Ramadan, Aditya Bhandari, and Emmett Witchel

“Inevitable Transactions” special transactions containing non-idempotent operations (I/O).

Such transactions unconditionally abort any conflicting transactions, thus non-idempotence is OK.

Allowing more than one concurrent inevitable transaction is necessary to achieve reasonable I/O performance, but feasibility is an open question

Page 42: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

42CS-510

Recent Work

Glue together Relativistic programming and Transactional Memory to gain scalability of readers and writers

A Relativistic Enhancement to Software Transactional Memory by Philip Howard, Jonathan Walpole

Page 43: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

43CS-510

Future Work Expand the comparison to include other synchronization

mechanisms (message passing, deferred reclamation, RCU)

Investigate combining different mechanisms:– TM and locking (much work in this area)– RCU and locking (typical use of RCU)– TM and RCU (very little work done here)

There might still be hope for a “silver bullet”– But until then, it would be quite foolish to ignore combinations of

existing mechanisms

Page 44: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

44CS-510

Page 45: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

45CS-510

References Lecture Slides from Winter 2008 by the authors Parallel Programming with Transactional Memory by Ulrich

Drepper, Red Hat Software Transactional Memory why is it only a research toy?

by Calin Cascaval, Colin Blundell, Maged Michael, Harold W.Cain, Peng Wu, Stefane Chiras and Siddhartha Chatterjee

Privatization Techniques for Software Transactional Memory by Michael F. Spear, Virendra J. Marathe, Luke Dalessandro, and Michael L. Scott

Inevitability Mechanisms for Software Transactional Memory by Michael F. Spear, Maged M. Michael, Michael L. Scott

http://en.wikipedia.org/wiki/Software_transactional_memory

Page 46: Paul E.  McKenney , IBM Linux Technology Center Maged  M. Michael, IBM TJ Watson Research

46CS-510

THANK YOU!