1 Alexander A. Shvartsman University of Connecticut, USA Specifying, Reasoning About, Optimizing,...
-
Upload
destini-bach -
Category
Documents
-
view
216 -
download
2
Transcript of 1 Alexander A. Shvartsman University of Connecticut, USA Specifying, Reasoning About, Optimizing,...
1
Alexander A. Shvartsman
University of Connecticut, USA
Specifying, Reasoning About, Optimizing, and Implementing Atomic Data Services for Dynamic Distributed Systems
FRIDA 2014, Vienna, Austria
2
“Three R’s” (Lesen, Schreiben, Rechnen)
Reading, ’Riting, and ’Rithmetic Underlay much of human intellectual activity Venerable foundation of computing technology
With networking, communication became a major activity Email – electronic counterpart of postal service
Yet, it is natural to deal with reading, writing, and computing In the Web, a browser app may load (i.e., read) a page,
perform computation, and save (i.e., write) the results In distributed databases we retrieve and store data, and
rarely talk about sending and receiving data Arguably, it is also easier to develop distributed algorithms
with readable/writeable objects, than to use message passing
3
Sharing Memory in a Networked System
Let’s place a shareable object at a node in a network Not fault-tolerant – single point of failure Not efficient – performance bottleneck Not very available, does not provide longevity, etc…
4
Sharing Memory in a Networked System
So we replicate – we’d have to anyway, since redundancy is the only means for providing fault-tolerance
5
Sharing Memory in a Networked System
With replication come challenges: How to preserve consistency while managing replicas? What kind of consistency? How to provide it? How to use it? a
ab
6
Sharing Memory in Networked Systems
Let’s place a shareable object at a node in a network Not fault-tolerant – single point of failure Not efficient – performance bottleneck Not very available, does not provide longevity, etc…
So we replicate – we’d have to anyway, since redundancy is the only means for providing fault-tolerance
With replication come challenges: How to preserve consistency while managing replicas? What kind of consistency? How to provide it? How to use it?
7
Consistency
Easiest for users: a single copy view Sequence of operations; a read sees all previous writes Captured by atomicity [Lamport] / linearizability [Herlihy Wing]
Not cheap to implement even without general updates Cheapest to implement: a read sees a subset of prior writes
Not the most natural semantic for the users Additional complications in dynamic systems
Ever-changing sets of replicas and participants Crashes never stop, timing variations persist Evolving topology, ultimately mobility
8
Atomicity / Linearizability [Lamport / Herlihy Wing ]
“Shrink” the interval of each operation to a serialization point so that the behavior of the object is consistent with its sequential type
read(0)
write(8)
read(8)
Time
read(8)
read(0)
write(8)
read(8)
Time
read(0)
write(8)
read(8)
Time
9
Valid (atomic) executions:
Invalid executions:
More Atomicity Examples
read( )
write(8)
ret(0)
ack( )
read( )
write(8)
ret(0) read( ) ret(8)
TimeTime
read( )
write(8)
ret(0)
ack( )
read( )
write(8)
ret(8) read( ) ret(0)
TimeTime
read( ) ret(0)
10
Consistency Polemics
Parallel / distributed systems Performance Speed-up
Distributed theory focus Fault-tolerance Consistency
User:
Yes, mine is wrong…
But it is fast!
Yes, mine is slow…
But it is correct!
I wish they reconciled…
11
Using Majorities/Quorums for Consistency
Consistency of replicated data: using intersecting sets Starting with Gifford (79) and Thomas (79) Upfal and Wigderson (85)
Majority sets of readers and writers to emulate shared memory in a synchronous distributed setting
Vitanyi and Awerbuch (86)MW/MR registers using matrices of SW/SR registers
where rows and columns are read and written Attiya, Bar-Noy, and Dolev (91/95)
Atomic SW/MR objects in message passing systems, majorities of processors, minority may crash
Two-phase protocol (ABD)
12
Quorum Systems and Examples
A C
B
Majorities [Thomas79,Gifford79]
Matrix Quorums:Processor ids arranged in a matrix.
A quorum: Row U Column
Quorum system Q over P, a set of processor ids:Q = {Q1, Q2, … }• Qi P • Qi Qj ≠ Ø for all i, j
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Lemma: The join of quorum system Qa over Pa andsystem Qb over Pb ,Qa ⋈ Qb , is a quorum system over Pa U Pb .
13
Main Idea: (Logical) Timestamps and Quorums
An object is represented as a pair (value, timestamp) A write records (new-value, new-timestamp) in a quorum A read obtains (value, timestamp) pairs from a quorum,
then returns the value with the largest timestamp
read(vmax)
(vmax,tmax
)write(v1)
(v1,t1)
write read
Time
14
“Readers Must Write”
If operations are concurrent and a reader simply returns the latest value, then atomicity can be violated:
Solution: If readers first help the writer to record the value in a quorum, then it is safe to return the latest value
Write(v1) . . . (happened to be slow)
v0
v1
Read returns v1 Read returns v0
v0
v0
v0
v0
15
Lastly: “Writers Must Read”
I assume you’re being facetious, Josef,
I distinctly yelled ‘second’ before you did!
• Writers can “read” before writing (and riding) to obtain the latest timestamp in order to compute a new timestamp
16
The ABD Algorithm [Attiya Bar-Noy Dolev]
17
The Complete Algorithm [Attiya Bar-Noy Dolev]
• Replica hosts respond to Get and Put requests
• Read and write uses identical two-phase communication patterns:• Get phase queries and obtains values from a quorum, • Put phase propagates values to a quorum.• The only difference is in what is sent out in the Put phase.
18
New Frontiers: Dynamic Atomic Memory
Goal: Develop and analyze algorithms to solve problems of coherent data sharing in dynamic distributed environments
“Dynamic” encompasses Changing sets of participants:
nodes come and go as they please Wide range of failures Asynchrony, timing variations Crashes, message loss, weak delivery guarantees Changes in network topology Processor mobility
19
Approach: Dynamic Quorums [Lynch Shvartsman]
Objects are replicated at several network locations To accommodate small, transient changes:
Use quorum configurations: members, quorums Maintains atomicity during “normal operation” Allows concurrency
To handle larger, more permanent changes: Reconfigure Maintains atomicity across configuration changes Any configuration can be installed at any time Reconfigure concurrently with reads/writes --
no heavy-handed view change
20
Algorithmic Ideas
Reconfigurable quorum systems Quorums maintain consistency during modest and
transient changes Reconfigurations accommodate more drastic and
permanent changes Read/write operations are frequent
Use quorum access and allow concurrency Isolate from reconfiguration
Reconfigurations are infrequent Use consensus to impose total order (Paxos) Optimistic dissemination without formal installation Conservative garbage collection of obsolete config-s
21
Related Other Approaches
Using consensus to agree on each operation [Lamport]
Performance overhead for each reads and write op Termination of operations depends on consensus
Use group communication service [Birman 87] with TO bcast [Amir, Dolev, Melliar-Smith, Moser 94], [Keidar, Dolev 96]
View change delays reads/writes One change may trigger view formation
DynaStore: Reconfiguration without consensus[Aguilera, Keidar, Malkhi, Shraer 11]
Initial quorum system, incremental adds/removes Changes yield DAGs of possibilities Reads/writes use ABD-like phases, traverse DAGs Assuming finite reconfigurations guarantees termination
22
GCS: Group Communication Services (a Detour)
Pioneering work of Ken Birman & Co. made GCSs one of the most important distributed system building blocks GCS: group membership + multicast communication Isis, Transis, Consul, Ensemble, Totem, Horus, Newtop, Relacs …
Commercial variants of Birman’s Isis used in French air traffic control system Stock exchange on Wall Street Swiss Electronic Bourse
GCSs for years defied sufficiently formal specification that would both Detail the utility of the services and Enable compelling reasoning about correctness
23
View-Synchrony [Fekete Lynch Shvartsman 01]
A new, simple formal specification for a partitionable view-oriented group communication service.
To demonstrate the value of our specification, we used it to construct an algorithm for an ordered-broadcast application that reconciles information derived from different views. Atomic memory is now easily obtained
Our algorithm is based on algorithms of Amir, Dolev, Keidar, Melliar-Smith and Moser.
Our specification has a simple implementation, based on the membership algorithm of Cristian and Schmuck.
24
View-Synchrony [Fekete Lynch Shvartsman 01]
We proved the correctness and analyzed the performance and fault-tolerance of the algorithm
Our work on GCS was (we believe) the first successful formal specification + rigorous proofs Safety: Invariants and simulation Conditional performance analysis Myla Archer @ NRL checked several invariants using
PVS – invaluable contribution to perfecting the proofs Despite perhaps only modest algorithmic advances the
paper was/is highly cited -- the value is in the proofs!
25
Our Ground-Up Approach: RAMBO
Reconfigurable Atomic Memory for Basic Objects[Lynch, Shvartsman 02], [Gilbert, Lynch, Shvartsman 10]
For small, transient changes, use quorum configurations Maintain atomicity and allow concurrency, a la ABD
For larger, permanent changes, reconfigure: Dynamically install any quorum system Guarantee atomicity across configurations
Guarantees: Linearizability in all cases Good latency in common cases High availability
Everything proceeds concurrently Use low-level gossip communication
26
Methodology
Specify algorithm Interacting state machines Using non-deterministic “gossip”
Show correctness/safety for arbitrary patterns of asynchrony assuming arbitrary crash-failures and message loss
Analyze performance for a subset of timed executions Bounded message delay, 0-time local processing Some “gossip” becomes deliberate, some periodic Non-failure of certain quorums for certain periods Reason about operation latency Of course none of this impacts safety
27
Reconfigurable Atomic Memory for Basic Objects
Global service specification Implementation:
Main algorithm + “recon” service Recon service:
“Advance reconnaissance” Consistent sequence of configurations Loosely coupled
Main algorithm: Reading, writing Receives, disseminates new configuration
information; no explicit installation Reconfigures: upgrade to new and remove old Reads/writes may use several configurations
Net
RAMBO
Recon
RAMBO
28
Architectural View
Communication Network
Node i
Joineri
R/Wi
Node j
Recon
Joinerj
R/Wj
Application
RAMBO
29
High-Level Functions
Joiner Introduces new participants to the system
Reader-Writer Routine read and write operations Two-phased algorithm using all “known” configurations Using tags
Recon Chooses new next configuration, e.g., using Paxos Informs members of the previous configuration
Configuration upgrade (“packaged” with Reader-Writer) Identify and remove obsolete configurations
(garbage collection)
30
Configurations and Reconfiguration
Configuration: quorum system Collection of subsets of replica host ids
where any two subsets intersect Or collection of read-quorums and write-
quorums, where any read-quorum intersects with any write-quorum
Reconfiguration process involves two decoupled steps Recon: Emitting a new configuration, then, at
some later time Garbage-collecting obsolete configurations while
“upgrading” to the latest (locally) known configuration
No constraints on the membership of distinct quorum systems
Q1 Q2 Q3
…
31
Consensus
Recon
Net
Implementation of Recon
Uses distributed consensus to determine new configurations 1,2,3,… Note: when the universe
of configurations isfinite and known, then consensus is not neededeven with unboundedreconfiguration [GeoQuorums]
Members of old configuration may propose a new config.
Proposals reconciled using consensus Consensus is a fairly heavy mechanism, but it is
Used only for reconfigurations, which are infrequent Does not delay Read/Write operations
32
Configurations and Config Maps
Configuration c members(c) -- set of members of configuration c read-quorums(c), write-quorums(c) -- sets of quorums
Configuration map cm mapping from naturals to configurations
cm(k) is configuration k
Can be defined (c), undefined (), garbage-collected (±)
± ± c c c c ... ...
GC’d Defined Mixed Undefined
c
33
Configuration Upgrade [Gilbert, Lynch, S 10]
Reconfigure to last configuration in a contiguous segment
Phase 1: Informs write-quorum of cj … ck-1 about ck Collects (value,tag) from read-quorums of cj … ck-1
Phase 2: Propagates latest (value, tag) to a write-quorum of ck
Garbage-collect: Set cmap(j…k -1) to ± Constant-time upgrade regardless of the number of
obsolete configurations (conditioned on failures) Maintains good read/write latency during system instability
or frequent reconfigurations
± cj . . . ck . . .. . . . . .
34
Configuration Map Changes (Local View)
c0
c0 c1
c0 c1 c2 ck
± c1 c2 ck
± ± c2 ck
TIME
. . .
. . .
. . .
. . .
. . .
± ± ± c3 ck
. . .
± ± ± ± ± c c c c . . .
. . .
35
On to Reads and Writes: Values and Tags
Each value v has an associated tag t (logical timestamp) Tag is made up of the sequence-processor pair
Reads: a set of value-tag pairs is obtained the result is the value with the maximum tag
Writes: a set of value-tag pairs is obtained new-value is propagated with a new-tag that is a
lexicographic increment of tag :
new-tag := tag.seq + 1, pid
36
Dynamic Reader-Writer and Recon
The work is split between Reader-Writer and Recon Recon emits consistent configurations Reader-Writer processes run two-phase quorum-based
algorithm, with all “active” configurations Background “gossip” builds fixed-points If Recon emits new configuration, Reader-Writer continues
reads/writes in progress, until fixed-point is reached
Net
R/Wi Recon R/Wj
37
Processing Reads and Writes
Reads and Writes perform Query and Propagation phases using known configurations, stored in op.cmap. Query: Gets “fresh” value, tag, and cmap information from
read-quorums Propagation: Gives latest (value,tag) to write-quorums Both phases: Extend op.cmap with newly-discovered
configurations that now must also be involved. Each phase ends with a fixed point, involving all the
configurations currently in op.cmap
Read or Write
Propagation PhaseQuery Phase
StartQuery
StartProp
EndQuery
EndProp
38
Methodology
Algorithms are presented formally, using interacting state machine models: Input/Output automata) service specifications algorithm descriptions models for applications
Safety: rigorous proof of correctness (atomicity) for arbitrary patterns of asynchrony and change
Conditional performance analysis E.g., when message latency < d, quorum configurations
are “viable”, then read and write operations take time between 4d and 8d, under reasonable “steady-state” assumptions.
…
Net
…
3939
Input Output Automata (IOA)
Input Output Automata [Lynch & Tuttle] A language and formalism based on simple
mathematical foundation Elegant, suitable for analysis, well documented Supports composition, abstraction Model system components as a collection of
interacting state machines Allows rigorous reasoning about system properties Widely used in research – hundreds of algorithms
40
Example Spec’n: Asynchronous Lossy Channel
Input sendi,j(m)Output recvj,i(m)
Internal lose(m)
Channeli,j
41
Details: Reader-Writer: Send and Recv Code
Specification of gossip using Input/Output Automata of [Lynch Tuttle]
Send
Receive
42
Details: Reader-Writer Fixed Points
Phase 1 fixed point
Specification of fixed points using Input/Output Automata
Phase 2 fixed point
43
Showing Read/Write Atomicity
We show atomicity using a partial order For be any sequence of reads and writes
Let be an irreflexive PO of all op-s in To prove atomicity show that satisfies:
For any operation , finitely many op-s If precedes , then not If is a write then either or Any read returns value written by last write, per
(or init value if no such writes) [Lynch, Lemma 13.16]
Rigorous proof that in RAMBO the max-tag of reads and new-tag of writes, lexi-ordered, provide such a PO
44
Some Latency Analysis Details
Restrict attention to a subset of timed executions Reminder: Read and write operations are not affected by
Recon delays or non-termination Configuration upgrade (garbage collection) takes 4d If reconfigurations are “rare” -- operations take 4d If configurations are in “steady state” -- operations take 8d Logarithmic time “catch-up” after a bursty period of
“bad timing behavior” A recovering node joins quickly after a long absence
45
Implementation
Experimental system implementations [Musial 07] Platform for refinement, optimization, tuning Observe of algorithms in a local area setting Cluster with 16+/- Linux machines & fast switch
Developed by manually translating the Input/Output Automata specification to Java code Precise rules are followed to mitigate error introduction
during translation Rigorous proofs [Georgiou, Musial, Sh., Sonderegger 07, 11]
Next steps: Specification in Tempo [Lynch Michel S 08] (Timed IOA) Code generation ([Georgiou Lynch Mavrommatis Tauber 09])
46
Optimizations – Rigorously Proved
Parallel upgrade [Gilbert Lynch S 10]
Improving communication efficiency [Georgiou, Musial, S 06]
Explicit “leave” service Incremental gossip for long-lived data
Improving liveness & gossip efficiency [Gramoli, Musial, S 05]
Non-deterministic heuristic for prodding slow operations Constrained gossip patterns
Providing complete shared memory [Georgiou, Musial, S 05]
Domain-oriented Rambo Optimizing performance for groups of related objects
47
Optimization and Development Methodology
Atomicityproperties
AbstractRambo
Rambowith gracefuldepartures
Long-LivedRambo
Proof
DerivationRunningSystem
Simulation
SimulationManual derivation
[Musial 07]or
semi-automatedcode generation
[Georgiou + 09,10]
[Lynch, S 02][Gilbert, L, S 10]
[Georgiou,Musial, S 04, 06]
48
Forward Simulation
Let C (low-level, concrete) and A (high-level, abstract) be two systems.The goal is to show that “C implements A” by “trace inclusion”
An execution is a finite or infinite alternating sequence s0a1s1a2s2… of states and actions, where s0 is a start state
A trace is obtained by removing all states and internal actions Forward simulation from C to A
states(C) x states(A) If s start(C) then (s) start(A) If (s,a,s’) transitions(C), then there exists an execution fragment
= (s), …, (s’) such that trace(a) = trace()
Theorem: If there exists a simulation relation from C to A, then traces(C) traces(A).
This implies that any safety property of A is also satisfied by C
49
Simulation (or Abstraction) Function
Simulation function F maps each state of specification C (concrete) to a state of specification A (abstract)
F is required to satisfy the following two conditions1. If s is an initial state of C then F(s) is an initial state of A2. If s and F(s) are reachable states of C and A
respectively, and (s, p, s') is a step of C, then there is a step ’p of A from F(s) to F(s'), having the same trace
A:
C:
F(s) F(s')
s s'
F F
p
’p
50
Optimization: Improving performance
Long-Lived RAMBO: Graceful Leave + Incremental Gossip Rigorous proof of correctness by simulation Performance study
[Georgiou, Musial, Shvartsman 06]
51
Complete Shared Memory
Atomicity is compositional Naïve approach:
Implement a singleton memory location Get a complete shared memory by running several
implementations Not so fast! Literally.
52
Complete Shared Memory
Domain-oriented reconfigurable atomic memory Optimizing performance for groups of related objects
- Composition
- Domain
53
Atila: Atomicity through Indirect Learning
Abstract Rambo assumes all-to-all connectivity Atila: “Atomicity Through Indirect Learning Algorithm”
Indirect learning enables progress without any routing protocol or complete connecticivity[Konwar,Musial, Nicolaou, S 07]
Prove atomicity guarantees Conditional analysis of operation latency Identify deployment settings that
lead to reasonable communication cost
54
RDS: Rambo Paxos
Reconfigurable Distributed Storage for Dynamic Networks RDS [Chockler, Gilbert, Gramoli, Musial, Shvartsman 09]
How can liveness of Rambo get hurt? Reconfiguration bursts and Rapid rate of failures
Solution: Paxos-based procedure for installing new configurattons Integrate configuration upgrade with installation Overlap phases of Paxos with “garbage collection” Obsolete configuration are removed quicker Reads and writes use only 1 or 2 configurations
55
Federated Array of Bricks (FAB)
Storage system developed and evaluated at HP Labs [Saito Frølund Veitch Merchant Spence 05]
Distributes workload and handles failures and recoveries without disturbing client requests
Read or write protocol involves majority quorums of storage “bricks” following the Rambo algorithm
Evaluations of the implementation showed FAB performance is similar to centralized solutions, While offering at the same time continuous service and
high availability
56
DynaDisk Implementation
Data-center read/write storage system Allows add/remove of storage devices on-the-fly Based on DynaStore, but with and without consensus [Shraer Martin Malkhi Keidar 10]
57
GeoQuorums
Dynamic atomic read/write memory for mobile settings [Dolev, Gilbert, Lynch, Shvartsman, Welch 04, 05]
Use Rambo architecture over Virtual Node layer Nodes: fixed geographical locations called Focal Points
Centers of populated, compact geographical areas: Traffic intersections, buildings, bridges, points-of-interest
Continuously populated, thus able to maintain state Implementations:
Virtual Node layer over the physical mobile network Atomic read/write memory over the Virtual Node layer
58
GeoQuorums
Mobile nodes Focal points –
implemented asVirtual Nodes
Quorums are definedover focal points
Use GPS as timestamps Fast(er) read/write operations
Single phase writes One or two phase reads
Simplified, consensus-free, reconfiguration Two-phase algorithm using fixed configurations Can be motivated by performance: e.g., if writes are
frequent, install smaller write quorums
59
Read-Modify-Write Storage Systems
RMW is strictly stronger than atomic read and write Some storage systems implement atomic RMW
operations. This type of consistency is expensive, and requires at its core atomic updates Reduce parts of the system to a single-writer model
(e.g., Microsoft’s Azure) Depend on clock synchronization hardware (Google’s
Spanner) Rely on complex mechanisms for resolving event
ordering such as vector clocks (Amazon’s Dynamo)
60
Tempo @ www.veromodo.com
A toolkit for the Timed Input/Output Automata formalism Computer-aided design framework
Specification language Suite of tools used to specify systems and
reasons about them (interfaces to UPPAAL and PVS)
Eclipse GUI (and alternatively a command line interface)
Tempo is used to Specify distributed algorithms / protocols Verify their correctness / safety with theorem
provers / model checkers Simulate their behavior (native simulator)
Tempo availability: beta at www.veromodo.com
61
Tempo – bird’s eye view
SyntaxError Reports
SyntaxError Reports
Parser
VocabulariesAutomataFiles
Syntax TreeSemanticAnalysis
Back-endlibraries
Internalrepresentation
Back-endPlugin loader
Simulator(interactive!)
UppaalModel checker
ProbabilisticModel checker
PVSGeneration
CodeGeneration
DeploymentOptimization
Legend
Existing back-end tools:
Proposed back-end tools:
62
Tempo Availability (beta for research use)
• Pure Java
• Runs on the Eclipse Rich Client Platform
• Runs on multiple platforms
• Mac OS X
• Windows
• Linux
63
Problems & Futures
Use Tempo (www.veromodo.com) Complete specifications Code generation
“Weaker” consistency – crafting atomicity? Local optimizations that preserve global atomicity
Practical settings Recoveries, reconfiguration, and self-stabilization Indirect learning in mobile settings
Reconfiguration When to reconfigure? How to choose configurations? Optimization of deployment [Sonderegger 11]
Going beyond read/write objects
64
Thank You!
Questions and Discussion
?
65
Specifying, Reasoning About, Optimizing, and Implementing Atomic Data Services for Distributed Systems
Consistent shareable data services supporting atomic (linearizable) objects provide convenient building blocks for dynamic distributed systems. In general it is challenging to combine the guarantee of linearizability with efficiency in practical systems subject to perturbations in the computing medium. We describe our work on specification and implementation of consistent data services for network settings. We present a general framework that can be tailored to yield services for various target network settings. In these services, long-lived data availability is ensured through replication, while atomicity is ensured by performing reads and writes using evolvable quorum configurations. We describe on-the-fly reconfiguration that only modestly interferes with on-going operations, in particular, the termination of reads and writes does not depend on the termination of reconfiguration. Safety (atomicity) holds for arbitrary patterns of asynchrony, processor crashes, and link failures. We describe specific examples of specification, rigorous reasoning about correctness, provable optimizations, and methodical implementations of consistent data services in distributed systems.