10/30/2006ELEG652-06F1 Topic 6 Advance Topics "Foolproof systems don't take into account the...
-
date post
21-Dec-2015 -
Category
Documents
-
view
212 -
download
0
Transcript of 10/30/2006ELEG652-06F1 Topic 6 Advance Topics "Foolproof systems don't take into account the...
10/30/2006 ELEG652-06F 1
Topic 6
Advance Topics
"Foolproof systems don't take into account the ingenuity of fools." Gene Brown.
“But I don’t want to go among mad people,” Alice remarked.“Oh, you can’t help that,” said the Cat. ‘”We’re all mad here. I’m mad. You’re mad.”'
“How do you know I’m mad?” said Alice.“You must be,” said the Cat. “or you wouldn’t have come here.”
Alice's Adventures in Wonderland
10/30/2006 ELEG652-06F 2
Reading List
• Slides: Topic6x
• Other papers as assigned in class or
homework
10/30/2006 ELEG652-06F 3
Outline
• A review of Parallel Architecture topics• Synchronization and Parallelism
– Methods to Alleviate and exploit
• Dataflow Model– Program Graphs– Static, dynamic and recursive models
• From Pure Dataflow to multithreading• Transactional Memory
– Lock Free Data Structures– Types
10/30/2006 ELEG652-06F 4
What have we learned?
• Terminology and interconnect networks– Its effects on communication– The different types of architectures– Classes of Applications
• Exploiting ILP– Methods of Exploiting parallelism in different architectures
• Memory Models– Its impact in programming and hardware design– Different implementation of such models into the memory
hierarchy
• Synchronization– Its cost and types
10/30/2006 ELEG652-06F 5
Exploiting Parallelism
What is the factor that determinate the parallelism of an application?
How to extract the maximum possible parallelism?
An model in which data “fires” operations
Think of Re-order Buffer
The dependencies
Respect most dependencies and resolve the ones that can be resolved
Dataflow
10/30/2006 ELEG652-06F 6
Synchronization
• Cost– Lock Acquisition and operations
• In the order of thousand cycles
– Barriers• In the order of ten thousand cycles
• Problem– Lock access
• Network and memory bandwidth and latency overhead
• Solution– Get rid of the locks– “Optimistic execution”– Lock Free Data structures– Transactional Memory
10/30/2006 ELEG652-06F 7
Topic 6a
Dataflow Model
An Execution Model for Parallel Computation
10/30/2006 ELEG652-06F 8
A Short Story
1960 1970 1980 1990 2000 2010
Carl Adam Petri defines Petri Nets
Estrin and Turn proposed an early dataflow model
Karp and Miller analyzed Computation Graphs w/o branches or merges
Rodriguez proposes Dataflow Graphs
Chamberlain proposes Single Assignment language for dataflow
Dennis proposes a dataflow language. Pure Dataflow is born
Kahn proposes a simple parallel processing language with vertices as queues. Static Dataflow is born
Dennis designs a dataflow arch
Arving & Gostelow, & separately Gurd and Watson created a tagged token dataflow model. Dynamic Dataflow is born
Arvind, Nikkel, et al designed the Monsoon dataflow machine
10/30/2006 ELEG652-06F 9
Important Concepts / Properties
• Determinate– In the execution of a concurrent program, if
the order in which the operations are performed does not affect the outcome of the computation.
• Non-determinate– In the execution of a concurrent program, if
the order in which the operations performed does affect the outcome of the computation.
10/30/2006 ELEG652-06F 10
Important Concepts / Properties
• Deterministic– In the execution of a concurrent program, if the order
in which the operations are performed remain the same each time the program is executed
• Non-deterministic– In the execution of a concurrent program, if the order
in which the operations are performed may vary each time the program is executed
• Determinate == Deterministic (?)
10/30/2006 ELEG652-06F 11
Deadlock
• A set of processes is “deadlocked” if each process in the set is waiting on events that only another process in the set can cause.
• Necessary conditions– Mutual Exclusion– Circular Wait– No pre-emption
• Difficulty– Programmability– Correctness– Avoidance, prevention and detection
• The case of LL and SC
Lock ALock B… Unlock BUnlock A
Lock BLock A… Unlock AUnlock B
A Deadlock Example
10/30/2006 ELEG652-06F 12
The Dataflow Model
• Can we come up with a parallel program execution model and a base language such that parallelism is fully exposed while the determinacy and deadlock-free properties are ensured if the user guided to write “well-structured” programs?
• The maximum parallelism in a given piece of code.• Motivation
– Parallelism by explicit data dependency– Determinacy– Deadlock free– Support high-performance architectures
10/30/2006 ELEG652-06F 13
Dataflow V.S. Control Flow
• Dataflow– Program graph of
operators– Operator
Consume / produce tokens
– All enabled operators can run concurrently
• Control Flow– Program sequence
of ops– Operator reads and
write data from storage
– Only one operation per time
• Define Successor
10/30/2006 ELEG652-06F 14
Dataflow Concepts
• Tokens– Data value with “presence” indication
• Actor– Takes a set of n inputs and produces a set of m
outputs– Only “enabled” when all n inputs are available
• Dataflow Graph– A group of operators / actors that represents a
computational section– The relationship between each actor– Controlled by data presence
+
+
+
+
10/30/2006 ELEG652-06F 15
DataflowA Base Language
• More on Dataflow Graphs– To serve as an intermediate-level language for high-
level languages (Jack B. Dennis)– To serve as a machine language for parallel
machines (Jack B. Dennis)– G = ( A, E ) is a directed graph where A, is a set of
actors and E is a set of directed arcs
• A Proper Graph– All actors must have arcs of required types– All arcs must be connected at both ends
10/30/2006 ELEG652-06F 16
Dataflow Model
• Similarities to the DAG• Dataflow graph can be constructed from
the DAG in a systematic and concise manner
• Exploit dynamic ordering of data arrival• Seen in aggressive control flow
implementations– ROB and Tomasulo
• Add some other actors
10/30/2006 ELEG652-06F 17
Actors
1) Links 2) Operators
3) Switch & Control Actors 4) Merge
T F T F
T F
10/30/2006 ELEG652-06F 18
Dataflow Model of Computation
+
-
*
ADD R0, R1, R2SUB R3, R4, R5MULT R6, R0, R3
R1
R2R4
R5
1
3
6
4
10/30/2006 ELEG652-06F 19
Dataflow Model of Computation
+
-
*
ADD R0, R1, R2SUB R3, R4, R5MULT R6, R0, R3
R1
R2R4
R5
1
3
2
10/30/2006 ELEG652-06F 20
Dataflow Model of Computation
+
-
*
ADD R0, R1, R2SUB R3, R4, R5MULT R6, R0, R3
R1
R2R4
R5
2
4
10/30/2006 ELEG652-06F 21
Dataflow Model of Computation
+
-
*
ADD R0, R1, R2SUB R3, R4, R5MULT R6, R0, R3
R1
R2R4
R5
8
10/30/2006 ELEG652-06F 22
Operational SemanticsFiring Rule
• Tokens Data• Assignment Placing a token in the
output arc• Snapshot / configuration: state• Computation
– The intermediate step between snapshots / configurations
• An actor of a dataflow graph is enabled if there is a token on each of its input arcs
10/30/2006 ELEG652-06F 23
Operational SemanticsFiring Rule
• Any enabled actor may be fired to define the “next state” of the computation
• An actor is fired by removing a token from each of its input arcs and placing tokens on each of its output arcs.
• Computation A Sequence of Snapshots– Many possible sequences as long as firing rules are
obeyed– Determinacy– “Locality of effect”
10/30/2006 ELEG652-06F 24
Firing Rules
1) Links 2) Operators
v
vv
v1 vn
unu1
10/30/2006 ELEG652-06F 25
Firing Rules
3) Switch & Control Actors 4) Merge
T F
T F T F
T F
T FT F
T F T F
v v
T F
v v
v1 v2
v1 v2
T
F
v2
v1
v1
v1 v2
v2
10/30/2006 ELEG652-06F 26
General Firing Rules
• A switch actor is enabled if a token is available on its control input arc, as well as the corresponding data input arc.– The firing of a switch actor will remove the input
tokens and deliver the input data value as an output token on the output arc.
• A (unconditional) merge actor is enabled if there is a token available on any of its input arcs.– An enabled (unconditional) merge actor may be fired
and will (non-deterministically) put one of the input tokens on the output arc.
10/30/2006 ELEG652-06F 27
Conditional Expression
if (p(y)){ f(x,y);}else{ g(y);}
Tp
f g
T F
x y
10/30/2006 ELEG652-06F 28
A Conditional Schema
D(k,1)
P(m,n)
Q(m,n)
T F
m
m m
nn
n
k
10/30/2006 ELEG652-06F 29
A Loop Schema
Loop op
COND
T F
T F
Initial Loop value
F
10/30/2006 ELEG652-06F 30
Snapshots
A (m,n) schema without any
enabled actors
V1 Vm
A (m,n) schema without any
enabled actors
U1 Un
Initial Snapshot Final Snapshot
10/30/2006 ELEG652-06F 31
Dataflow GraphsWell Behaved Graphs
• Data flow graphs that produce exactly one set of result values at each output arcs for each set of values presented at the input arcs
• Self Resetting Graphs
• Determinacy
10/30/2006 ELEG652-06F 32
Dataflow GraphWell Formed Schemas
• Well Formed Dataflow Schemas (WFDS)• An operator is an WFDS• A Conditional Schema is an WFDS• An Iterative Schema is an WFDS• An Acyclic Composition of WFDS is a WFDS in
itself• Proposed by Jack B. Dennis and Fossen in 1973• Theorem: “A well-formed data flow graph is
well-behaved”
10/30/2006 ELEG652-06F 33
Sick Formed Dataflow Graph
A
B
D
C
E
A
G H
I
K L
J
M N
Deadlock
Hangup
Conflict
Unclean
10/30/2006 ELEG652-06F 34
Well Behaved Program
• Always determinate in the sense that a unique set of output values is determined by a set of input values
• References:Rodriquez, J.E. 1966, “A Graph Model of Parallel Computation”, MIT, TR-64]Patil, S. “Closure Properties of Interconnections of Determinate Systems”, Records of the project MAC conf. on concurrent
systemsand parallel Computation, ACM, 1970, pp 107-116]Denning, P.J. “On the Determinacy of Schemata” pp 143-147Karp, R.M. & Miller, R.E., “Properties of a Model of Parallel Computation Termination, termination, queuing”, Appl. Math, 14(6),
Nov. 1966
10/30/2006 ELEG652-06F 35
Topic 6b
Types of Dataflow
10/30/2006 ELEG652-06F 36
Dataflow Models
• Static Dataflow Model
• Tagged Token Dataflow Model– Also known as dynamic
• Recursive Program Graphs
10/30/2006 ELEG652-06F 37
Static Dataflow Model
• “...for any actor to be enabled, there must be no tokens on any of its output arcs...”
10/30/2006 ELEG652-06F 38
Conditional Expression
if (p(y)){ f(x,y);}else{ g(y);}
Tp
f g
T F
x y
T F
FIF
O
10/30/2006 ELEG652-06F 39
ExamplePower Function
long power(int x, int n){ int y = 1; for(int i = n; i > 0; --i)
y *= x; return y;}
y = xn
10/30/2006 ELEG652-06F 40
Power Function
T
T F
T
T FT F
T F
i>0
*
return
x 1 n
x y i
-1
f f f
y = xn
10/30/2006 ELEG652-06F 41
Power Function
T
T F
T
T FT F
T F
i>0
*
return
2 1 3
x y i
-1
f f f
y = 23
10/30/2006 ELEG652-06F 42
Power Function
T
T F
T
T FT F
T F
i>0
*
return
2 1 3
-1
y = 23
3
10/30/2006 ELEG652-06F 43
Power Function
T
T F
T
T FT F
T F
i>0
*
return
2 1 3
-1
y = 23
t t t
t t t
10/30/2006 ELEG652-06F 44
Power Function
T
T F
T
T FT F
T F
i>0
*
return2 13
-1
y = 23
t t t
2
10/30/2006 ELEG652-06F 45
Power Function
T
T F
T
T FT F
T F
i>0
*
return
2
2
-1
y = 23
2
tt
10/30/2006 ELEG652-06F 46
Power Function
T
T F
T
T FT F
T F
i>0
*
return
2
2
-1
y = 23
2 2
10/30/2006 ELEG652-06F 47
Power Function
T
T F
T
T FT F
T F
i>0
*
return
2 2
-1
y = 23
2
t
tt
tt
t
10/30/2006 ELEG652-06F 48
Power Function
T
T F
T
T FT F
T F
i>0
*
return2 2
-1
y = 23
2
ttt
2
10/30/2006 ELEG652-06F 49
Power Function
T
T F
T
T FT F
T F
i>0
*
return
2
1
-1
y = 23
tt
4
10/30/2006 ELEG652-06F 50
Power Function
T
T F
T
T FT F
T F
i>0
*
return
2
1
-1
y = 23
4 1
10/30/2006 ELEG652-06F 51
Power Function
T
T F
T
T FT F
T F
i>0
*
return
2 1
-1
y = 23
4
t
tt
tt
t
10/30/2006 ELEG652-06F 52
Power Function
T
T F
T
T FT F
T F
i>0
*
return2 1
-1
y = 23
4
ttt
2
10/30/2006 ELEG652-06F 53
Power Function
T
T F
T
T FT F
T F
i>0
*
return
2
0
-1
y = 23
8
tt
10/30/2006 ELEG652-06F 54
Power Function
T
T F
T
T FT F
T F
i>0
*
return
2
0
-1
y = 23
8 0
10/30/2006 ELEG652-06F 55
Power Function
T
T F
T
T FT F
T F
i>0
*
return
2 0
-1
y = 23
8
f
ff
ff
f
10/30/2006 ELEG652-06F 56
Power Function
T
T F
T
T FT F
T F
i>0
*
8
-1
y = 23
fff
10/30/2006 ELEG652-06F 57
DFGVector Addition
T T T F T T
>=
+1
Select Select
+Assign
c
T F T F T F T F T F
a b NULL 0 N
for(i = 0; i < N; ++i) c[i] = a[i] + b[i];
10/30/2006 ELEG652-06F 58
Static Dataflow ModelFeatures
• One-token-per-arc• Deterministic merge• Conditional/iteration construction• Consecutive iterations of a loop can only be
pipelined.• A dataflow graph activity templates
– Opcode of the represented instruction– Operand slots for holding operand values– Destination address fields
• Token value + destination
10/30/2006 ELEG652-06F 59
Static Dataflow ModelFeatures
• Deficiencies: – Due to acknowledgment tokens, the token traffic is
doubled. – Lack of support for programming constructs that are
essential to modern programming language – no procedure calls, – no recursion.
• Advantage: – simple model
10/30/2006 ELEG652-06F 60
Activity Template
Opcode
Operand / Value
Next Operand / Result
Input AddressSignal Back
*x = 2y = 3sqrt
x signaly signal
sqrtsqrt res
nextsqrt signal
x y
*
sqrt
next
Opx
Opy
Token Arc
Communication ArcOpx / Opy The operation that produced x and y
next The operation that will use the sqrt resultnext
10/30/2006 ELEG652-06F 61
A more Complicated Example
a * c – b * da * d + b * c
*ad+
a signald signal
*N1N2
nextN1 signalN2 signal
*bc+
b signalc signal
*ac-
a signalc signal
*bd-
b signald signal
-M1M2next
M1 signalM2 signal
ab cd
10/30/2006 ELEG652-06F 62
Recursive Program Graphs
• Outlaw iterations :– Graph must be acyclic
• One-token-per-arc-per-invocation
• Iteration is expressed in terms of a tail recursion
10/30/2006 ELEG652-06F 63
Tail Function Application
• Tail-procedure application– a procedure application that occurs as
the last statement in another procedure;• Tail-function application is a function
application (appears in the body expression) whose result value is also returned as the value of the entire functions
• Consider the role of stack
10/30/2006 ELEG652-06F 64
Factorial
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
Normal Recursive
Tail Recursive
10/30/2006 ELEG652-06F 65
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
Hand Simulate fact(3)
T
10/30/2006 ELEG652-06F 66
F
n==0
-1
Apply fact
*
3 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
Hand Simulate fact(3)
T
10/30/2006 ELEG652-06F 67
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
Hand Simulate fact(3)
3
3
T
10/30/2006 ELEG652-06F 68
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
Hand Simulate fact(3)
3F
T
10/30/2006 ELEG652-06F 69
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
Hand Simulate fact(3)
3
FT
F
10/30/2006 ELEG652-06F 70
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
Hand Simulate fact(3)3
T
10/30/2006 ELEG652-06F 71
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
Hand Simulate fact(3)3
3
T
10/30/2006 ELEG652-06F 72
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
Hand Simulate fact(3)
2
3
T
10/30/2006 ELEG652-06F 73
F
n==0
-1
Apply fact
*
2 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2)
T
10/30/2006 ELEG652-06F 74
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2)
2
2
T
10/30/2006 ELEG652-06F 75
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2)
2F
T
10/30/2006 ELEG652-06F 76
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2)
2
FT
F
10/30/2006 ELEG652-06F 77
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2)2
T
10/30/2006 ELEG652-06F 78
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2)2
2
T
10/30/2006 ELEG652-06F 79
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2)
1
2
T
10/30/2006 ELEG652-06F 80
F
n==0
-1
Apply fact
*
1 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2 * fact(1))
T
10/30/2006 ELEG652-06F 81
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2 * fact(1))
1
1
T
10/30/2006 ELEG652-06F 82
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2 * fact(1))
1F
T
10/30/2006 ELEG652-06F 83
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2 * fact(1))
1
FT
F
10/30/2006 ELEG652-06F 84
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2 * fact(1))1
T
10/30/2006 ELEG652-06F 85
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2 * fact(1))1
1
T
10/30/2006 ELEG652-06F 86
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2 * fact(1))
0
1
T
10/30/2006 ELEG652-06F 87
F
n==0
-1
Apply fact
*
0 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2 * fact(1 * fact(0)))
T
10/30/2006 ELEG652-06F 88
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2 * fact(1 * fact(0)))
0
0
T
10/30/2006 ELEG652-06F 89
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2 * fact(1 * fact(0)))
0T
T
10/30/2006 ELEG652-06F 90
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2 * fact(1 * fact(0)))
0
TT
T
10/30/2006 ELEG652-06F 91
F
n==0
-1
Apply fact
*
n
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2 * fact(1 * fact(0)))
1
T
10/30/2006 ELEG652-06F 92
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2 * fact(1 * 1))
1
1
T
10/30/2006 ELEG652-06F 93
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2 * fact(1 * 1))
1
T
10/30/2006 ELEG652-06F 94
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2 * 1)
1
2
T
10/30/2006 ELEG652-06F 95
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * fact(2 * 1)
2
T
10/30/2006 ELEG652-06F 96
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * 2
2
3
T
10/30/2006 ELEG652-06F 97
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
3 * 2
6
T
10/30/2006 ELEG652-06F 98
F
n==0
-1
Apply fact
*
n 1
FactorialThe Normal Version
long fact(n){ if(n == 0) return 1; else return n * fact(n-1);}
6
6
T
10/30/2006 ELEG652-06F 99
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n p
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
Hand Simulate fact(3,1)
10/30/2006 ELEG652-06F 100
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
3 1
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
Hand Simulate fact(3,1)
10/30/2006 ELEG652-06F 101
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n 1
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
Hand Simulate fact(3,1)
3
3
10/30/2006 ELEG652-06F 102
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n 1
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
Hand Simulate fact(3,1)
3
FF
10/30/2006 ELEG652-06F 103
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n p
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
Hand Simulate fact(3,1)
3
1
10/30/2006 ELEG652-06F 104
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n p
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
Hand Simulate fact(3,1)
3 13
10/30/2006 ELEG652-06F 105
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n p
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
Hand Simulate fact(3,1)
2
3
10/30/2006 ELEG652-06F 106
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
2 3
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(2,3))
10/30/2006 ELEG652-06F 107
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n 3
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(2,3))
2
2
10/30/2006 ELEG652-06F 108
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n 3
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(2,3))
2
FF
10/30/2006 ELEG652-06F 109
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n p
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(2,3))
2
3
10/30/2006 ELEG652-06F 110
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n p
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(2,3))
2 32
10/30/2006 ELEG652-06F 111
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n p
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(2,3))
1
6
10/30/2006 ELEG652-06F 112
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
1 6
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(fact(1,6)))
10/30/2006 ELEG652-06F 113
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n 6
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(fact(1,6)))
1
1
10/30/2006 ELEG652-06F 114
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n 6
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(fact(1,6)))
1
FF
10/30/2006 ELEG652-06F 115
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n p
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(fact(1,6)))
1
6
10/30/2006 ELEG652-06F 116
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n p
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(fact(1,6)))
1 61
10/30/2006 ELEG652-06F 117
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n p
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(fact(1,6)))
0
6
10/30/2006 ELEG652-06F 118
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
0 6
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(fact(fact(0,6))))
10/30/2006 ELEG652-06F 119
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n 6
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(fact(fact(0,6))))
0
0
10/30/2006 ELEG652-06F 120
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n 6
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(fact(fact(0,6))))
0
TT
10/30/2006 ELEG652-06F 121
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n p
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(fact(fact(0,6))))6
10/30/2006 ELEG652-06F 122
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n p
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(fact(6)))6
10/30/2006 ELEG652-06F 123
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n p
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(fact(6))6
10/30/2006 ELEG652-06F 124
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n p
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
fact(6)6
10/30/2006 ELEG652-06F 125
FactorialThe Tail Recursion Version
F
n==0
-1
Apply fact
*
n p
F T
long fact_1(n, p){ if(n == 0) return p; else return fact_1(n-1, n*p);}
6
6
10/30/2006 ELEG652-06F 126
Recursive Program GraphFeatures
• Acyclic
• One-token-per-link-in-lifetime
• Tags
• No deterministic merge needed
• Recursion is expressed by runtime copying
• No matching is needed (why?)
10/30/2006 ELEG652-06F 127
Power function as a Recursive Graph
apply (Rec)
X
y
F TT T
> 0
x y n
x n
-1
int rec(int x, int y, int n){ if(n >= 0) return y; else rec(x, x*y, n-1);}
10/30/2006 ELEG652-06F 128
Power function as a Recursive Graph
x n
T
Apply rec
T
-1
*
n
x > 0
x n
F
1
int rec(int x, int y, int n){ if(n >= 0) return 1; else x * rec(x, n-1);}
Note: Tail-recursive = Iterations:i.e. the states of the computation are captured explicitly by the set of iteration variables.
10/30/2006 ELEG652-06F 129
Dynamic Dataflow
• Static Dataflow– Only one token per arc– Problems with Function calls, nested loops and data
structures– A signal is needed to allow the parent’s operator to
fire
• Dynamic Dataflow– You can see them as replicating static dataflow
machines– The MIT tagged token model
10/30/2006 ELEG652-06F 130
The Token in Dynamic Dataflow
[v, <u, s>, d]
v : Valueu : activation instances : destination actord : operand slot
Token Tag + Value
Different from Static Dataflow that it needs the tag
10/30/2006 ELEG652-06F 131
Dynamic Dataflow
• Loops and function calls– Should be executed in parallel as instances of
the same graph
• Abstract the replication• Arc a container with different token that
have different tags• A node can fire as soon that all tokens
with identical tags are presented in its input arcs
10/30/2006 ELEG652-06F 132
Dynamic Dataflow
• Advantages– Better Performance– More Parallelism
• Disadvantages– Implementation of the
matching unit– Associative Memory
would be ideal• Not cost effective• Hashing is used
10/30/2006 ELEG652-06F 133
Dataflow MemoryThe I Structures
• Single Assignment Rule and Complex Data structures– Consume the entire data structure after each access
• The Concept of the I Structure– Only consume the entry on a write– A data repository that obeys the Single Assignment
Rule– Written only once, read many times
• Elements are associated with status bits and a queue of deferred reads
10/30/2006 ELEG652-06F 134
DataflowThe I Structures
• The structure becomes defined on a write and it only happens once– At this moment all deferred reads will be
satisfied
• Use a data structure before it is completely defined
• Incremental creating or reading of data structures
10/30/2006 ELEG652-06F 135
DataflowThe I Structures
• Status– Present: The element can
be read but not written– Absent: The element has
been attempted to be read but the element has not been written yet (initial state)
– Waiting: At least one read request has been deferred
Absent
Waiting Present
Error
r
r rw
w
w
10/30/2006 ELEG652-06F 136
DataflowThe I Structures
• Elementary– Allocate: reserves space for the new I-Structure– I-fetch: Get the value of the new I-structure (deferred)– I-store: Writes a value into the specified I structure
element
• Used to create construct nodes:– SELECT– ASSIGN
10/30/2006 ELEG652-06F 137
DataflowThe I Structures
SELECT ASSIGN
I struct
A j A j x
Addr
I Fetch
A j
I struct
Addr
I Store
A j x
10/30/2006 ELEG652-06F 138
Evolution from “Pure Dataflow” to
“Hybrid” and “Multithreading”
Topic 6b
10/30/2006 ELEG652-06F 139
Non-dataflowbased
CDC 66001964
MASAHalstead1986
HEPB. Smith1978
Cosmic CubeSeiltz1985
J-MachineDally1988-93
M-MachineDally1994-98
Dataflowmodel inspired
MIT TTDAArvind1980
ManchesterGurd & Watson1982
*T/Start-NGMIT/Motorola1991-
SIGMA-IShimada1988
MonsoonPapadopoulos& Culler 1988
P-RISCNikhil & Arvind1989
EM-5/4/X RWC-11992-97
Iannuci’s1988-92
Others: Multiscalar (1994), SMT (1995), etc.
Flynn’sProcessor1969
CHoPP’77 CHoPP’87
TAMCuller1990
TeraB. Smith1990-
AlwifeAgarwal1989-96
CilkLeiserson
LAUSyre1976
Eldorado
CASCADE
StaticDataflowDennis 1972MIT
Arg-FetchingDataflowDennisGao1987-88
MDFAGao1989-93
MTAHumTheobaldGao 94
EARTH CAREPACT95’, ISCA96, Theobald99
Marquez04
Evolution of Multithreaded Execution and Architecture Models
10/30/2006 ELEG652-06F 140
Begin for i = 1…
endforend.
Program
Compile
.
.
SequentialMachineRepresentation
CPULoa
d
Processor
Von Neumann-type Processing
10/30/2006 ELEG652-06F 141
.. .
.... . .
. . To otherPEs
One PE
A Multi-Threaded Architecture
10/30/2006 ELEG652-06F 142
n1
n3n2fetch fetch
Argument-flow principle
store store
n1
n3n2
fetch fetch
Argument-fetching principle
store
McGill Data Flow Architecture Model (MDFA)
10/30/2006 ELEG652-06F 143
A Dataflow Program Tuple
Program Tuple ::= {P-Code. S-Code}
23
n1
23
n2
23
n3
ab
cd
z
ISU
IPU
P-Code: n1: x=a+bn2: y=c-dn3: z=x*y
S-Code:
10/30/2006 ELEG652-06F 144
The McGill Dataflow Architecture Model
P I P U
D I S UEnable Memory
andController
SignalProcessing
Fire Done
10/30/2006 ELEG652-06F 145
P I P U
Fire Done
X
X
X
X
X
D I S U
X
Waiting instructions
Enabled instructions
Important Features:
Pipeline can be kept fully utilized provided that the program has sufficient parallelism. = PC
The McGill Dataflow Processor
10/30/2006 ELEG652-06F 146
1 0 1 1
0 1 0 0
0 0 1 0
0 1 0 1
CONTR
OLLER
SignalProcessing
CountSignal(s)
DoneFire
0
1 The instruction is enabled
The instruction is not enabled
DISUEnable
memory
The Scheduling Memory (enable)
10/30/2006 ELEG652-06F 147
Advantages of the McGill Dataflow Architecture Model
• Eliminate unnecessary token copying and transmission overhead
• Instruction scheduling is separated from the main datapath of the processor
10/30/2006 ELEG652-06F 148
Von Neumann Threads as Macro Dataflow Nodes
• A sequence of instructions is “packed” into a macro-dataflow node
• Synchronization is done at the macro-node level.
1
2
k
A macro node
10/30/2006 ELEG652-06F 149
Hybrid Evaluation Von Neumann style Instruction Execution on the McGill Dataflow Architecture
• Group a “sequence” of dataflow instruction into a “thread” or a macro dataflow node.
• Data-driven synchronization among threads.• “Von Neumann style sequencing” within a thread.
Advantage:Preserves the parallelism among threads but avoids unnecessary fine-grain synchronization between instructions within a sequential thread.
10/30/2006 ELEG652-06F 150
What Do We Get?
• A hybrid architecture model without sacrificing the advantage of fine-grain parallelism!(latency-hiding, pipelining support)
10/30/2006 ELEG652-06F 151
A Realization of the Hybrid Evaluation
1 2 . . . . . k... ...
Von Neumann bit
PIPU
DISU
Short CutFireSignals
DoneSignals
10/30/2006 ELEG652-06F 152
Topic 6c
Multithreaded Execution Model, Architecture and System
10/30/2006 ELEG652-06F 153
Latency due to:- Communication- Synchronization
NetworkC
NI
M
P
C
NI
M
P
Challenges: The “Killer Latency Problem”
10/30/2006 ELEG652-06F 154
Low “Round-trip” Latency
• Very important to many parallel applications
• Solutions?
– Minimize communication and synchronization cost
– Fully utilize available communication bandwidth to
hide latency
• A good multithreaded execution and architecture
model help both
10/30/2006 ELEG652-06F 155
Data Parallel Models
• Difficult to write unstructured programs – convenient only for
problems with, regular structured parallelism
• Limited composability! – Inherent limitation of
single threading
compute
communicate
compute
communicate
?
10/30/2006 ELEG652-06F 156
CPU
Memory
Fine-Grain Multithreading
ThreadUnit
ExecutorLocus
A PoolThread
CPU
Memory
ExecutorLocus
A SingleThread
Coarse-Grain Multithreading
ThreadUnit
Coarse-Grain vs. Fine-Grain Multithreading
10/30/2006 ELEG652-06F 157
HTVMGaoEtAl, Delaware
1987 1989 1994 20041999 2005
Arg-Fetching DataflowDennis, Gao, McGill
DennisGao88
MTAHumEtAl, McGill
HumTheobaldGao94
MDFA/Super ActorGao, McGill
Hum93
EARTHTheobald, Delaware/McGill
Theobald99
CAREAndres, Delaware
Andres04
x
+
-
MIT dataflow(1970s)
Evolution of Multithreaded Execution & Architecture Models Based on Dataflow
10/30/2006 ELEG652-06F 158
Topic 6c
EARTH:
An Efficient Architecture
for Running THreads
10/30/2006 ELEG652-06F 159
Open Issues
• Can multithreaded program execution model supports high scalability for large-scale parallel computing while maintaining uniformly high processing efficiency?
• If so, can this be achieved without exotic hardware support?
10/30/2006 ELEG652-06F 160
The EARTH Program Execution Model
• What is a thread?
• How is the state of a thread represented?
• How is a thread enabled?
10/30/2006 ELEG652-06F 161
What is a Thread?
• A parallel function invocation
(threaded function invocation)
• A code sequence defined (by a user or a compiler)
to be a thread
• Usually, a function body may be partitioned into
several threads
10/30/2006 ELEG652-06F 162
The Fibonacci Exampleint fib (long n){ int sum1; int sum2; if (n < 2) { return (1) ; } else { sum1 = fib(n-1); sum2 = fib(n-2); return (sum1 + sum2); }}
Sum2Sum1
nPC
Sum2Sum1
nPC
Sum2Sum1 n 4
PC
.
.
.
Frame offib(2)
Frame offib(3)
Frame offib(4)
The stack
2
3
The state of a function invocation is <fp, ip>fp: a frame pointer to its own frameip: a program pointer to its own PC
10/30/2006 ELEG652-06F 163
Execution of FibonacciExploitation of Parallelism
fib(6)
fib(5) fib(4)
fib(4) fib(3) fib(3) fib(2)
fib(3) fib(2) fib(2) fib(1) fib(2) fib(1) fib(1) fib(0)
fib(2) fib(1) fib(1) fib(0) fib(1) fib(0) fib(1) fib(0)
10/30/2006 ELEG652-06F 164
Parallel Function Invocation
fib n-2
fib n
fib n-2fib n-1
fib n-3
caller’s<fp,ip>
localvars
SYNCslots
Tree of “Activation Frames”
10/30/2006 ELEG652-06F 165
Stack and Activation Frames
Synchronization slots
Local variables
Stack frames
Tree of frames Activation frame
10/30/2006 ELEG652-06F 166
b = x[j];sum = a + b;prod = a * b;
r1 = g(sum);r2 = g(prod);r3 = g(fact);
return(r1 + r2 + r3);}
int f(int *x, int i, int j)
{
int a, b, sum, prod, fact;
int r1, r2, r3;
a = x[i];
fact = 1;
fact = fact * a;
An Example
10/30/2006 ELEG652-06F 167
a = x[i];fact = 1;
Fiber-0:
fact = fact * a;b = x[j];
Fiber-1:
sum = a + b;prod = a * b;r1 = g(sum);r2 = g(prod);r3 = g(fact);
Fiber-2:
return (r1 + r2 + r3);
Fiber-3:
1
1
3
The Example Four Fibers
10/30/2006 ELEG652-06F 168
FiberStates
• A Fiber shares its “enclosing frame” with other
Fibers within the same function invocation
• The state of a Fiber includes
– its instruction pointer
– its “temporary register set”
• A Fiber is “ultra-light weighted”: it does not need
dynamic storage (frame) allocation.
10/30/2006 ELEG652-06F 169
The Fiber Execution Model
1 2 42
1 22 2“signal token”
a “Fiber” actor
- data token- locality token
A Multithread Program Graph (MPG)
10/30/2006 ELEG652-06F 170
EARTH Fiber Firing Rule
• A Fiber in a MPG becomes enabled if it has received all
input signals;
• An enabled Fiber may be selected for execution when
required hardware resource is allocated;
• When Fiber finishes its execution, signal will send to
destination Fibers in the MPG and update the
corresponding synchronization slots.
10/30/2006 ELEG652-06F 171
Fiber States
DORMANT
ENABLED ACTIVE
Thread created
Thread terminated
Synchronizationreceived Thread completed
CPU ready
10/30/2006 ELEG652-06F 172
The EARTH Model of Computation
Fiber within a frame
Parallel function invocation
a sync operation
Invoke a threaded func
10/30/2006 ELEG652-06F 173
EARTH Multithreaded Architecture Model
Local Memory
SUEU
PE
NETWORK
Local Memory
SUEU
PE
10/30/2006 ELEG652-06F 174
The EARTH Operation Set
• The base operation
• Thread synchronization and scheduling ops
SPAWN, SYNC
• Split-phase data & sync ops
GET-SYNC, DATA_SYNC
• Threaded function invocation and load balancing ops
INVOKE, TOKEN
10/30/2006 ELEG652-06F 175
Topic 6d
Programming Models for
Multithreaded Architectures:
The EARTH Threaded-C Experience
10/30/2006 ELEG652-06F 176
Local Memory
SUEU
PE
NETWORK
Local Memory
SUEU
PE
EARTH-MANNA Testbed
10/30/2006 ELEG652-06F 177
Features of Threaded Programming
• Thread partition
- Thread length vs useful parallelism
- Where to “cut”?
• Split-phase synchronization and communication
• Parallel threaded function invocation
• Dynamic load balancing
Lat
ency
tole
ranc
e an
d m
anag
emen
t
10/30/2006 ELEG652-06F 178
Table 1EARTH Instruction Set
• Basic instructions: Arithmetic, Logic and Branching typical RISC instructions, e.g., those from the i860
• Thread Switching FETCH_NEXT
• Synchronization SPAWN fp, ip
SYNC fp, ss_off INIT_SYNC ss_off, sync_cnt, reset_cnt, ip INCR_SYNC fp, ss_off, value
10/30/2006 ELEG652-06F 179
Table 1 EARTH Instruction Set
• Data Transfer & Synchronization DATA_SPAWN value, dest_addr, fp, ip DATA_SYNC value, dest_addr, fp, ss_off BLOCKDATA_SPAWN src_addr, dest_addr, size, fp, ip BLOCKDATA_SYNC src_addr, dest_addr, size, fp, ss_off
• Split_phase Data Requests GET_SPAWN src_addr, dest_addr, fp, ip GET_SYNC src_addr, dest_addr, fp, ss_off
GET_BLOCK_SPAWN src_addr, dest_addr, size, fp, ip GET_BLOCK_SYNC src_addr, dest_addr, size, fp, ip
• Function Invocation INVOKE dest_PE, f_name, no_params, params
TOKEN f_name, no_params, params END_FUNCTION
10/30/2006 ELEG652-06F 180
Threaded-C A Base-Language
• To serve as a target language for high-level
language compilers
• To serve as a machine language for the
EARTH architecture
10/30/2006 ELEG652-06F 181
The Role of Threaded-C
High-level LanguageTranslation
Threaded-CCompiler
Threaded-C
C Fortran
EARTH Platforms
Users
10/30/2006 ELEG652-06F 182
Parallel Function Invocation
fib n-2
fib n
fib n-2fib n-1
fib n-3
caller’s<fp,ip>
localvars
SYNCslots
Tree of “Activation Frames”
Links between frames
10/30/2006 ELEG652-06F 183
The fib Example
if( n < 2 )DATA_RSYNC(1, result, done);
else{TOKEN(fib, n-1, &sum1, slot_1);TOKEN(fib, n-2, &sum2, slot_1);
}END_THREAD();
THREAD_1:DATA_RSYNC(sum1 + sum2, result, done);
END_THREAD();
END_FUNCTION
fib
n result done
0 0
2 2
10/30/2006 ELEG652-06F 184
The inner product Example
BLKMOV_SYNC(a, row_a, N, slot_1);BLKMOV_SYNC(b, column_b, N, slot_1);sum = 0;END_THREAD();
THREAD_1:for(i = 0; i < N; ++i)
sum = sum + (row_a[i] * column_b[i]);DATA_RSYNC(sum, result, done);
END_THREAD();
END_FUNCTION
inner
a result done
0 0
2 2
b
10/30/2006 ELEG652-06F 185
void main ( ){ int i, j, k; float sum;
for (i=0; i < N; i++) for (j=0; j < N ; j++) { sum = 0; for (k=0; k < N; k++) sum = sum + a [i] [k] * b [k] [j] c [i] [j] = sum; }}
Sequential Version
Matrix Multiply
10/30/2006 ELEG652-06F 186
The Matrix Multiply Example
for(i = 0 ; i < N; ++i){row_a = &a[i][0];column_b = &b[0][i];TOKEN(inner, &c[i][j], row_a, column_b,
slot_1);}
THREAD_1; RETURN ( );END_THREAD();
END_FUNCTION
inner 0 0
N2 N2
10/30/2006 ELEG652-06F 187
Topic 6e
Transactional Memory
An Overview
10/30/2006 ELEG652-06F 188
Transactional Memory
• Coming from the database world• An All or none scheme• A group of operations (of arbitrary size) is
consider a transaction– A transaction is atomic– Get data, operate, commit
• In case of commit: if the memory cell(s) has not been modified, write your results to memory
– “Modification” has taken place, then discard your results and try again
10/30/2006 ELEG652-06F 189
Final Side NoteA Review of LL and SC
• PowerPC and many other architecture instructions
• Provide a way to optimistically execute a piece of code
• In case that a “violation” has taken place, discard your results
• Many implementations– PowerPC: lwarx and stwcx
10/30/2006 ELEG652-06F 190
Final Side NoteThe LL and SC behavior
• The lwarx instruction– Loads a word aligned
location– Side Effects:
• A reservation is created• Storage coherence
mechanism is notified that a reservation exists
• The stwcx instruction– Conditionally Store a
location to a given memory location.
• Conditionally Depends on the reservation
– If success, all changes will be committed to memory
– If not, changes will be discarded.
10/30/2006 ELEG652-06F 191
Final Side NoteReservations
• At most one per processor• A reservation is lost when
– Processor holding the reservation executes• A lwarx or ldarx• A stwcx or stdcx (No matter if the reservation matches or not)
– Other processors executes• A store or a dcbz to the granule
– Some other mechanism modifies a storage location in the same reservation granule
• Interrupts does not clean reservations– But interrupt handlers might
• Granularity– The length of the memory block to keep under surveillance
10/30/2006 ELEG652-06F 192
Final Side NoteExamples
LL a = ?
SC a
a
a *= 100;
brnz
Storage Mechanism
…
Memory
a = ?
10/30/2006 ELEG652-06F 193
Final Side NoteExamples
LL a = ?
SC a
a
a *= 100;
brnz
Storage Mechanism
LL a = ?
SC a
a += 100;
brnz
a
Memory
a = ?
10/30/2006 ELEG652-06F 194
Final Side NoteExamples
LL a = ?
SC a
X
a *= 100;
brnz
Storage Mechanism
LL a = ?
SC a
a += 100;
brnz
X
a = 100;
Memory
a = 100
10/30/2006 ELEG652-06F 195
Final Side NoteExamples
LL a = ?
SC a
X
a *= 100;
brnz
Storage Mechanism
LL a = ?
SC a
a += 100;
brnz
X
Memory
a = 100
10/30/2006 ELEG652-06F 196
Final Side NoteExamples
LL a = ?
SC a
X
a *= 100;
brnz
Storage Mechanism
LL a = 100
SC a
a += 100;
brnz
a
Memory
a = 100
10/30/2006 ELEG652-06F 197
Final Side NoteExamples
LL a = 100
SC a
a
a *= 100;
brnz
Storage Mechanism
LL a = 100
SC a
a += 100;
brnz
a
Memory
a = 100
10/30/2006 ELEG652-06F 198
Final Side NoteExamples
LL a = 100
SC a
X
a *= 100;
brnz
Storage Mechanism
LL a = 100
SC a
a += 100;
brnz
a
Memory
a = 200
10/30/2006 ELEG652-06F 199
Final Side NoteExamples
LL a = 100
SC a
X
a *= 100;
brnz
Storage Mechanism
Memory
a = 200
10/30/2006 ELEG652-06F 200
Final Side NoteExamples
LL a = 200
SC a
a
a *= 100;
brnz
Storage Mechanism
Memory
a = 200
10/30/2006 ELEG652-06F 201
Final Side NoteExamples
LL a = 200
SC a
a
a *= 100;
brnz
Storage Mechanism
Memory
a =20000
10/30/2006 ELEG652-06F 202
Final Side NoteLL / SC Disadvantages
• Only works for granule size memory cells
• Cannot target different memory cells at the same time
• Since at most one reservation can be held by a processor, nesting is out of the question
10/30/2006 ELEG652-06F 203
Transactional Memory
• Similar Concept presented by LL and SC
• Based on the concept of a transaction:– A group of instructions is executed atomically
with respect to others transactions– The memory affected might be of different
size or distributed across the system– A transaction will commit or abort depending
on the memory stateGet Sets Do Ops Validate Atomic WB
Retry Abort
10/30/2006 ELEG652-06F 204
Transactional Memory
• More on the concept of a transaction:– Transactions runs in isolation
• No Side effects are visible to the outside world
– Transaction’s Properties• Atomicity: All or none• Serial-ability: Transactions executes one after the
other in the same order for all who observe them. (Can be weaken)
10/30/2006 ELEG652-06F 205
TransactionsScalability
• Multiple Readers– Not allowed for “normal” locks
• Exception: Reader and Writer locks
– Transactions “naturally” allow multiple readers
• Concurrent access to disjoint data– Normal: Programmer’s responsibility for fine
grain locks– Transactions allows (given enough hardware
resources) concurrent access to disjoint data
10/30/2006 ELEG652-06F 206
Transaction Memory
• Atomicity and Isolation – Two basic properties of any implementation– Data Versioning
• Memory Ops:– Unprotected reads and writes– Transactions
• Strong atomicity– Any write op will produce a violation– Any read will see the whole transaction or none of it.
• Weak Atomicity– Only transaction’s writes will be considered to produce violations– A read from non transactional mem op may read a partial set of the
uncommitted transaction
– Conflict detection• Detect Read-Write and Write-Write Conflicts
10/30/2006 ELEG652-06F 207
Transaction Memory
• Strongest Transactional Model: Found in DBMS databases– Called ACID
• Atomicity• Consistency• Isolation • Durability
• Implicit versus Explicit– Programming Language centric
• Provide a collection of low level constructs or function calls Explicit
• Provide a general “abstraction” for transactions Implicit
10/30/2006 ELEG652-06F 208
Transactional MemoryData Versioning
• Management of data from new and old transactions
• Eager– Memory Rollback– Adv: Faster Commits and direct reads– Cons: Slower aborts, no fault tolerance
• Lazy– Buffer Rollback– Adv: Faster Abort, fault tolerance– Cons: Slow commits, indirect reads
10/30/2006 ELEG652-06F 209
Transactional MemoryData Versioning
T a=10
Memory
T
10
a=15
Memory
a=15
T a=15
Memory
T a=10
Memory
Begin
Write
Commit Abort
Eager Versioning Example
10/30/2006 ELEG652-06F 210
Transactional MemoryData Versioning
T a=10
Memory
T
15
a=10
Memory
a=15
T a=15
Memory
T a=10
Memory
Begin
Write
Commit Abort
Lazy Versioning Example
10/30/2006 ELEG652-06F 211
Transactional MemoryConflict Detection
• Read and Write Sets– Read Set represents all the variables that are
only read through out this transaction– Write Set represents all the variables that are
written through out this transaction
10/30/2006 ELEG652-06F 212
Transactional MemoryConflict Detection
• A conflict– The intersection between the read set of one
transaction and the write sets of two or more different transactions is not zero
– The intersection between the write set of one transaction and the writes sets of two or more different transactions is not zero
10/30/2006 ELEG652-06F 213
Transactional MemoryConflict Detection
• Pessimistic Detection– Conflicts resolutions during reads and writes
• Through coherence blocks or locks and version numbers
– Manager to resolve conflicts• Stall or abort
– Pros• Detects conflicts early
– Stalls instead of aborts (in some cases)
– Cons• No guarantees in forward progress• Issues with locks and fine grain communications
10/30/2006 ELEG652-06F 214
Transactional MemoryConflict Detection
Pessimistic Conflict DetectionT0 T1
wr(y)
wr(z)
wr(x)
Check
Check
Check
Commit Commit
T0 T1wr(m)
rd(m)
Check
Check
Commit
STALL
Commit
T0 T1rd(l)
wr(l)
Check
Check
Commit
ABORT
Commitrd(l)
Check
ABORT
T0 T1
rd(n)wr(n)
rd(n)wr(n) Check
Check
… …
Check
ABORT
rd(n)wr(n)
rd(n)wr(n)Check
ABORTEarly Detect
Success
Abort
No Progress
10/30/2006 ELEG652-06F 215
Transactional MemoryConflict Detection
• Optimistic Detection– Detect conflicts at commit
• Compare the to-be-committed write set against other read sets
– To-be-committed write will always succeed, but may cause others to fails
• Validate write and read sets using locks or version numbers• Pros
– Forward progress ensured
– Potentially less conflicts
• Cons– Conflicts are detected late
10/30/2006 ELEG652-06F 216
Transactional MemoryConflict Detection
Optimistic Conflict DetectionT0 T1
wr(y)
wr(z)
wr(x)
Check
Check
Commit
Commit
T0 T1wr(m)
rd(m)
Check
Commit
STALL
Commit
T0 T1rd(l)
wr(l)Check
Check
Commit
Commit
ABORT
T0 T1
rd(n)wr(n)
rd(n)wr(n) Check
Commit …
rd(n)wr(n)
Check
Abort
Success
SuccessForward Progress
ABORT
Check
rd(m)
Commit
10/30/2006 ELEG652-06F 217
Transactional MemoryConflict Detection
• Granularity– Object
• Pro: Overhead reduction, closer to the programming model
• Cons: False sharing
– Word• Pro: Down with false sharing• Cons: More overhead
– Cache Line• Pro: Between Word and Object
10/30/2006 ELEG652-06F 218
Transactional MemoryNested Transactions
• Transactions inside transactions– Allowed composability with transactions
running in library calls or function calls– Allow multiple transactions to run inside the
given transaction• Remember that the transaction should appear
atomic only to other transactions. • Doesn’t impose restrictions on operations inside
the transactions– DMA transfers, threads creations and operations, etc.
10/30/2006 ELEG652-06F 219
Transactional MemoryNested Transactions
• Closed Nested Transactions– Inner transactions’ commit stage
• In success Merge with parent and let the parent commit the changes
• In failure Rollback inside the parent (or abort)
– Read and write sets may be disjoint from the parent– Only outer most transaction will commit– Children transactions may fail but outer one may
succeed.– Alternative execution paths!!!!
10/30/2006 ELEG652-06F 220
Transactional MemoryNested Transactions
• Open Nested Transactions– Inner transactions’ commit
• In Success Update memory AND merge with parent• In Failure Abort and Rollback inside the parent
– SHOCK!!!! HORROR!!!! ATOMICITY IS BROKEN!!!!• If write sets are not disjoint
– Moreover, if the parent fails, then a rollback mechanism should be provided to rollback the children transactions
10/30/2006 ELEG652-06F 221
Transactional MemoryNested Transactions
Write Set = {A, D}
Write Set {A, B, C}
Commit {A, B, C, D}
Merge {A, B, C}
Write Set = {A, D}
Write Set {A, B, C}
Commit {A, B, C, D}
Merge {A, B, C}Commit {A, B, C}
Reads A
Reads A
Clo
sed
Nes
ted
Tra
nsa
ctio
n
Op
en N
este
d T
ran
sact
ion
10/30/2006 ELEG652-06F 222
Transactional MemoryTypes
• Hardware, Software and Hybrid– Implementation dependent classification– Conflict detection and data revision are
different for all implementations
• Hardware– Conflict detection Through Cache
Coherence Protocol– Data Revision Cache Lines– High Performance plus binary compatibility
10/30/2006 ELEG652-06F 223
Transactional MemoryTypes
• Software– Translation of programming constructs– Runtime and Compiler Support– Low Performance– Better Abstraction than locks for fine grain constructs– Data Revision: Object granularity – Conflict Detection: Lock and / or version numbers.
Runtime data structures.
• Hybrid– A combination of the above approaches
10/30/2006 ELEG652-06F 224
Transactional Memory
A summary of transactional memory systems with main characteristics being shown
Courtesy of Carlstrom, et.al “The ATOMOS Transactional Programming Language” PLDI 2006
Programming Language Programming constructs that supports transactions instead of library callsMultiprocessor The main programming model is based on a multithreaded environment instead of a uniprocessor one
10/30/2006 ELEG652-06F 225
Bibliography
• Theobald, Kevin. “EARTH: An Efficient Architecture for Running Threads.” PhD Thesis, McGill University, Quebec Canada, May 1999
• Carlstrom, Brian; McDonald, Austen; Chaif, Hassan; Chung, Jae Woong; Minh, Chi Cao; Kozyrakis, Christos; Olukotun, Kunle; “The ATOMOS Transactional Programming Language.” Computer Systems laboratory, Stanford University, PLDI 2006.
• Herlihy, Maurice; Moss, J.E. B. “Transactional Memory: Architectural Support for Lock Free Data Structures.” Proceedings of the Twentieth Annual International Symposium on Computer Architecture. 1993
• Kozyrakis, Christos. “Transactional Memory Tutorial.” PACT 2006