DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers
description
Transcript of DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers
![Page 1: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/1.jpg)
DYNAMO vs. ADOREA Tale of Two Dynamic Optimizers
Wei Chung Hsu徐慰中Computer Science Department交通大學
(work was done in University of Minnesota, Twin Cities)3/05/2010
![Page 2: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/2.jpg)
Dynamo Dynamo is a dynamic optimizer It won the best paper award in PLDI’2000, cited
612 times Work started by the HP lab and the HP system lab. MIT took over and ported it to x86, called it
DynamoRIO. This group later started a new company, called Determina (now acquired by VMware)
Considered revolutionary since optimizations were always performed statically (i.e. at compile time)
![Page 3: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/3.jpg)
SPEC CINT2006 for Opteron X4Name Description IC×109 CPI Tc (ns) Exec time Ref time SPECratio
perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3
bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8
gcc GNU C Compiler 1,050 1.72 0.40 724 8,050 11.1
mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8
go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6
hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5
sjeng Chess game (AI) 2,176 0.96 0.40 837 12,100 14.5
libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8
h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3
omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1
astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1
xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0
Geometric mean 11.7
Very high cache miss rate rates Ideal CPI should be 0.33
Time=CPI x Inst x Clock period
![Page 4: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/4.jpg)
Where have all the cycles gone?• Cache misses
– Capacity, Compulsory/Cold, Conflict, Coherence– I-cache and D-cache– TLB misses
• Branch mis-predictions– Static and dynamic prediction– Mis-speculation
• Pipeline stalls– Ineffective code scheduling
often caused by memory aliasing
Unpredictable
Hard to deal with at compile time
![Page 5: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/5.jpg)
5
Trend of Multi-cores
Exploiting these potentials demands thread-level parallelism
Intel Core i7 die photo
![Page 6: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/6.jpg)
6
Exploiting Thread-Level Parallelism
Potentially more parallelism with speculation
dependence
Sequential
Store *pLoad *q
Store *p
Time
p != q
Thread-Level Speculation (TLS)
Traditional ParallelizationLoad *q
p != q ??
Store 88
Load 20
Parallel execution
Store 88
Load 88
Speculation Failure
Time
Time
Load 88
Compiler gives up
p == q
But Unpredictable
![Page 7: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/7.jpg)
Dynamic OptimizersDynamic optimizers
Dynamic Binary Optimizers (DBO)
Java VM (JVM) with JIT compiler(dynamic compilationor adaptive optimization)
Native-to-native dynamic binary optimizers (x86 x86, x86-32 x86-64 IA64 IA64)
Non-nativedynamic binarytranslators(e.g. x86 IA64,ARM MIPS,PPC x86, QEMUVmware, Rosetta)
![Page 8: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/8.jpg)
More on why dynamic binary optimizationNew architecture/micro-architecture features offer more opportunity for performance, but are not effectively exploited by legacy binary.x86 P5/P6/PII/PIII, x86-32/x86-64, PA 7200/8000, …
Software evolution and ISV behaviors reduce effectiveness of traditional static optimizerDLL, middleware, binary distribution, …
Profile sensitive optimizations would be more effective if performed at runtimepredication, speculation, branch prediction, prefetching
Multi-core environment with dynamic resource sharing makes static optimization challengingshared cache, off-chip bandwidth, shared FU’s
![Page 9: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/9.jpg)
How Dynamo Works
Interpret untiltaken branch
Lookup branchtarget
Start of tracecondition?
Jump to codecache
Increment counterfor branch target
Counter exceedthreshold?
Interpret +code gen
End-of-tracecondition?
Create trace& optimize it
Emit intocache
Signalhandler
Code Cache
Dynamo is VM based
![Page 10: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/10.jpg)
Trace Selection
A
B C
D
F
G H
I
E
A
C
D
F
G
I
E
call
returnto Bto H
back toruntime
trace layout in
tracecache
trace selection
![Page 11: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/11.jpg)
Backpatching
A
C
D
F
G
I
E
to Bto H
back toruntime
H
I
E
When H becomes hot,a new trace is selectedstarting from H, and thetrace exit branch in block F is backpatched to branch to the new trace.
![Page 12: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/12.jpg)
Execution Migrates to Code Cache
a.out
1
2
3
Code cache
1
2
3
0
4
interpreter/emulator
traceselector
optimizer
![Page 13: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/13.jpg)
Trace Based Optimizations• Full and partial redundancy elimination• Dead code elimination• Trace scheduling• Instruction cache locality improvement• Dynamic procedure inlining (or procedure
outlining)• Some loop based optimizations
![Page 14: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/14.jpg)
Summary of Dynamo• Dynamic Binary Optimization customizes
performance delivery:– Code is optimized by how the code is used
• Dynamic trace formation and trace-based optimizations
– Code is optimized for the machine it runs on– Code is optimized when all executables are
available– Code should be optimized only the parts that
really matters
![Page 15: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/15.jpg)
ADORE ADORE means ADaptive Object code RE-
optimization Was developed at the CSE department, U. of
Minnesota, Twin Cities Applied a very different model for dynamic
optimization systems Considered evolutionary, cited by 61
![Page 16: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/16.jpg)
Dynamic Binary Optimizer’s Models
Application Binaries
DBO
Operating System
Hardware Platform
Application Binaries DBO
Operating System
Hardware Platform
- Translate most execution paths and keep in code cache
- Easy to maintain control - Dynamo (PA-RISC) DynamoRIO (x86)
- Translate only hot execution paths and keep in code cache
- Lower overhead - ADORE (IA64, SPARC) COBRA (IA64, x86 – ongoing)
![Page 17: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/17.jpg)
ADORE Framework
Hardware Performance Monitoring Unit (PMU)
Kernel
Phase Detection
Trace Selection
Optimization
Deployment
Main ThreadDynamic
OptimizationThread
Code Cache
Init PMUInt. on Event
Int on K-buffer ovf
On phase change
Pass traces to opt
Init Code $ Optimized Traces
Patch traces
![Page 18: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/18.jpg)
Thread Level View
18
Application
K-buffer overflow handler
Init ADORE
ADORE invoked
ADORE invoked
User buffer full
User buffer full
Thread 1 Thread 2
sleep
sleep
User buffer full is
maintained for 1 main event. This
event is usually
CPU_CYCLES
![Page 19: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/19.jpg)
Perf. of ADORE/Itanium on SPEC2000
![Page 20: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/20.jpg)
Performance on BLAST
-15%
0%
15%
30%
45%
60%
blastnnt.1
blastnnt.10(4)
blastnnt.10(5)
blastnnt.10(7)
blastpaa.1
blastxnt.1
tblastnaa.1
Queries
% S
peed
-up
GCC O2 ORC O2 ECC O2
![Page 21: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/21.jpg)
ADORE vs. DynamoTasks Dynamo ADORE
Observation(profiling)
Interpretation/ instrumentation based
HPM sampling based
Optimization Trace layout and classic optimization
I/D-cache related optimizations(prefetching + trace layout)
Code cache Need large Code$ Small Code$ sufficient
Re-direction Interpretation and trace chaining
Code Patching
![Page 22: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/22.jpg)
ADORE on Multi-cores
• COBRA (Continuous Object code Re-Adaptation) framework is a follow up project, implemented on Itanium Montecito and x86’s new multi-core machines.
• ADORE on SPARC Panther (Ultra Sparc IV+) multi-core machines.
• ADORE for TLS tuning
![Page 23: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/23.jpg)
23
COBRA Framework
• Optimization Thread– Centralized Control– Initialization – Trace Selection– Trace Optimization– Trace Patching
• Monitor Threads– Localized Control– Per-thread Profile
Single System Image (Kernel)
Perfmon Sampling Kernel Driver
Kernel Sampling Buffer (KSB)
Per-Thread Monitoring Threads
Per-Thread UserSampling Buffer (USB)
Per-Thread Profile Buffer (PB)
Per-Thread Phase and Profile
Manager
Optimization Thread
MainController
Trace Selectionand Optimization
TraceCache
Main/Working Threads
Multi-Threaded Program With COBRA Monitoring and Optimizing Threads in Same Address Space
Trace Patcher
Processor 0
HardwarePerformance
Counters
Processor 3
HardwarePerformance
Counters
Processor 2
HardwarePerformance
Counters
Processor 1
HardwarePerformance
Counters
![Page 24: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/24.jpg)
24
Startup of 4 thread OpenMP Program
monitorthread
main process (worker thread) vfork
OMP monitor thread OMP monitor thread
worker thread worker thread
worker thread worker thread
Worker thread Worker thread
pthread_create
monitorthreadmonitor
threadmonitorthread
monitoringprocess
Optimzer thread Optimzer thread start
end
1
6
5
4
32
Same Address Space
![Page 25: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/25.jpg)
25
Prefetch vs. NoPrefetch
• The prefetch version when running with 4 threads suffers significantly from L2_OZQ_FULL stalls.
Scalability of DAXPY Kernel on 4-way Itanium 2 Machine(# of threads, with/without prefetch)
0.000.200.400.600.801.001.201.401.601.802.00
128K 512K 2Mdata working set size
Nor
mal
ized
exe
cutio
n tim
e to
ba
selin
e
(1, prefetch)(1, noprefetch)(2, prefetch)(2, noprefetch)(4, prefetch)(4, noprefetch)
26%34%
![Page 26: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/26.jpg)
26
Prefetch vs. Prefetch with .excl
• .excl hint: prefetch a cache line in exclusive state instead of shared state. (Invalidation based cache coherence protocol)
Scalability of DAXPY Kernel on 4-way Itanium 2 Machine (#of threads, prefetch without/with .excl hints)
0.00
0.20
0.40
0.60
0.80
1.00
1.20
128K 512K 2Mdata working set size
Nor
mal
ized
exe
cutio
n tim
e to
ba
selin
e
(1, prefetch)(1, prefetch.excl)(2, prefetch)(2, prefetch.excl)(4, prefetch)(4, prefetch.excl)
15%12%
![Page 27: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/27.jpg)
27
Execution time on 4-way SMP
0.900
0.950
1.000
1.050
1.100
1.150
1.200
bt.S sp.S lu.S ft.S mg.S cg.S avgNPB OMP v3.0 benchmarks
Spee
dup
rela
tive
to b
asel
ine
(pre
fetc
h)(4, prefetch) (4, noprefetch) (4, prefetch.excl)
8%
15%
2.7%4.7%
noprefetch: up to 15%, average 4.7% speedup prefetch.excl: up to 8%, average 2.7% speedup
![Page 28: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/28.jpg)
28
Execution time on cc-NUMA
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1.400
1.600
1.800
bt.S sp.S lu.S ft.S mg.S cg.S avgNPB OMP v3.0 benchmarks
Spee
dup
rela
tive
to b
asel
ine
(pre
fetc
h)(8, prefetch) (8, noprefetch) (8, prefetch.excl)
17.5%8.5%
68%
18%
noprefetch: up to 68%, average 17.5% speedup prefetch.excl: up to 18%, average 8.5% speedup
![Page 29: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/29.jpg)
29
Summary of Results from COBRA• We showed that coherent misses caused by aggressive
prefetching could limit the scalability of multithreaded program on scalable shared memory multiprocessors.
• With the guide of runtime profile, we experimented two optimizations.– Reducing aggressiveness of prefetching
• Up to 15%, average 4.7% speedup on 4-way SMP• Up to 68%, average 17.5% speedup on SGI Altix cc-NUMA
– Using exclusive hint for prefetch• Up to 8%, average 2.7% speedup on 4-way SMP• Up to 18%, average 8.5% speedup on SGI Altix cc-NUMA
![Page 30: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/30.jpg)
ADORE/SPARC• ADORE has been ported to Sparc/Solaris
platform since 2005.• Some porting issues:
– ADORE uses the libcpc interface on Solaris to conduct runtime profiling. A kernel buffer enhancement is added to Solaris 10.0 to reduce profiling and phase detection overhead
– Reachability is a true problem. (e.g. Oracle, Dyna3D)– Lack of branch trace buffer is painful. (e.g. Blast)
![Page 31: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/31.jpg)
Performance of In-Thread Opt. (USIII+)
-10.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
BasePeak
![Page 32: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/32.jpg)
Helper Thread Prefetching for Multi-Core
Main thread
Second corePrefetches initiated
Cache miss avoided L2
CacheMiss
timeFirst Core
Trigger to activate (About 65 cycles delay)
Spin Waiting Spin again waiting for the next trigger
![Page 33: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/33.jpg)
Performance of Dynamic Helper Thread(on Sun UltraSparc IV+)
-20.00%
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
BasePeak
![Page 34: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/34.jpg)
Evaluation Environment for TLS
Benchmarks· SPEC2000 written in C, -O3 optimization
Underlying architecture· 4-core, chip-multiprocessor (CMP)· speculation supported by coherence
Simulator· Superscalar with detailed memory model· simulates communication latency· models bandwidth and contention
Detailed, cycle-accurate simulation
C
C
P
C
P
Interconnect
C
P
C
P
34
![Page 35: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/35.jpg)
35
Dynamic Tuning for TLS
ammp artbzip
2cra
fty
equake gap gcc gzip mcfmesa
parser
perlbmk
twolf
vorte
xvp
r-p vpr-r
G.M.
0
0.5
1
1.5
2
2.5
3
Simple Quantitative Quantitative+StaticHint
Spee
dup
wrt
. SEQ
1.17x
1.23x
1.37x
Parallel Code Overhead
![Page 36: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/36.jpg)
Summary of ADORE• ADORE uses Hardware Performance Monitoring (HPM)
capability to implement a light weight runtime profiling system. Efficient profiling and phase detection is the key to the success of dynamic native binary optimizers.
• ADORE can speed up real-world large applications optimized by production compilers.
• ADORE works on two architectures: Itanium and SPARC. COBRA is a follow-up system of ADORE. It works on Itanium and x86.
• ADORE/COBRA can also optimize for multi-cores.• ADORE has recently been applied to dynamic TLS tuning.
![Page 37: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers](https://reader036.fdocument.pub/reader036/viewer/2022062323/568166de550346895ddb0e26/html5/thumbnails/37.jpg)
Conclusion“It was the best of times,
it was the worst of times…” -- opening line of “A Tale of Two Cities”
best of times for research: new areas where innovations are needed worst of times for research:saturated area where technologies are matured or well-understood, hard to innovate, …