DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers

Post on 24-Feb-2016

47 views 0 download

description

DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers. Wei Chung Hsu 徐慰中 Computer Science Department 交通大學 (work was done in University of Minnesota, Twin Cities ) 3/05/2010. Dynamo. Dynamo is a dynamic optimizer It won the best paper award in PLDI’2000, cited 612 times - PowerPoint PPT Presentation

Transcript of DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers

DYNAMO vs. ADOREA Tale of Two Dynamic Optimizers

Wei Chung Hsu徐慰中Computer Science Department交通大學

(work was done in University of Minnesota, Twin Cities)3/05/2010

Dynamo Dynamo is a dynamic optimizer It won the best paper award in PLDI’2000, cited

612 times Work started by the HP lab and the HP system lab. MIT took over and ported it to x86, called it

DynamoRIO. This group later started a new company, called Determina (now acquired by VMware)

Considered revolutionary since optimizations were always performed statically (i.e. at compile time)

SPEC CINT2006 for Opteron X4Name Description IC×109 CPI Tc (ns) Exec time Ref time SPECratio

perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3

bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8

gcc GNU C Compiler 1,050 1.72 0.40 724 8,050 11.1

mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8

go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6

hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5

sjeng Chess game (AI) 2,176 0.96 0.40 837 12,100 14.5

libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8

h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3

omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1

astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1

xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0

Geometric mean 11.7

Very high cache miss rate rates Ideal CPI should be 0.33

Time=CPI x Inst x Clock period

Where have all the cycles gone?• Cache misses

– Capacity, Compulsory/Cold, Conflict, Coherence– I-cache and D-cache– TLB misses

• Branch mis-predictions– Static and dynamic prediction– Mis-speculation

• Pipeline stalls– Ineffective code scheduling

often caused by memory aliasing

Unpredictable

Hard to deal with at compile time

5

Trend of Multi-cores

Exploiting these potentials demands thread-level parallelism

Intel Core i7 die photo

6

Exploiting Thread-Level Parallelism

Potentially more parallelism with speculation

dependence

Sequential

Store *pLoad *q

Store *p

Time

p != q

Thread-Level Speculation (TLS)

Traditional ParallelizationLoad *q

p != q ??

Store 88

Load 20

Parallel execution

Store 88

Load 88

Speculation Failure

Time

Time

Load 88

Compiler gives up

p == q

But Unpredictable

Dynamic OptimizersDynamic optimizers

Dynamic Binary Optimizers (DBO)

Java VM (JVM) with JIT compiler(dynamic compilationor adaptive optimization)

Native-to-native dynamic binary optimizers (x86 x86, x86-32 x86-64 IA64 IA64)

Non-nativedynamic binarytranslators(e.g. x86 IA64,ARM MIPS,PPC x86, QEMUVmware, Rosetta)

More on why dynamic binary optimizationNew architecture/micro-architecture features offer more opportunity for performance, but are not effectively exploited by legacy binary.x86 P5/P6/PII/PIII, x86-32/x86-64, PA 7200/8000, …

Software evolution and ISV behaviors reduce effectiveness of traditional static optimizerDLL, middleware, binary distribution, …

Profile sensitive optimizations would be more effective if performed at runtimepredication, speculation, branch prediction, prefetching

Multi-core environment with dynamic resource sharing makes static optimization challengingshared cache, off-chip bandwidth, shared FU’s

How Dynamo Works

Interpret untiltaken branch

Lookup branchtarget

Start of tracecondition?

Jump to codecache

Increment counterfor branch target

Counter exceedthreshold?

Interpret +code gen

End-of-tracecondition?

Create trace& optimize it

Emit intocache

Signalhandler

Code Cache

Dynamo is VM based

Trace Selection

A

B C

D

F

G H

I

E

A

C

D

F

G

I

E

call

returnto Bto H

back toruntime

trace layout in

tracecache

trace selection

Backpatching

A

C

D

F

G

I

E

to Bto H

back toruntime

H

I

E

When H becomes hot,a new trace is selectedstarting from H, and thetrace exit branch in block F is backpatched to branch to the new trace.

Execution Migrates to Code Cache

a.out

1

2

3

Code cache

1

2

3

0

4

interpreter/emulator

traceselector

optimizer

Trace Based Optimizations• Full and partial redundancy elimination• Dead code elimination• Trace scheduling• Instruction cache locality improvement• Dynamic procedure inlining (or procedure

outlining)• Some loop based optimizations

Summary of Dynamo• Dynamic Binary Optimization customizes

performance delivery:– Code is optimized by how the code is used

• Dynamic trace formation and trace-based optimizations

– Code is optimized for the machine it runs on– Code is optimized when all executables are

available– Code should be optimized only the parts that

really matters

ADORE ADORE means ADaptive Object code RE-

optimization Was developed at the CSE department, U. of

Minnesota, Twin Cities Applied a very different model for dynamic

optimization systems Considered evolutionary, cited by 61

Dynamic Binary Optimizer’s Models

Application Binaries

DBO

Operating System

Hardware Platform

Application Binaries DBO

Operating System

Hardware Platform

- Translate most execution paths and keep in code cache

- Easy to maintain control - Dynamo (PA-RISC) DynamoRIO (x86)

- Translate only hot execution paths and keep in code cache

- Lower overhead - ADORE (IA64, SPARC) COBRA (IA64, x86 – ongoing)

ADORE Framework

Hardware Performance Monitoring Unit (PMU)

Kernel

Phase Detection

Trace Selection

Optimization

Deployment

Main ThreadDynamic

OptimizationThread

Code Cache

Init PMUInt. on Event

Int on K-buffer ovf

On phase change

Pass traces to opt

Init Code $ Optimized Traces

Patch traces

Thread Level View

18

Application

K-buffer overflow handler

Init ADORE

ADORE invoked

ADORE invoked

User buffer full

User buffer full

Thread 1 Thread 2

sleep

sleep

User buffer full is

maintained for 1 main event. This

event is usually

CPU_CYCLES

Perf. of ADORE/Itanium on SPEC2000

Performance on BLAST

-15%

0%

15%

30%

45%

60%

blastnnt.1

blastnnt.10(4)

blastnnt.10(5)

blastnnt.10(7)

blastpaa.1

blastxnt.1

tblastnaa.1

Queries

% S

peed

-up

GCC O2 ORC O2 ECC O2

ADORE vs. DynamoTasks Dynamo ADORE

Observation(profiling)

Interpretation/ instrumentation based

HPM sampling based

Optimization Trace layout and classic optimization

I/D-cache related optimizations(prefetching + trace layout)

Code cache Need large Code$ Small Code$ sufficient

Re-direction Interpretation and trace chaining

Code Patching

ADORE on Multi-cores

• COBRA (Continuous Object code Re-Adaptation) framework is a follow up project, implemented on Itanium Montecito and x86’s new multi-core machines.

• ADORE on SPARC Panther (Ultra Sparc IV+) multi-core machines.

• ADORE for TLS tuning

23

COBRA Framework

• Optimization Thread– Centralized Control– Initialization – Trace Selection– Trace Optimization– Trace Patching

• Monitor Threads– Localized Control– Per-thread Profile

Single System Image (Kernel)

Perfmon Sampling Kernel Driver

Kernel Sampling Buffer (KSB)

Per-Thread Monitoring Threads

Per-Thread UserSampling Buffer (USB)

Per-Thread Profile Buffer (PB)

Per-Thread Phase and Profile

Manager

Optimization Thread

MainController

Trace Selectionand Optimization

TraceCache

Main/Working Threads

Multi-Threaded Program With COBRA Monitoring and Optimizing Threads in Same Address Space

Trace Patcher

Processor 0

HardwarePerformance

Counters

Processor 3

HardwarePerformance

Counters

Processor 2

HardwarePerformance

Counters

Processor 1

HardwarePerformance

Counters

24

Startup of 4 thread OpenMP Program

monitorthread

main process (worker thread) vfork

OMP monitor thread OMP monitor thread

worker thread worker thread

worker thread worker thread

Worker thread Worker thread

pthread_create

monitorthreadmonitor

threadmonitorthread

monitoringprocess

Optimzer thread Optimzer thread start

end

1

6

5

4

32

Same Address Space

25

Prefetch vs. NoPrefetch

• The prefetch version when running with 4 threads suffers significantly from L2_OZQ_FULL stalls.

Scalability of DAXPY Kernel on 4-way Itanium 2 Machine(# of threads, with/without prefetch)

0.000.200.400.600.801.001.201.401.601.802.00

128K 512K 2Mdata working set size

Nor

mal

ized

exe

cutio

n tim

e to

ba

selin

e

(1, prefetch)(1, noprefetch)(2, prefetch)(2, noprefetch)(4, prefetch)(4, noprefetch)

26%34%

26

Prefetch vs. Prefetch with .excl

• .excl hint: prefetch a cache line in exclusive state instead of shared state. (Invalidation based cache coherence protocol)

Scalability of DAXPY Kernel on 4-way Itanium 2 Machine (#of threads, prefetch without/with .excl hints)

0.00

0.20

0.40

0.60

0.80

1.00

1.20

128K 512K 2Mdata working set size

Nor

mal

ized

exe

cutio

n tim

e to

ba

selin

e

(1, prefetch)(1, prefetch.excl)(2, prefetch)(2, prefetch.excl)(4, prefetch)(4, prefetch.excl)

15%12%

27

Execution time on 4-way SMP

0.900

0.950

1.000

1.050

1.100

1.150

1.200

bt.S sp.S lu.S ft.S mg.S cg.S avgNPB OMP v3.0 benchmarks

Spee

dup

rela

tive

to b

asel

ine

(pre

fetc

h)(4, prefetch) (4, noprefetch) (4, prefetch.excl)

8%

15%

2.7%4.7%

noprefetch: up to 15%, average 4.7% speedup prefetch.excl: up to 8%, average 2.7% speedup

28

Execution time on cc-NUMA

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1.400

1.600

1.800

bt.S sp.S lu.S ft.S mg.S cg.S avgNPB OMP v3.0 benchmarks

Spee

dup

rela

tive

to b

asel

ine

(pre

fetc

h)(8, prefetch) (8, noprefetch) (8, prefetch.excl)

17.5%8.5%

68%

18%

noprefetch: up to 68%, average 17.5% speedup prefetch.excl: up to 18%, average 8.5% speedup

29

Summary of Results from COBRA• We showed that coherent misses caused by aggressive

prefetching could limit the scalability of multithreaded program on scalable shared memory multiprocessors.

• With the guide of runtime profile, we experimented two optimizations.– Reducing aggressiveness of prefetching

• Up to 15%, average 4.7% speedup on 4-way SMP• Up to 68%, average 17.5% speedup on SGI Altix cc-NUMA

– Using exclusive hint for prefetch• Up to 8%, average 2.7% speedup on 4-way SMP• Up to 18%, average 8.5% speedup on SGI Altix cc-NUMA

ADORE/SPARC• ADORE has been ported to Sparc/Solaris

platform since 2005.• Some porting issues:

– ADORE uses the libcpc interface on Solaris to conduct runtime profiling. A kernel buffer enhancement is added to Solaris 10.0 to reduce profiling and phase detection overhead

– Reachability is a true problem. (e.g. Oracle, Dyna3D)– Lack of branch trace buffer is painful. (e.g. Blast)

Performance of In-Thread Opt. (USIII+)

-10.00%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

BasePeak

Helper Thread Prefetching for Multi-Core

Main thread

Second corePrefetches initiated

Cache miss avoided L2

CacheMiss

timeFirst Core

Trigger to activate (About 65 cycles delay)

Spin Waiting Spin again waiting for the next trigger

Performance of Dynamic Helper Thread(on Sun UltraSparc IV+)

-20.00%

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

BasePeak

Evaluation Environment for TLS

Benchmarks· SPEC2000 written in C, -O3 optimization

Underlying architecture· 4-core, chip-multiprocessor (CMP)· speculation supported by coherence

Simulator· Superscalar with detailed memory model· simulates communication latency· models bandwidth and contention

Detailed, cycle-accurate simulation

C

C

P

C

P

Interconnect

C

P

C

P

34

35

Dynamic Tuning for TLS

ammp artbzip

2cra

fty

equake gap gcc gzip mcfmesa

parser

perlbmk

twolf

vorte

xvp

r-p vpr-r

G.M.

0

0.5

1

1.5

2

2.5

3

Simple Quantitative Quantitative+StaticHint

Spee

dup

wrt

. SEQ

1.17x

1.23x

1.37x

Parallel Code Overhead

Summary of ADORE• ADORE uses Hardware Performance Monitoring (HPM)

capability to implement a light weight runtime profiling system. Efficient profiling and phase detection is the key to the success of dynamic native binary optimizers.

• ADORE can speed up real-world large applications optimized by production compilers.

• ADORE works on two architectures: Itanium and SPARC. COBRA is a follow-up system of ADORE. It works on Itanium and x86.

• ADORE/COBRA can also optimize for multi-cores.• ADORE has recently been applied to dynamic TLS tuning.

Conclusion“It was the best of times,

it was the worst of times…” -- opening line of “A Tale of Two Cities”

best of times for research: new areas where innovations are needed worst of times for research:saturated area where technologies are matured or well-understood, hard to innovate, …