Post on 09-Jan-2016
description
1
εε-Optimal Minimum-Delay/Area Zero-Skew Clock -Optimal Minimum-Delay/Area Zero-Skew Clock
Tree Wire-Sizing in Pseudo-Polynomial TimeTree Wire-Sizing in Pseudo-Polynomial Time
Jeng-Liang TsaiJeng-Liang Tsai
Tsung-Hao ChenTsung-Hao Chen
Charlie Chung-Ping ChenCharlie Chung-Ping Chen (National Taiwan (National Taiwan
University)University)
University of Wisconsin-Madisonhttp://vlsi.ece.wisc.edu
2
OutlineOutline
Background• Motivation and contribution• Literature overview
ClockTune algorithm• Problem formulation• ClockTune algorithm overview• Optimality and complexity analysis
Experimental results• Runtime, memory usage, and optimality• Power/Delay trade-off• Incremental refinement
3
MotivationMotivation
Clock skew cycle time penalty• Start with zero-skew clock tree
• Minimize clock delay reduces system-level skew (Kuh, et al. [DAC ‘90])
Clock tree is power-hungry (30% in Intel McKinley(0.18um/1GHz/130W) • P = f CV2
• Minimize switching capacitance (wiring area)
Stability affects design convergence• Allow incremental refinement to accommodate local changes
Interconnect delay dominates total delay• Wire-sizing is effective in reducing interconnect delay
4
MotivationMotivation
Non-convex zero-skew constraints• No known algorithm solves zero-skew wire-sizing problem optimally
with polynomial runtime
Hence, a good clock tree wire-sizing algorithm can Minimize delay and power Guarantee optimality and runtime Have good stability
5
ContributionContribution
First ε-optimal algorithm for solving clock min-delay/power zero-skew wire-sizing optimization problem
Provide complete (Sampled) solution set of the delay/power/area trade-off information for design planning
Efficient pseudo-polynomial runtime (6170-branch clock tree in 6 minutes within 1% optimality)
Runtime v.s. Optimality tradeoff Incremental clock re-balancing to speed up design convergence
6
Literature OverviewLiterature Overview
“Reliable non-zero skew clock tree using wire width optimization”, Pillage, et al. [DAC ’93]• Iteratively optimize skew and delay using adjoint sensitivity analysis• Aimed at reliable clock trees under process variation
Deferred Merging Embedding (DME) algorithm, Kahng, et al. [TCAD ’92] • Bottom-up merging segment construction, top-down embedding
Integrated Deferred Merging Embedding (IDME) algorithm, Wong, et al. [ISPD’00]• Handles simultaneous routing, buffer-insertion, and wire-sizing• Merging segment set: a set of line samples of a merging region• No optimality guarantee• The size of MSS grows exponentially
“Process variation aware clock tree routing”, Lu, et al. [ISPD ’03]• Based on DME/BST
7
OutlineOutline
Background• Motivation and contribution• Literature overview
ClockTune algorithm• Problem formulation• ClockTune algorithm overview• Optimality and complexity analysis
Experimental results• Runtime, memory usage, and optimality• Power/Delay trade-off• Incremental refinement
8
Problem formulationProblem formulation
min-ZSWS (Zero Skew Wire Sizing) problem• Given a clock routing
minimize
s.t.
where Pi, Pj are paths from v to leaf nodes i and jZero-skew constraints are non-convex constraints
• No known algorithm solves the problem optimally in polynomial runtime
Mm
ji
v
v
vv
www
jiwPwP
wT
wT
wTwT
s)constraint skew(Zero),(delay)(delay
Delay)(delay
Area)(area
)(delay)(area
Max
Max
21
Tv
ji
Pi Pj
9
DC region approachDC region approach
Clock Delay and wiring Capacitance are top concerns Define f : RN R2, such that
• fY(w) = Delay(Tv(w)), fX(w) = Capacitance(Tv(w))
• DC region (v): The projection of the feasible region
• Choose a d-c pair from the DC region on R2
C
D
f : R6 -> R2
DC regionTv
w1 w2
w3
w4 w5
w6
Feasible region
10
ClockTuneClockTune algorithm algorithm overviewoverview
Phase 1: bottom-up construct DC regions for every node Phase 2: top-down embedding after delay/power tradeoff
(a) (b)
1
2
2
3
4
5
6 7
4
3
1
C
D D
C
C
DD
D
C
C
CC
C
D
D D
11
Optimality analysisOptimality analysis
Embeddings not fall on the delay samples will be omitted• Propagated error
• Delay sampling error
• Wire width sampling error (detailed in the paper)
D
C
w
d
p
DC region
DC region usingchildren informationSampled DC region
12
D
C
DC region
Sampled DC region
Optimality analysisOptimality analysis Error is bounded
d : delay sampling resolution
w : wire width sampling resolution
• k, : Constants related to l, r0, c0, wm, wM …
Generally speaking, error reduced about a half when resolution doubled
ErrorError
ResolutionResolution
13
Optimality runtime Optimality runtime trade offtrade off
Control sampling resolution can trade off optimality with runtime and memory
0.0%
0.5%
1.0%
1.5%
2.0%
128 256 512 1024
r1
r2
r3
r4
r5
(sample )
Minimum delay v.s. Optimal delay
0
20
40
60
80
100
120
0 1000 2000 3000 4000
p, q = 1024
(min)
(node )
512
256128
Runtime
14
Complexity analysisComplexity analysis
Runtime• Bottom-up phase takes O(n p max(p,q))
• Top-down phase takes O(np)
• Overall: O(n p max(p,q))
MemoryO(np)
where n : number of nodes of the clock tree,
p : number of delay samples taken at each node
q : number of wire width samples taken at each level-2 node
15
OutlineOutline
Background• Motivation and contribution• Related works• problem formulation
ClockTune Algorithm• Design space projection• Algorithm overview• Optimality and complexity analysis
Experimental Results• Runtime, memory usage, and optimality• Power/Delay trade-off• Incremental refinement
16
Experimental setupExperimental setup
• ClockTune is implemented in C++, executed on a 128MB 533MHz Pentium III PC
Benchmarks r1 – r5 from Tsay et al. [ICCAD‘91] Initial routing generated by BB+DME algorithm with minimum
wire width w = 1 m ClockTune uses wm = 1 m, wM = 4 m
p: number of delay samples taken at every node q: number of wire width samples taken at every level-2 node r0 = 0.03, c0 = 210-16/m2
17
Runtime and memory Runtime and memory usageusage
Runtime and memory usage are linear to problem size when p, q are fixed Within 1% optimality when p,q=256 (runtime < 6 minutes, memory ~ 64MB)
p, q = 256 # sink nodes # branches Runtime (s) Memory (MB) Optimality
r1 267 527 24.1 6.0 0.38%
r2 598 1185 61.0 12.5 0.71%
r3 862 1710 100.0 14.4 0.46%
r4 1903 3787 202.4 38.0 0.57%
r5 3101 6170 339.2 64.0 0.93%
0
20
40
60
80
100
120
0 1000 2000 3000 4000
p, q = 1024
(min)
(node )
512
256128
Runtime
0102030405060708090
0 1000 2000 3000 4000
(MB)
(node)
p, q = 1024
512
256
128
Memory Usage
18
Optimality resultsOptimality results
Optimality Error below 1% with p=q=256 Error reduced to about a half when resolution doubled
0.0%
0.5%
1.0%
1.5%
2.0%
128 256 512 1024
r1
r2
r3
r4
r5
(sample )
Minimum delay v.s. Optimal delay
0.0%
0.2%
0.4%
0.6%
0.8%
128 256 512 1024
r1
r2
r3
r4
r5
(sample )
Minimum area v.s. Optimal area
19
Power/Delay trade-offPower/Delay trade-off
r5
Capacitance
Delay
0.2~1.1nF0.2~1.1nF
5~150ns5~150ns
Minimum powerMinimum power
Minimum delayMinimum delay
15:1 delay:power trade-off
20
Incremental Incremental refinementrefinement
DC region captures the design space• Enables incremental refinement
C
DC
D
C
DC
D
X
21
Conclusion & Future Conclusion & Future WorkWork
Provide a zero-skew clock tree wire-sizing algorithm which• Minimizes delay and area ε-optimally
• Guarantees pseudo-polynomial runtime and memory usage
• Provides delay/power trade-off information to designers
• Speeds up design convergence by allowing clock tree re-balancing with minimum changes
Better delay model Buffer insertion/sizing capability
22
Thank you !Thank you !