A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion
description
Transcript of A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion
A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion
Professor Shiyan Hu, Ph.D.Department of Electrical and Computer Engineering
Michigan Technological University
Moore’s law
2
Twice the number of transistors, approximately every two years
Interconnect Delay Dominates Gate Delay
3
Technology Scaling
4
130nm 65nm
Global interconnect lengths does not shrink Local interconnect lengths shrink Delay ∝ RC Resistance R = rL/S, where S is reduced Capacitance C slightly changes
Interconnect Delay Scaling
5
Scaling factor s=0.7 per generation Emore Delay of a wire of length l
tint = (rl)(cl)/2= rcl2/2 (first order)
Local interconnects tint : (r/s2)(c)(ls)2/2 = rcl2/2
– Local interconnect delay is roughly unchanged
Global interconnects tint : (r/s2)(c)(l)2/2= rcl2
– Global interconnect delay doubles which is unsustainable
Interconnect delay increasingly more dominant
Timing Driven Buffer Insertion
6
Buffers Reduce RC Wire Delay
7
R
x/2
cx/4 cx/4rx/2
∆t = t_buf – t_unbuf = RC + tb – rcx2/4
x/2
cx/4 cx/4rx/2
C
C R
x
∆t
x/2
x
Intuitive Analysis
8
Interconnect Elmore delay = rcL2/2
l=2 lll
L
/22
1
1Interconnect Delay 2 2
2 2Since there are L/2 buffers
L Lrc rc rcL
(Of course, we need to consider buffer delay)
The delay of a wire of length L is T=rcL2/2
Detailed Analysis
9
gddg
ggd
CRl
cRrCrclL
clCrlclCRNT
12/
2/
0dldT
02 2
opt
gd
l
CRrcL
rc
CRl gdopt
2
L
r,c – Resistance, cap. per unit lengthRd – On resistance of inverterCg – Gate input capacitance
l Assume N identical buffers with equal inter-buffer length l (L = Nl). To minimize
delay
Quadratic Delay -> Linear Delay
10
Substituting lopt back into the interconnect delay expression:
rc
CR
CRcRrC
rc
CRrcL
CRl
cRrCrclLT
gd
gddg
gd
gdopt
dgoptopt
2
2
1
cRrCrcCRLT dggdopt 2
Delay grows linearly with L instead of quadratically.This is why buffer insertion is highly effective and thus widely used for reducing circuit delay.
25% Gates are Buffers
11
Saxena, et al. [TCAD 2004]
ITRS Projections
12
Problem Formulation
13
Minimal cost (area/power) solution
1. Steiner Tree
2. n candidate buffer locations
T
Solution Characterization
14
To model effect to downstream, a candidate solution is associated with
• v: a node• C: downstream
capacitance• Q: required arrival
time• W: cumulative
buffer cost
Candidate Buffering Solutions
15
Dynamic Programming (DP)
16
Candidate solutions are propagated toward the source
Start from sinks Candidate solutions
are generated Three operations
– Add Wire
– Insert Buffer
– Merge Solution Pruning
Solution Propagation: Add Wire
17
c2 = c1 + cx q2 = q1 - (rcx2/2 + rxc1) r: wire resistance per unit length c: wire capacitance per unit length
(v1, c1, w1, q1)(v2, c2, w2, q2)x
Solution Propagation: Insert Buffer
18
(v1, c1, w1, q1)(v1, c1b, w1b, q1b)
q1b = q1 - d(b) c1b = C(b) w1b = w1 + w(b) d(b): buffer delay
Solution Propagation: Merge
19
cmerge = cl + cr
wmerge = wl + wr
qmerge = min(ql , qr)
(v, cl , wl , ql) (v, cr, wr, qr)
Example of Solution Propagation
20
(v1, 1, 20, 0)22
v1 v1
(v2, 3, 16, 0)
• r = 1, c = 1• Rb = 1, Cb = 1, tb = 1• Rd = 1
(v2, 1, 12, 1)
v1
(v3, 5, 8, 0)
v1
(v3, 3, 8, 1)
slack = 5slack = 3
Add wire
Add wire
Insert bufferAdd wire
Add driver Add driver
(v, C, Q, W)
Solution Propagation
21
(1)
(2)
(3)
Exponential Runtime
22
2 solutions
4 solutions
8 solutions
16 solutions
n candidate buffer locations lead to 2n solutions
Too Many Solutions
23
Needs solution pruning for acceleration Two candidate solutions
– (v, c1, q1,w1)
– (v, c2, q2,w2)
Solution 1 is inferior to Solution 2 if – c1 c2 : larger load
– and q1 q2 : tighter timing
– and w1 w2: larger cost
Car Race - Speed
24
END
Car Speed <=> RAT
Car Race - Load
25
Load <=> Load Capacitance
Faster & Smaller Load
26
ENDFaster & smaller load(larger RAT, smaller
capacitance):Good
Slower & larger load(smaller RAT, larger
capacitance):Inferior
Faster & Larger Load: Result 1
27
END
Faster & Larger Load: Result 2
28
END
Who will be the winner?Cannot tell at this moment,
so keep both of them.
Pruning
29
(Q1,C1,W1)
(Q2,C2,W2)
inferior/dominatedif C1 C2,W1 W2 and Q1 Q2 Non-dominated solutions are
maintained: for the same Q and W, pick min C # of solutions depends on # of distinct W and Q, but not their values
Generating Candidates
30
(1)
(2)
(3)
Pruning Candidates
31
(3)
(a) (b)
Both (a) and (b) look the same to the source.Remove the one with the worse slack and cost
(4)
Candidate Example Continued
32
(4)
(5)
Candidate Example Continued
33
After pruning
(5)
At driver, compute the candidate solution satisfying the timing target with minimum cost. The result is optimal.
Branch Merge
34
Right Candidates
Left Candidates
Pruning During Branch Merge
35
With pruning(n1n2) solutions after each branch merge. Worst-case ((n/m)m) solutions.
Selected Milestone Works on Timing Buffering
36
1990 1991 ……. 1996 ……. 2003 2004 ……. 2008 2009
van
Ginne
ken’s
algo
rithm
Lillis’
algo
rithm
Shi a
nd Li’s
alg
orith
m
NP-har
dnes
s pro
of
Is it possible to design a provably good algorithm running in polynomial time with theoretical guarantee on the error to the optimal solution?
This is a major open problem for a decade!
Bridging The Gap
37
We are bridging the gap!
A Fully Polynomial Time Approximation Scheme (FPTAS) Provably good Computes a solution
with cost at most (1+ɛ) of the optimal cost for any ɛ>0
Runs in time polynomial in n (nodes), b (buffer types) and 1/ɛ
Best solution for an NP-hard problem in theory
Highly practical
The Rough Picture
38
W*: the cost of optimal solution
Make guess on W*
Good (close to W*)
Not Good
Key 2: Smart guessKey 1: Efficient checking
Check it
Return the solution
Key 1: Efficient Checking
39
Benefit of guess Only maintain
the solutions with cost no greater than the guessed cost
This is the first reason for acceleratation
The Oracle
40
Oracle (x): the checker, able to decide whether x>W* or not
– Without knowing W*– Answer efficiently
Construction of Oracle(x)
41
Scale and round each buffer cost
Only interested in whether there is a solution with
cost up to x satisfying timing
constraint
Dynamic Programming
Perform DP to scaled problem with cost upper bound n/ɛ. Time
polynomial in n/ɛ
Scaling and Rounding
42
xɛ/n 2xɛ/n 3xɛ/n 4xɛ/n
Buffer cost
0
Scaling and Rounding
43
Buffer cost1 2 3 40
# distinct buffer costs is at most O(n/ε) since only solutions with W bounded by n/ɛ are propagated.
Rounding error at each buffer xɛ/n, total rounding error xɛ. • Larger xɛ/n: larger error, fewer distinct costs and faster • Smaller xɛ/n: smaller error, more distinct costs and slower • Rounding is the second reason for acceleration
Oracle Construction
44
Yes, there is a solution satisfying timing
constraint
No, no such solution
With cost rounded and scaled back, the solution has cost at most n/ɛ • xɛ/n + xɛ=
(1+ɛ)x > W*
With cost rounded and scaled back, the solution has cost at least n/ɛ •
xɛ/n = x W*
Run dynamic programming with cost n/ɛ
Rounding on Q
45
# solutions bounded by # distinct W and Q # W = O(n/ɛ1), ɛ1 is used for W
– Rounding before DP # Q
– Round up Q to nearest value in {0, ɛ2T/m , 2ɛ2T/m, 3ɛ2T/m,…,T }, in branch merge (m is # sinks)
– Rounding during DP– # Q = O(m/ɛ2), ɛ2 is used for Q – Rounding error bounded by ɛ2T/m per branch merge, by
ɛ2T for the whole tree # non-dominated solutions is O(mn/ɛ1ɛ2)
3ɛ2T/m2ɛ2T/mɛ2T/m 4ɛ2T/m0
Q-W Rounding Before Branch Merge
46
W
Q
n/ɛ1
T
ɛ2T/m
0 1 2 3 4
2ɛ2T/m
3ɛ2T/m
4ɛ2T/m
Buffer Insertion Runtime
47
branch single ain solutions dominated-non )(most At 1
2
21 bnmn
O
pruning.bin - Wcross No node.each for time)( 1
22
21 bnmnb
O
mergebranch aafter solutions )(21
mnO
esbuffer typ b with solutions dominated-non )( introducesinsertion buffer A 1nb
O
bins- W)(1n
O
Branch Merge Runtime - 1
48
Target Q=0
When merging Wl=2 with Wr=1, previously we need to try quadratic # of combinations, now only linear # of combinations.
Branch Merge Runtime - 2
49
Target Q= ɛ2T/m
Branch Merge Runtime - 3
50
Target Q= 2ɛ2T/m
Branch Merge Runtime - 4
51
time)( each takes wherea,W Wall try a, WmergedFor 2
rl am
O
)( is runtime total,0,1,...,aFor 2
21
2
1 mn
On
)( isit bins, into solutions puttingfor timeIncluding2
21
2
1
2
21 mnbnmn
O
mergebranch aafter solutions )(21
mnO
Timing-Cost Approximate DP
52
Lemma: a buffering solution with cost at most (1+ɛ1)W* and with timing at most (1+ɛ2)T can be computed in time
)(1
23
21
2
22
1
22
1
2
21
2
bnbmnnmbmnnm
O
U (L): upper (lower) bound on W* Naive binary search style approach
Runtime (# iterations) depends on the initial bounds U and L
Key 2: Geometric Sequence Based Guess
53
Oracle (x)
x=(U+L)/2
Set U and L on W*
U= (1+ɛ)x L= x
W*<(1+ɛ)x W* x
Adapt ɛ1
54
Rounding factor xɛ1/n for W Larger ɛ1: faster with rough estimation Smaller ɛ1: slower with accurate estimation Adapt ɛ1 according to U and L
U/L Related Scale and Round
55
Buffer cost
0U/L
xɛ/n
xɛ/n
Conceptually
56
Begin with large ɛ1 and progressively reduce it (towards ɛ) according to U/L as x approaches W*
Fix ɛ2=ɛ in rounding T for limiting timing violation
• Set ɛ1 as a geometric sequence of …, 8, 4, 2, 1, 1/2, …, ɛ• Suppose that one run of DP takes O(n/ɛ1) time. Total runtime is bounded by the last run as O(… + n/8 + n/4 + n/2 + … + n/ɛ) = O(n/ɛ).
Oracle Query Till U/L<2
57
'
*,
*,
*,
*,'
1 ,1
i
iliu
il
iui
WWx
W
W
)()()1
(
)3/4(2/1
1*,
*,
2
2
1*,
*,
2
2
1'
2
2it
ti iu
il
ti iu
il
ti i W
WnmO
W
WnmO
nmO
)() 59.0()(2
2
0
)3/4(2/1
2
2)3/4(2/1
0*,
*,
2
2
nm
Onm
OW
WnmO
tjtj iu
il j
j
it
tu
tl
iu
il
iu
il
iu
il
il
iu
il
iu
W
W
W
W
W
W
W
W
W
W
W
W
)3/4(
*,
*,
*,
*,
3/4
*,
*,
*,
*,
4/3
*,
*,
*1,
*1,
)(1
23
21
2
22
1
22
1
2
21
2
bnbmnnmbmnnm
O
Mathematically
58
When U/L<2
59
At least one feasible solution, otherwise no solution with cost 2n/ɛ • Lɛ/n = 2L U
Lɛ/n rounding error per buffer and Lɛ in a solution
A single DP runtime
Pick min cost solution satisfying timing at driver
W=2n/ɛ
Scale and round each cost by Lɛ/n
Run DP
U/L<2
The Algorithmic Flow
60
Oracle (x)
Adapting ɛ1 =[U/L-1]1/2
Set U and L of W*
Set x=[UL/(1+ ɛ1)]1/2
Update U or L
Compute final solution
Main Theorem
61
Theorem: a (1+ ɛ) approximation to the timing constrained minimum cost buffering problem can be computed in O(m2n2b/ɛ3+ n3b2/ɛ) time for 0<ɛ<1 and in O(m2n2b/ɛ+mn2b+n3b) time for ɛ 1
Experiments
62
Experimental Setup– 1000 industrial nets
– 48 industrial buffer types including non-inverting buffers and inverting buffers
Compared to Dynamic Programming which is the state of the art technique and is widely used in industry
Cost Ratio Compared to DP
63
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
FPTASFPTAS
Buf
fer
Cos
t R
atio
Approximation
Speedup Compared to DP
64
Spe
edup
Approximation
0.01 0.05 0.1 0.2 0.3 0.4 0.50
1
2
3
4
5
6
FPTASFPTAS
Observations
65
FPTAS always achieves the theoretical guarantee Larger ɛ leads to more speedup On average about 5x faster than dynamic programming Can run 4.6x faster with 0.57% solution degradation <5% nets with timing violations which can be fixed by a simple
timing recovery procedure
Our Bridge
66
NP-Hardness Complexity
Exponential Time Algorithm
Conclusion
67
Propose a (1+ ɛ) approximation for timing constrained minimum cost buffering for any ɛ > 0 (DAC’09)
– Runs in O(m2n2b/ɛ3+ n3b2/ɛ) time– Timing-cost approximate dynamic programming – Double-ɛ geometric sequence based oracle search– 5x speedup in experiments– Few percent additional buffers as guaranteed
theoretically The first provably good approximation algorithm on this
problem which is a major open problem in the field
Thanks