Dynamic-Programming Strategies for Analyzing Biomolecular Sequences
description
Transcript of Dynamic-Programming Strategies for Analyzing Biomolecular Sequences
Dynamic-Programming Strategies for Analyzing Biomolecular SequencesKun-Mao Chao (趙坤茂 )
Department of Computer Science and Information EngineeringNational Taiwan University, Taiwan
E-mail: [email protected]: http://www.csie.ntu.edu.tw/~kmchao
2
Dynamic Programming• Dynamic programming is a class of
solution methods for solving sequential decision problems with a compositional cost structure.
• Richard Bellman was one of the principal founders of this approach.
3
Two key ingredients• Two key ingredients for an optimization
problem to be suitable for a dynamic-programming solution:
Each substructure is optimal.
(Principle of optimality)
1. optimal substructures
2. overlapping subproblems
Subproblems are dependent.(otherwise, a divide-and-conquer approach is the choice.)
4
Three basic components• The development of a dynamic-programming algorithm has three basic components:
– The recurrence relation (for defining the value of an optimal solution);– The tabular computation (for computing the value of an optimal solution);– The traceback (for delivering an optimal solution).
5
Fibonacci numbers
.for 21
11
00
i>1iFiFiFFF
The Fibonacci numbers are defined by the following recurrence:
6
How to compute F10?
F10
F9
F8
F8
F7
F7
F6
……
7
Tabular computation• The tabular computation can avoid recompuation.
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
0 1 1 2 3 5 8 13 21 34 55
8
Maximum-sum interval• Given a sequence of real numbers
a1a2…an , find a consecutive subsequence with the maximum sum.9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9
For each position, we can compute the maximum-sum interval starting at that position in O(n) time. Therefore, a naive algorithm runs in O(n2) time.
9
O-notation: an asymptotic upper bound
• f(n) = O(g(n)) iff there exist two positive constant c and n0 such that 0 f(n) cg(n) for all n n0
cg(n)
f(n)
n0
10
How functions grow?
30n 92n log n 26n2 0.68n3 2n
100 0.003 sec. 0.003 sec. 0.0026
sec. 0.68 sec. 4 x 1016
yr.
100,000 3.0 sec. 2.6 min. 3.0 days 22 yr.
For large data sets, algorithms with a complexity greater than O(n log n) are often impractical!
n
function
(Assume one million operations per second.)
11
Maximum-sum interval(The recurrence relation)
• Define S(i) to be the maximum sum of the intervals ending at position i.
0
)1(max)(
iSaiS i
ai
If S(i-1) < 0, concatenating ai with its previous interval gives less sum than ai itself.
12
Maximum-sum interval(Tabular computation)
9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7
The maximum sum
13
Maximum-sum interval(Traceback)9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9
S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7
The maximum-sum interval: 6 -2 8 4
14
Two fundamental problems we recently solved (I) (joint work with Lin and Jiang)• Given a sequence of real numbers
of length n and an upper bound U, find a consecutive subsequence of length at most U with the maximum sum --- an O(n)-time algorithm.U = 3
9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9
15
Joint work with Huang, Jiang and Lin• Xiaoqiu Huang (黃曉秋 )Iowa State University, USA• Tao Jiang (姜濤 ) University of California Riverside, USA• Yaw-Ling Lin (林耀鈴 )Providence University, Taiwan
16
Two fundamental problems we recently solved (II) (joint work with Lin and Jiang)• Given a sequence of real numbers
of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. --- an O(n log L)-time algorithm.
L = 4
3 2 14 6 6 2 10 2 6 6 14 2 1
17
Another exampleGiven a sequence as follows:
2, 6.6, 6.6, 3, 7, 6, 7, 2and L = 2, the highest-average
interval is the squared area, which has the average value 20/3.
2, 6.6, 6.6, 3, 7, 6, 7, 2
18
C+G rich regions• Our method can be used to locate
a region of length at least L with the highest C+G ratio in O(n log L) time.
ATGACTCGAGCTCGTCA
00101011011011010
Search for an interval of length at least L with the highest average.
19
After I opened this problem in Academia Sinica in Aug. 2001 ...• JCSS (2002) – Lin, Jiang, Chao
– A linear-time algorithm for a maximum-sum segment with a lower bound L and an upper bound U.
– An O(n log L) time algorithm for a maximum-average segment with a lower bound L.
– Two open problems raised:• O(n log L) O(n)?• average (with an upper bound U)?
20
After I opened this problem in Academia Sinica in Aug. 2001 ...• Comb. Math. (2002) – Yeh et. al
– An expected O(nL)-time algorithm• Comb. Math. (2002) – Lu
– A linear time algorithm for a maximum-average segment with a lower bound L in a binary sequence.
21
After I opened this problem in Academia Sinica in Aug. 2001 ...• WABI (2002) & JCSS – Goldwasswer, Kao,
and Lu– O(n) for a lower bound L only– O(n log (U-L+1)) for L and U
• IPL (2003) – Kim– O(n) for a lower bound L only
(Reformulated as a Computational Geometry Problem)
22
After I opened this problem in Academia Sinica in Aug. 2001 ...• Bioinformatics (2003) – Lin, Huang, Jiang,
Chao– A tool for finding k-best maximum-average seg
ments• Bioinformatics (2003) – Wang, Xu
– A tool for locating a maximum-sum segment with a lower bound L and an upper bound U
23
After I opened this problem in Academia Sinica in Aug. 2001 ...• CIAA(2003) – Lu et al.
– An alternative linear-time algorithm for a maximum-sum segment (online; for finding repeats)
• ESA (2003) – Chung, Lu– O(n log (U-L+1)) O(n) for L and U
• ISAAC (2003) – Lin (R.R.), Kuo, Chao– O(nL) for a maximum-average path in a tree– O(n) for a complete tree
• Some were omitted here...• Some are coming soon...
24
Length-unconstrained version
• Maximum-average interval
3 2 14 6 6 2 10 2 6 6 14 2 1
The maximum element is the answer. It can be done in O(n) time.
25
Q: computing density in O(1) time?
• prefix-sum(i) = S[1]+S[2]+…+S[i],– all n prefix sums are computable in O(n) time.
• sum(i, j) = prefix-sum(j) – prefix-sum(i-1)• density(i, j) = sum(i, j) / (j-i+1)
prefix-sum(j)
i j
prefix-sum(i-1)
26
A naive algorithm• A simple shift algorithm can compute the
highest-average interval of a fixed length in O(n) time
• Try L, L+1, L+2, ..., n. In total, O(n2).
27
An O(nL)-time algorithm (Huang, CABIOS’ 94)• Observing that there exists an optimal interv
al of length bounded by 2L, we immediately have an O(nL)-time algorithm.
We can bisect a region of length >= 2L into two segments, where each of them is of length >= L.
28
Good partners• Finding good partner g(i) for each i=1, 2, …, n-L+1.
L
g(i)
maximing avg[i, g(i)]
i + L
29
Right-Skew Decomposition• Partition S into substrings S1,S2,…,Sk such that
– each Si is a right-skew substring of S• the average of any prefix is always less than or equal to
the average of the remaining suffix.– density(S1) > density(S2) > … > density(Sk)
• [Lin, Jiang, Chao]– Unique– Computable in linear time.
30
Right-Skew Decomposition
1 1 0 1 1 0 1 0 1 1 0 0 1
1 2/3 3/5 1/3> >>
Lu’s illustration
31
An O(n log L)-time algorithm (Lin, Jiang, Chao, JCSS’02)• Decreasingly right-skew decomposition
(O(n) time)
9 7 8 1 9 8 3 7 2 8
97.5 6
5
8 9 8 75
8
32
Right-Skew pointers p[ ]
9 7 8 1 9 8 3 7 2 8
97.5 6
5
8 9 8 75
8
1 2 3 4 5 6 7 8 9 10
p[ ] 1 3 3 6 5 6 10 8 10 10
33
Illustration
L
i+L g(i)
L
i+L
Lu’s illustration
34
An O(n log L)-time algorithm
• Jumping tables that allows binary search.– O(log L) time for each position to find its good
partner, therefore the total running time is O(n log L).
• We also implemented a program that linearly scans right-skew segments for the good partnership. Our empirical tests showed that it ran in linear time empirically.
35
Pointer-jumping tables
9 7 8 1 9 8 3 7 2 8
97.5 6
5
8 9 8 75
8
1 2 3 4 5 6 7 8 9 10
p(0)[ ] 1 3 3 6 5 6 10 8 10 10
p(1)[ ] 3 6 6 10 6 10 10 10 10 10
p(2)[ ] 10 10 10 10 10 10 10 10 10 10
}],1][[min{][][][
)()()1(
)0(
nippipipip
kkk
36
Goldwasser, Kao, and Lu’s progress
• An O(n) time algorithm for the problem– Appeared in Proceedings of the Secon
d Workshop on Algorithms in Bioinformatics (WABI), Rome, Itay, Sep. 17-21, 2002.
– JCSS, accepted.
37
A new important observation
• i < j < g(j) < g(i) implies– density(i, g(i)) is no more than density(j, g(j))
i g(i)j g(j)
Lu’s illustration
38
i g(i)j g(j)
Lu’s illustration
39
Searching for all g(i) in linear time
Lu’s illustration
40
k non-overlapping maximum-average segments (Lin, Huang, Jiang, Chao, Bioinformatics)• Maintain all candidates in a priority queue.
• A new k-best algorithm + algorithms on trees (Lin and Chao (2003))
41
Longest increasing subsequence(LIS)• The longest increasing subsequence is to find a
longest increasing subsequence of a given sequence of distinct integers a1a2…an .e.g. 9 2 5 3 7 11 8 10 13 6
2 3 7
5 7 10 13
9 7 11
3 5 11 13
are increasing subsequences.
are not increasing subsequences.
We want to find a longest one.
42
A naive approach for LIS• Let L[i] be the length of a longest increasing
subsequence ending at position i.L[i] = 1 + max j = 0..i-1{L[j] | aj < ai}(use a dummy a0 = minimum, and L[0]=0)
9 2 5 3 7 11 8 10 13 6L[i] 1 1 2 2 3 4 ?
43
A naive approach for LIS
9 2 5 3 7 11 8 10 13 6L[i] 1 1 2 2 3 4 4 5 6 3
L[i] = 1 + max j = 0..i-1 {L[j] | aj < ai}
The maximum length
The subsequence 2, 3, 7, 8, 10, 13 is a longest increasing subsequence.
This method runs in O(n2) time.
44
Binary search• Given an ordered sequence x1x2 ... xn, where
x1<x2< ... <xn, and a number y, a binary search finds the largest xi such that xi< y in O(log n) time.
n ...
n/2n/4
45
Binary search• How many steps would a binary search
reduce the problem size to 1?n n/2 n/4 n/8 n/16 ... 1
How many steps? O(log n) steps.
nsn s
2log12/
46
An O(n log n) method for LIS• Define BestEnd[k] to be the smallest number of an
increasing subsequence of length k.9 2 5 3 7 11 8 10 13 69 2 2
523
237
23711
2378
2378
10
2378
1013
BestEnd[1]
BestEnd[2]
BestEnd[3]
BestEnd[4]
BestEnd[5]
BestEnd[6]
47
An O(n log n) method for LIS• Define BestEnd[k] to be the smallest number of an
increasing subsequence of length k.9 2 5 3 7 11 8 10 13 69 2 2
523
237
23711
2378
2378
10
2378
1013
23681013
BestEnd[1]
BestEnd[2]
BestEnd[3]
BestEnd[4]
BestEnd[5]
BestEnd[6]
For each position, we perform a binary search to update BestEnd. Therefore, the running time is O(n log n).
48
Longest Common Subsequence (LCS)
• A subsequence of a sequence S is obtained by deleting zero or more symbols from S. For example, the following are all subsequences of “president”: pred, sdn, predent.
• The longest common subsequence problem is to find a maximum-length common subsequence between two sequences.
49
LCSFor instance, Sequence 1: president Sequence 2: providence Its LCS is priden.
president
providence
50
LCSAnother example: Sequence 1: algorithm Sequence 2: alignmentOne of its LCS is algm.
a l g o r i t h m
a l i g n m e n t
51
How to compute LCS?• Let A=a1a2…am and B=b1b2…bn .• len(i, j): the length of an LCS between
a1a2…ai and b1b2…bj
• With proper initializations, len(i, j)can be computed as follows.
,. and 0, if)),1(),1,(max(
and 0, if1)1,1(,0or 0 if0
),(
ji
ji
bajijilenjilenbajijilen
jijilen
52
p r o c e d u r e L C S - L e n g t h ( A , B ) 1 . f o r i ← 0 t o m d o l e n ( i , 0 ) = 0 2 . f o r j ← 1 t o n d o l e n ( 0 , j ) = 0 3 . f o r i ← 1 t o m d o 4 . f o r j ← 1 t o n d o
5 . i f ji ba t h e n
" "),(
1)1,1(),(jiprev
jilenjilen
6 . e l s e i f )1,(),1( jilenjilen
7 . t h e n
" "),(
),1(),(jiprev
jilenjilen
8 . e l s e
" "),(
)1,(),(jiprev
jilenjilen
9 . r e t u r n l e n a n d p r e v
53
i j 0 1 p
2 r
3 o
4 v
5 i
6 d
7 e
8 n
9 c
10 e
0 0 0 0 0 0 0 0 0 0 0 0
1 p 2
0 1 1 1 1 1 1 1 1 1 1
2 r 0 1 2 2 2 2 2 2 2 2 2
3 e 0 1 2 2 2 2 2 3 3 3 3
4 s 0 1 2 2 2 2 2 3 3 3 3
5 i 0 1 2 2 2 3 3 3 3 3 3
6 d 0 1 2 2 2 3 4 4 4 4 4
7 e 0 1 2 2 2 3 4 5 5 5 5
8 n 0 1 2 2 2 3 4 5 6 6 6
9 t 0 1 2 2 2 3 4 5 6 6 6
54
p r o c e d u r e O u tp u t - L C S ( A , p r e v , i , j )
1 i f i = 0 o r j = 0 t h e n r e t u r n
2 i f p r e v ( i , j ) = ” “ t h e n
iajiprevALCSOutput
print )1,1,,(
3 e l s e i f p r e v ( i , j ) = ” “ t h e n O u tp u t - L C S ( A , p r e v , i - 1 , j ) 4 e l s e O u tp u t - L C S ( A , p r e v , i , j - 1 )
55
i j 0 1 p
2 r
3 o
4 v
5 i
6 d
7 e
8 n
9 c
10 e
0 0 0 0 0 0 0 0 0 0 0 0
1 p 2
0 1 1 1 1 1 1 1 1 1 1
2 r 0 1 2 2 2 2 2 2 2 2 2
3 e 0 1 2 2 2 2 2 3 3 3 3
4 s 0 1 2 2 2 2 2 3 3 3 3
5 i 0 1 2 2 2 3 3 3 3 3 3
6 d 0 1 2 2 2 3 4 4 4 4 4
7 e 0 1 2 2 2 3 4 5 5 5 5
8 n 0 1 2 2 2 3 4 5 6 6 6
9 t 0 1 2 2 2 3 4 5 6 6 6
Output: priden
56
Now try this example on your own
Sequence 1: algorithmSequence 2: alignmentTheir LCS?Recall that ...
a l g o r i t h m
a l i g n m e n t