Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

Dynamic-Programming Strategies for Analyzing Biomolecular SequencesKun-Mao Chao (趙坤茂 )

Department of Computer Science and Information EngineeringNational Taiwan University, Taiwan

E-mail: [email protected]: http://www.csie.ntu.edu.tw/~kmchao

2

Dynamic Programming• Dynamic programming is a class of

solution methods for solving sequential decision problems with a compositional cost structure.

• Richard Bellman was one of the principal founders of this approach.

3

Two key ingredients• Two key ingredients for an optimization

problem to be suitable for a dynamic-programming solution:

Each substructure is optimal.

(Principle of optimality)

1. optimal substructures

2. overlapping subproblems

Subproblems are dependent.(otherwise, a divide-and-conquer approach is the choice.)

4

Three basic components• The development of a dynamic-programming algorithm has three basic components:

– The recurrence relation (for defining the value of an optimal solution);– The tabular computation (for computing the value of an optimal solution);– The traceback (for delivering an optimal solution).

5

Fibonacci numbers

.for 21

11

00

i>1iFiFiFFF

The Fibonacci numbers are defined by the following recurrence:

6

How to compute F10？

F10

F9

F8

F8

F7

F7

F6

……

7

Tabular computation• The tabular computation can avoid recompuation.

F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10

0 1 1 2 3 5 8 13 21 34 55

8

Maximum-sum interval• Given a sequence of real numbers

a1a2…an , find a consecutive subsequence with the maximum sum.9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9

For each position, we can compute the maximum-sum interval starting at that position in O(n) time. Therefore, a naive algorithm runs in O(n2) time.

9

O-notation: an asymptotic upper bound

• f(n) = O(g(n)) iff there exist two positive constant c and n0 such that 0 f(n) cg(n) for all n n0

cg(n)

f(n)

n0

10

How functions grow?

30n 92n log n 26n2 0.68n3 2n

100 0.003 sec. 0.003 sec. 0.0026

sec. 0.68 sec. 4 x 1016

yr.

100,000 3.0 sec. 2.6 min. 3.0 days 22 yr.

For large data sets, algorithms with a complexity greater than O(n log n) are often impractical!

n

function

(Assume one million operations per second.)

11

Maximum-sum interval(The recurrence relation)

• Define S(i) to be the maximum sum of the intervals ending at position i.

0

)1(max)(

iSaiS i

ai

If S(i-1) < 0, concatenating ai with its previous interval gives less sum than ai itself.

12

Maximum-sum interval(Tabular computation)

9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7

The maximum sum

13

Maximum-sum interval(Traceback)9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9

S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7

The maximum-sum interval: 6 -2 8 4

14

Two fundamental problems we recently solved (I) (joint work with Lin and Jiang)• Given a sequence of real numbers

of length n and an upper bound U, find a consecutive subsequence of length at most U with the maximum sum --- an O(n)-time algorithm.U = 3

9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9

15

Joint work with Huang, Jiang and Lin• Xiaoqiu Huang (黃曉秋 )Iowa State University, USA• Tao Jiang (姜濤 ) University of California Riverside, USA• Yaw-Ling Lin (林耀鈴 )Providence University, Taiwan

16

Two fundamental problems we recently solved (II) (joint work with Lin and Jiang)• Given a sequence of real numbers

of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. --- an O(n log L)-time algorithm.

L = 4

3 2 14 6 6 2 10 2 6 6 14 2 1

17

Another exampleGiven a sequence as follows:

2, 6.6, 6.6, 3, 7, 6, 7, 2and L = 2, the highest-average

interval is the squared area, which has the average value 20/3.

2, 6.6, 6.6, 3, 7, 6, 7, 2

18

C+G rich regions• Our method can be used to locate

a region of length at least L with the highest C+G ratio in O(n log L) time.

ATGACTCGAGCTCGTCA

00101011011011010

Search for an interval of length at least L with the highest average.

19

After I opened this problem in Academia Sinica in Aug. 2001 ...• JCSS (2002) – Lin, Jiang, Chao

– A linear-time algorithm for a maximum-sum segment with a lower bound L and an upper bound U.

– An O(n log L) time algorithm for a maximum-average segment with a lower bound L.

– Two open problems raised:• O(n log L) O(n)?• average (with an upper bound U)?

20

After I opened this problem in Academia Sinica in Aug. 2001 ...• Comb. Math. (2002) – Yeh et. al

– An expected O(nL)-time algorithm• Comb. Math. (2002) – Lu

– A linear time algorithm for a maximum-average segment with a lower bound L in a binary sequence.

21

After I opened this problem in Academia Sinica in Aug. 2001 ...• WABI (2002) & JCSS – Goldwasswer, Kao,

and Lu– O(n) for a lower bound L only– O(n log (U-L+1)) for L and U

• IPL (2003) – Kim– O(n) for a lower bound L only

(Reformulated as a Computational Geometry Problem)

22

After I opened this problem in Academia Sinica in Aug. 2001 ...• Bioinformatics (2003) – Lin, Huang, Jiang,

Chao– A tool for finding k-best maximum-average seg

ments• Bioinformatics (2003) – Wang, Xu

– A tool for locating a maximum-sum segment with a lower bound L and an upper bound U

23

After I opened this problem in Academia Sinica in Aug. 2001 ...• CIAA(2003) – Lu et al.

– An alternative linear-time algorithm for a maximum-sum segment (online; for finding repeats)

• ESA (2003) – Chung, Lu– O(n log (U-L+1)) O(n) for L and U

• ISAAC (2003) – Lin (R.R.), Kuo, Chao– O(nL) for a maximum-average path in a tree– O(n) for a complete tree

• Some were omitted here...• Some are coming soon...

24

Length-unconstrained version

• Maximum-average interval

3 2 14 6 6 2 10 2 6 6 14 2 1

The maximum element is the answer. It can be done in O(n) time.

25

Q: computing density in O(1) time?

• prefix-sum(i) = S[1]+S[2]+…+S[i],– all n prefix sums are computable in O(n) time.

• sum(i, j) = prefix-sum(j) – prefix-sum(i-1)• density(i, j) = sum(i, j) / (j-i+1)

prefix-sum(j)

i j

prefix-sum(i-1)

26

A naive algorithm• A simple shift algorithm can compute the

highest-average interval of a fixed length in O(n) time

• Try L, L+1, L+2, ..., n. In total, O(n2).

27

An O(nL)-time algorithm (Huang, CABIOS’ 94)• Observing that there exists an optimal interv

al of length bounded by 2L, we immediately have an O(nL)-time algorithm.

We can bisect a region of length >= 2L into two segments, where each of them is of length >= L.

28

Good partners• Finding good partner g(i) for each i=1, 2, …, n-L+1.

L

g(i)

maximing avg[i, g(i)]

i + L

29

Right-Skew Decomposition• Partition S into substrings S1,S2,…,Sk such that

– each Si is a right-skew substring of S• the average of any prefix is always less than or equal to

the average of the remaining suffix.– density(S1) > density(S2) > … > density(Sk)

• [Lin, Jiang, Chao]– Unique– Computable in linear time.

30

Right-Skew Decomposition

1 1 0 1 1 0 1 0 1 1 0 0 1

1 2/3 3/5 1/3> >>

Lu’s illustration

31

An O(n log L)-time algorithm (Lin, Jiang, Chao, JCSS’02)• Decreasingly right-skew decomposition

(O(n) time)

9 7 8 1 9 8 3 7 2 8

97.5 6

5

8 9 8 75

8

32

Right-Skew pointers p[ ]

9 7 8 1 9 8 3 7 2 8

97.5 6

5

8 9 8 75

8

1 2 3 4 5 6 7 8 9 10

p[ ] 1 3 3 6 5 6 10 8 10 10

33

Illustration

L

i+L g(i)

L

i+L

Lu’s illustration

34

An O(n log L)-time algorithm

• Jumping tables that allows binary search.– O(log L) time for each position to find its good

partner, therefore the total running time is O(n log L).

• We also implemented a program that linearly scans right-skew segments for the good partnership. Our empirical tests showed that it ran in linear time empirically.

35

Pointer-jumping tables

9 7 8 1 9 8 3 7 2 8

97.5 6

5

8 9 8 75

8

1 2 3 4 5 6 7 8 9 10

p(0)[ ] 1 3 3 6 5 6 10 8 10 10

p(1)[ ] 3 6 6 10 6 10 10 10 10 10

p(2)[ ] 10 10 10 10 10 10 10 10 10 10

}],1][[min{][][][

)()()1(

)0(

nippipipip

kkk

36

Goldwasser, Kao, and Lu’s progress

• An O(n) time algorithm for the problem– Appeared in Proceedings of the Secon

d Workshop on Algorithms in Bioinformatics (WABI), Rome, Itay, Sep. 17-21, 2002.

– JCSS, accepted.

http://www.iis.sinica.edu.tw/~hil/icon/face-big-02.jpg

37

A new important observation

• i < j < g(j) < g(i) implies– density(i, g(i)) is no more than density(j, g(j))

i g(i)j g(j)

Lu’s illustration

38

i g(i)j g(j)

Lu’s illustration

39

Searching for all g(i) in linear time

Lu’s illustration

40

k non-overlapping maximum-average segments (Lin, Huang, Jiang, Chao, Bioinformatics)• Maintain all candidates in a priority queue.

• A new k-best algorithm + algorithms on trees (Lin and Chao (2003))

41

Longest increasing subsequence(LIS)• The longest increasing subsequence is to find a

longest increasing subsequence of a given sequence of distinct integers a1a2…an .e.g. 9 2 5 3 7 11 8 10 13 6

2 3 7

5 7 10 13

9 7 11

3 5 11 13

are increasing subsequences.

are not increasing subsequences.

We want to find a longest one.

42

A naive approach for LIS• Let L[i] be the length of a longest increasing

subsequence ending at position i.L[i] = 1 + max j = 0..i-1{L[j] | aj < ai}(use a dummy a0 = minimum, and L[0]=0)

9 2 5 3 7 11 8 10 13 6L[i] 1 1 2 2 3 4 ?

43

A naive approach for LIS

9 2 5 3 7 11 8 10 13 6L[i] 1 1 2 2 3 4 4 5 6 3

L[i] = 1 + max j = 0..i-1 {L[j] | aj < ai}

The maximum length

The subsequence 2, 3, 7, 8, 10, 13 is a longest increasing subsequence.

This method runs in O(n2) time.

44

Binary search• Given an ordered sequence x1x2 ... xn, where

x1<x2< ... <xn, and a number y, a binary search finds the largest xi such that xi< y in O(log n) time.

n ...

n/2n/4

45

Binary search• How many steps would a binary search

reduce the problem size to 1?n n/2 n/4 n/8 n/16 ... 1

How many steps? O(log n) steps.

nsn s

2log12/

46

An O(n log n) method for LIS• Define BestEnd[k] to be the smallest number of an

increasing subsequence of length k.9 2 5 3 7 11 8 10 13 69 2 2

523

237

23711

2378

2378

10

2378

1013

BestEnd[1]

BestEnd[2]

BestEnd[3]

BestEnd[4]

BestEnd[5]

BestEnd[6]

47

An O(n log n) method for LIS• Define BestEnd[k] to be the smallest number of an

increasing subsequence of length k.9 2 5 3 7 11 8 10 13 69 2 2

523

237

23711

2378

2378

10

2378

1013

23681013

BestEnd[1]

BestEnd[2]

BestEnd[3]

BestEnd[4]

BestEnd[5]

BestEnd[6]

For each position, we perform a binary search to update BestEnd. Therefore, the running time is O(n log n).

48

Longest Common Subsequence (LCS)

• A subsequence of a sequence S is obtained by deleting zero or more symbols from S. For example, the following are all subsequences of “president”: pred, sdn, predent.

• The longest common subsequence problem is to find a maximum-length common subsequence between two sequences.

49

LCSFor instance, Sequence 1: president Sequence 2: providence Its LCS is priden.

president

providence

50

LCSAnother example: Sequence 1: algorithm Sequence 2: alignmentOne of its LCS is algm.

a l g o r i t h m

a l i g n m e n t

51

How to compute LCS?• Let A=a1a2…am and B=b1b2…bn .• len(i, j): the length of an LCS between

a1a2…ai and b1b2…bj

• With proper initializations, len(i, j)can be computed as follows.

,. and 0, if)),1(),1,(max(

and 0, if1)1,1(,0or 0 if0

),(

ji

ji

bajijilenjilenbajijilen

jijilen

52

p r o c e d u r e L C S - L e n g t h ( A , B ) 1 . f o r i ← 0 t o m d o l e n ( i , 0 ) = 0 2 . f o r j ← 1 t o n d o l e n ( 0 , j ) = 0 3 . f o r i ← 1 t o m d o 4 . f o r j ← 1 t o n d o

5 . i f ji ba t h e n

" "),(

1)1,1(),(jiprev

jilenjilen

6 . e l s e i f )1,(),1( jilenjilen

7 . t h e n

" "),(

),1(),(jiprev

jilenjilen

8 . e l s e

" "),(

)1,(),(jiprev

jilenjilen

9 . r e t u r n l e n a n d p r e v

53

i j 0 1 p

2 r

3 o

4 v

5 i

6 d

7 e

8 n

9 c

10 e

0 0 0 0 0 0 0 0 0 0 0 0

1 p 2

0 1 1 1 1 1 1 1 1 1 1

2 r 0 1 2 2 2 2 2 2 2 2 2

3 e 0 1 2 2 2 2 2 3 3 3 3

4 s 0 1 2 2 2 2 2 3 3 3 3

5 i 0 1 2 2 2 3 3 3 3 3 3

6 d 0 1 2 2 2 3 4 4 4 4 4

7 e 0 1 2 2 2 3 4 5 5 5 5

8 n 0 1 2 2 2 3 4 5 6 6 6

9 t 0 1 2 2 2 3 4 5 6 6 6

54

p r o c e d u r e O u tp u t - L C S ( A , p r e v , i , j )

1 i f i = 0 o r j = 0 t h e n r e t u r n

2 i f p r e v ( i , j ) = ” “ t h e n

iajiprevALCSOutput

print )1,1,,(

3 e l s e i f p r e v ( i , j ) = ” “ t h e n O u tp u t - L C S ( A , p r e v , i - 1 , j ) 4 e l s e O u tp u t - L C S ( A , p r e v , i , j - 1 )

55

i j 0 1 p

2 r

3 o

4 v

5 i

6 d

7 e

8 n

9 c

10 e

0 0 0 0 0 0 0 0 0 0 0 0

1 p 2

0 1 1 1 1 1 1 1 1 1 1

2 r 0 1 2 2 2 2 2 2 2 2 2

3 e 0 1 2 2 2 2 2 3 3 3 3

4 s 0 1 2 2 2 2 2 3 3 3 3

5 i 0 1 2 2 2 3 3 3 3 3 3

6 d 0 1 2 2 2 3 4 4 4 4 4

7 e 0 1 2 2 2 3 4 5 5 5 5

8 n 0 1 2 2 2 3 4 5 6 6 6

9 t 0 1 2 2 2 3 4 5 6 6 6

Output: priden

56

Now try this example on your own

Sequence 1: algorithmSequence 2: alignmentTheir LCS?Recall that ...

a l g o r i t h m

a l i g n m e n t

Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

Documents

Transcript of Dynamic-Programming Strategies for Analyzing Biomolecular Sequences