第四章 Dynamic Programming 技术

Post on 13-Jan-2016

241 views 0 download

description

第四章 Dynamic Programming 技术. 邹权(博士) 计算机科学系. 4. 1 Introduction. Fibonacci number F ( n ). 1if n = 0 or 1 F ( n -1) + F ( n -2) if n > 1. F ( n ) =. Pseudo code for the recursive algorithm: F ( n ) 1 if n =0 or n =1 then return 1 2 else return F ( n -1) + F ( n -2). - PowerPoint PPT Presentation

Transcript of 第四章 Dynamic Programming 技术

1

第四章第四章

Dynamic Dynamic ProgrammingProgramming 技术技术

邹权(博士)邹权(博士)计算机科学系计算机科学系

2

4.1 Introduction

F(n) = 1 if n = 0 or 1F(n-1) + F(n-2) if n > 1

n 0 1 2 3 4 5 6 7 8 9 10

F(n) 1 1 2 3 5 8 13 21 34 55 89

Pseudo code for the recursive algorithm:F(n)1 if n=0 or n=1 then return 12 else return F(n-1) + F(n-2)

Pseudo code for the recursive algorithm:F(n)1 if n=0 or n=1 then return 12 else return F(n-1) + F(n-2)

Fibonacci number F(n)

3

The execution of F(7)

F7

F6

F5

F5

F4 F3

F3

F2

F1 F0

F2

F1 F0

F1

F2

F1 F0

F2

F1 F0

F2

F1 F0

F4 F4

F3

F3

F3

F2

F1 F0

F2

F1 F0

F2

F1 F0

F1

F1 F1

F1

4

The execution of F(7)

Computation of F(2) is repeated 8

times!

F7

F6

F5

F5

F4 F3

F3

F2

F1 F0

F2

F1 F0

F1

F2

F1 F0

F2

F1 F0

F2

F1 F0

F4 F4

F3

F3

F3

F2

F1 F0

F2

F1 F0

F2

F1 F0

F1

F1 F1

F1

5

The execution of F(7)

Computation of F(3) is also repeated 5

times!

F7

F6

F5

F5

F4 F3

F3

F2

F1 F0

F2

F1 F0

F1

F2

F1 F0

F2

F1 F0

F2

F1 F0

F4 F4

F3

F3

F3

F2

F1 F0

F2

F1 F0

F2

F1 F0

F1

F1 F1

F1

6

The execution of F(7)

Many computations are

repeated!!How to avoid

this?

F7

F6

F5

F5

F4 F3

F3

F2

F1 F0

F2

F1 F0

F1

F2

F1 F0

F2

F1 F0

F2

F1 F0

F4 F4

F3

F3

F3

F2

F1 F0

F2

F1 F0

F2

F1 F0

F1

F1 F1

F1

7

Idea for improvement

Memorization

Store F1(i) somewhere after we have computed its value

Afterward, we don’t need to re-compute F1(i); we can retrieve its value from our memory.

F1(n)

1 if v[n] < 0 then

2 v[n] ←F1(n-1)+F1(n-2)

3 return v[n]

Main()

1 v[0] = v[1] ←1

2 for i ← 2 to n do

3 v[i] = -1

4 output F1(n)

8

Look at the execution of F(7)

1

1

-1

-1

-1

-1

-1

-1

v[0]

v[1]

v[2]

v[3]

v[4]

v[5]

v[6]

v[7]

F7

F6

F5

F5

F4 F3

F3

F2

F1 F0

F2

F1 F0

F1

F2

F1 F0

F2

F1 F0

F2

F1 F0

F4 F4

F3

F3

F3

F2

F1 F0

F2

F1 F0

F2

F1 F0

F1

F1 F1

F1

F(i)=Fi

9

Look at the execution of F(7)

1

1

-1

-1

-1

-1

-1

-1

v[0]

v[1]

v[2]

v[3]

v[4]

v[5]

v[6]

v[7]

F7

F6

F5

F5

F4 F3

F3

F2

F1 F0

F2

F1 F0

F1

F2

F1 F0

F2

F1 F0

F2

F1 F0

F4 F4

F3

F3

F3

F2

F1 F0

F2

F1 F0

F2

F1 F0

F1

F1 F1

F1

10

Look at the execution of F(7)

1

1

2

-1

-1

-1

-1

-1

v[0]

v[1]

v[2]

v[3]

v[4]

v[5]

v[6]

v[7]

F7

F6

F5

F5

F4 F3

F3

F2

F1 F0

F2

F1 F0

F1

F2

F1 F0

F2

F1 F0

F2

F1 F0

F4 F4

F3

F3

F3

F2

F1 F0

F2

F1 F0

F2

F1 F0

F1

F1 F1

F1

11

Look at the execution of F(7)

1

1

2

-1

-1

-1

-1

-1

v[0]

v[1]

v[2]

v[3]

v[4]

v[5]

v[6]

v[7]

F7

F6

F5

F5

F4 F3

F3

F2

F1 F0

F2

F1 F0

F1

F2

F1 F0

F2

F1 F0

F2

F1 F0

F4 F4

F3

F3

F3

F2

F1 F0

F2

F1 F0

F2

F1 F0

F1

F1 F1

F1

12

Look at the execution of F(7)

1

1

2

3

-1

-1

-1

-1

v[0]

v[1]

v[2]

v[3]

v[4]

v[5]

v[6]

v[7]

F7

F6

F5

F5

F4 F3

F3

F2

F1 F0

F2

F1 F0

F1

F2

F1 F0

F2

F1 F0

F2

F1 F0

F4 F4

F3

F3

F3

F2

F1 F0

F2

F1 F0

F2

F1 F0

F1

F1 F1

F1

13

Look at the execution of F(7)

1

1

2

3

-1

-1

-1

-1

v[0]

v[1]

v[2]

v[3]

v[4]

v[5]

v[6]

v[7]

F7

F6

F5

F5

F4 F3

F3

F2

F1 F0

F2

F1 F0

F1

F2

F1 F0

F2

F1 F0

F2

F1 F0

F4 F4

F3

F3

F3

F2

F1 F0

F2

F1 F0

F2

F1 F0

F1

F1 F1

F1

14

F1 F0

Look at the execution of F(7)

1

1

2

3

5

-1

-1

-1

v[0]

v[1]

v[2]

v[3]

v[4]

v[5]

v[6]

v[7]

F7

F6

F5

F5

F4 F3

F3

F2

F1 F0

F2

F1

F2

F1 F0

F2

F1 F0

F2

F1 F0

F4 F4

F3

F3

F3

F2

F1 F0

F2

F1 F0

F2

F1 F0

F1

F1 F1

F1

15

F1 F0

F2

F1 F0

F1

Look at the execution of F(7)

1

1

2

3

5

-1

-1

-1

F7

F6

F5

F5

F4 F3

F3

F2

F1 F0

F2

F1

F2

F1 F0

F2

F1 F0

F4 F4

F3

F3

F3

F2

F1 F0

F2

F1 F0

F2

F1 F0

F1 F1

F1

v[0]

v[1]

v[2]

v[3]

v[4]

v[5]

v[6]

v[7]

16

F1 F0

F2

F1 F0

F1

Look at the execution of F(7)

1

1

2

3

5

8

-1

-1

F7

F6

F5

F5

F4 F3

F3

F2

F1 F0

F2

F1

F2

F1 F0

F2

F1 F0

F4 F4

F3

F3

F3

F2

F1 F0

F2

F1 F0

F2

F1 F0

F1 F1

F1

v[0]

v[1]

v[2]

v[3]

v[4]

v[5]

v[6]

v[7]

17

F1 F0

F2

F1 F0

F2

F1 F0

F2

F1 F0

F3

F1

F1

Look at the execution of F(7)

1

1

2

3

5

8

-1

-1

F7

F6

F5

F5

F4 F3

F3

F2

F1 F0

F2

F1

F4 F4 F3

F3

F2

F1 F0

F2

F1 F0

F2

F1 F0

F1

F1

v[0]

v[1]

v[2]

v[3]

v[4]

v[5]

v[6]

v[7]

18

F1 F0

F2

F1 F0

F2

F1 F0

F2

F1 F0

F3

F1

F1

Look at the execution of F(7)

1

1

2

3

5

8

13

-1

F7

F6

F5

F5

F4 F3

F3

F2

F1 F0

F2

F1

F4 F4 F3

F3

F2

F1 F0

F2

F1 F0

F2

F1 F0

F1

F1

v[0]

v[1]

v[2]

v[3]

v[4]

v[5]

v[6]

v[7]

19

F1 F0

F2

F1 F0

F2

F1 F0

F2

F1 F0

F3

F3

F3

F2

F1 F0

F2

F1 F0

F2

F1 F0

F1

F1 F1

F1

Look at the execution of F(7)

1

1

2

3

5

8

13

-1

F7

F6

F5

F5

F4 F3

F3

F2

F1 F0

F2

F1

F4 F4

v[0]

v[1]

v[2]

v[3]

v[4]

v[5]

v[6]

v[7]

20

F1

F2

F1 F0

F2

F1 F0

F2

F1 F0

F4

F3

F3

F3

F2

F1 F0

F2

F1 F0

F2

F1 F0

F1

F1 F1

F1

Look at the execution of F(7)

1

1

2

3

5

8

13

21

F7

F6

F5

F5

F4 F3

F3

F2

F1 F0

F2

F0

F1

F4

v[0]

v[1]

v[2]

v[3]

v[4]

v[5]

v[6]

v[7]

21

This new implementation saves lots of overhead.

Can we do even better? Observation

The 2nd version still make many function calls, and each wastes times in parameters passing, dynamic linking, ...

In general, to compute F(i), we need F(i-1) & F(i-2) only Idea to further improve

Compute the values in bottom-up fashion. That is, compute F(2) (we already know F(0)=F(1)=1), then

F(3), then F(4)…F2(n)1 F[0] ← 12 F[1] ← 13 for i ← 2 to n do4 F[i] ← F[i-1] + F[i-2]5 return F[n]

22

Recursive vs Dynamic programming

Recursive version:F(n)1 if n=0 or n=1 then return 12 else return F(n-1) + F(n-2)

Recursive version:F(n)1 if n=0 or n=1 then return 12 else return F(n-1) + F(n-2)

Dynamic Programming version:F2(n)1 F[0] ← 12 F[1] ← 13 for i ← 2 to n do4 F[i] ← F[i-1] + F[i-2]5 return F[n]

Dynamic Programming version:F2(n)1 F[0] ← 12 F[1] ← 13 for i ← 2 to n do4 F[i] ← F[i-1] + F[i-2]5 return F[n]

TooSlow!

Efficient!Time complexity is O(n)

23

Summary of the methodology Write down a formula that relates a solution of a problem with

those of sub-problems.E.g. F(n) = F(n-1) + F(n-2).

Index the sub-problems so that they can be stored and retrieved easily in a table (i.e., array)

Fill the table in some bottom-up manner; start filling the solution of the smallest problem. This ensures that when we solve a particular sub-problem, the

solutions of all the smaller sub-problems that it depends are available.

For historical reasons, we call such methodology

Dynamic Programming.In the late 40’s (when computers were rare),programming refers to the "tabular method".

24

Dynamic programming VS Divide-and-conquer

Divide-and-conquer method 1. Subproblem is independent.

2. Subproblem is solved repeatedly. Dynamic programming (DP) 1. Subproblem is not independent. 2. Subproblem is just solved once. Common: Problem is partitioned into one or more subproblem,

then the solution of subproblem is combined. DP reduces computation by

Solving subproblems in a bottom-up fashion. Storing solution to a subproblem the first time it is solved. Looking up the solution when subproblem is met again.

25

26

4.2 Assembly-line scheduling

End

Si[j] :The jth station on line i ( i=1 or 2) .ai[j] :The assembly time required at station Si[j]. ei :Entry time. xi: Exit time.

e1

e2

a1[1] a1[2] a1[3]

a2[1] a2[2] a2[3]

a1[n-1] a1[n]

a2[n-1] a2[n]

x1

x2

t1[n-1]

t2[n-1]…Begin

Line 1

Line 2

t1[1]

t2[1]

t1[2]

t2[2]

S1[1] S1[2] S1[3] S1[n-1] S1[n]

S2[1] S2[2] S2[3] S2[n-1] S2[n]

27

One Solution Brute force

Enumerate all possibilities of selecting stations Compute how long it takes in each case and choose

the best one Problem:

There are 2n possible ways to choose stations Infeasible when n is large

1 0 0 1 1

1 if choosing line 1 at step j (= n)

1 2 3 4 n

0 if choosing line 2 at step j (= 3)

28

fi[j] = the fastest time to get from the starting point through station Si[j]

Let f * denote fastest time to get a chassis all the way through the factory.

j = 1 (getting through station 1)

f1[1] = e1 + a1[1]

f2[1] = e2 + a2[1]

f * = min( f1[n]+x1 , f2[n]+x2 )

e1

e2

a1[1] a1[2] a1[3]

a2[1] a2[2] a2[3]

a1[n-1] a1[n]

a2[n-1] a2[n]

x1

x2

t1[n-1]

t2[n-1]…Begin

Line 1

Line 2

t1[1]

t2[1]

t1[2]

t2[2]

29

2. A Recursive Solution (cont.) Compute fi[j] for j = 2, 3, …,n, and i = 1, 2

Fastest way through S1[j] is either:

fastest way through S1[j-1]then directly through S1[j], or f1[j] = f1[j - 1] + a1[j]

fastest way through S2[j-1], transfer from line 2 to line 1, then through S1[j]

f1[j] = f2[j -1] + t2[j-1] + a1[j]

f1[j] = min(f1[j - 1] + a1[j], f2[j -1] + t2[j-1] + a1[j])

a1[j]a1[j-1]

a2[j-1]

t2[j-1]

S1[j]S1[j-1]

S2[j-1]

30

The recursive equation

For example, n=5:

Solving top-down would result in exponential running time

1 if])[]1[]1[ ],[]1[min(

1 if]1[][

1 if])[]1[]1[ ],[]1[min(

1 if]1[][

21122

222

12211

111

jjajtjfjajf

jeajf

jjajtjfjajf

jeajf

f1[j]

f2[j]

1 2 3 4 5

f1[5]

f2[5]

f1[4]

f2[4]

f1[3]

f2[3]

2 times4 times

f1[2]

f2[2]

f1[1]

f2[1]

31

3. Computing the Optimal Solution For j ≥ 2, each value fi[j] depends only on the values of

f1[j – 1] and f2[j - 1]

Compute the values of fi[j] in increasing order of j

Bottom-up approach First find optimal solutions to subproblems Find an optimal solution to the problem from the subproblems

f1[j]

f2[j]

1 2 3 4 5

increasing j

32

Additional Information To construct the optimal solution we need the sequence of

what line has been used at each station:

li[j] – the line number (1, 2) whose station (j - 1) has been

used to get in fastest time through Si[j], j = 2, 3, …, n

l* - the line whose station n is used to get in the fastest way

through the entire factory

l1[j]

l2[j]

2 3 4 5

increasing j

33

Step 3: Computing the fastest wayDPFastestWay(a, t, e, x, n)1 f1[1] ← e1 + a1[1]; f2[1] ←e2 + a2[1]2 for j ← 2 to n do3 if f1[j - 1] + a1[j] ≤ f2 [j - 1] + t2[j-1] + a1[j] then4 f1[j] ← f1[j - 1] + a1[j]5 l1[j] ← 16 else f1[j] ← f2 [j - 1] + t2[j-1] + a1[j] 7 l1[j] ← 28 if f2[j - 1] + a2[j]≤ f1[j - 1] + t1[j-1] + a2[j] then9 f2[j] ← f2[j - 1] + a2[j] 10 l2[j] ← 211 else f2 [j] ← f1[j - 1] + t1[j-1]+ a2[j] then 12 l2[j] ← 113 if f1[n] + x1≤f2[n] + x2 then14 f* ← f1[n] + x1

15 l* ← 116 else f* ← f2[n] + x2 17 l*← 2

Running time: (n)

34

For example

2

4

3

2

7 9 4 48

8 5 4 75

2 3 3 4

2 1 2 1

3

6

1

2

Begin End

1 2 3 4 5 6j

f1[j]

f2[j]

f *=38

9

12

18

16

20

22

24

25

32

30

35

37

2 3 4 5 6j

l1[j]

l2[j]

l *=1

21 1 1

1 12 2

2

2

35

4. Construct an Optimal Solution

PrintStations(l, n)

1 i ← l*

2 print “line ” i “, station ” n

3 for j ← n downto 2 do

4 i ←li[j]

5 print “line ” i “, station ” j - 1

line 1, station 6

line 2, station 5

line 2, station 4

line 1, station 3

line 2, station 2

2 3 4 5 6j

l1[j]

l2[j]l *=1

21 1 1

1 12 2

2

2

line 1, station 1

36

2

4

3

2

7 9 4 48

8 5 4 75

2 3 3 4

2 1 2 1

3

6

1

2Begin End

The fastest assembly way f *=38

37

4.3 Matrix-chain multiplication

Problem:

Given a chain A1, A2, . . . , An of n matrices, where for i = 1,

2, . . . , n , matrix Ai has dimension pi-1 × pi, fully parenthesize

the product A1 A2...An in a way that minimizes the number of

scalar multiplications.

A1 A2 Ai Ai+1 An

p0 × p1 p1 × p2 pi-1 × pi pi × pi+1 pn-1 × pn

A product of matrices is fully parenthesized if it is either a single matrix or the product of two fully parenthesized matrix products, surrounded by parentheses.

38

For example, if the chain of matrices is <A1, A2, A3, A4>, the product A1 A2 A3 A4 can be fully parenthesized in five distinct ways:

(A1 (A2 (A3 A4))) ,

(A1 ((A2 A3) A4)) ,

((A1 A2) (A3 A4)) ,

((A1 (A2 A3)) A4) ,

(((A1 A2) A3) A4).

39

MatrixMultiply(A,B) 1 if m≠p then2 print “Two matrices cannot multiply" 3 else for i←1 to n 4 for j←1 to q do5 C[i, j]←0 6 for k ←1 to m do7 C[i ,j]←C[i ,j] + A[i, k] B[k, j] 8 return C

Running time: (nmq)

40

Consider the problem of a chain A1, A2, A3 of three matrices. Suppose that the dimensions of the matrices are 10×100, 100×5, and 5×50, respectively.

1. ((A1A2) A3): A1A2 = 10×100×5 = 5,000 (10×5)

((A1 A2) A3) = 10×5×50 = 2,500

Total: 7,500 scalar multiplications

2. (A1 (A2A3)): A2A3 = 100×5×50 = 25,000 (100×50)

(A1 (A2 A3)) = 10×100×50 = 50,000

Total: 75,000 scalar multiplications

one order of magnitude difference!!

41

1. The structure of an optimal parenthesization

Notation:

Ai…j = Ai Ai+1 Aj, i j

For i < j:

Ai…j = Ai Ai+1 Aj

= Ai Ai+1 Ak Ak+1 Aj

= Ai…k Ak+1…j

42

2. A recursive solution

Subproblem:

determine the minimum cost of parenthesizing Ai…j = Ai

Ai+1 Aj for 1 i j n

Let m[i, j] = the minimum number of multiplications

needed to compute Ai…j

Full problem (A1..n): m[1, n]

43

2. A Recursive Solution (cont.)

Consider the subproblem of parenthesizing Ai…j = Ai

Ai+1 Aj for 1 i j n

= Ai…k Ak+1…j for i k < j

m[i, j] = the minimum number of multiplications needed

to compute the product Ai…j

m[i, j] = m[i, k] + m[k+1, j] + pi-1pkpjmin # of multiplications to compute Ai…k

# of multiplications to compute Ai…kAk…j

min # of multiplications to compute Ak+1…j

m[i, k] m[k+1,j]

pi-1pkpj

44

2. A Recursive Solution (cont.)

m[i, j] = m[i, k] + m[k+1, j] + pi-1pk pj

We do not know the value of k There are j–i possible values for k: k = i, i+1, …, j-1

Minimizing the cost of parenthesizing the product Ai Ai+1 Aj becomes:

. if}],1[],[{min

, if0],[

1 jipppjkmkim

jijim

jkijki

45

3. Computing the Optimal Costs

How many subproblems do we have? Parenthesize Ai…j for 1 i j n One problem for each choice of i and j

A recurrent algorithm may encounter each subproblem many times in different branches of the recursion overlapping subproblems

Compute a solution using a tabular bottom-up approach

(n2)

. if}],1[],[{min

, if0],[

1 jipppjkmkim

jijim

jkijki

46

Reconstructing the Optimal Solution

Additional information to maintain:

s[i, j] = a value of k at which we can split the product Ai

Ai+1 Aj in order to obtain an optimal parenthesization

47

3. Computing the Optimal Costs (cont.)

How do we fill in the tables m[1..n, 1..n] and s[1..n, 1..n] ? Determine which entries of the table are used in

computing m[i, j] m[i, j] = cost of computing a product of j – i – 1

matrices m[i, j] depends only on costs for products of fewer

than j – i – 1 matrices

Ai…j = Ai…k Ak+1…j

Fill in m such that it corresponds to solving problems of increasing length

48

3. Computing the Optimal Costs (cont.)

Length = 0: i = j, i = 1, 2, …, n Length = 1: j = i + 1, i = 1, 2, …, n-1

1

1

2 3 n

2

3

n

first

seco

nd

Compute rows from bottom to top and from left to rightIn a similar matrix s we keep the optimal values of k

m[1, n] gives the optimalsolution to the problem

i

j

50

Example: min {m[i, k] + m[k+1, j] + pi-1 pk pj}

m[2, 2] + m[3, 5] + p1p2p5

m[2, 3] + m[4, 5] + p1p3p5

m[2, 4] + m[5, 5] + p1p4p5

1

1

2 3 6

2

3

6

i

j

4 5

4

5

m[2, 5] = min

Values m[i, j] depend only on values that have been previously computed

k = 2

k = 3

k = 4

51

Example: Compute A1A2A3

A1: 10×100 (p0×p1)

A2: 100×5 (p1×p2)

A3: 5×50 (p2×p3)

m[i, i] = 0 for i = 1, 2, 3

m[1, 2] = m[1, 1] + m[2, 2] + p0p1p2 (A1A2) = 0 + 0 + 10×100×5 = 5,000

m[2, 3] = m[2, 2] + m[3, 3] + p1p2p3 (A2A3) = 0 + 0 + 100×5×50 = 25,000

m[1,1] + m[2,3] + p0p1p3 = 75,000 (A1(A2A3))

m[1,2] +m[3, 3] + p0p2p3 = 7,500 ((A1A2)A3)

0

0

0

1

1

2

2

3

3

50001

250002

75002

m[1,3] = min

52

4. Construct the Optimal Solution

Store the optimal choice made at each subproblem s[i, j] = a value of k such that an optimal parenthesization of

Ai..j splits the product between Ak and Ak+1

s[1, n] is associated with the entire product A1..n

The final matrix multiplication will be split at k = s[1,n]

A1..n = A1..s[1, n] As[1, n]+1..n

For each subproduct recursively find the corresponding value of k that results in an optimal parenthesization

53

4. Construct the Optimal Solution (cont.)

s[i, j] = value of k such that the optimal parenthesization of Ai Ai+1 Aj splits the product between Ak and Ak+1

3 3 3 5 5 -

3 3 3 4 -

3 3 3 -

1 2 -

1 -

-

1

1

2 3 6

2

3

6

i

j

4 5

4

5 s[1,n]=3 A1..6 = A1..3 A4..6

s[1,3]=1 A1..3 = A1..1 A2..3

s[4,6]=5 A4..6 = A4..5 A6..6

A1..n = A1..s[1, n] As[1, n]+1..n

54

4. Construct the Optimal Solution (cont.)

3 3 3 5 5 -

3 3 3 4 -

3 3 3 -

1 2 -

1 -

-

1

1

2 3 6

2

3

6

i

j

4 5

4

5

PrintParens(s, i, j)

1 if i = j then print “A”i

2 else print “(”

3 PrintParens(s, i, s[i, j])

4 PrintParens(s, s[i, j] + 1, j)

5 print “)”

55

Example: A1 A6

3 3 3 5 5 -

3 3 3 4 -

3 3 3 -

1 2 -

1 -

-

1

1

2 3 6

2

3

6

i

j

4 5

4

5

PrintOPTParens(s, i, j)

1 if i = j then print “A”i

2 else print “(”

3 PrintOPTParens(s, i, s[i, j])

4 PrintOPTParens(s, s[i, j] + 1, j)

5 print “)”

POP(s, 1, 6)s[1, 6] = 3

i = 1, j = 6 “(“ POP (s, 1, 3) s[1, 3] = 1

i = 1, j = 3 “(“ POP(s, 1, 1) “A1”

POP(s, 2, 3) s[2, 3] = 2

i = 2, j = 3 “(“ POP (s, 2, 2) “A2”

POP (s, 3, 3) “A3”

“)”

“)”

( ((A4A5)A6))A1(A2A3) )

(

s[1..6, 1..6]

56

4.4 The longest-common-subsequence problem

Subsequence Common subsequence Maximum-length common subsequence Problem:Given two sequences Xm =< x1, x2, ..., xm>and

Yn = <y1, y2, ..., yn>and wish to find a maximum-length common subsequence of Xm and Yn .

E.g.:

X = A, B, C, B, D, A, B Subsequences of X:

A subset of elements in the sequence taken in order

A, B, D, B, C, D, B, etc.

57

Example

X = A, B, C, B, D, A, B X = A, B, C, B, D, A, B

Y = B, D, C, A, B, A Y = B, D, C, A, B, A

B, C, B, A and B, D, A, B are longest common subsequences of X and Y (length = 4)

B, C, A, however is not a LCS of X and Y

58

Brute-Force Solution

For every subsequence of Xm, check whether it’s a

subsequence of Yn

There are 2m subsequences of Xm to check

Each subsequence takes (n) time to check

Scan Yn for first letter, from there scan for second, and so on

Running time: (n2m)

59

Notations

Given a sequence Xm = x1, x2, …, xm we define the i-th

prefix of Xm , for i = 0, 1, 2, …, m

Xi = x1, x2, …, xi

c[i, j] = the length of a LCS of the sequences Xi = x1,

x2, …, xi and Yj = y1, y2, …, yj

60

Step 1: Optimal substructure

Theorem (Optimal substructure of an LCS)

Let Xm =< x1, x2, ..., xm>and Yn = <y1, y2, ..., yn> be sequences, and let Zk =< z1, z2, ..., zk> be some LCS of Xm and Yn.

1. If xm = yn, then zk = xm = yn and zk-1 is an LCS of Xm-1 and Yn-1.

2. If xm ≠ yn, and zk ≠ xm,then Zk is an LCS of Xm-1 and Yn.

3. If xm≠ yn, and zk ≠ yn ,then Zk is an LCS of Xm and Yn-1.

61

Step 2: A recursive solution

Case 1: xi = yj

e.g.: Xi = A, B, D, E

Yj = Z, B, E

Append xi = yj to the LCS of Xi-1 and Yj-1

Must find a LCS of Xi-1 and Yj-1 optimal solution to a

problem includes optimal solutions to subproblems

. and 0, if])1,[],,1[max(

, and 0, if1]1,1[

,0or 0 if0

],[

ji

ji

yxjijicjic

yxjijic

ji

jic

62

A Recursive Solution

Case 2: xi yj

e.g.: Xi = A, B, D, G

Yj = Z, B, D

Must solve two problems

find a LCS of Xi-1 and Yj: Xi-1 = A, B, D and Yj = Z, B, D

find a LCS of Xi and Yj-1: Xi = A, B, D, G and Yj = Z, B

Optimal solution to a problem includes optimal solutions to

subproblems

c[i, j] = max { c[i - 1, j], c[i, j-1] }

63

Overlapping Subproblems

To find a LCS of Xm and Yn

We may need to find the LCS between Xm and Yn-1 and that of

Xm-1 and Yn

Both the above subproblems has the subproblem of finding the

LCS of Xm-1 and Yn-1

Subproblems share subsubproblems

64

Step 3: Computing the Length of the LCS

0 if i = 0 or j = 0

c[i, j] = c[i-1, j-1] + 1 if xi = yj

max(c[i, j-1], c[i-1, j]) if xi yj

0 0 0 0 0 0

0

0

0

0

0

yj:

xm

y1 y2 yn

x1

x2

xi

j

i

0 1 2 n

m

1

2

0

first

second

c[i, j]

65

Additional Information 0 if i, j = 0

c[i, j] = c[i-1, j-1] + 1 if xi = yj

max(c[i, j-1], c[i-1, j]) if xi yj

0 0 0 0 0 0

0

0

0

0

0

yj:

D

A C F

A

B

xi

j

i

0 1 2 n

m

1

2

0

A matrix b[i, j]: it tells us what choice

was made to obtain the optimal value

If xi = yj

b[i, j] = “ ” Else, if

c[i - 1, j] ≥ c[i, j-1]

b[i, j] = “ ”else

b[i, j] = “ ”

3

3 C

Db & c:

c[i,j-1]

c[i-1,j]

66

LCSLength(X, Y, m, n)1 for i ← 1 to m do c[i, 0] ← 02 for j ← 0 to n do c[0, j] ← 03 for i ← 1 to m do4 for j ← 1 to n do

5 if xi = yj then6 c[i, j] ← c[i - 1, j - 1] + 17 b[i, j ] ← “ ”↖8 else if c[i - 1, j] ≥ c[i, j - 1] then9 c[i, j] ← c[i - 1, j]10 b[i, j] ← “↑”11 else c[i, j] ← c[i, j - 1]12 b[i, j] ← “←”13 return c and b

Running time: (nm)

67

ExampleX = B, D, C, A, B, AY = A, B, C, B, D, A

0 if i = 0 or j = 0

c[i, j] = c[i-1, j-1] + 1 if xi = yj

max(c[i, j-1], c[i-1, j]) if xi yj

0 1 2 63 4 5yj B D AC A B

5

1

2

0

3

4

6

7

D

A

B

xi

C

B

A

B

0 0 00 0 00

0

0

0

0

0

0

0

0

0

0

1 1

1

1 1 1

1

2 2

1

1

2 2

2

2

1

1

2

2

3 3

1

2

2

2

3

3

1

2

3

2

3

4

1

2

2

3

4

4

if xi = yj then

b[i, j] = “ ”

else if c[i - 1, j]≥c[i, j-1] then

b[i, j] = “ ” else

b[i, j] = “ ”

68

4. Constructing a LCS Start at b[m, n] and follow the arrows When we encounter a “ “ in b[i, j] xi = yj is an

element of the LCS 0 1 2 63 4 5

yj B D AC A B

5

1

2

0

3

4

6

7

D

A

B

xi

C

B

A

B

0 0 00 0 00

0

0

0

0

0

0

0

0

0

0

1 1

1

1 1 1

1

2 2

1

1

2 2

2

2

1

1

2

2

3 3

1

2

2

2

3

3

1

2

3

2

3

4

1

2

2

3

4

4

69

Step 4: Constructing an LCS

PrintLCS(b, X, i, j)1 if i = 0 or j = 0 then return 02 if b[i, j] = " " ↖ then3 PrintLCS (b, X, i - 1, j - 1)4 print xi5 else if b[i, j] = "↑" then6 PrintLCS(b, X, i - 1, j)7 else PrintLCS(b, X, i, j - 1)

70

4. Constructing a LCS Start at b[m, n] and follow the arrows When we encounter a “ “ in ↖ b[i, j] xi = yj is an element of

the LCS

0 1 2 63 4 5yj B D AC A B

5

1

2

0

3

4

6

7

D

A

B

xi

C

B

A

B

0 0 00 0 00

0

0

0

0

0

0

0

0

0

0

1 1

1

1 1 1

1

2 2

1

1

2 2

2

2

1

1

2

2

3 3

1

2

2

2

3

3

1

2

3

2

3

4

1

2

2

3

4

4

j

i

71

Improving the Code What can we say about how each entry c[i, j] is computed?

It depends only on c[i -1, j - 1], c[i - 1, j], and c[i, j - 1] Eliminate table b and compute in O(1) which of the three values was

used to compute c[i, j] We save (mn) space from table b However, we do not asymptotically decrease the auxiliary space

requirements: still need table c

If we only need the length of the LCS LCS Lenght works only on two rows of c at a time

The row being computed and the previous row We can reduce the asymptotic space requirements by storing only

these two rows

72

An Interesting Example before this chapter

Answers for English Exam

73

4.5 sequence comparison

why compare sequences?

sequence comparison: operation consisting of finding which parts of the sequences are alike and which parts differ / Algorithms for an efficient solution

74

TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG || || || | | ||| | |||| ||||| ||| ||| TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG

AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG | | | | |||||| | |||| | || | | AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG

AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT ||| | ||| || || ||| | ||||||||| || |||||| | AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT

75

Two notions• Similarity: a measure of how similar two sequences

are

• Alignment: a basic operation to compare two sequences, a way of placing one sequence above the other in order to make clear the correspondence between similar characters or substrings from the sequences.

76

Four cases, task and application• (1)2 sequences: tens of thousands (10,000) of characters;

isolated difference: insertions, deletions, substitutions / rarely as one each hundred (100) characters, to find the difference , two different sequencings

• (2)2 sequences 100s: whether a prefix similar to a suffix; if so, produced

• (3)2 sequences, is one is the subsequence of the other, to search certain sequence pattern

• (4)2 sequences : whether there are two similar substrings, one from each sequence, similar, to analyze conservation sequence

77

comparing two sequences

alignments involving: global comparisons: entire sequences local comparisons: just substrings of sequences semiglobal comparisons: prefixes and suffixes

dynamic programming (DP)

78

global comparison- example

example of aligning GACGGATTAG GATCGGAATAG

GA –CGGATTAG GATCGGAATAG an extra T; a change from A to T; space: dash

79

global comparison- the basic algorithm Definitions

Alignment: • insertion of spaces: same size• creating a correspondence: one over the other•Both space are not allowed•(Spaces can be inserted in beginning or end)

Scoring function : a measure of similarity between elements (nucleotides, amino acids, gaps);

• a match: +1/ identical characters• a mismatch: -1/ distinct characters• a space: -2/ •Scoring system: to reward matches and penalize mismatches

and spaces

80

global comparison- the basic algorithm

GA –CGGATTAG GATCGGAATAG

Example: total score is 6 similarity : sim(s, t)

• maximum alignment score; many alignments with similarity best alignment

• alignment with similarity

81

global comparison- the basic algorithm

Basic DP algorithm for comparison of two sequences number of alignment between two sequences: exponential Efficient algorithm

• DP: prefixes: shorter to larger•Idea:

(m+1)*(n+1) array: entry (i, j) is similarity between s1..i and t1..j

p(i, j)=+1 if s[i]=t[j], and -1 if s[i]≠t[j]: upper left corners

( [1 ], [1 1]) 2

( [1 ], [1 ]) max ( [1 1], [1 1]) ( , )

( [1 1], [1 ]) 2

sim s i t j

sim s i t j sim s i t j p i j

sim s i t j

2],1[

],[]1,1[

2]1,[

max],[

jia

jipjia

jia

jia

82

global comparison- the basic algorithm

0

0

0 -2

1

-4

2

-6

3

1 1

-21

-42

-63

-84 -1 -5

1 -3

1 -1

A

A

A

C

A G C

-1 -1

-1 -4

-1 -2

-1 0

1 -1

-1 -1

-1 -2

-1 -3

83

global comparison- the basic algorithm

a good computing order: row by row: left to right on each row column by column: top

to bottom on each column other order: to make sure a[i, j-1], a[i-1, j], and a[ i-1, j-1] are

available when a[i, j] must be computed. notes:

parameter g: specifying the space penalty (usually g<0)/g=-2 scoring function p for pairs of characters/p(a,b)=1 if a=b, and

p(a,b)=-1 if a!=b.

84

global comparison- the basic algorithm

Algorithm Similarityinput: sequence s and toutput: similarity between s and tm←|s|n←|t|for i←0 to m do a[i, 0] ←i×gfor j←0 to n do a[0, j] ←j×gfor i←1 to m do

for j←1 to n do a[i, j] ←max(a[i, j-1]+g, a[ i-1, j-1]+p(i,j), a[i-1, j]+g)

return a[m,n]

85

optimal alignments How to construct an optimal alignment between two

sequences ← similarity Idea of Algorithm Align

All we need to do is start at entry (m, n) and follow the arrows until we get to (0, 0).

An optimal alignment can be easily constructed from right to left if we have the matrix a computed by the basic algorithm.

The variables align-s and align-t are treated as globals in the code.

Call Align(m, n, len) will construct an optimal alignment Note: max(|s|, |t|)≤len≤m+n

86

Recursive algorithm for optimal alignmentAlgorithm Aligninput: indices i, j, array a given by algorithm Similarityoutput: alignment in align-s, align-t, and length in lenif i=0 and j=0 then

len← 0else if i>0 and a[i, j]= a[i-1, j]+g then

Align(i-1, j, len)len← len+1align-s[len] ←s[i]align-t[len] ← -

else if i>0 and j>0 and a[i, j]= a[ i-1, j-1]+p(i,j) thenAlign(i-1, j-1, len)len← len+1align-s[len] ←s[i]align-t[len] ← t[j]

else //has to be j>0 and a[i, j]= a[i, j-1]+g Align(i, j-1, len)len← len+1align-s[len] ← -align-t[len] ← t[i]

87

optimal alignments

Arrow preference When there is choice, a column with a space in t has precedence over a

column with two symbols, which in turn has precedence over a column with a space in s

- ATAT rather than ATAT - TATA - - TATA

For AA and AAAA, there are 4!/2!2=6 optimal alignments.

maximum

preference

minimum

preference

88

optimal alignments

Complexity of the algorithms for time and space:

• Basic dynamic programming: comparison of two sequences/ figure 3.2/ to compute Similarity: O(mn) or O(n2)

• Recursive algorithm for optimal alignment: O(len)=O(m+n)

89

local comparison

Problem: local alignment between s and t: an alignment

between a substring of s and a substring of t Algorithm: to find the highest scoring local

alignment between two sequences

90

local comparison Idea:

Data structure: •an (m+1)×(n+1) array; •entry: holding the highest score of an alignment between a suffix of

s[1..i] and a suffix of t[1..j]. Initialization

•First row and column: initialized with zeros←for any entry (i,j), there is always the alignment between the empty suffixes of s[1..i] and t[1..j], which has score zero.

0

],1[

),(]1,1[

]1,[

max],[gjia

jipjia

gjia

jia

91

Local alignment

0 0 0 0 0 0 0 0

0 2 1 0 0 0 2 1

0 1 1 0 0 0 1 1

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 2 1 0

ij

0 1a

2b

3b

4c

5d

6a

7e

0

1 a

2 f

3 g

4 f

5 d

0 0 0 0 0 1 1 36 e

92

Global alignment

93

semiglobal comparison

Similar to global comparison but to ignore the score end spaces

definitions end spaces: appear before the first after last charater in a

sequence. semiglobal comparison: to score alignments ignoring some of

the end spaces in the sequence Suitability: The lengths of the two sequences differ

considerably(8/18): many spaces in any alignment ⇒giving a large negative contribution to the score.

94

semiglobal comparison example

CAGCA- CTTGGATTCTCGG (3.4) [score: -19/3]- - - CAGCGTGG- - - - - - - -

CAGCACTTGGATTCTCGG (3.5) [score: -12]CAGC- - - - - G- T- - - -GGComparison: According to the scoring system we have been using so far: (3.5) (-12) better

than (3.4) (-19) Disregard (not charge for) end spaces: (3.4) (3) better than (3.5) (-12) /Pretty

good From the point of view of finding similar regions in the sequences: (3.5) was

simply torn apart brutally by the spaces just for the sake of matching exactly its characters

if we are looking for regions of the longer sequence that are approximately the same as the shorter sequence, then undoubtedly (3.4) is more to the point

95

semiglobal comparison

1

( , ) max ,n

j

sim s t a m j

Ingnore the end space of first sequence s: maximum along border of last row

Ingnore end space of second sequence t: maximum along border of last column

Without any end spaces: maximum along border of last row and column, unit (0,0)

1

( , ) max ,m

i

sim s t a i n

96

semiglobal comparison

Ingnore the beginning of first sequence s not charge for initial spaces in s: to the best alignment

between s and a suffix of t each entry (i, j): contain the highest similarity

between s1..i and a suffix of a prefix, which is to say a suffix of t1..j

To initialize the first row with 0, a[m,n]Between both suffixes of s and t:

• Both first row and column: initialized with 0• Optimal alignment: from maximum entry/ until border,

then back to origin

97

semiglobal comparison

Summary Forgiving initial spaces: initializing certain positions with zero Forgiving final spaces: looking for maximum along certain positions

Place where spaces are not charged for Action

Beginning of first sequence Initialize first row with zeros

End of first sequence Look for maximum in last row

Beginning of second sequence Initialize first column with zeros

End of second sequence Look for maximum in last column

98

extensions to the basic algorithms

motivation Better algorithms for special situations

improvements Computational complexity

•space: O(mn)⇒O(m+n) / at cost of doubling time•Time reducing: only for similar sequences and for a certain

family of scoring parameters Biological interpretation of alignments

•A series of consecutive spaces is more realistic than individual spaces

99

saving space

Computing sim(s, t)Algorithm BestScoreinput: sequence s and toutput: vector am←|s|n←|t|for j←0 to n do

a[j] ←j×gfor i←1 to m do

old ←a[0]a[0] ←i×g

for j←1 to n do temp←a[j] a[j] ←max(a[j]+g, old+p(i,j), a[j-1]+g) old←temp

100

saving space An optimal alignment in linear space

Idea:Divide and conquer strategyFix position i in s, and consider what matching s[i] in alignment, two possibilities:1, The symbol t[j] will match s[i], for some j in 1..n

(3.6)

2, a space between t[j] and t[j+1] will match s[i], for some j in 1..n

(3.7)

Recursive method1, for fixed i2, to decide which value of i to use in each recursive call: to pick i as close as

possible to the middle of sequence

]..1[

]..1[

][

][

]1..1[

]1..1[

njt

misoptimal

jt

is

jt

isoptimal

]..1[

]..1[][

]..1[

]1..1[

njt

misoptimal

is

jt

isoptimal

101

saving space

102

general gap penalty functions

Examples AAAAAAAA AAAAAAAA – A– A– A– – AAA – – – – –

motivation Gap definition: a consecutive number k>1 of spaces Fact: occurrence of a gap with k spaces is more probable than

occurrence of k isolated spaces when mutations are involved Reason: a gap due to single mutational event (more often); while

separated space to distinct events Penalty function: w(k)=bk / no distinction between clustered or

isolated spaces

103

general gap penalty functions

blocks Additive: at block level other than column Three kinds blocks

• 1. two aligned characters from ∑• 2. A maximal series of consecutive characters in t

aligned with spaces in s• 3. A maximal series of consecutive characters in s

aligned with spaces in t• scores: 1) p(a, b); 2), 3): -w(k)

104

general gap penalty functions

ExampleS1 AAC---AATTCCGACTACS2 ACTACCT------CGC--S1 AAC---AATTCCGACTACS2 ACTACCT------CGC--

Blocks cannot follow other blocks arbitrarily, especially for type 2 or 3 Algorithm

Initialization / data structure• 3 arrays of (m+1)×(n+1): one for each type of ending block

a[0, 0]=0: alignments ending in character-character blocks

b[0, j]=-w(j): alignments ending in spaces in s

c[i, 0]=-w(i): alignments ending in spaces in t

Others: -∞

105

general gap penalty functions Updated process / recurrence relations

Sim(s,t): maximum among a[m,n], b[m,n] and c[m,n] Time complexity: O(mn2+m2n)• 3+2j+2i

]1,1[

]1,1[

]1,1[

max),(],[

jic

jib

jia

jipjia

jkforkwkjic

jkforkwkjiajib

1),(],[

1),(],[max],[

ikforkwjkib

ikforkwjkiajic

1),(],[

1),(],[max],[

106

general gap penalty functions

• mn2+5mn+m2n Optimal alignment: as before Summary

considerable increase in time and space ( auxiliary arrays)--critical

m

i

n

j1 12i2j3

107

affine gap penalty functions motivation

a linear gap function: O(n2) general gap function: O(n3) Here: less general / charges less for a gap with k spaces than k

isolated spaces: O(n2) ?⇒ Two concepts

subaddittive function: w(k1+k2+…+kn)≤w(k1) +…+ w(kn) Affine function:

•w(k)=h+gk: k≥1, with w(0)=0•Subaddittive: If h, g>0•First space in a gap costs: h+g / the other spaces cost: g

108

affine gap penalty functions Algorithm

Initialization / data structure• 3 arrays of (m+1)×(n+1): one for each type of ending block

a[0, 0]=0

a[i, 0]= -∞ for 1≤i≤m

a[0, j]= -∞ for 1≤j≤n

b[i, 0]= -∞ for 1≤i≤m

b[0, j]= -(h+gj) for 1≤j≤n

c[i, 0]= -(h+gi) for 1≤i≤m

c[0, j]= -∞ for 1≤j≤n

109

affine gap penalty functions

Updated process / recurrence relations

Sim(s,t): maximum among a[m,n], b[m,n] and c[m,n] Optimal alignment: trace back from (m, n) until (0, 0)

Time complexity: O(mn) / space complexity: O(mn) or O(n)

]1,1[

]1,1[

]1,1[

max),(],[

jic

jib

jia

jipjia

]1,[)(

]1,[

]1,[)(

max],[

jicgh

jibg

jiagh

jib

],1[

],1[)(

],1[)(

max],[

jicg

jibgh

jiagh

jic

110

comparing similar sequences motivation

Two sequences are similar: “look like” when scores of alignments are very close to maximum possible

faster algorithms to find good alignments global alignments only

assumption s and t have the same length n properties:

• DP matrix is a square matrix• Main diagonal from (0, 0) to (n, n): unique alignment without spaces between s and t

• spaces will always be inserted in pairs, one in s and one in t• alignment is thrown off the main diagonal• Number of space pairs is greater than or equal to maximum departure from main diagonal

111

comparing similar sequences

G C G C A T G G A T T G A G C G A

TGCGCCATGGATGAGCA

112

comparing similar sequences example

s=GCGCATGGATTGAGCGAt=TGCGCCATGGATGAGCA

s=- GCGC -ATGGATTGAGCGAt=TGCGCCATGGAT- GAGC - A

idea Path of best alignment near main diagonal: if sequences are similar Not necessary to fill entire matrix suffice : a narrow band⇒

Algorithm KBand Fill-in a band of horizontal (or vertical) width 2k+1 around main diagonal a[n, n]: highest score of an alignment confined to that band O(kn): a big win over the usual O(n2) if k≪n

113

comparing multiple sequences motivation

multiple alignment (MA): which parts of the sequences are similar and which parts are different / s1, …, sk

multiple alignment is a generalization of pairwise alignment, similar operation

// no column made exclusively of spaces

114

comparing multiple sequences

Amino acid sequences: are more common with proteins How to evaluate different MAs of the same set of sequences?

115

comparing multiple sequences

Multiple sequence alignments are used for many reasons, including:

(1) to detect regions of variability or conservation in a family of proteins,

(2) to provide stronger evidence than pairwise similarity for structural and functional inferences,

(3) to serve as the first step in phylogenetic reconstruction, in RNA secondary structure

prediction, and in building profiles (probabilistic models) for protein families or DNA

signals.

116

comparing multiple sequences

Scoring scheme: (1) SP measure:scoring a alignment based on

pairwise alignments. (2) star alignment (3) tree alignmant

117

the SP measureScoring MA

additive functions here: one way: combination of arguments / for k is 10 1023⇒ than 1000 “Reasonable” properties

(1)Functions: independent of order of sequences,i.e

SP(I,-,I,V)=SP(V,I,I-)

(2)To reward presence of many equal or strongly related residues and penalize unrelated residues and spaces

ii

Score s

118

the SP measure

sum-of-pairs (SP) function is a function which meet the two properties

E.g., SP-score(I, -, I, V)=P(I, -)+ P(I, I)+ P(I, V)+ P(-, I)+ P(-, V)+ P(I, V)

• ( match = 1, a mismatch = -1, and a gap = -2) SP(I,-,I,V) = score(I,-) + score(I, I) +score(I,V) + score(-,I) + score (-,V)

+ score(I,V) = -2 + 1 + -1 + -2 + -2 + -1 = -7

119

the SP measure

Although there is never an entire column of gaps, if we look at any 2 sequences in the alignment, there may be columns where both have gaps

p(-, -)=0

120

the SP measure1. In MA, select two of sequences / forget all the rest / remove

columns with two spaces and derive a true PA2. E.g.,

PEAALYGRFT---IKSDVM PEALNYGRY---SSESDVW

PEAALYGRFT-IKSDVM PEALNYGRY-SSESDVW

• αij: PA induced by α on si and sj

summary• Way 1: compute scores of each column, and then add all column scores ←true only if

p(-, -)=0• Way 2: compute scores for induced PA, and then add these scores • Common: add scores p(s’i[c], s’j[c]) / s’: Extended sequence (with spaces inserted)

ji

ijscorescoreSP )()(

121

the SP measure

Note: An optimum MSA-SP may not give the optimal pairwise alignments. For example, the optimum MSA-SP for the sequences AT, A, T, AT and AT is as follows, with the induced alignment

between A & T also shown.

122

the SP measure

However, the optimum pairwise arrangement for the two sequences A and T is

which, if it were part of the MSA would yield:

123

DP

124

DP

Therefore, we must: (1) fill out the entire array which has [n + 1]k cells, (2) consider 2k – 1 possibilities for each cell in the array, (3) compute

the sum-of-pairs measure for each of these possibilities. Therefore, the time complexity of DP

(n + 1)k(2k – 1)( k2)

This is impractical even for numbers as small as k = 10

125

DP

summary Total time complexity: O(k22knk) ⇒exponential in

input sequence number Polynomial algorithms: unlikely ←MA problem

with SP measure is NPC

126

star alignments

Heuristic method for multiple sequence alignments

Select a sequence sc as the center of the star For each sequence s1, …, sk such that index i c,

perform a global alignment Aggregate alignments with the principle “once a

gap, always a gap.”

127

star alignments

For example, say your sequences are: S1 A T T G C C A T T S2 A T G G C C A T T S3 A T C C A A T T T T S4 A T C T T C T T S5 A C T G A C C

(1) To find the center sequence

ci

ci ssscore ),(

128

star alignments

(2) do pairwise alignments

129

star alignments

(3) build the multiple alignment

130

tree alignments

Model the k sequences with a tree having k leaves (1 to 1 correspondence)

Compute a weight for each edge, which is the similarity score

Sum of all the weights is the score of the tree Find tree with maximum score

131

tree alignments

motivation: An evolutionary tree for the sequences involved Comparison with SP measure

• tree alignments: to compute overall similarity based on PA along tree edges• SP measure: to take into account all pairwise similarities

tree alignment Leaves: one-to-one correspondence between k leaves and k sequences Interior nodes: to assign sequences Weight for each edge: similarity between two sequences in nodes incident to

the edge Score of tree: sum of all weights tree alignments: finding a sequence assignment maximizing score Star alignments: particular case / tree is a star

132

tree alignments

example

Match +1, gap -1, mismatch 0 tree alignment problem:

NP-hard Approximation algos: weight: distance / sequence assignment minimizing

distance sum

CTG

CGGT

CAT

y=CGx=CT

1 1

2

1

1

133

参考文献 Algorithms on Strings, Trees and

Sequences (Dan Gusfield)

计算分子生物学导论 ( 赛图宝 )

134

4.6 0/1 Knapsack Problem The 0-1 knapsack problem

A thief rubbing a store finds n items: the i-th item is worth

vi dollars and weights wi pounds (vi, wi integers)

The thief can only carry W pounds in his knapsack

Items must be taken entirely or left behind

Which items should the thief take to maximize the value of

his load?

135

The 0-1 Knapsack Problem

Thief has a knapsack of capacity W

There are n items: for i-th item value vi and weight wi

Goal:

find xi such that for all xi = {0, 1}, i = 1, 2, .., n

wixi W and

xivi is maximum

136

Optimal Substructure

Consider the most valuable load that weights at most W

pounds

If we remove item j from this load

The remaining load must be the most valuable load

weighing at most W – wj that can be taken from the

remaining n – 1 items

137

0-1 Knapsack - Dynamic Programming

V(i, w) – the maximum profit that can be

obtained from items 1 to i, if the

knapsack has size w

Case 1: thief takes item i

V(i, w) =

Case 2: thief does not take item i

V(i, w) =

vi + V(i - 1, w-wi)

V(i - 1, w)

138

DPKnapsack(S, W)

1 for w ← 0 to w1 - 1 do V[1, w] ← 0

2 for w ← w1 to W do V[1, w] ← v1

3 for i ← 2 to n do4 for w ← 0 to W do

5 if wi > w then6 V[i, w] ← V[i-1, w]7 b[i, w] ← "↑"

8 else if V[i-1, w]> V[i-1, w-wi] + vi then9 V[i, w] ← V[i-1, w] 10 b[i, w] ← "↑"11 else

12 V[i, w] ←V[i-1, w-wi] + vi

13 b[i, w] ← " "↖14 return V and b

Running time: (nW)

139

0-1 Knapsack - Dynamic Programming

0 0 0 0 0 0 0 0 0 0 0

0

0

0

0

0

0

0:

n

1 w - wi W

i-1

0

first

V(i, w) = max {vi + V(i - 1, w-wi), V(i - 1, w) }

Item i was taken Item i was not taken

i

w

second

140

V(i, w) = max {vi + V(i - 1, w-wi), V(i - 1, w) }

0 0 0 0 0 0

0

0

0

0

物品号 i wi vi

1 2 12

2 1 10

3 3 20

4 2 150 1 2 3 4 5

1

2

3

4

W = 5

0

12 12 12 12

10 12 22 22 22

10 12 22 30 32

10 15 25 30 37

V(1, 1) =

V(1, 2) =

V(1, 3) =

V(1, 4) =

V(1, 5) =

V(2, 1)=

V(2, 2)=

V(2, 3)=

V(2, 4)=

V(2, 5)=

V(3, 1)=

V(3, 2)=

V(3, 3)=

V(3, 4)=

V(4, 5)=

V(4, 1)=

V(4, 2)=

V(4, 3)=

V(4, 4)=

V(4, 5)=

max{12+0, 0} = 12

max{12+0, 0} = 12

max{12+0, 0} = 12

max{12+0, 0} = 12

max{10+0, 0} = 10

max{10+0, 12} = 12

max{10+12, 12} = 22

max{10+12, 12} = 22

max{10+12, 12} = 22

P(2,1) = 10

P(2,2) = 12

max{20+0, 22}=22

max{20+10,22}=30

max{20+12,22}=32

P(3,1) = 10

max{15+0, 12} = 15

max{15+10, 22}=25

max{15+12, 30}=30

max{15+22, 32}=37

0

P(0, 1) = 0

Example

wi

141

Reconstructing the Optimal Solution

0 0 0 0 0 0

0

0

0

0

0 1 2 3 4 5

1

2

3

4

0

12 12 12 12

10 12 22 22 22

10 12 22 30 32

10 15 25 30 37

0

Start at V(n, W) When you go left-up item i has been taken When you go straight up item i has not been taken

• Item 4

• Item 2

• Item 1

142

The problem we presented before

We get to observe the “qualities” of m secretaries: X1,…,Xm sequentially according to a random order. Our goal is to maximize the probability of finding the “best” candidate with no looking back!

Pm=1/m;