A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

54
A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime Crochemore Gad M. Landau Michal Ziv-Ukelson

description

A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices. Maxime Crochemore Gad M. Landau Michal Ziv-Ukelson. Presentation. R89922024 蘇展弘 B86202049 葉恆青 R90725054 呂育恩 R90922001 張文亮 R90922091 游騰楷. Outline. Introduction and preliminaries. LZ-78. The basic concept. - PowerPoint PPT Presentation

Transcript of A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Page 1: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

A Sub-quadratic Sequence Alignment

Algorithm for Unrestricted Cost Matrices

Maxime Crochemore

Gad M. Landau

Michal Ziv-Ukelson

Page 2: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Presentation

• R89922024 蘇展弘• B86202049 葉恆青• R90725054 呂育恩• R90922001 張文亮• R90922091 游騰楷

Page 3: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Outline

• Introduction and preliminaries.– LZ-78.– The basic concept.

• Global alignment.

• Local alignment.

• Proof for LZ-76.

• Proof for SMAWK algorithm.

Page 4: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

LZ-78

aacgacga0

aacgacga

aacgacgaaacgacga

aacgacga

aacgacgaa

1

aacgacga c

2

aacgacga

g

4

aacgacga

g

3

The number of distinct code word:

)(logn

nO

Page 5: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Sample of LZ-78

1 2 3 4 5

c t a cg ag a

1 2 3 4

a ac g acg a

0

21 3

4 5

c t a

g g

0

1 3

2

4

a g

c

g

Page 6: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Basic Concept

0 1 2 3 4

a ac g acg a

1 c

2 t

3 a

4 cg

5 ag

aa

5/4ag5

cg4

a3

t2

c1

aacggaca

43210

g

a

gca

a

5/45/2ag5

cg4

a3

t2

c1

aacggaca

43210

g

a

ca

a

5/45/2ag5

cg4

3/4a3

t2

c1

aacggaca

43210

a

gca

a

5/45/2ag5

cg4

3/43/2a3

t2

c1

aacggaca

43210

a

ca

g

a

gca

Left prefix

g

a

gca

Top prefix

g

a

gca

Diagonal prefixInput border: IOutput border: O

Page 7: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Basic Concept I/O Propagation Across G

0 1 2 3 4 5

I0=0 0 -1 -2 -3 △ △

I1=0 -1 -1 -2 -1 -3 △

I2=0 -2 0 0 1 -1 -3

I3=0 △ -2 -2 0 -2 -2

I4=0 △ △ -2 0 -1 -1

I5=0 △ △ △ -2 -1 0

DIST matrix

0-1-2△△△I5=0

-1-10-2△△I4=0

-2-20-2-2△I3=0

-3-1100-2I2=0

△-3-1-2-1-1I1=0

△△-3-2-10I0=0

543210

DIST matrix

321-14-14-14I5=3

001-1-13-13I4=1

00200-12I3=2

024331I2=3

-∞-11011I1=2

-∞-∞-2-101I0=1

543210

OUT matrix

Directly assign -∞

OUT[i,j]=-(n+i+1) x k, Where k is the maximal absolute value in the penalty matrix.

321-14-14-14I5=3

001-1-13-13I4=1

00200-12I3=2

024331I2=3

-∞-11011I1=2

-∞-∞-2-101I0=1

543210

324331

O5O4O3O2O1Oo

Page 8: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Basic Concept Monge Property

DIST matrix0 1 2 3 4 5

I0=0 0 -1 -2 -3 △ △

I1=0 -1 -1 -2 -1 -3 △

I2=0 -2 0 0 1 -1 -3

I3=0 △ -2 -2 0 -2 -2

I4=0 △ △ -2 0 -1 -1

I5=0 △ △ △ -2 -1 0

Aggarwal and Park and Schmidt observed that DIST matrices are Monge arrays.Def : A matrix M[ m x n ] is Monge if either condition 1 or 2 below holds for all a, b=0…m; c, d=0…n:1. convex condition:

.],[],[],[],[ dcandbaallfordaMcbMdbMcaM

2. concave condition: .],[],[],[],[ dcandbaallfordaMcbMdbMcaM

Page 9: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Basic Concept Tatally Monotone

An important property of Monge arrays is that of being totally monotone.Def : A matrix M[ m x n ] is totally monotone if either condition 1 or 2 below holds for all a, b=0…m; c, d=0…n:1. convex condition:

.],[],[],[],[ dcandbaallfordbMdaMcbMcaM

2. concave condition: .],[],[],[],[ dcandbaallfordbMdaMcbMcaM

Both DIST and OUT matrices are totally monotone by the concave condition.Aggarwal et al gave a recursive algorithm, nicknamed SMAWK, which can compute on O(n) time all row and column maxima of a n x n totally monotone matrix, by querying only O(n) elements of the array.

Page 10: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

DIST matrixI0 = 1 0 -1 -2 -3 Δ Δ

I1 = 2 -1 -1 -2 -1 -2 Δ

I2 = 3 -2 0 0 1 -1 -3

I3 = 2 Δ -2 -2 0 -2 -2

I4 = 1 Δ Δ -2 0 -1 -1

I5 = 3 Δ Δ Δ -2 -1 0

Page 11: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

OUT matrix

1 0 -1 -2 - -

1 1 0 1 -1 -

1 3 3 4 2 0

-12 0 0 2 0 0

-13 -13 -1 1 0 0

-14 -14 -14 1 2 3

Page 12: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

OUT matrix

1 0 -1 -2 - -1 1 0 1 -1 -1 3 3 4 2 0

-12 0 0 2 0 0

-13 -13 -1 1 0 0

-14 -14 -14 1 2 3

• concave monotonicity: 若左行的上面比下面小,則右行的上面也比下面小

• No new column maximum :

–(n + i + 1) * k-

Page 13: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

0 1 2 3 4

a a c g a c g a

1 c

2 t

3 a

4c

g

5a

g

a

Graph G for Block(5,4)

gca

g

a

gca

g

a

The New block

Page 14: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

gca

g

a

gca

g

a

I0

I4 I5I2 I3

I1

O3 DIST matrix

0-1-2ΔΔΔI5 = 3

-1-10-2ΔΔI4 = 1

-2-20-2-2ΔI3 = 2

-3-1100-2I2 = 3

Δ-2-1-2-1-1I1 = 2

ΔΔ-3-2-10I0 = 1

0-1-2ΔΔΔI5 = 3

-1-10-2ΔΔI4 = 1

-2-20-2-2ΔI3 = 2

-3-1100-2I2 = 3

Δ-2-1-2-1-1I1 = 2

ΔΔ-3-2-10I0 = 1

Corresponding matrices

Page 15: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Maintaining Direct Access to DIST Columns

• 目的 : 跑 SMAWK 時需要用到的 OUT matrix 必須由

DIST 和 inplut 來提供,並在 Constant time 內得到OUT matrix 的每一格。但是 Space 又不能超過。

• 作法 :只存新產生的 column ,並維護一個 data structure 。

Page 16: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Trie for A0

31

2

54

g

c

ta

g

Trie for B0

1 3

2

4

g

ga

c

DIST(5,4)

-3

-1

1

0

0

-2

-2

-2

0

-2

-2

-1

-1

0

-2

0

-1

-2

-2

-1

-2

-1

-1

-3

-2

-1

0

0 1 2 3 4

a a c g a c g a

1 c

2 t

3 a

4c

g

5a

g

a

Data Strucure

Page 17: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Trie for A0

31

2

54

g

c

ta

g

Trie for B0

1 3

2

4

g

ga

c

DIST(5,4)

-3

-1

1

0

0

-2

-2

-2

0

-2

-2

-1

-1

0

-2

0

-1

-2

-2

-1

-2

-1

-1

-3

-2

-1

0

0 1 2 3 4

a a c g a c g a

1 c

2 t

3 a

4c

g

5a

g

a

Construction

Page 18: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Time and Space Complexity

• 作 new column

Trie for A0

31

2

54

g

c

ta

g

Trie for B0

1 3

2

4

g

ga

c

DIST(5,4)DIST(5,4)

-2

0

0

1

-1

-3

-2

0

0

1

-1

-3

-2

-2

0

-2

-2

-2

-2

0

-2

-2

-2

0

-1

-1

-2

0

-1

-1

-2

-1

0

-2

-1

0

-1

-1

-2

-1

-2

-1

-1

-2

-1

-2

0

-1

-2

-3

0

-1

-2

-3

a

a

g5

c

g4

a3

t2

c1

aa c gga ca

43210

a

a

g5

c

g4

a3

t2

c1

aa c gga ca

43210

Data Strucure

• 作 DIST vector ( 即找出該 DIST matrix 所有的column)

• 用 SMAWK 從這個DIST( 加上 input) 算出 output maxima 。

O ( t )

Page 19: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Total complexity

gca

g

a

gca

g

a

I0

I4 I5I2I3

I1

O3O0 O1 O2

O4

O5

0 1 2 3 4

a a c g a c g a

1 c

2 t

3 a

4c

g

5a

g

a

h n / log ( n )

nO ( h n2 / log(n) )

Page 20: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Sub-Quadratic Local Alignment

Eric, Yu En Lu

Information Management Dept.

National Taiwan University

Page 21: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Sub-Quadratic Global Alignment

• Exploits Redundancy among sequences resulted by Lempel-Ziv Compression (self-repeating) to obtain the sub-quadratic part

Page 22: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Sub-Quadratic Local Alignment

• Requires additional knowledge of where a locally optimal string starts and ends

• However, this algorithm is performed on a per-block basis, we have to compute additional information specific to a block

• And then use it as the cue to the final score

Page 23: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Additional Information

I

S[i]

C

E[k]F=max {MAX t

i=0 {I[i]+E[i]}, C}

Page 24: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Algorithm Body

• Given: DISTG

• Encoding– Compute values of E– Compute values of S– Compute values of C

• Propagation– Compute values of O’ (modified from the O in global

alignment)– Computing F

• Seek Highest Score– Find the highest score F

Page 25: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Back-Tracking the Exact Path

• Global Alignment

• Local Alignment– Given the block with max F value– We seek its path through looking its max{lp, tp,

dia} block recursively until the score 0

Page 26: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Time/Space Analysis

• Encoding– E: max{E[i]lp, E[i]tp,DIST[I, lc]} O(t)– S: (all other can be copies, except..) Slr,lc = max{Slr-

1,lc+W,Slr,lc-1+W,Slr-1,lc-1+W} O(t)– C: max{Clp, Ctp, S[lc]} O(1)

• Propagation– O[i] = max{O[i], S[i]} O(t)– F=max {MAX ti=0 {I[i]+E[i]}, C} O(t)

• Find F O(hn2/log2 n)• Total Complexity O(hn2/log n)

Page 27: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Further Improvements

• Efficient alignment storage algorithm– Conditioned in “discrete weights”– Gives a minimal encoding to DIST (O(t)

O(1) ) for G– Thus we obtain O(hn2/(log n)2) storage

complexity in Global-Alignment problem– While time complexity is O(hn2/log n)

Page 28: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Now, we are going to have presentations on

SMAWK & LZ-76

Thank you!

Page 29: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

)log

(2

n

hnO

Page 30: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

The Maximum Numbers of Distinct Words

Speaker : Emory Chang

Date : 2002/1/31

Lempel and Ziv,1976“On the Complexity of Finite Sequences”

Reference :

Page 31: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

What is a Distinct Word?

• EX: (LZ78)

A = {0,1} ,a = |A| = 2S = 0101000 ,n = |S| = 7

0

5 1

0

4

02

1

3

1

0 • 1 • 0 1 • 0 0 • 0

we have four distinct words,and five steps to generate the sequence.

Page 32: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Notation

• A : the set of alphabets• α : the number of alphabets

• S : a sequence belong to A

• n : the length of S

• C(S) : production complexity of S

• N : the maximum possible number of distinct words.

n

Page 33: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

The upper bound

• Any sequential encoding procedure employs a parsing rule which a long string of data is broken down into words that are individually mapped into distinct words.

• For every : nAS

)log()1()(

n

nSC

)log(

)log(log12

n

n

Page 34: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Special case(1/2)Let N denote the maximum possible number of distinct words.

Clearly C(S) < N+1 (a possible exception of the last one)

Consider the special case :

The sequence is formed by all distinct words of length of one, two, …, k

ex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 0

1 2

43

0

0

1

1

650 1n = 1•2 + 2•2

1 2

Page 35: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Special case(2/2)

1

)1(

0

kk

i

ikN

k

i

ik inn

0

]1

1)

1

1([

11

kin k

k

i

ik

11

111)1(

11)(

k

n

k

nNSC kkkk

k

The length of symbols at level i

The number of nodes at level i

1)1( kkN

Page 36: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

General Case

kknn

1)1(0 kk k

1111)(

k

n

k

n

kk

nSC kkkk

The length of level k+1=k+1

Level k+1

Page 37: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Proof:

)log()1()(

n

nSC

Since

We have

Therefore

)1log(21)log()log(1

1

kkinkk

i

i

)1log(22)log(1 knk

))log(1log(22)log( nn from

from

Page 38: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

SMAWK

A Linear Time Algorithm for the Maximum Problem on Wide Totally

Monotone Matrices

Page 39: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Definition

• Let A be an nm matrix with real entries. – Aj denote the jth column of A and Ai denote the

ith row of A.

– A[i1,…,ik;j1,…,jk] denote the submatrix of A.

– Let j(i) be the smallest column index j such that A(i,j) equals to the maximum value in Ai.

Page 40: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

1 1 1 -12 -13 -140 1 3 0 -13 -14-1 0 3 0 -1 -14-2 1 4 2 1 1

-1 2 0 0 20 0 0 3

i=2

j(i)=3

Page 41: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Definition

• A nm matrix A is monotone if for 1i1i2n, j(i1) j(i2).

• A is totally monotone if every submatrix of A is monotone.

Page 42: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Another Definition

• In the previous paper, the definition of totally monotone is:– A matrix M[0…m,0…n] is totally monotone is either

condition 1 or 2 below holds for all a,b=0…n; c,d=0…m:

– 1. Convex condition: M[a,c] M[b,c] M[a,d] M[b,d] for all a<b and c<d

– 2. Concave condition: M[a,c] M[b,c] M[a,d] M[b,d] for all a<b and c<d

• We use concave here.

Page 43: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Comparison

• Now we want to compare these two definitions.

• The definition in SMAWK’s paper is called Ds, The definition in this paper is called Dc (we need a transpose to match the row and column of these to definition).

Page 44: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Comparison(cont.)

• To proof Dc Ds.

– Dc holds on matrix A[0…n,0…m]

– Let A’[i…i’,j…j’] be a submatrix of A, ii1i2i’, j1= j(i1), j2= j(i2) j1 j2.

a b c d

e f g h

i1

i2

j1

a,b,c d e,f,g h

So j1 j2

Page 45: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Comparison(cont.)

• To proof Ds Dc

– The matrix satisfies Ds but not Dc.

• Dc is stronger.

120

021

120

Page 46: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Lemma 1– We define an entry A[i,j] is dead if j j(i).– Lemma 1:– Let A be a totally monotone nm matrix and

let 1 j1 j2 m. if A(r, j1)A(r,j2) then entries in {A(i, j2):1 i r} are dead. if A(r, j1)A(r,j2) then entries in {A(i, j1):r i n} are dead.

dc

a bef

j1 j2

r

i

i

Page 47: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

REDUCE(A)

• REDUCE(A)

– C=A; k=1

– While C has more than n columns do

– case

– C(k,k) C(k,k+1) and k < n : k = k+1

– C(k,k) C(k,k+1) and k = n : Delete column Ck+1

– C(k,k) < C(k,k+1) : Delete column Ck;

– if k>1 then k = k-1

Page 48: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

REDUCE(A)

<

Page 49: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

REDUCE(A)

Page 50: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

REDUCE(A)

<

Page 51: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Time Complexity

• Case 2 + Case 3 = m – n

• Case 1 at most n + (m – n) –1

• Totally 2m – n – 1

• O(m)

Page 52: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

MAXCOMPUTE(A)

• MAXCOMPUTE(A)– B = REDUCE(A)– If n=1 then output the maximum and return– C=B[2,4,…,2n/2; 1,2…,n]– MAXCOMPUTE(C)– From the known positions of maxima in the

even rows of B, find the maxima in its odd rows.

Page 53: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

IDEA

Page 54: A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Time Complexity

• T(n,m) = c1m + c2n + T(n/2, n)

• = c1m + (c1+c2)n + c2n/2 + T(n/4, n/2)

• T(n,m) = 2 (c1+c2)n + c1m = O(m)