A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices
description
Transcript of A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices
A Sub-quadratic Sequence Alignment
Algorithm for Unrestricted Cost Matrices
Maxime Crochemore
Gad M. Landau
Michal Ziv-Ukelson
Presentation
• R89922024 蘇展弘• B86202049 葉恆青• R90725054 呂育恩• R90922001 張文亮• R90922091 游騰楷
Outline
• Introduction and preliminaries.– LZ-78.– The basic concept.
• Global alignment.
• Local alignment.
• Proof for LZ-76.
• Proof for SMAWK algorithm.
LZ-78
aacgacga0
aacgacga
aacgacgaaacgacga
aacgacga
aacgacgaa
1
aacgacga c
2
aacgacga
g
4
aacgacga
g
3
The number of distinct code word:
)(logn
nO
Sample of LZ-78
1 2 3 4 5
c t a cg ag a
1 2 3 4
a ac g acg a
0
21 3
4 5
c t a
g g
0
1 3
2
4
a g
c
g
Basic Concept
0 1 2 3 4
a ac g acg a
1 c
2 t
3 a
4 cg
5 ag
aa
5/4ag5
cg4
a3
t2
c1
aacggaca
43210
g
a
gca
a
5/45/2ag5
cg4
a3
t2
c1
aacggaca
43210
g
a
ca
a
5/45/2ag5
cg4
3/4a3
t2
c1
aacggaca
43210
a
gca
a
5/45/2ag5
cg4
3/43/2a3
t2
c1
aacggaca
43210
a
ca
g
a
gca
Left prefix
g
a
gca
Top prefix
g
a
gca
Diagonal prefixInput border: IOutput border: O
Basic Concept I/O Propagation Across G
0 1 2 3 4 5
I0=0 0 -1 -2 -3 △ △
I1=0 -1 -1 -2 -1 -3 △
I2=0 -2 0 0 1 -1 -3
I3=0 △ -2 -2 0 -2 -2
I4=0 △ △ -2 0 -1 -1
I5=0 △ △ △ -2 -1 0
DIST matrix
0-1-2△△△I5=0
-1-10-2△△I4=0
-2-20-2-2△I3=0
-3-1100-2I2=0
△-3-1-2-1-1I1=0
△△-3-2-10I0=0
543210
DIST matrix
321-14-14-14I5=3
001-1-13-13I4=1
00200-12I3=2
024331I2=3
-∞-11011I1=2
-∞-∞-2-101I0=1
543210
OUT matrix
Directly assign -∞
OUT[i,j]=-(n+i+1) x k, Where k is the maximal absolute value in the penalty matrix.
321-14-14-14I5=3
001-1-13-13I4=1
00200-12I3=2
024331I2=3
-∞-11011I1=2
-∞-∞-2-101I0=1
543210
324331
O5O4O3O2O1Oo
Basic Concept Monge Property
DIST matrix0 1 2 3 4 5
I0=0 0 -1 -2 -3 △ △
I1=0 -1 -1 -2 -1 -3 △
I2=0 -2 0 0 1 -1 -3
I3=0 △ -2 -2 0 -2 -2
I4=0 △ △ -2 0 -1 -1
I5=0 △ △ △ -2 -1 0
Aggarwal and Park and Schmidt observed that DIST matrices are Monge arrays.Def : A matrix M[ m x n ] is Monge if either condition 1 or 2 below holds for all a, b=0…m; c, d=0…n:1. convex condition:
.],[],[],[],[ dcandbaallfordaMcbMdbMcaM
2. concave condition: .],[],[],[],[ dcandbaallfordaMcbMdbMcaM
Basic Concept Tatally Monotone
An important property of Monge arrays is that of being totally monotone.Def : A matrix M[ m x n ] is totally monotone if either condition 1 or 2 below holds for all a, b=0…m; c, d=0…n:1. convex condition:
.],[],[],[],[ dcandbaallfordbMdaMcbMcaM
2. concave condition: .],[],[],[],[ dcandbaallfordbMdaMcbMcaM
Both DIST and OUT matrices are totally monotone by the concave condition.Aggarwal et al gave a recursive algorithm, nicknamed SMAWK, which can compute on O(n) time all row and column maxima of a n x n totally monotone matrix, by querying only O(n) elements of the array.
DIST matrixI0 = 1 0 -1 -2 -3 Δ Δ
I1 = 2 -1 -1 -2 -1 -2 Δ
I2 = 3 -2 0 0 1 -1 -3
I3 = 2 Δ -2 -2 0 -2 -2
I4 = 1 Δ Δ -2 0 -1 -1
I5 = 3 Δ Δ Δ -2 -1 0
OUT matrix
1 0 -1 -2 - -
1 1 0 1 -1 -
1 3 3 4 2 0
-12 0 0 2 0 0
-13 -13 -1 1 0 0
-14 -14 -14 1 2 3
OUT matrix
1 0 -1 -2 - -1 1 0 1 -1 -1 3 3 4 2 0
-12 0 0 2 0 0
-13 -13 -1 1 0 0
-14 -14 -14 1 2 3
• concave monotonicity: 若左行的上面比下面小,則右行的上面也比下面小
• No new column maximum :
–(n + i + 1) * k-
0 1 2 3 4
a a c g a c g a
1 c
2 t
3 a
4c
g
5a
g
a
Graph G for Block(5,4)
gca
g
a
gca
g
a
The New block
gca
g
a
gca
g
a
I0
I4 I5I2 I3
I1
O3 DIST matrix
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
Corresponding matrices
Maintaining Direct Access to DIST Columns
• 目的 : 跑 SMAWK 時需要用到的 OUT matrix 必須由
DIST 和 inplut 來提供,並在 Constant time 內得到OUT matrix 的每一格。但是 Space 又不能超過。
• 作法 :只存新產生的 column ,並維護一個 data structure 。
Trie for A0
31
2
54
g
c
ta
g
Trie for B0
1 3
2
4
g
ga
c
DIST(5,4)
-3
-1
1
0
0
-2
-2
-2
0
-2
-2
-1
-1
0
-2
0
-1
-2
-2
-1
-2
-1
-1
-3
-2
-1
0
0 1 2 3 4
a a c g a c g a
1 c
2 t
3 a
4c
g
5a
g
a
Data Strucure
Trie for A0
31
2
54
g
c
ta
g
Trie for B0
1 3
2
4
g
ga
c
DIST(5,4)
-3
-1
1
0
0
-2
-2
-2
0
-2
-2
-1
-1
0
-2
0
-1
-2
-2
-1
-2
-1
-1
-3
-2
-1
0
0 1 2 3 4
a a c g a c g a
1 c
2 t
3 a
4c
g
5a
g
a
Construction
Time and Space Complexity
• 作 new column
Trie for A0
31
2
54
g
c
ta
g
Trie for B0
1 3
2
4
g
ga
c
DIST(5,4)DIST(5,4)
-2
0
0
1
-1
-3
-2
0
0
1
-1
-3
-2
-2
0
-2
-2
-2
-2
0
-2
-2
-2
0
-1
-1
-2
0
-1
-1
-2
-1
0
-2
-1
0
-1
-1
-2
-1
-2
-1
-1
-2
-1
-2
0
-1
-2
-3
0
-1
-2
-3
a
a
g5
c
g4
a3
t2
c1
aa c gga ca
43210
a
a
g5
c
g4
a3
t2
c1
aa c gga ca
43210
Data Strucure
• 作 DIST vector ( 即找出該 DIST matrix 所有的column)
• 用 SMAWK 從這個DIST( 加上 input) 算出 output maxima 。
O ( t )
Total complexity
gca
g
a
gca
g
a
I0
I4 I5I2I3
I1
O3O0 O1 O2
O4
O5
0 1 2 3 4
a a c g a c g a
1 c
2 t
3 a
4c
g
5a
g
a
h n / log ( n )
nO ( h n2 / log(n) )
Sub-Quadratic Local Alignment
Eric, Yu En Lu
Information Management Dept.
National Taiwan University
Sub-Quadratic Global Alignment
• Exploits Redundancy among sequences resulted by Lempel-Ziv Compression (self-repeating) to obtain the sub-quadratic part
Sub-Quadratic Local Alignment
• Requires additional knowledge of where a locally optimal string starts and ends
• However, this algorithm is performed on a per-block basis, we have to compute additional information specific to a block
• And then use it as the cue to the final score
Additional Information
I
S[i]
C
E[k]F=max {MAX t
i=0 {I[i]+E[i]}, C}
Algorithm Body
• Given: DISTG
• Encoding– Compute values of E– Compute values of S– Compute values of C
• Propagation– Compute values of O’ (modified from the O in global
alignment)– Computing F
• Seek Highest Score– Find the highest score F
Back-Tracking the Exact Path
• Global Alignment
• Local Alignment– Given the block with max F value– We seek its path through looking its max{lp, tp,
dia} block recursively until the score 0
Time/Space Analysis
• Encoding– E: max{E[i]lp, E[i]tp,DIST[I, lc]} O(t)– S: (all other can be copies, except..) Slr,lc = max{Slr-
1,lc+W,Slr,lc-1+W,Slr-1,lc-1+W} O(t)– C: max{Clp, Ctp, S[lc]} O(1)
• Propagation– O[i] = max{O[i], S[i]} O(t)– F=max {MAX ti=0 {I[i]+E[i]}, C} O(t)
• Find F O(hn2/log2 n)• Total Complexity O(hn2/log n)
Further Improvements
• Efficient alignment storage algorithm– Conditioned in “discrete weights”– Gives a minimal encoding to DIST (O(t)
O(1) ) for G– Thus we obtain O(hn2/(log n)2) storage
complexity in Global-Alignment problem– While time complexity is O(hn2/log n)
Now, we are going to have presentations on
SMAWK & LZ-76
Thank you!
)log
(2
n
hnO
The Maximum Numbers of Distinct Words
Speaker : Emory Chang
Date : 2002/1/31
Lempel and Ziv,1976“On the Complexity of Finite Sequences”
Reference :
What is a Distinct Word?
• EX: (LZ78)
A = {0,1} ,a = |A| = 2S = 0101000 ,n = |S| = 7
0
5 1
0
4
02
1
3
1
0 • 1 • 0 1 • 0 0 • 0
we have four distinct words,and five steps to generate the sequence.
Notation
• A : the set of alphabets• α : the number of alphabets
• S : a sequence belong to A
• n : the length of S
• C(S) : production complexity of S
• N : the maximum possible number of distinct words.
n
The upper bound
• Any sequential encoding procedure employs a parsing rule which a long string of data is broken down into words that are individually mapped into distinct words.
• For every : nAS
)log()1()(
n
nSC
)log(
)log(log12
n
n
Special case(1/2)Let N denote the maximum possible number of distinct words.
Clearly C(S) < N+1 (a possible exception of the last one)
Consider the special case :
The sequence is formed by all distinct words of length of one, two, …, k
ex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 0
1 2
43
0
0
1
1
650 1n = 1•2 + 2•2
1 2
Special case(2/2)
1
)1(
0
kk
i
ikN
k
i
ik inn
0
]1
1)
1
1([
11
kin k
k
i
ik
11
111)1(
11)(
k
n
k
nNSC kkkk
k
The length of symbols at level i
The number of nodes at level i
1)1( kkN
General Case
kknn
1)1(0 kk k
1111)(
k
n
k
n
kk
nSC kkkk
The length of level k+1=k+1
Level k+1
Proof:
)log()1()(
n
nSC
Since
We have
Therefore
)1log(21)log()log(1
1
kkinkk
i
i
)1log(22)log(1 knk
))log(1log(22)log( nn from
from
SMAWK
A Linear Time Algorithm for the Maximum Problem on Wide Totally
Monotone Matrices
Definition
• Let A be an nm matrix with real entries. – Aj denote the jth column of A and Ai denote the
ith row of A.
– A[i1,…,ik;j1,…,jk] denote the submatrix of A.
– Let j(i) be the smallest column index j such that A(i,j) equals to the maximum value in Ai.
1 1 1 -12 -13 -140 1 3 0 -13 -14-1 0 3 0 -1 -14-2 1 4 2 1 1
-1 2 0 0 20 0 0 3
i=2
j(i)=3
Definition
• A nm matrix A is monotone if for 1i1i2n, j(i1) j(i2).
• A is totally monotone if every submatrix of A is monotone.
Another Definition
• In the previous paper, the definition of totally monotone is:– A matrix M[0…m,0…n] is totally monotone is either
condition 1 or 2 below holds for all a,b=0…n; c,d=0…m:
– 1. Convex condition: M[a,c] M[b,c] M[a,d] M[b,d] for all a<b and c<d
– 2. Concave condition: M[a,c] M[b,c] M[a,d] M[b,d] for all a<b and c<d
• We use concave here.
Comparison
• Now we want to compare these two definitions.
• The definition in SMAWK’s paper is called Ds, The definition in this paper is called Dc (we need a transpose to match the row and column of these to definition).
Comparison(cont.)
• To proof Dc Ds.
– Dc holds on matrix A[0…n,0…m]
– Let A’[i…i’,j…j’] be a submatrix of A, ii1i2i’, j1= j(i1), j2= j(i2) j1 j2.
a b c d
e f g h
i1
i2
j1
a,b,c d e,f,g h
So j1 j2
Comparison(cont.)
• To proof Ds Dc
– The matrix satisfies Ds but not Dc.
• Dc is stronger.
120
021
120
Lemma 1– We define an entry A[i,j] is dead if j j(i).– Lemma 1:– Let A be a totally monotone nm matrix and
let 1 j1 j2 m. if A(r, j1)A(r,j2) then entries in {A(i, j2):1 i r} are dead. if A(r, j1)A(r,j2) then entries in {A(i, j1):r i n} are dead.
dc
a bef
j1 j2
r
i
i
REDUCE(A)
• REDUCE(A)
– C=A; k=1
– While C has more than n columns do
– case
– C(k,k) C(k,k+1) and k < n : k = k+1
– C(k,k) C(k,k+1) and k = n : Delete column Ck+1
– C(k,k) < C(k,k+1) : Delete column Ck;
– if k>1 then k = k-1
REDUCE(A)
<
REDUCE(A)
REDUCE(A)
<
Time Complexity
• Case 2 + Case 3 = m – n
• Case 1 at most n + (m – n) –1
• Totally 2m – n – 1
• O(m)
MAXCOMPUTE(A)
• MAXCOMPUTE(A)– B = REDUCE(A)– If n=1 then output the maximum and return– C=B[2,4,…,2n/2; 1,2…,n]– MAXCOMPUTE(C)– From the known positions of maxima in the
even rows of B, find the maxima in its odd rows.
IDEA
Time Complexity
• T(n,m) = c1m + c2n + T(n/2, n)
• = c1m + (c1+c2)n + c2n/2 + T(n/4, n/2)
• T(n,m) = 2 (c1+c2)n + c1m = O(m)