Post on 15-Feb-2016
description
Introduction
Optimal alignments in linear spaceEugene W.Myers and Webb Miller Optimal alignments in linear space1OutlineIntroductionGotoh's algorithm O(N) space Gotoh's algorithmMain algorithmImplementationConclusion
outlineintroductionpaperGotohlinear spacegotohimplementation2Introduction Optimal alignments in linear space3IntroductionSpace, not time Hirschbergs AlgorithmMaximizing the similarity score of an alignmentGotohs AlgorithmMinimizing the difference score of a conversionLinear space version for affine gap penalties.For a megabyte of memory.W.Myers and Miller : sequences of length 62500 Altschul and Erickson : sequences length < 1070
hirschbergGotohhirschbergAlignmentGotohseqeucensequencecostpapergotohaffine gap penaltieslinear spacePaper1 megabyte62500sequenceErickson1070sequence4Transformation (1/2)hisberggotohHirschbergGotohHirbergmatchmismatch#(a,b)Gotohw(a,b)XXX.. GotohcostgappenaltiesG = -Q, h = XXXH XXX. h1/2max?5Transformation (2/2)Match = 8, Mismatch = -5, Gap Symbol = -3, Gap-open = -4< hisberg2gap penaltymismatchW(a,b)gotohmimatch#max-2rgotohgapgapcost6Example(1/2)hisbergmatch8convertion cost0mismatch7Example(2/2)Hirschbergs AlgorithmGotohs AlgorithmCostC(minimum) gotohconversion costChisberg maximum score Mnseq1seqconversion cost0hisberg2mismatchhisrsbeg673gapConversion costgap symbolgap open11hirsbergGotohconversion cost
8Gotoh's algorithmR99922005Some notations : the i-symbol prefix of A : the j-symbol prefix of BC(i, j):minimum cost of a conversion of to
Simple gap(1/4)gap(k)= h*k
Simple gap(2/4)0.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTACSpace= O(n^2)Simple gap(3/4)m/2Simple gap(4/4)Forward score and backward scoreSpace: O(m+n)Affine gap(1/8)A gap of length k : cost = g + k*hA - - - T A A C TC G A A T C - - T Affine gap(2/8)C(i, j):minimum cost of a conversion of to D(i, j):minimum cost of a conversion of to that deletesI(i, j):minimum cost of a conversion of to that inserts
Affine gap(3/8) if i > 0 and j> 0 if i = 0 and j> 0 if i > 0 and j= 0 if i = 0 and j= 0
Affine gap(4/8) if i > 0 and j> 0 if i = 0 and j> 0
Affine gap(5/8) if i > 0 and j> 0 if i > 0 and j= 0
Affine gap(6/8)
Affine gap(7/8)*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0 A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC
CDIAffine gap(8/8)*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC
IDCO(N) space Gotoh's algorithm R99922041Observationi-th row of C and D depends only on row i and i-1.i-th row of I depends only on row i.
CDILinear SpaceUse two one-dimension arrays (CC and DD) and three variables.Linear SpaceAlgorithm
*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC
CDIg = 2.0 h = 0.5CCDDt = 2.0*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC
CDIg = 2.0 h = 0.5CCDDt = 2.0*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC
sceCCDDg = 2.0 h = 0.5i = 5t = 4.5CDI*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC
sceCCDDt = 4.5i = 5j = 1g = 2.0 h = 0.5CDI*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC
scCCDDt = 4.5i = 5j = 1g = 2.0 h = 0.5eCDI*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC
sCCDDt = 4.5i = 5j = 1g = 2.0 h = 0.5ecCDI*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC
Optimal conversion cost.CCDDCDIWhat is the conversion of AGTAC and AAG ?Main algorithm part 1B95902077
MidpointHirschberg (1975): recursive divide-and-conquerBackward ComputingForward ComputingGap Penaltyi-1, j-1i, j-1i-1, ji, jGap PenaltyCC( j) = minimum cost of a conversion of Ai* to BjDD( j) = minimum cost of a conversion of Ai* to Bj that ends with a deleteGap PenaltyRR(N - j) = minimum cost of a conversion of Ai*T to BjTSS(N - j) = minimum cost of a conversion of Ai*T to BjT that begins with a delete
Find Midpoint with Gap PenaltyBackward ComputingForward ComputingHow to compute the midpoint?Main algorithm part 2R99922035MidpointThe problem of calculating the midpoint is that when we concatenate two substrings into one, we may coalesce two gaps into one
Which means that we may consider min { CC + RR, DD + SS - g, II + JJ - g}MidpointRecall the above algorithm, we do save the space of II and JJ.
We can reduce it into min {CC + RR, DD + SS - g} MidpointRemember that we should find minj [0, N]{min { CC + RR, DD + SS - g, II + JJ - g}} i*j j+1MidpointType 1 recurrence Type 2 recurrence
i*j*i*j*Example A = agtac , B = aag, i* = 2 agtac a__ag
Recurrsive call on (a, a) and (ac, ag)
ImplementationR99922062ImplementationStorage Requirement
Memory v.s. Sequence length
Compared with classic dynamic programming algorithm
Linear space algorithm -> space not time49Storage Requirement(1/4)Vectors : CC,DD,RR, and SSSpace: 4N words
M + N words for an optimal conversion
M = N = 38
40Storage Requirement(2/4)16384 words for the table(w):replacement costs128*128
wASCII [1]ASCII [2]ASCII[3]ASCII[4]ASCII[]ASCII[128]ASCII [1]W1,1W1,2W1,3W1,4W1,W1,128ASCII [2]W2,1W2,2W2,3W2,4W2,W2,128ASCII [3]W3,1W3,2W3,3W3,4W3,W3,128ASCII [4]W4,1W4,2W4,3W4,4W4,W4,128ASCII[]W,1W,2W,3W,4W,W,128ASCII[128]W128,1W128,2W128,3W128,4W128,W128,128Storage Requirement(3/4)16 words for the table(w):replacement costs4*4
ATCGAW(A,A)W(A,T)W(A,C)W(A,G)TW(T,A)W(T,T)W(T,C)W(T,G)CW(C,A)W(C,T)W(C,C)W(C,G)GW(G,A)W(G,T)W(G,C)W(G,G)Storage Requirement(4/4)M + N bytes for the sequences A and B.A and B could be compressedDNA sequences only 2(M + N) bits are necessary
Compress -> Huffman code 53Memory v.s. Sequence lengthMaximum length of sequences that can be aligned in a given amount of memory
Altschul and Erickson : 7MN-bit approachMemory (bytes)Linear Space(w/o op.)Linear Space(with op.)Altschul and Erickson 64K40002666270128k80005333382256k16000106665401000k62500416661069N = Memory / 4*4N = Memory / 6*4N = sqrt(Memory *8 / 7)
Compared with classic dynamic programming algorithmclassic dynamic programming algorithm(Wagner and Fischer, 1974).
Compared with classic dynamic programming algorithmSpace : classic dynamic programming algorithm : O(MN)linear-space algorithm O(N + lgM)Time : Both O(MN)But in practice, linear-space slower than classic dynamic programming algorithm.linear-space : classic DP = 2.84 : 1 ConclusionR9994502058 0-3-6-9-12-15-18-21-24-3852-1-4-7-10-13-6530-3741-2-920-2-552-19-12-1-3-5630107-15-4-6-831-285-18-7-9-110-2963-21-10-12-14-386414C G G A T C A TCTTAACTReduce problem58Reduce problem(cont.)
60Reduce problem(cont.)m/2Partition line60