18 May 2006 Klaus-Bernd Schürmann Jens Stoye Technische Fakultät Universität Bielefeld Germany...
Transcript of 18 May 2006 Klaus-Bernd Schürmann Jens Stoye Technische Fakultät Universität Bielefeld Germany...
18 May 2006
Klaus-Bernd SchürmannJens Stoye
Technische FakultätUniversität BielefeldGermany
Counting Suffix Arrays and StringsCounting Suffix Arrays and Strings
Dagstuhl, May 2006 - Jens Stoye Slide 2
Suffix Array Data StructureSuffix Array Data Structure
Suffix Array – lexicographically sorted list of all suffixes:
13 - $12 - C$10 - CTC$5 - CTCTTCTC$7 - CTTCTC$2 - CTTCTCTTCTC$11 - TC$9 - TCTC $4 - TCTCTTCTC$6 - TCTTCTC$1 - TCTTCTCTTCTC$8 - TTCTC$3 - TTCTCTTCTC$
Text to be indexed: T C T T1 2 3 4
C T C T5 6 7 8
T9C10
T C11 12
$13
Dagstuhl, May 2006 - Jens Stoye Slide 3
OverviewOverview
1. Classify strings sharing same suffix array
2. Counting strings sharing same suffix array
3. Counting suffix arrays Lower bound suffix array compression
4. Summation identities
Dagstuhl, May 2006 - Jens Stoye Slide 4
1. Classify Strings for Suffix Array1. Classify Strings for Suffix Array
t - string of length n,P - permutation of {1,..., n},R - inverse of P.
Theorem:
P is the suffix array of t if and only if for all i {1,...,n}
a) t[P[i]] t[P[i+1]] andb) t[P[i]] = t[P[i+1]] R[P[i]+1] R[P[i+1]+1]same asb) R[P[i]+1] > R[P[i+1]+1] t[P[i]] < t[P[i+1]]
Dagstuhl, May 2006 - Jens Stoye Slide 5
Text to be indexed: t = A A1 2
B C B3 4 5
12345
i
12534
P[i]A ABCBA BCBBB CBC B
t[P[i]]
a) t[P[i]] t[P[i+1]] andb) R[P[i]+1] > R[P[i+1]+1] t[P[i]] < t[P[i+1]]
1. Classify Strings for Suffix Array1. Classify Strings for Suffix Array
R+-descent
Dagstuhl, May 2006 - Jens Stoye Slide 6
Text to be indexed: t = A A1 2
B C B3 4 5
12345
i
12534
P[i]A ABCBA BCBBB CBC B
t[P[i]]
t2 = A A1 2
C D C3 4 5
t3 = A B1 2
D E D3 4 5
A ACDCA CDCC C DCD C
t2[P[i]]A BDEDB DEDDD EDE D
t3[P[i]]
Equivalences between strings
(order-equivalent) (order-distinct)
1. Classify Strings for Suffix Array1. Classify Strings for Suffix Array
Dagstuhl, May 2006 - Jens Stoye Slide 7
2. Counting Strings for Suffix Array2. Counting Strings for Suffix Array
Text to be indexed: t = A A1 2
B C B3 4 5
12345
i
12534
P[i]
t2 = A A1 2
C D C3 4 5
t3 = A B1 2
D E D3 4 5
AABBC
t[P[i]]AACCD
t2[P[i]]+ 0 =+ 0 =+ 1 =+ 1 =+ 1 =
AABBC
t[P[i]]ABDDE
t3[P[i]]+ 0 =+ 1 =+ 2 =+ 2 =+ 2 =
Non-decreasing sequences
Base string
Dagstuhl, May 2006 - Jens Stoye Slide 8
2. Counting Strings for 2. Counting Strings for SSuffix uffix AArrayrray
Suffix array P of length n with d R+-descents.
Number of strings over alphabet of size a for P= Number of non-decreasing sequences over
a-d elements
1
1
da
dan
Dagstuhl, May 2006 - Jens Stoye Slide 9
2. Counting Strings for 2. Counting Strings for SSuffix uffix AArrayrray
Suffix array P of length n with d R+-descents.
Number of strings composed of exactly k distinct characters for P is
1
1
dk
dn
Dagstuhl, May 2006 - Jens Stoye Slide 10
2. Counting Strings for 2. Counting Strings for SSuffix uffix AArrayrray
Number of strings over alphabet size 20 for suffix arrays of length n with 10 R+-descents:
nStrings composed of up to 20 characters
Strings composed of all 20 characters
5 2,002 0
10 92,378 0
15 1,307,504 0
20 10,015,005 1
25 52,451,256 2,002
30 211,915,132 92,378
35 708,930,508 1,307,504
Dagstuhl, May 2006 - Jens Stoye Slide 11
a
dk dk
dn
1 1
1
2. Counting Strings for 2. Counting Strings for SSuffix uffix AArrayrray
Suffix array P of length n with d R+-descents
Number of order-distinct strings over alphabet of size a is
Number of order-distinct strings where all k distinct characters must appear is
1
1
dk
dn
Dagstuhl, May 2006 - Jens Stoye Slide 12
Definition:
Let P permutation of {1,..., n}.
Position i{1,...,n-1} is a permutation descentif P[i] > P[i+1].
Definition:
The Eulerian number gives the number of
permutations of {1,...,n} with exactly dpermutation descents.
d
n
3. Counting Suffix Arrays3. Counting Suffix Arrays
Dagstuhl, May 2006 - Jens Stoye Slide 13
3. Counting Suffix Arrays3. Counting Suffix Arrays
Well-known fact:
Recursive enumeration of Eulerian numbers
a) ,
b) for n d, and
c)
10
n
0d
n
1
1)(
1)1(
d
ndn
d
nd
d
n
Dagstuhl, May 2006 - Jens Stoye Slide 14
3. Counting Suffix Arrays3. Counting Suffix Arrays
Definition:Let A(n,d) be the number of permutations of
length n with d R+-descents.
Observation:a) A(n,0) = 1b) A(n,d) = 0 for n dc) see next
Dagstuhl, May 2006 - Jens Stoye Slide 15
3. Counting Suffix Arrays3. Counting Suffix Arrays
Text to be indexed: t = A A1 2
B C B3 4 5
12345
i
12534
Pt[i]A ABCBA BCBBB CBC B
t[P[i]]
At = A A1 2
A B C3 4 5
B6
12364
PAt[i]
5
A AABCBA ABCBA BCBBB CB
At[P[i]]
C B
12345
i
6
(d+1) possible positions without additional R+-descent
Dagstuhl, May 2006 - Jens Stoye Slide 16
3. Counting Suffix Arrays3. Counting Suffix Arrays
Text to be indexed: t = A A1 2
B C B3 4 5
12345
i
12534
Pt[i]A ABCBA BCBBB CBC B
t[P[i]]
Bt = B A1 2
A B C3 4 5
B6
23614
PBt[i]
5
A ABCBA BCBBB AABCBB CB
Bt[P[i]]
C B
12345
i
6
(d+1) possible positions without additional R+-descent
Dagstuhl, May 2006 - Jens Stoye Slide 17
3. Counting Suffix Arrays3. Counting Suffix Arrays
Together:a) A(n,0) = 1,b) A(n,d) = 0 for n d, andc) A(n,d) = (d+1) A(n-1,d) + (n-d) A(n-1,d-1)
Theorem:The number A(n,d) of permutations of length n
with d R+-descents is the Eulerian number .dn
Dagstuhl, May 2006 - Jens Stoye Slide 18
3. Counting Suffix Arrays3. Counting Suffix Arrays
The number of distinct suffix arrays of length n for strings over alphabet of size a:
Lower bound for compressibility of suffix arrays in the Kolmogorov sense:
1
0
a
d d
n
1
0
loga
d d
n
Dagstuhl, May 2006 - Jens Stoye Slide 19
3. Counting Suffix Arrays3. Counting Suffix Arrays
Number of distinct suffix arrays of length n for strings over alphabet of size 20:
n String count (20n) Suffix array count
4 160,000 24
6 6.4 107 720
8 2.6 1010 40,320
10 1.0 1013 3.6 106
12 4.1 1015 4.8 108
14 1.6 1018 8.7 1010
16 6.6 1020 2.1 1013
18 2.6 1023 6.4 1015
Dagstuhl, May 2006 - Jens Stoye Slide 20
3. Counting Suffix Arrays3. Counting Suffix Arrays
Number of distinct suffix arrays of length n for strings over alphabet of size 4:
n String count (4n) Suffix array count
4 256 24
6 4,096 662
8 65,536 20,160
10 1,048,576 504,046
12 16,777,216 10,670,040
14 268,435,456 202,964,470
16 4,294,967,296 3,614,083,520
18 68,719,476,736 61,786,015,150
Dagstuhl, May 2006 - Jens Stoye Slide 21
4. Summation Identities4. Summation Identities
Worpitzki‘s identity by summing up the number of strings of length n for each suffix array:
Summation rule for Eulerian numbers to generate the Stirling numbers of second kind:
i
a
d
n
n
ia
i
n
da
dan
d
na
1
0 1
1
i
k
d kn
i
i
n
dk
dn
d
n
k
nk
1
0 1
1!
Dagstuhl, May 2006 - Jens Stoye Slide 22
SummarySummary
Constructive proofs to count strings sharing the same suffix array
Constructive proof to count distinct suffix arrays yielding lower bound for suffix array compression
Constructive proofs for Worpitzki‘s identity and the summation rule of Eulerian numbers to count Stirling numbers of second kind
Dagstuhl, May 2006 - Jens Stoye Slide 23
OutlookOutlook
Efficient enumeration algorithm for suffix arrays
Compressed suffix arrays for fast querying in bioinformatics applications
Average case analysis under non-uniform model