你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 [email protected] National Taiwan...
Click here to load reader
-
Upload
melinda-jennings -
Category
Documents
-
view
262 -
download
6
Transcript of 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 [email protected] National Taiwan...
112/04/21 Jen-Wei Huang 2
112/04/21 Jen-Wei Huang 3
* http://www.wretch.cc/blog/EtudeBIKE
112/04/21 Jen-Wei Huang 4
* http://www.giant-bicycles.com/zh-TW/
112/04/21 Jen-Wei Huang 5
112/04/21 Jen-Wei Huang 6
112/04/21 Jen-Wei Huang 7
* http://cape7.pixnet.net/blog
112/04/21 Jen-Wei Huang 8
* http://cape7.pixnet.net/blog
112/04/21 Jen-Wei Huang 9
* http://cape7.pixnet.net/blog
112/04/21 Jen-Wei Huang 10
* http://www.wretch.cc/blog/orzboyz* http://blog.sina.com.tw/9winds/
* http://atomcinema.pixnet.net/blog
112/04/21 Jen-Wei Huang 11
112/04/21 Jen-Wei Huang 12
* http://www.amazon.com
112/04/21 Jen-Wei Huang 13
* http://www.amazon.com
112/04/21 Jen-Wei Huang 14
* http://www.hq.nasa.gov/office/pao/History/ap11ann/kippsphotos/apollo.html
A General Model for Sequential Pattern Mining
with a Progressive Database
Jen-Wei Huang, Chi-Yao Tseng,
Jian-Chih Ou and Ming-Syan Chen
National Taiwan University
* IEEE Trans. on Knowledge and Data Engineering, Vol. 20, No. 6, June 2008
112/04/21 Jen-Wei Huang 16
Outlines Introduction Preliminaries Algorithm Pisa Experiments Conclusions Q & A
16
112/04/21 Jen-Wei Huang 17
Introduction to SPM “Mining of frequently occurring patterns
related to time or other sequences.” J. Han, Data Mining – Concepts and Techniques
“Given a set of sequences, find the complete set of frequent subsequences” J. Pei, PrefixSpan
Ex) What items one will buy if he/she has bought some certain items
17
112/04/21 Jen-Wei Huang 18
Time-related data Customers’ buying behavior Natural phenomena Sensor network data Web access patterns Stock price changes DNA sequence applications
18
112/04/21 Jen-Wei Huang 19
Definition Let I = {x1, x2, ..., xn} be a set of different
items. An element e, denoted by (xi xj ...), is a
subset of items ⊆ I of which items appear in a sequence at the same time.
A sequence s, denoted by < e1, e2, ..., em >, is an ordered list of elements.
A sequence database Db contains a set of sequences and |Db| represents the number of sequences in Db.
19
112/04/21 Jen-Wei Huang 20
Definition
A sequence α = < a1, a2, ..., an > is a subsequence of another sequence β = < b1, b2, ..., bm > if there exists a set of integers,
1 ≤ i1 < i2 < ... < in ≤ m, such that a1 ⊆ bi1 , a2 ⊆ bi2 , ..., and an ⊆ bin .
20
112/04/21 Jen-Wei Huang 21
Definition The sequential pattern mining can be
defined as "Given a sequence database, Db, and a
user-defined minimum support, min_sup, find the complete set of subsequences whose occurrence frequencies ≥ min_sup ∗ |Db|."
21
112/04/21 Jen-Wei Huang 22
Three Categories Depending on the management of the
corresponding database, sequential pattern mining can be divided into three categories, namely sequential pattern mining with a static database. an incremental database. a progressive database.
22
How To Do Sequential Pattern Mining on a Static Database
An Overview
2006/03/24jwhuang National Taiwan
University 24
How? Apriori-like algorithms
AprioriAll – by Agrawal et al GSP – by R. Srikant et al
Partition-based algorithms FreeSpan – by J. Han et al PrefixSpan – by J. Pei et al
Vertical format algorithms SPADE – by Zaki et al SPAM – by Ayres et al
2006/03/24jwhuang National Taiwan
University 25
Apriori-like Algorithms 1.Sort phase
Sort the database Customer id as the primary key and time
as the second key 2.Litemset phase
Count the frequency of each itemset The fraction of customers who bought the
itemset
2006/03/24jwhuang National Taiwan
University 26
Apriori-like Algorithms 3.Transformation phase
Transform each tx to all litemsets in the form of
C01: <(1,5) (2) (3) (4)> C02: <(1) (3) (4) (3,5)> C03: <(1) (2) (3) (4}> C04: <(1) (3) (5)> C05: <(4) (5)>
112/04/21 Jen-Wei Huang 27
CID Items
2 10 205 902 30
2 40 60 70
4 30
3 30 50 70
1 301 904 40 704 903 105 101 40 705 202 903 20
CID Items
1 30 90 {40 70}
2 {10 20} 30 {40 60 70} 90
3 {30 50 70} 10 204 30 {40 70} 905 90 10 20
Itemset #
10 3 20 3 30 4 40 3 50 1 60 1 70 4 90 4 {10 20} 1 {40 60} 1 {40 70} 3 {60 70} 1 {40 60 70}
1
{30 50} 1 {30 70} 1 {50 70} 1 {30 50 70}
1
112/04/21 Jen-Wei Huang 28
Itemset # New
10 3 1 20 3 2 30 4 3 40 3 4 70 4 5 90 4 6 {40 70}
3 7
CID Items
1 3 6 {4, 5, 7}
2 {1, 2} 3 {4, 5, 7} 6
3 {3, 5} 1 24 3 {4, 5, 7} 65 6 1 2
2006/03/24jwhuang National Taiwan
University 29
Apriori-like Algorithms 4.Mining phase
Apriori-like algorithm 5.Maximal phase
Find the maximum patterns
112/04/21 Jen-Wei Huang 30
CID
Items
1 3 6 {4, 5, 7}
2 {1, 2} 3 {4, 5, 7} 6
3 {3, 5} 1 24 3 {4, 5, 7} 65 6 1 2
Itemset
#
1 2 2
1 3 1
1 4 1
1 5 1
1 6 1
1 7 1
2 1 0
2 3 1
2 4 1
2 5 1
2 6 1
2 7 1
3 1 1
3 2 1
Itemset
#
3 4 3
3 5 3
3 6 3
3 7 3
4 1 0
4 2 0
4 3 0
4 5 0
4 6 2
4 7 0
5 1 1
5 2 1
5 3 0
5 4 0
Itemset
#
5 6 2
5 7 0
6 1 1
6 2 1
6 3 0
6 4 1
6 5 1
6 7 1
7 1 0
7 2 0
7 3 0
7 4 0
7 5 0
7 6 2
112/04/21 Jen-Wei Huang 31
CID
Items
1 3 6 {4, 5, 7}
2 {1, 2} 3 {4, 5, 7} 6
3 {3, 5} 1 24 3 {4, 5, 7} 65 6 1 2
Itemset
#
3 4 6 23 5 6 23 7 6 2
Therefore, frequent sequential patterns are:<1 2> <3 4> <3 5> <3 6> <3 7> <4 6> <5 6> <7 6><3 4 6> <3 5 6> <3 7 6>
Itemset #
10 3 1 20 3 2 30 4 3 40 3 4 70 4 5 90 4 6 {40 70}
3 7
According to mappings, original frequent sequential patterns are:<10 20> <30 40> <30 70> <30 90> <30 {40 70}><40 90> <70 90> <{40 70} 90> <30 40 90> <30 70 90><30 {40 70} 90>
112/04/21 Jen-Wei Huang 32
According to mappings, original frequent sequential patterns are:<10 20> <30 40> <30 70> <30 90> <30 {40 70}><40 90> <70 90> <{40 70} 90> <30 40 90> <30 70 90><30 {40 70} 90>
Because<30 40> and <30 70> are contained by <30 {40 70}><40 90> and <70 90> are contained by <{40 70} 90><30 40 90> and <30 70 90> are contained by <30 {40 70} 90>,
final maximal sequential patterns are:<10 20> <30 90> <30 {40 70}> <{40 70} 90> <30 {40 70} 90>
112/04/21 Jen-Wei Huang 33
Related Works Static database
AprioriAll – by Agrawal et al GSP – by R. Srikant et al SPADE – by Zaki et al FreeSpan – by J. Han et al PrefixSpan – by J. Pei et al SPAM – by Ayres et al
33
112/04/21 Jen-Wei Huang 34
Related Works Incremental database
ISM – by Parthasarathy et al IncSP – by Lin et al ISE – by Masseglia et al IncSpan – by Cheng et al MILE – by Chen et al
34
112/04/21 Jen-Wei Huang 35
Motivation The assumption of having a static
database may not hold in practice. The data in real world change on the fly.
Finding sequential patterns in an incremental database may lack of interest to the users. It is noted that users are usually more
interested in the recent data than the old ones.
35
112/04/21 Jen-Wei Huang 36
Motivation If a certain sequence does not have
any newly arriving elements, this sequence will still stay in the database and undesirably contribute to |Db|. New sequential patterns which appear
frequently in the recent sequences may not be considered as frequent sequential patterns.
36
112/04/21 Jen-Wei Huang 37
Definition -- Period of Interest
Period of Interest (abbreviated as POI) is a sliding window whose length is a user-specified time
interval, continuously advancing as the time goes by.
The sequences having elements whose timestamps fall into this period, POI, contribute to the |Db| for current sequential patterns.
37
time
A C AD
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …
B B
AD B
BCD C
CD B DA
A
A
B CBCA A
CS01
S02
S03
S04
S05
S06 A C
BD
C
D
D
Db1,5
Db2,6
Db3,7
Db4,8
Db5,9
Db6,10
SID
POI=5, min_supp=0.538
112/04/21 Jen-Wei Huang 39
Outlines Introduction Preliminaries Algorithm Pisa Experiments Conclusions Q & A
39
112/04/21 Jen-Wei Huang 40
Progressive Sequential Pattern
Progressive sequential pattern mining problem is defined as follows "Given a progressive sequence database, a
user-specified period of interest, and a user-defined minimum support threshold, find the complete set of frequent subsequences whose occurrence frequencies are greater than or equal to the minimum support times the number of sequences in every period of interest of the database."
40
112/04/21 Jen-Wei Huang 41
Naïve Algorithm Use conventional static sequential
pattern mining algorithms to mine sequential patterns separately from all combination of POIs e.g., Db1,5, Db2,6, Db3,7, Db4,8, Db5,9, etc.
For the sequence database which has the elements appearing in the interval of n timestamps, the total number of POIs in this interval is equal to (n − POI +1).
41
112/04/21 Jen-Wei Huang 42
Prior Work The only prior work on progressive database
is GSP+ and MFS+ proposed by Zhang based on static algorithms GSP and MFS (also derived by the same authors).
However, these algorithms still have to re-mine each sub-database using the static algorithms GSP and MFS.
Nevertheless, the performance improvement of GSP+ and MFS+ over GSP and MFS is only within 15% as reported by their authors.
42
112/04/21 Jen-Wei Huang 43
Algorithm DirApp Stands for Direct Append. Consists of two procedures
Progressively Updating abbreviated as PrUp
Immediately Filtering abbreviated as ImFi
43
112/04/21 Jen-Wei Huang 44
Procedure PrUp When progressively reading newly
incoming elements, Procedure PrUp can update each sequence in the sequence
database generate candidate sequential patterns calculate occurrence frequencies of all
candidate equential patterns in the current POI.
44
112/04/21 Jen-Wei Huang 45
Procedure ImFi DirApp uses Procedure ImFi to
filter out obsolete data from the existing sequence database
prune away obsolete candidate sequential patterns from the candidate set.
report the most up-to-date frequent sequential patterns to the user in every POI
45
A
B
C
AD
B
time
A C AD
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …
B B
AD B
BCD C
CD B DA
A
A
B CBCA A
CS01
S02
S03
S04
S05
S06 A C
BD
C
D
D
SID
timet1 t2 t3 t4 t5 t6 t7 t8 t9 t1
0
…
S01
46
112/04/21 Jen-Wei Huang 47
Example
time
A
B
C
AD
B
t1 t2 t3 t4 t5 t6 t7 t8 t9 t1
0
…
47
Db1,1
A1
Db1,4
A1
B2
AB1
C4
AC1
BC2
ABC1
Db1,2
A1
B2
AB1
Db1,3
A1
B2
AB1
(1) (4)
(2)
(3)
A
B
C
AD
B
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …
48
Db1,4
A1
B2
AB1
C4
AC1
BC2
ABC1
(4)
A
B
C
AD
B
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …
Db1,5
A5 B(AD)2
B2 ABD1
AB1 AB(AD)1
C4 CA4
AC1 CD4
BC2 C(AD)4
ABC1 ACD1
D5 AC(AD)1
(AD)5 BCA2
AD1 BCD2
A(AD)1 BC(AD)2
BA2 ABCD1
BD2 ABC(AD)1
(5)
49
A
B
C
AD
B
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …
Db1,5
A5 B(AD)2
B2 ABD1
AB1 AB(AD)1
C4 CA4
AC1 CD4
BC2 C(AD)4
ABC1 ACD1
D5 AC(AD)1
(AD)5 BCA2
AD1 BCD2
A(AD)1 BC(AD)2
BA2 ABCD1
BD2 ABC(AD)1
(5)
Db2,6
A5
B2
C4
BC2
D5
(AD)5
BA2
BD2
B(AD)2
CA4
CD4
C(AD)4
BCA2
BCD2
BC(AD)2
(6)
50
A
B
C
AD
B
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …
Db2,6
A5
B2
C4
BC2
D5
(AD)5
BA2
BD2
B(AD)2
CA4
CD4
C(AD)4
BCA2
BCD2
BC(AD)2
(6)
Db3,7
A5
C4
D5
(AD)5
CA4
CD4
C(AD)4
B7
AB5
CB4
DB5
(AD)B5
CAB4
CDB4
C(AD)B4
(7)
…
51
Db1,1
A1
Db1,5
A5 B(AD)2
B2 ABD1
AB1 AB(AD)1
C4 CA4
AC1 CD4
BC2 C(AD)4
ABC1 ACD1
D5 AC(AD)1
(AD)5 BCA2
AD1 BCD2
A(AD)1 BC(AD)2
BA2 ABCD1
BD2 ABC(AD)1
Db1,4
A1
B2
AB1
C4
AC1
BC2
ABC1
Db1,2
A1
B2
AB1
Db2,6
A5
B2
C4
BC2
D5
(AD)5
BA2
BD2
B(AD)2
CA4
CD4
C(AD)4
BCA2
BCD2
BC(AD)2
Db1,3
A1
B2
AB1
Db3,7
A5
C4
D5
(AD)5
CA4
CD4
C(AD)4
B7
AB5
CB4
DB5
(AD)B5
CAB4
CDB4
C(AD)B4
(1) (4) (5) (6) (7)
(2)
(3)
52
Db1,2
A1
B2
AB1
S01
time
A C AD
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …
B B
AD B
BCD C
CD B DA
A
A
B CBCA A
CS01
S02
S03
S04
S05
S06 A C
BD
C
D
D
SID
Db1,2
A1 AB1
D1 DB1
(AD)1 (AD)B1
B2
S02 S03
Db1,2
A1 AB1
B2 AC1
C2 A(BC)1
(BC)2
Db1,2(4)
AB1 3
A(BC)1 1
AC1 1
(AD)B1 1
DB1 1
AB1(3)
Db1,2
D2
S04
53
Db1,2(4)
AB1 3
A(BC)1 1
AC1 1
(AD)B1 1
DB1 1
Db1,3(5)
AB1 3
A(BC)1 1
AC1 1
(AD)B1 1
DB1 1
A(BC)B1 1
ACB1 1
(BC)B2 1
CB2 1
DC2 1
AB1(3)
AB1(3)AB1(3)
AB1(3) DA3(3)BA4(3)
(2) (3) (4) (5)
Db1,4(5)
AB1 3A(BC)B
C1 1
A(BC)1 1 A(BC)C1 1
AC1 2 (AD)A1 1
(AD)B1 1 (AD)BA1 1
DB3 2 BA2 1
A(BC)B1 1 BC3 2
ACB1 1 (BC)BC2 1
(BC)B2 1 (BC)C2 1
CB2 1 DA1 1
DC2 1 DBA1 1
ABC1 2
Db1,5(5)
AB1 3 ABC1 2 DBA3 2 BCA2 1
A(BC)1 1A(BC)BC
1 1 A(AD)1 1BC(AD)
2 1
AC1 2 A(BC)C1 1 AB(AD)1 1 BCD2 1
(AD)B1 1 (AD)A1 1 ABC(AD)1 1 BD2 1
DB3 2 (AD)BA1 1 ABCD1 1 CA4 2
A(BC)B1 1 BA4 3 ABD1 1 C(AD)4 1
ACB1 1 BC3 2 AC(AD)1 1 CD4 1
(BC)B2 1 (BC)BC2 1 ACD1 1 DCA2 1
CB2 1 (BC)C2 1 AD1 1
DC2 1 DA3 3 B(AD)2 1
54
Db2,6(5)
DB3 1 BC(AD)2 1
(BC)B2 1 BCD2 1
CB2 1 BD2 1
DC2 1 CA4 3
BA4 4 C(AD)4 1
BC3 2 CD4 1
(BC)BC2 1 DCA2 1
(BC)C2 1 (BC)A2 1
DA3 2 (BC)BA2 1
DBA3 1(BC)BC
A2 1
B(AD)2 1 (BC)CA2 1
BCA3 2 CBA2 1BA4(4) CA4(3)
(6)
Db3,7(5)
DB5 2 (AD)B5 1
BA4 2 BAC4 1
BC4 2 CAB4 2
DA3 1 CA(BC)3 1
DBA3 1 C(AD)B4 1
BCA3 1 CB4 2
CA4 3 C(BC)3 1
C(AD)4 1 CDB4 1
CD4 1 DAC3 1
AB5 2 DBAC3 1
A(BC)5 1 DBC3 1
AC5 2 DC3 1
(7)
Db4,8(6)
DB5 1 BAC4 1
BA4 1 CAB4 1
BC7 2 C(AD)B4 1
CA4 2 CB4 1
C(AD)4 1 CDB4 1
CD4 1 ABC5 1
AB5 2 (AD)BC5 1
A(BC)5 1 (AD)C5 1
AC6 4 DBC5 1
(AD)B5 1 DC5 1
(8)
Db5,9(5)
DB5 1
BC7 1
AB5 2
A(BC)5 1
AC8 5
(AD)B5 1
ABC5 1
(AD)BC5 1
(AD)C5 1
DBC5 1
DC5 1
ACD6 2
AD6 2
CD8 2
AC6(4)
(9)
CA4(3)
AC8(5)
55
112/04/21 Jen-Wei Huang 56
The Advantages of DirApp DirApp needs only one scan of newly
arriving elements and the candidate set at each timestamp rather than quadratic scans by conventional algorithms.
DirApp can maintain latest data sequences find the complete set of up-to-date
sequential patterns delete obsolete data and patterns rapidly
56
112/04/21 Jen-Wei Huang 57
The Disadvantages of DirApp
DirApp needs lots of working space to store the candidate sets for all sequences.
Scanning all candidate sets induces huge computation in execution time.
DirApp needs another data structure to calculate the occurrence frequencies of all candidate sequential patterns.
57
112/04/21 Jen-Wei Huang 58
Outlines Introduction Preliminaries Algorithm Pisa Experiments Conclusions Q & A
58
112/04/21 Jen-Wei Huang 59
Algorithm Pisa Pisa stands for Progressive mIning of
Sequential pAtterns Pisa utilizes a Progressive Sequential
tree (abbreviated as PS-tree) to maintain the information of all sequences in each POI to update each sequence find up-to-date sequential patterns
59
112/04/21 Jen-Wei Huang 60
PS-tree The nodes in PS-tree can be divided
into two different types Root node Common nodes
Each common node stores two information Node label = element in a sequence Sequence list
sequence IDs containing this element marked by corresponding timestamps
Root
Sequence ID
Timestamp
Label
60
112/04/21 Jen-Wei Huang 61
PS-tree Whenever there are a series of
elements appearing in the same sequence, there will be a series of nodes labeled by each element with the same sequence IDs in their sequence lists. The first node will be connected to the Root
node representing the first element. The other nodes will be connected to the
first node analogously.61
112/04/21 Jen-Wei Huang 62
PS-treeRoot
A C AD
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …
B B
AD B
BC
D C
CD B DA
A
A
B CBCA A
CS01
S02
S03
S04
S05
S06 A C
BD
C
D
D
SID
Root
Sequence ID
Timestamp
Label
011
A
011
B
011
C
62
112/04/21 Jen-Wei Huang 63
PS-tree The path from Root node to any other
node represents the candidate sequential pattern appearing in this sequence.
The appearing timestamp for each candidate sequential pattern will be marked in the node labeled by the last element.
63
112/04/21 Jen-Wei Huang 64
PS-treeRoot
A C AD
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …
B B
AD B
BC
D C
CD B DA
A
A
B CBCA A
CS01
S02
S03
S04
S05
S06 A C
BD
C
D
D
SID
Root
Sequence ID
Timestamp
Label
011
A012
B014
C
011
B
011
C
012
C011
C
64
112/04/21 Jen-Wei Huang 65
Algorithm Pisa When receiving elements at
timestamp t+1, Pisa traverses the PS-tree in post-order to delete the obsolete elements from update current sequences in insert newly arriving elements into
the PS-tree of timestamp t andtransforms it into PS-tree of timestamp t+1.
65
112/04/21 Jen-Wei Huang 66
For a common node Pisa deletes the obsolete sequences in
the sequence list of this node If there is no sequence ID left in the sequence
list, Pisa prunes this node away from its parent
Pisa checks the sequence IDs left in the sequence list to see if there is newly arriving element of the sequences If there is no newly arriving element, Pisa
goes to the next node
66
112/04/21 Jen-Wei Huang 67
For a common node Otherwise, Pisa generates all combination of
candidate elements from the arriving element Ex) ABC -> A, B, C, AB, AC, BC, ABC
For each candidate element that does not exist on the path from Root to the current node :
If there is a child of the same label, Pisa updates the timestamp of this sequence to the timestamp of the same sequence in parent’s sequence list.
Otherwise, Pisa creates a new child of this element with the sequence ID and the timestamp of the same sequence in parent’s sequence list.
67
112/04/21 Jen-Wei Huang 68
For Root node Instead of checking the sequence list, Pisa
examines all sequences that have newly arriving elements.
After Pisa generates all combination of candidate element, for each of them : If there is a child of the same label, Pisa
updates the timestamp of this sequence to t+1.
Otherwise, Pisa creates a new child of this element with sequence ID and timestamp t+1.
68
112/04/21 Jen-Wei Huang 69
Algorithm Pisa After Pisa processes a common node,
if the number of sequence IDs in the sequence list is larger than the min_supp*|Dbp,q|,
the path from Root to this node will be outputted as a frequent sequential pattern.
69
112/04/21 Jen-Wei Huang 70
PS-treeRoot
A C AD
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …
B B
AD B
BC
D C
CD B DA
A
A
B CBCA A
CS01
S02
S03
S04
S05
S06 A C
BD
C
D
D
SID
Root
Sequence ID
Timestamp
Label
011
A012
B014
C
011
B
011
C
012
C011
C
70
Root A C AD
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …
B B
AD B
BC
D C
CD B DA
A
A
B CBCA A
CS01
S02
S03
S04
S05
S06 A C
BD
C
D
D
SID
Sequence ID
Timestamp
Label
POI=5, min_supp=0.571
Db1,1(3)
03
1
02
1
01
1
A
02
1
D
02
1
AD
A
t1
AD
A
S01
S02
S03
S04
S05
S06
SID
Sequence ID
Timestamp
Label
72
Db1,2(4)
03
1
02
1
01
1
A
04
2
02
1
D
03
1
02
1
01
1
B
03
1
C
03
1
BC
02
1
B
02
1
B
02
1
AD
03
2
C
03
2
BC
03
2
02
2
01
2
B
AB1(3)
B
B
D
BC
t2
S01
S02
S03
S04
S05
S06
SID
Sequence ID
Timestamp
Label
Db1,1(3)
03
1
02
1
01
1
A
02
1
D
02
1
AD
73
Db1,3(5)
03
1
02
1
01
1
A
03
1
02
1
01
1
B
03
1
C
03
1
BC
02
1
B
02
1
B
02
1
AD
03
2
BC
03
3
02
2
01
2
B
03
1
B
03
1
B
04
2
C
05
3
04
2
02
1
D
04
3
03
2
C
03
2
B
AB1(3)
03
2
B
t3
S01
S02
S03
S04
S05
S06
SID
Sequence ID
Timestamp
Label
C
D
B
74
Db1,4(5)
03
1
02
4
01
1
A
03
1
02
1
01
1
B
03
1
BC
02
1
B
02
1
AD
03
2
BC
03
1
B
03
1
B
04
2
C
05
3
04
2
02
1
D
03
2
B
03
1
01
1
C
03
1
01
1
C
03
1
C
02
1
A
03
1
C
02
1
A
05
3
02
1
B
02
1
A
02
1
A
03
3
01
2
C
02
2
A
03
3
02
2
01
2
B
05
4
03
2
C
03
2
C
04
3
03
4
01
4
C
AB1(3)
03
2
B
t4
S01
S02
S03
S04
S05
S06
SID
Sequence ID
Timestamp
Label
C
B
A
C
75
Db1,5(5)
03
1
02
1
01
1
B
03
1
BC
02
1
B
03
2
BC
03
1
B
03
1
B
04
2
C
03
2
B
03
1
01
1
C
03
1
01
1
C
03
1
C
03
1
C
05
3
02
1
B
02
1
A
02
1
A
03
3
01
2
C
03
3
02
2
01
2
B
05
4
03
2
C
03
2
C
04
3
03
4
01
4
C
AB1(3)01
1
D
01
1
AD
01
1
D
01
1
AD
01
1
D
01
1
AD
01
1
D
01
1
AD
05
3
02
1
A
04
2
A
03
1
02
4
01
5
A
04
5
05
5
04
2
02
1
01
5
D
05
3
02
1
01
5
AD
01
2
D
01
2
AD
01
2
A
01
2
D
01
2
AD
05
4
02
2
01
2
A
01
4
D
01
4
AD
04
3
01
4
A
BA4(3)
05
3
04
2
02
1
A
DA3(3)
03
2
B
Sequence ID
Timestamp
Label
76
Db2,6(5)
03
2
BC
04
2
C
03
2
B
03
3
01
2
C
03
3
02
2
01
2
B
05
4
03
2
C
03
2
C
04
3
03
4
01
4
C
CA4(3)
04
2
A
05
3
04
2
A
01
2
D
01
2
AD
01
2
D
01
2
AD
01
4
D
01
4
AD
BA4(4)
05
3
B
05
3
A
05
3
04
2
01
5
D
01
5
AD
03
3
01
2
A
03
3
02
2
01
2
A
05
4
04
3
03
4
01
4
A
03
2
A
03
2
A
03
2
A
03
2
A
03
6
02
4
01
5
A
04
5
05
5
03
2
B
03
2
A
t6
S01
S02
S03
S04
S05
S06
SID
Sequence ID
Timestamp
Label
A
77
Db3,7(5)
04
7
03
3
01
7
B
05
4
CA4(3)
01
4
D
01
4
AD
05
3
A
01
5
AD
04
3
03
4
01
4
A
04
5
BC
05
5
04
5
C
05
3
C
05
3
C
05
4
03
3
A
05
3
A
05
3
C
05
3
01
5
D
03
3
A
03
3
C
05
4
C
04
3
BC
04
3
BC
05
3
01
5
B
01
5
B
04
3
01
4
B
01
4
B
01
4
B
04
3
01
4
B
04
7
03
4
01
4
C
05
7
04
7
BC
03
6
02
4
01
5
A
04
5
05
5
04
5
01
5
B
05
3
C
t7
S01
S02
S03
S04
S05
S06
SID
Sequence ID
Timestamp
Label
B
BCC
78
Db4,8(6)
AC6(3)
01
4
D
01
4
AD
01
5
AD
04
5
BC
01
5
C
05
3
01
5
D
05
4
C
01
4
B
01
5
B
01
4
B
01
4
B
04
7
03
8
01
8
C
05
7
04
7
BC
01
5
C
01
5
C
01
5
C
05
5
04
5
03
6
C
05
4
A
05
4
04
7
01
7
B
01
4
B
03
4
01
4
A
01
7
C
04
5
03
8
02
4
A
05
5
06
8
01
5
04
5
01
5
B
01
5
C
01
5
B
t8
S01
S02
S03
S04
S05
S06
SID
Sequence ID
Timestamp
Label
C
C
A
79
Db5,9(5)
AC8(4)
01
5
AD
04
5
BC
01
5
B
01
5
B
04
7
03
8
01
8
C
05
7
04
7
BC
01
5
C
01
5
C
01
5
C
01
7
C
05
5
03
6
D
05
5
03
6
D
05
5
04
5
03
6
C
06
8
05
9
03
9
01
5
D
04
7
01
7
B
05
7
03
8
D
04
5
01
5
B
01
5
C
04
5
03
8
01
5
A
05
5
06
8
t9
S01
S02
S03
S04
S05
S06
SID
Sequence ID
Timestamp
Label
D
C
D
80
Db6,10(5)
CD8(4)
01
10
BD
04
7
03
8
01
8
C
05
7
04
7
BC
01
7
C
04
7
01
10
B
06
8
03
6
C
03
6
D
03
6
D
06
8
03
8
A
01
7
BD
01
7
D
04
10
03
9
01
10
D
05
9
01
8
BD
01
8
B
04
7
03
8
01
8
D
05
7
04
7
D
04
7
D
01
7
BD
t10
S01
S02
S03
S04
S05
S06
SID
Sequence ID
Timestamp
Label
BD
D
81
112/04/21 Jen-Wei Huang 82
The Advantages of Pisa Pisa needs only one scan of newly
arriving elements and the PS-tree at each timestamp rather than quadratic scans by conventional algorithms.
Pisa can maintain latest data sequences find the complete set of up-to-date
sequential patterns delete obsolete data and patterns rapidly
82
112/04/21 Jen-Wei Huang 83
The Advantages of Pisa Each path from Root to any other node on
PS-tree forms a unique candidate sequential pattern. Thus Pisa combines the same candidate patterns together and all patterns do not have to store their prefix elements. PS-tree consumes smaller space. Dealing with the same sequential patterns
together is also very efficient in execution time. Fast Pisa with approximation results.
83
112/04/21 Jen-Wei Huang 84
Outlines Introduction Preliminaries Algorithm Pisa Experiments Conclusions Q & A
84
112/04/21 Jen-Wei Huang 85
Experiments Comparative algorithms
GSP+ -- re-mining version of GSP SPAM+ -- re-mining version of SPAM DirApp
Environment Pentium 4 — 3GHz CPU and 2GB RAM Coded in C++
85
112/04/21 Jen-Wei Huang 86
Experiments The synthetic datasets are generated
in the way similar to the IBM data generator designed for testing sequential pattern mining algorithms.
86
112/04/21 Jen-Wei Huang 87
Experiments We divide the target dataset into n
timestamps. According to the POI, the first m
timestamps (m = POI and m < n) are viewed as the original database and the rest of transactions in the dataset are received by the system incrementally.
87
112/04/21 Jen-Wei Huang 88
Experiments The first run of the experiments mines
the first POI from the beginning m timestamps of the dataset.
After that, we shift the POI forward t (t<<m) timestamps forward for the following runs.
88
112/04/21 Jen-Wei Huang 89
Experiments The real data sets are from
KDDCUP’07. We randomly choose successive 120 days
for the performance evaluation. A timestamp is set as 3 days in order to obtain sufficient frequent sequential patterns.
Therefore, there are total 40 timestamps and POI is set as 10. The new datasets contain more than 5000 sequences and 2000 different items.
89
112/04/21 Jen-Wei Huang 90
Cumulative Execution Time
90
112/04/21 Jen-Wei Huang 91
Minimum Support
91
112/04/21 Jen-Wei Huang 92
Length of POI
92
112/04/21 Jen-Wei Huang 93
Number of Sequences
93
112/04/21 Jen-Wei Huang 94
Scalability of Pisa
94
112/04/21 Jen-Wei Huang 95
Real Data Set
95
112/04/21 Jen-Wei Huang 96
Improvement of FastPisa
96
112/04/21 Jen-Wei Huang 97
Information Lose of FastPisa
97
112/04/21 Jen-Wei Huang 98
Outlines Introduction Preliminaries Algorithm Pisa Experiments Conclusions Q & A
98
112/04/21 Jen-Wei Huang 99
Conclusions We proposed a progressive algorithm
Pisa to handle the progressive sequential pattern mining problem without re-mining all sub-databases at each timestamp.
Pisa needs only one scan of newly arriving elements and the PS-tree at each timestamp rather than quadratic scans by conventional algorithms.
99
112/04/21 Jen-Wei Huang 100
Conclusions Pisa can
maintain the latest information of sequences
find the complete set of up-to-date sequential patterns
delete obsolete data and patterns rapidly Pisa also
consumes less space has high efficiency possesses great scalability
100
112/04/21 Jen-Wei Huang 101
References R. Srikant and R.Agrawal, “Mining Sequential
Patterns: Generalizations and Performance Improvements.” Proc. of ICDE, 1995
J. Ayres, J. Gehrke, T. Yiu, and J. Flannick. “Sequential pattern mining using a bitmap representation.” Proc. of ACM SIGKDD, 2002.
M. Zhang, B. Kao, D. W.-L. Cheung, and C. L. Yip. “Efficient algorithms for incremental update of frequent sequences.” Proc. of PAKDD, 2002.
101
112/04/21 Jen-Wei Huang 102
Thank You ! Q & A
102