你的一小步，我的一大步 Jen-Wei Huang 黃仁暐 [email protected] National Taiwan...

你的一小步，我的一大步Jen-Wei Huang

黃仁暐[email protected]

National Taiwan University

112/04/21 Jen-Wei Huang 2


* http://www.wretch.cc/blog/EtudeBIKE


* http://www.giant-bicycles.com/zh-TW/


* http://cape7.pixnet.net/blog


* http://www.wretch.cc/blog/orzboyz* http://blog.sina.com.tw/9winds/

* http://atomcinema.pixnet.net/blog


* http://www.amazon.com


* http://www.hq.nasa.gov/office/pao/History/ap11ann/kippsphotos/apollo.html

A General Model for Sequential Pattern Mining

with a Progressive Database

Jen-Wei Huang, Chi-Yao Tseng,

Jian-Chih Ou and Ming-Syan Chen

National Taiwan University

* IEEE Trans. on Knowledge and Data Engineering, Vol. 20, No. 6, June 2008


Outlines Introduction Preliminaries Algorithm Pisa Experiments Conclusions Q & A

16


Introduction to SPM “Mining of frequently occurring patterns

related to time or other sequences.” J. Han, Data Mining – Concepts and Techniques

“Given a set of sequences, find the complete set of frequent subsequences” J. Pei, PrefixSpan

Ex) What items one will buy if he/she has bought some certain items

17


Time-related data Customers’ buying behavior Natural phenomena Sensor network data Web access patterns Stock price changes DNA sequence applications

18


Definition Let I = {x1, x2, ..., xn} be a set of different

items. An element e, denoted by (xi xj ...), is a

subset of items ⊆ I of which items appear in a sequence at the same time.

A sequence s, denoted by < e1, e2, ..., em >, is an ordered list of elements.

A sequence database Db contains a set of sequences and |Db| represents the number of sequences in Db.

19


Definition

A sequence α = < a1, a2, ..., an > is a subsequence of another sequence β = < b1, b2, ..., bm > if there exists a set of integers,

1 ≤ i1 < i2 < ... < in ≤ m, such that a1 ⊆ bi1 , a2 ⊆ bi2 , ..., and an ⊆ bin .

20


Definition The sequential pattern mining can be

defined as "Given a sequence database, Db, and a

user-defined minimum support, min_sup, find the complete set of subsequences whose occurrence frequencies ≥ min_sup ∗ |Db|."

21


Three Categories Depending on the management of the

corresponding database, sequential pattern mining can be divided into three categories, namely sequential pattern mining with a static database. an incremental database. a progressive database.

22

How To Do Sequential Pattern Mining on a Static Database

An Overview

2006/03/24jwhuang National Taiwan

University 24

How? Apriori-like algorithms

AprioriAll – by Agrawal et al GSP – by R. Srikant et al

Partition-based algorithms FreeSpan – by J. Han et al PrefixSpan – by J. Pei et al

Vertical format algorithms SPADE – by Zaki et al SPAM – by Ayres et al


University 25

Apriori-like Algorithms 1.Sort phase

Sort the database Customer id as the primary key and time

as the second key 2.Litemset phase

Count the frequency of each itemset The fraction of customers who bought the

itemset


University 26

Apriori-like Algorithms 3.Transformation phase

Transform each tx to all litemsets in the form of

C01: <(1,5) (2) (3) (4)> C02: <(1) (3) (4) (3,5)> C03: <(1) (2) (3) (4}> C04: <(1) (3) (5)> C05: <(4) (5)>


CID Items

2 10 205 902 30

2 40 60 70

4 30

3 30 50 70

1 301 904 40 704 903 105 101 40 705 202 903 20

CID Items

1 30 90 {40 70}

2 {10 20} 30 {40 60 70} 90

3 {30 50 70} 10 204 30 {40 70} 905 90 10 20

Itemset #

10 3 20 3 30 4 40 3 50 1 60 1 70 4 90 4 {10 20} 1 {40 60} 1 {40 70} 3 {60 70} 1 {40 60 70}

1

{30 50} 1 {30 70} 1 {50 70} 1 {30 50 70}

1


Itemset # New

10 3 1 20 3 2 30 4 3 40 3 4 70 4 5 90 4 6 {40 70}

3 7

CID Items

1 3 6 {4, 5, 7}

2 {1, 2} 3 {4, 5, 7} 6

3 {3, 5} 1 24 3 {4, 5, 7} 65 6 1 2


University 29

Apriori-like Algorithms 4.Mining phase

Apriori-like algorithm 5.Maximal phase

Find the maximum patterns


CID

Items

1 3 6 {4, 5, 7}

2 {1, 2} 3 {4, 5, 7} 6

3 {3, 5} 1 24 3 {4, 5, 7} 65 6 1 2

Itemset

#

1 2 2

1 3 1

1 4 1

1 5 1

1 6 1

1 7 1

2 1 0

2 3 1

2 4 1

2 5 1

2 6 1

2 7 1

3 1 1

3 2 1

Itemset

#

3 4 3

3 5 3

3 6 3

3 7 3

4 1 0

4 2 0

4 3 0

4 5 0

4 6 2

4 7 0

5 1 1

5 2 1

5 3 0

5 4 0

Itemset

#

5 6 2

5 7 0

6 1 1

6 2 1

6 3 0

6 4 1

6 5 1

6 7 1

7 1 0

7 2 0

7 3 0

7 4 0

7 5 0

7 6 2


CID

Items

1 3 6 {4, 5, 7}

2 {1, 2} 3 {4, 5, 7} 6

3 {3, 5} 1 24 3 {4, 5, 7} 65 6 1 2

Itemset

#

3 4 6 23 5 6 23 7 6 2

Therefore, frequent sequential patterns are:<1 2> <3 4> <3 5> <3 6> <3 7> <4 6> <5 6> <7 6><3 4 6> <3 5 6> <3 7 6>

Itemset #

10 3 1 20 3 2 30 4 3 40 3 4 70 4 5 90 4 6 {40 70}

3 7

According to mappings, original frequent sequential patterns are:<10 20> <30 40> <30 70> <30 90> <30 {40 70}><40 90> <70 90> <{40 70} 90> <30 40 90> <30 70 90><30 {40 70} 90>


According to mappings, original frequent sequential patterns are:<10 20> <30 40> <30 70> <30 90> <30 {40 70}><40 90> <70 90> <{40 70} 90> <30 40 90> <30 70 90><30 {40 70} 90>

Because<30 40> and <30 70> are contained by <30 {40 70}><40 90> and <70 90> are contained by <{40 70} 90><30 40 90> and <30 70 90> are contained by <30 {40 70} 90>,

final maximal sequential patterns are:<10 20> <30 90> <30 {40 70}> <{40 70} 90> <30 {40 70} 90>


Related Works Static database

AprioriAll – by Agrawal et al GSP – by R. Srikant et al SPADE – by Zaki et al FreeSpan – by J. Han et al PrefixSpan – by J. Pei et al SPAM – by Ayres et al

33


Related Works Incremental database

ISM – by Parthasarathy et al IncSP – by Lin et al ISE – by Masseglia et al IncSpan – by Cheng et al MILE – by Chen et al

34


Motivation The assumption of having a static

database may not hold in practice. The data in real world change on the fly.

Finding sequential patterns in an incremental database may lack of interest to the users. It is noted that users are usually more

interested in the recent data than the old ones.

35


Motivation If a certain sequence does not have

any newly arriving elements, this sequence will still stay in the database and undesirably contribute to |Db|. New sequential patterns which appear

frequently in the recent sequences may not be considered as frequent sequential patterns.

36


Definition -- Period of Interest

Period of Interest (abbreviated as POI) is a sliding window whose length is a user-specified time

interval, continuously advancing as the time goes by.

The sequences having elements whose timestamps fall into this period, POI, contribute to the |Db| for current sequential patterns.

37

time

A C AD

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

B B

AD B

BCD C

CD B DA

A

A

B CBCA A

CS01

S02

S03

S04

S05

S06 A C

BD

C

D

D

Db1,5

Db2,6

Db3,7

Db4,8

Db5,9

Db6,10

SID

POI=5, min_supp=0.538



39


Progressive Sequential Pattern

Progressive sequential pattern mining problem is defined as follows "Given a progressive sequence database, a

user-specified period of interest, and a user-defined minimum support threshold, find the complete set of frequent subsequences whose occurrence frequencies are greater than or equal to the minimum support times the number of sequences in every period of interest of the database."

40


Naïve Algorithm Use conventional static sequential

pattern mining algorithms to mine sequential patterns separately from all combination of POIs e.g., Db1,5, Db2,6, Db3,7, Db4,8, Db5,9, etc.

For the sequence database which has the elements appearing in the interval of n timestamps, the total number of POIs in this interval is equal to (n − POI +1).

41


Prior Work The only prior work on progressive database

is GSP+ and MFS+ proposed by Zhang based on static algorithms GSP and MFS (also derived by the same authors).

However, these algorithms still have to re-mine each sub-database using the static algorithms GSP and MFS.

Nevertheless, the performance improvement of GSP+ and MFS+ over GSP and MFS is only within 15% as reported by their authors.

42


Algorithm DirApp Stands for Direct Append. Consists of two procedures

Progressively Updating abbreviated as PrUp

Immediately Filtering abbreviated as ImFi

43


Procedure PrUp When progressively reading newly

incoming elements, Procedure PrUp can update each sequence in the sequence

database generate candidate sequential patterns calculate occurrence frequencies of all

candidate equential patterns in the current POI.

44


Procedure ImFi DirApp uses Procedure ImFi to

filter out obsolete data from the existing sequence database

prune away obsolete candidate sequential patterns from the candidate set.

report the most up-to-date frequent sequential patterns to the user in every POI

45

A

B

C

AD

B

time

A C AD

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

B B

AD B

BCD C

CD B DA

A

A

B CBCA A

CS01

S02

S03

S04

S05

S06 A C

BD

C

D

D

SID

timet1 t2 t3 t4 t5 t6 t7 t8 t9 t1

0

…

S01

46


Example

time

A

B

C

AD

B

t1 t2 t3 t4 t5 t6 t7 t8 t9 t1

0

…

47

Db1,1

A1

Db1,4

A1

B2

AB1

C4

AC1

BC2

ABC1

Db1,2

A1

B2

AB1

Db1,3

A1

B2

AB1

(1) (4)

(2)

(3)

A

B

C

AD

B

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

48

Db1,4

A1

B2

AB1

C4

AC1

BC2

ABC1

(4)

A

B

C

AD

B

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

Db1,5

A5 B(AD)2

B2 ABD1

AB1 AB(AD)1

C4 CA4

AC1 CD4

BC2 C(AD)4

ABC1 ACD1

D5 AC(AD)1

(AD)5 BCA2

AD1 BCD2

A(AD)1 BC(AD)2

BA2 ABCD1

BD2 ABC(AD)1

(5)

49

A

B

C

AD

B

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

Db1,5

A5 B(AD)2

B2 ABD1

AB1 AB(AD)1

C4 CA4

AC1 CD4

BC2 C(AD)4

ABC1 ACD1

D5 AC(AD)1

(AD)5 BCA2

AD1 BCD2

A(AD)1 BC(AD)2

BA2 ABCD1

BD2 ABC(AD)1

(5)

Db2,6

A5

B2

C4

BC2

D5

(AD)5

BA2

BD2

B(AD)2

CA4

CD4

C(AD)4

BCA2

BCD2

BC(AD)2

(6)

50

A

B

C

AD

B

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

Db2,6

A5

B2

C4

BC2

D5

(AD)5

BA2

BD2

B(AD)2

CA4

CD4

C(AD)4

BCA2

BCD2

BC(AD)2

(6)

Db3,7

A5

C4

D5

(AD)5

CA4

CD4

C(AD)4

B7

AB5

CB4

DB5

(AD)B5

CAB4

CDB4

C(AD)B4

(7)

…

51

Db1,1

A1

Db1,5

A5 B(AD)2

B2 ABD1

AB1 AB(AD)1

C4 CA4

AC1 CD4

BC2 C(AD)4

ABC1 ACD1

D5 AC(AD)1

(AD)5 BCA2

AD1 BCD2

A(AD)1 BC(AD)2

BA2 ABCD1

BD2 ABC(AD)1

Db1,4

A1

B2

AB1

C4

AC1

BC2

ABC1

Db1,2

A1

B2

AB1

Db2,6

A5

B2

C4

BC2

D5

(AD)5

BA2

BD2

B(AD)2

CA4

CD4

C(AD)4

BCA2

BCD2

BC(AD)2

Db1,3

A1

B2

AB1

Db3,7

A5

C4

D5

(AD)5

CA4

CD4

C(AD)4

B7

AB5

CB4

DB5

(AD)B5

CAB4

CDB4

C(AD)B4

(1) (4) (5) (6) (7)

(2)

(3)

52

Db1,2

A1

B2

AB1

S01

time

A C AD

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

B B

AD B

BCD C

CD B DA

A

A

B CBCA A

CS01

S02

S03

S04

S05

S06 A C

BD

C

D

D

SID

Db1,2

A1 AB1

D1 DB1

(AD)1 (AD)B1

B2

S02 S03

Db1,2

A1 AB1

B2 AC1

C2 A(BC)1

(BC)2

Db1,2(4)

AB1 3

A(BC)1 1

AC1 1

(AD)B1 1

DB1 1

AB1(3)

Db1,2

D2

S04

53

Db1,2(4)

AB1 3

A(BC)1 1

AC1 1

(AD)B1 1

DB1 1

Db1,3(5)

AB1 3

A(BC)1 1

AC1 1

(AD)B1 1

DB1 1

A(BC)B1 1

ACB1 1

(BC)B2 1

CB2 1

DC2 1

AB1(3)

AB1(3)AB1(3)

AB1(3) DA3(3)BA4(3)

(2) (3) (4) (5)

Db1,4(5)

AB1 3A(BC)B

C1 1

A(BC)1 1 A(BC)C1 1

AC1 2 (AD)A1 1

(AD)B1 1 (AD)BA1 1

DB3 2 BA2 1

A(BC)B1 1 BC3 2

ACB1 1 (BC)BC2 1

(BC)B2 1 (BC)C2 1

CB2 1 DA1 1

DC2 1 DBA1 1

ABC1 2

Db1,5(5)

AB1 3 ABC1 2 DBA3 2 BCA2 1

A(BC)1 1A(BC)BC

1 1 A(AD)1 1BC(AD)

2 1

AC1 2 A(BC)C1 1 AB(AD)1 1 BCD2 1

(AD)B1 1 (AD)A1 1 ABC(AD)1 1 BD2 1

DB3 2 (AD)BA1 1 ABCD1 1 CA4 2

A(BC)B1 1 BA4 3 ABD1 1 C(AD)4 1

ACB1 1 BC3 2 AC(AD)1 1 CD4 1

(BC)B2 1 (BC)BC2 1 ACD1 1 DCA2 1

CB2 1 (BC)C2 1 AD1 1

DC2 1 DA3 3 B(AD)2 1

54

Db2,6(5)

DB3 1 BC(AD)2 1

(BC)B2 1 BCD2 1

CB2 1 BD2 1

DC2 1 CA4 3

BA4 4 C(AD)4 1

BC3 2 CD4 1

(BC)BC2 1 DCA2 1

(BC)C2 1 (BC)A2 1

DA3 2 (BC)BA2 1

DBA3 1(BC)BC

A2 1

B(AD)2 1 (BC)CA2 1

BCA3 2 CBA2 1BA4(4) CA4(3)

(6)

Db3,7(5)

DB5 2 (AD)B5 1

BA4 2 BAC4 1

BC4 2 CAB4 2

DA3 1 CA(BC)3 1

DBA3 1 C(AD)B4 1

BCA3 1 CB4 2

CA4 3 C(BC)3 1

C(AD)4 1 CDB4 1

CD4 1 DAC3 1

AB5 2 DBAC3 1

A(BC)5 1 DBC3 1

AC5 2 DC3 1

(7)

Db4,8(6)

DB5 1 BAC4 1

BA4 1 CAB4 1

BC7 2 C(AD)B4 1

CA4 2 CB4 1

C(AD)4 1 CDB4 1

CD4 1 ABC5 1

AB5 2 (AD)BC5 1

A(BC)5 1 (AD)C5 1

AC6 4 DBC5 1

(AD)B5 1 DC5 1

(8)

Db5,9(5)

DB5 1

BC7 1

AB5 2

A(BC)5 1

AC8 5

(AD)B5 1

ABC5 1

(AD)BC5 1

(AD)C5 1

DBC5 1

DC5 1

ACD6 2

AD6 2

CD8 2

AC6(4)

(9)

CA4(3)

AC8(5)

55


The Advantages of DirApp DirApp needs only one scan of newly

arriving elements and the candidate set at each timestamp rather than quadratic scans by conventional algorithms.

DirApp can maintain latest data sequences find the complete set of up-to-date

sequential patterns delete obsolete data and patterns rapidly

56


The Disadvantages of DirApp

DirApp needs lots of working space to store the candidate sets for all sequences.

Scanning all candidate sets induces huge computation in execution time.

DirApp needs another data structure to calculate the occurrence frequencies of all candidate sequential patterns.

57



58


Algorithm Pisa Pisa stands for Progressive mIning of

Sequential pAtterns Pisa utilizes a Progressive Sequential

tree (abbreviated as PS-tree) to maintain the information of all sequences in each POI to update each sequence find up-to-date sequential patterns

59


PS-tree The nodes in PS-tree can be divided

into two different types Root node Common nodes

Each common node stores two information Node label = element in a sequence Sequence list

sequence IDs containing this element marked by corresponding timestamps

Root

Sequence ID

Timestamp

Label

60


PS-tree Whenever there are a series of

elements appearing in the same sequence, there will be a series of nodes labeled by each element with the same sequence IDs in their sequence lists. The first node will be connected to the Root

node representing the first element. The other nodes will be connected to the

first node analogously.61


PS-treeRoot

A C AD

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

B B

AD B

BC

D C

CD B DA

A

A

B CBCA A

CS01

S02

S03

S04

S05

S06 A C

BD

C

D

D

SID

Root

Sequence ID

Timestamp

Label

011

A

011

B

011

C

62


PS-tree The path from Root node to any other

node represents the candidate sequential pattern appearing in this sequence.

The appearing timestamp for each candidate sequential pattern will be marked in the node labeled by the last element.

63


PS-treeRoot

A C AD

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

B B

AD B

BC

D C

CD B DA

A

A

B CBCA A

CS01

S02

S03

S04

S05

S06 A C

BD

C

D

D

SID

Root

Sequence ID

Timestamp

Label

011

A012

B014

C

011

B

011

C

012

C011

C

64


Algorithm Pisa When receiving elements at

timestamp t+1, Pisa traverses the PS-tree in post-order to delete the obsolete elements from update current sequences in insert newly arriving elements into

the PS-tree of timestamp t andtransforms it into PS-tree of timestamp t+1.

65


For a common node Pisa deletes the obsolete sequences in

the sequence list of this node If there is no sequence ID left in the sequence

list, Pisa prunes this node away from its parent

Pisa checks the sequence IDs left in the sequence list to see if there is newly arriving element of the sequences If there is no newly arriving element, Pisa

goes to the next node

66


For a common node Otherwise, Pisa generates all combination of

candidate elements from the arriving element Ex) ABC -> A, B, C, AB, AC, BC, ABC

For each candidate element that does not exist on the path from Root to the current node :

If there is a child of the same label, Pisa updates the timestamp of this sequence to the timestamp of the same sequence in parent’s sequence list.

Otherwise, Pisa creates a new child of this element with the sequence ID and the timestamp of the same sequence in parent’s sequence list.

67


For Root node Instead of checking the sequence list, Pisa

examines all sequences that have newly arriving elements.

After Pisa generates all combination of candidate element, for each of them : If there is a child of the same label, Pisa

updates the timestamp of this sequence to t+1.

Otherwise, Pisa creates a new child of this element with sequence ID and timestamp t+1.

68


Algorithm Pisa After Pisa processes a common node,

if the number of sequence IDs in the sequence list is larger than the min_supp*|Dbp,q|,

the path from Root to this node will be outputted as a frequent sequential pattern.

69


PS-treeRoot

A C AD

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

B B

AD B

BC

D C

CD B DA

A

A

B CBCA A

CS01

S02

S03

S04

S05

S06 A C

BD

C

D

D

SID

Root

Sequence ID

Timestamp

Label

011

A012

B014

C

011

B

011

C

012

C011

C

70

Root A C AD

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

B B

AD B

BC

D C

CD B DA

A

A

B CBCA A

CS01

S02

S03

S04

S05

S06 A C

BD

C

D

D

SID

Sequence ID

Timestamp

Label

POI=5, min_supp=0.571

Db1,1(3)

03

1

02

1

01

1

A

02

1

D

02

1

AD

A

t1

AD

A

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

72

Db1,2(4)

03

1

02

1

01

1

A

04

2

02

1

D

03

1

02

1

01

1

B

03

1

C

03

1

BC

02

1

B

02

1

B

02

1

AD

03

2

C

03

2

BC

03

2

02

2

01

2

B

AB1(3)

B

B

D

BC

t2

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

Db1,1(3)

03

1

02

1

01

1

A

02

1

D

02

1

AD

73

Db1,3(5)

03

1

02

1

01

1

A

03

1

02

1

01

1

B

03

1

C

03

1

BC

02

1

B

02

1

B

02

1

AD

03

2

BC

03

3

02

2

01

2

B

03

1

B

03

1

B

04

2

C

05

3

04

2

02

1

D

04

3

03

2

C

03

2

B

AB1(3)

03

2

B

t3

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

C

D

B

74

Db1,4(5)

03

1

02

4

01

1

A

03

1

02

1

01

1

B

03

1

BC

02

1

B

02

1

AD

03

2

BC

03

1

B

03

1

B

04

2

C

05

3

04

2

02

1

D

03

2

B

03

1

01

1

C

03

1

01

1

C

03

1

C

02

1

A

03

1

C

02

1

A

05

3

02

1

B

02

1

A

02

1

A

03

3

01

2

C

02

2

A

03

3

02

2

01

2

B

05

4

03

2

C

03

2

C

04

3

03

4

01

4

C

AB1(3)

03

2

B

t4

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

C

B

A

C

75

Db1,5(5)

03

1

02

1

01

1

B

03

1

BC

02

1

B

03

2

BC

03

1

B

03

1

B

04

2

C

03

2

B

03

1

01

1

C

03

1

01

1

C

03

1

C

03

1

C

05

3

02

1

B

02

1

A

02

1

A

03

3

01

2

C

03

3

02

2

01

2

B

05

4

03

2

C

03

2

C

04

3

03

4

01

4

C

AB1(3)01

1

D

01

1

AD

01

1

D

01

1

AD

01

1

D

01

1

AD

01

1

D

01

1

AD

05

3

02

1

A

04

2

A

03

1

02

4

01

5

A

04

5

05

5

04

2

02

1

01

5

D

05

3

02

1

01

5

AD

01

2

D

01

2

AD

01

2

A

01

2

D

01

2

AD

05

4

02

2

01

2

A

01

4

D

01

4

AD

04

3

01

4

A

BA4(3)

05

3

04

2

02

1

A

DA3(3)

03

2

B

Sequence ID

Timestamp

Label

76

Db2,6(5)

03

2

BC

04

2

C

03

2

B

03

3

01

2

C

03

3

02

2

01

2

B

05

4

03

2

C

03

2

C

04

3

03

4

01

4

C

CA4(3)

04

2

A

05

3

04

2

A

01

2

D

01

2

AD

01

2

D

01

2

AD

01

4

D

01

4

AD

BA4(4)

05

3

B

05

3

A

05

3

04

2

01

5

D

01

5

AD

03

3

01

2

A

03

3

02

2

01

2

A

05

4

04

3

03

4

01

4

A

03

2

A

03

2

A

03

2

A

03

2

A

03

6

02

4

01

5

A

04

5

05

5

03

2

B

03

2

A

t6

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

A

77

Db3,7(5)

04

7

03

3

01

7

B

05

4

CA4(3)

01

4

D

01

4

AD

05

3

A

01

5

AD

04

3

03

4

01

4

A

04

5

BC

05

5

04

5

C

05

3

C

05

3

C

05

4

03

3

A

05

3

A

05

3

C

05

3

01

5

D

03

3

A

03

3

C

05

4

C

04

3

BC

04

3

BC

05

3

01

5

B

01

5

B

04

3

01

4

B

01

4

B

01

4

B

04

3

01

4

B

04

7

03

4

01

4

C

05

7

04

7

BC

03

6

02

4

01

5

A

04

5

05

5

04

5

01

5

B

05

3

C

t7

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

B

BCC

78

Db4,8(6)

AC6(3)

01

4

D

01

4

AD

01

5

AD

04

5

BC

01

5

C

05

3

01

5

D

05

4

C

01

4

B

01

5

B

01

4

B

01

4

B

04

7

03

8

01

8

C

05

7

04

7

BC

01

5

C

01

5

C

01

5

C

05

5

04

5

03

6

C

05

4

A

05

4

04

7

01

7

B

01

4

B

03

4

01

4

A

01

7

C

04

5

03

8

02

4

A

05

5

06

8

01

5

04

5

01

5

B

01

5

C

01

5

B

t8

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

C

C

A

79

Db5,9(5)

AC8(4)

01

5

AD

04

5

BC

01

5

B

01

5

B

04

7

03

8

01

8

C

05

7

04

7

BC

01

5

C

01

5

C

01

5

C

01

7

C

05

5

03

6

D

05

5

03

6

D

05

5

04

5

03

6

C

06

8

05

9

03

9

01

5

D

04

7

01

7

B

05

7

03

8

D

04

5

01

5

B

01

5

C

04

5

03

8

01

5

A

05

5

06

8

t9

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

D

C

D

80

Db6,10(5)

CD8(4)

01

10

BD

04

7

03

8

01

8

C

05

7

04

7

BC

01

7

C

04

7

01

10

B

06

8

03

6

C

03

6

D

03

6

D

06

8

03

8

A

01

7

BD

01

7

D

04

10

03

9

01

10

D

05

9

01

8

BD

01

8

B

04

7

03

8

01

8

D

05

7

04

7

D

04

7

D

01

7

BD

t10

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

BD

D

81


The Advantages of Pisa Pisa needs only one scan of newly

arriving elements and the PS-tree at each timestamp rather than quadratic scans by conventional algorithms.

Pisa can maintain latest data sequences find the complete set of up-to-date

sequential patterns delete obsolete data and patterns rapidly

82


The Advantages of Pisa Each path from Root to any other node on

PS-tree forms a unique candidate sequential pattern. Thus Pisa combines the same candidate patterns together and all patterns do not have to store their prefix elements. PS-tree consumes smaller space. Dealing with the same sequential patterns

together is also very efficient in execution time. Fast Pisa with approximation results.

83



84


Experiments Comparative algorithms

GSP+ -- re-mining version of GSP SPAM+ -- re-mining version of SPAM DirApp

Environment Pentium 4 — 3GHz CPU and 2GB RAM Coded in C++

85


Experiments The synthetic datasets are generated

in the way similar to the IBM data generator designed for testing sequential pattern mining algorithms.

86


Experiments We divide the target dataset into n

timestamps. According to the POI, the first m

timestamps (m = POI and m < n) are viewed as the original database and the rest of transactions in the dataset are received by the system incrementally.

87


Experiments The first run of the experiments mines

the first POI from the beginning m timestamps of the dataset.

After that, we shift the POI forward t (t<<m) timestamps forward for the following runs.

88


Experiments The real data sets are from

KDDCUP’07. We randomly choose successive 120 days

for the performance evaluation. A timestamp is set as 3 days in order to obtain sufficient frequent sequential patterns.

Therefore, there are total 40 timestamps and POI is set as 10. The new datasets contain more than 5000 sequences and 2000 different items.

89


Cumulative Execution Time

90


Minimum Support

91


Length of POI

92


Number of Sequences

93


Scalability of Pisa

94


Real Data Set

95


Improvement of FastPisa

96


Information Lose of FastPisa

97



98


Conclusions We proposed a progressive algorithm

Pisa to handle the progressive sequential pattern mining problem without re-mining all sub-databases at each timestamp.

Pisa needs only one scan of newly arriving elements and the PS-tree at each timestamp rather than quadratic scans by conventional algorithms.

99


Conclusions Pisa can

maintain the latest information of sequences

find the complete set of up-to-date sequential patterns

delete obsolete data and patterns rapidly Pisa also

consumes less space has high efficiency possesses great scalability

100


References R. Srikant and R.Agrawal, “Mining Sequential

Patterns: Generalizations and Performance Improvements.” Proc. of ICDE, 1995

J. Ayres, J. Gehrke, T. Yiu, and J. Flannick. “Sequential pattern mining using a bitmap representation.” Proc. of ACM SIGKDD, 2002.

M. Zhang, B. Kao, D. W.-L. Cheung, and C. L. Yip. “Efficient algorithms for incremental update of frequent sequences.” Proc. of PAKDD, 2002.

101


Thank You ! Q & A

102

你的一小步，我的一大步 Jen-Wei Huang 黃仁暐 [email protected] National Taiwan...

Documents

Transcript of 你的一小步，我的一大步 Jen-Wei Huang 黃仁暐 [email protected] National Taiwan...