你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 [email protected] National Taiwan...

102

Click here to load reader

Transcript of 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 [email protected] National Taiwan...

Page 1: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

你的一小步,我的一大步Jen-Wei Huang

黃仁暐[email protected]

National Taiwan University

Page 2: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 2

Page 3: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 3

* http://www.wretch.cc/blog/EtudeBIKE

Page 4: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 4

* http://www.giant-bicycles.com/zh-TW/

Page 5: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 5

Page 6: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 6

Page 7: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 7

* http://cape7.pixnet.net/blog

Page 8: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 8

* http://cape7.pixnet.net/blog

Page 9: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 9

* http://cape7.pixnet.net/blog

Page 10: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 10

* http://www.wretch.cc/blog/orzboyz* http://blog.sina.com.tw/9winds/

* http://atomcinema.pixnet.net/blog

Page 11: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 11

Page 12: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 12

* http://www.amazon.com

Page 13: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 13

* http://www.amazon.com

Page 14: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 14

* http://www.hq.nasa.gov/office/pao/History/ap11ann/kippsphotos/apollo.html

Page 15: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

A General Model for Sequential Pattern Mining

with a Progressive Database

Jen-Wei Huang, Chi-Yao Tseng,

Jian-Chih Ou and Ming-Syan Chen

National Taiwan University

* IEEE Trans. on Knowledge and Data Engineering, Vol. 20, No. 6, June 2008

Page 16: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 16

Outlines Introduction Preliminaries Algorithm Pisa Experiments Conclusions Q & A

16

Page 17: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 17

Introduction to SPM “Mining of frequently occurring patterns

related to time or other sequences.” J. Han, Data Mining – Concepts and Techniques

“Given a set of sequences, find the complete set of frequent subsequences” J. Pei, PrefixSpan

Ex) What items one will buy if he/she has bought some certain items

17

Page 18: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 18

Time-related data Customers’ buying behavior Natural phenomena Sensor network data Web access patterns Stock price changes DNA sequence applications

18

Page 19: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 19

Definition Let I = {x1, x2, ..., xn} be a set of different

items. An element e, denoted by (xi xj ...), is a

subset of items ⊆ I of which items appear in a sequence at the same time.

A sequence s, denoted by < e1, e2, ..., em >, is an ordered list of elements.

A sequence database Db contains a set of sequences and |Db| represents the number of sequences in Db.

19

Page 20: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 20

Definition

A sequence α = < a1, a2, ..., an > is a subsequence of another sequence β = < b1, b2, ..., bm > if there exists a set of integers,

1 ≤ i1 < i2 < ... < in ≤ m, such that a1 ⊆ bi1 , a2 ⊆ bi2 , ..., and an ⊆ bin .

20

Page 21: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 21

Definition The sequential pattern mining can be

defined as "Given a sequence database, Db, and a

user-defined minimum support, min_sup, find the complete set of subsequences whose occurrence frequencies ≥ min_sup ∗ |Db|."

21

Page 22: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 22

Three Categories Depending on the management of the

corresponding database, sequential pattern mining can be divided into three categories, namely sequential pattern mining with a static database. an incremental database. a progressive database.

22

Page 23: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

How To Do Sequential Pattern Mining on a Static Database

An Overview

Page 24: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

2006/03/24jwhuang National Taiwan

University 24

How? Apriori-like algorithms

AprioriAll – by Agrawal et al GSP – by R. Srikant et al

Partition-based algorithms FreeSpan – by J. Han et al PrefixSpan – by J. Pei et al

Vertical format algorithms SPADE – by Zaki et al SPAM – by Ayres et al

Page 25: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

2006/03/24jwhuang National Taiwan

University 25

Apriori-like Algorithms 1.Sort phase

Sort the database Customer id as the primary key and time

as the second key 2.Litemset phase

Count the frequency of each itemset The fraction of customers who bought the

itemset

Page 26: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

2006/03/24jwhuang National Taiwan

University 26

Apriori-like Algorithms 3.Transformation phase

Transform each tx to all litemsets in the form of

C01: <(1,5) (2) (3) (4)> C02: <(1) (3) (4) (3,5)> C03: <(1) (2) (3) (4}> C04: <(1) (3) (5)> C05: <(4) (5)>

Page 27: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 27

CID Items

2 10 205 902 30

2 40 60 70

4 30

3 30 50 70

1 301 904 40 704 903 105 101 40 705 202 903 20

CID Items

1 30 90 {40 70}

2 {10 20} 30 {40 60 70} 90

3 {30 50 70} 10 204 30 {40 70} 905 90 10 20

Itemset #

10 3 20 3 30 4 40 3 50 1 60 1 70 4 90 4 {10 20} 1 {40 60} 1 {40 70} 3 {60 70} 1 {40 60 70}

1

{30 50} 1 {30 70} 1 {50 70} 1 {30 50 70}

1

Page 28: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 28

Itemset # New

10 3 1 20 3 2 30 4 3 40 3 4 70 4 5 90 4 6 {40 70}

3 7

CID Items

1 3 6 {4, 5, 7}

2 {1, 2} 3 {4, 5, 7} 6

3 {3, 5} 1 24 3 {4, 5, 7} 65 6 1 2

Page 29: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

2006/03/24jwhuang National Taiwan

University 29

Apriori-like Algorithms 4.Mining phase

Apriori-like algorithm 5.Maximal phase

Find the maximum patterns

Page 30: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 30

CID

Items

1 3 6 {4, 5, 7}

2 {1, 2} 3 {4, 5, 7} 6

3 {3, 5} 1 24 3 {4, 5, 7} 65 6 1 2

Itemset

#

1 2 2

1 3 1

1 4 1

1 5 1

1 6 1

1 7 1

2 1 0

2 3 1

2 4 1

2 5 1

2 6 1

2 7 1

3 1 1

3 2 1

Itemset

#

3 4 3

3 5 3

3 6 3

3 7 3

4 1 0

4 2 0

4 3 0

4 5 0

4 6 2

4 7 0

5 1 1

5 2 1

5 3 0

5 4 0

Itemset

#

5 6 2

5 7 0

6 1 1

6 2 1

6 3 0

6 4 1

6 5 1

6 7 1

7 1 0

7 2 0

7 3 0

7 4 0

7 5 0

7 6 2

Page 31: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 31

CID

Items

1 3 6 {4, 5, 7}

2 {1, 2} 3 {4, 5, 7} 6

3 {3, 5} 1 24 3 {4, 5, 7} 65 6 1 2

Itemset

#

3 4 6 23 5 6 23 7 6 2

Therefore, frequent sequential patterns are:<1 2> <3 4> <3 5> <3 6> <3 7> <4 6> <5 6> <7 6><3 4 6> <3 5 6> <3 7 6>

Itemset #

10 3 1 20 3 2 30 4 3 40 3 4 70 4 5 90 4 6 {40 70}

3 7

According to mappings, original frequent sequential patterns are:<10 20> <30 40> <30 70> <30 90> <30 {40 70}><40 90> <70 90> <{40 70} 90> <30 40 90> <30 70 90><30 {40 70} 90>

Page 32: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 32

According to mappings, original frequent sequential patterns are:<10 20> <30 40> <30 70> <30 90> <30 {40 70}><40 90> <70 90> <{40 70} 90> <30 40 90> <30 70 90><30 {40 70} 90>

Because<30 40> and <30 70> are contained by <30 {40 70}><40 90> and <70 90> are contained by <{40 70} 90><30 40 90> and <30 70 90> are contained by <30 {40 70} 90>,

final maximal sequential patterns are:<10 20> <30 90> <30 {40 70}> <{40 70} 90> <30 {40 70} 90>

Page 33: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 33

Related Works Static database

AprioriAll – by Agrawal et al GSP – by R. Srikant et al SPADE – by Zaki et al FreeSpan – by J. Han et al PrefixSpan – by J. Pei et al SPAM – by Ayres et al

33

Page 34: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 34

Related Works Incremental database

ISM – by Parthasarathy et al IncSP – by Lin et al ISE – by Masseglia et al IncSpan – by Cheng et al MILE – by Chen et al

34

Page 35: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 35

Motivation The assumption of having a static

database may not hold in practice. The data in real world change on the fly.

Finding sequential patterns in an incremental database may lack of interest to the users. It is noted that users are usually more

interested in the recent data than the old ones.

35

Page 36: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 36

Motivation If a certain sequence does not have

any newly arriving elements, this sequence will still stay in the database and undesirably contribute to |Db|. New sequential patterns which appear

frequently in the recent sequences may not be considered as frequent sequential patterns.

36

Page 37: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 37

Definition -- Period of Interest

Period of Interest (abbreviated as POI) is a sliding window whose length is a user-specified time

interval, continuously advancing as the time goes by.

The sequences having elements whose timestamps fall into this period, POI, contribute to the |Db| for current sequential patterns.

37

Page 38: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

time

A C AD

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

B B

AD B

BCD C

CD B DA

A

A

B CBCA A

CS01

S02

S03

S04

S05

S06 A C

BD

C

D

D

Db1,5

Db2,6

Db3,7

Db4,8

Db5,9

Db6,10

SID

POI=5, min_supp=0.538

Page 39: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 39

Outlines Introduction Preliminaries Algorithm Pisa Experiments Conclusions Q & A

39

Page 40: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 40

Progressive Sequential Pattern

Progressive sequential pattern mining problem is defined as follows "Given a progressive sequence database, a

user-specified period of interest, and a user-defined minimum support threshold, find the complete set of frequent subsequences whose occurrence frequencies are greater than or equal to the minimum support times the number of sequences in every period of interest of the database."

40

Page 41: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 41

Naïve Algorithm Use conventional static sequential

pattern mining algorithms to mine sequential patterns separately from all combination of POIs e.g., Db1,5, Db2,6, Db3,7, Db4,8, Db5,9, etc.

For the sequence database which has the elements appearing in the interval of n timestamps, the total number of POIs in this interval is equal to (n − POI +1).

41

Page 42: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 42

Prior Work The only prior work on progressive database

is GSP+ and MFS+ proposed by Zhang based on static algorithms GSP and MFS (also derived by the same authors).

However, these algorithms still have to re-mine each sub-database using the static algorithms GSP and MFS.

Nevertheless, the performance improvement of GSP+ and MFS+ over GSP and MFS is only within 15% as reported by their authors.

42

Page 43: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 43

Algorithm DirApp Stands for Direct Append. Consists of two procedures

Progressively Updating abbreviated as PrUp

Immediately Filtering abbreviated as ImFi

43

Page 44: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 44

Procedure PrUp When progressively reading newly

incoming elements, Procedure PrUp can update each sequence in the sequence

database generate candidate sequential patterns calculate occurrence frequencies of all

candidate equential patterns in the current POI.

44

Page 45: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 45

Procedure ImFi DirApp uses Procedure ImFi to

filter out obsolete data from the existing sequence database

prune away obsolete candidate sequential patterns from the candidate set.

report the most up-to-date frequent sequential patterns to the user in every POI

45

Page 46: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

A

B

C

AD

B

time

A C AD

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

B B

AD B

BCD C

CD B DA

A

A

B CBCA A

CS01

S02

S03

S04

S05

S06 A C

BD

C

D

D

SID

timet1 t2 t3 t4 t5 t6 t7 t8 t9 t1

0

S01

46

Page 47: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 47

Example

time

A

B

C

AD

B

t1 t2 t3 t4 t5 t6 t7 t8 t9 t1

0

47

Page 48: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

Db1,1

A1

Db1,4

A1

B2

AB1

C4

AC1

BC2

ABC1

Db1,2

A1

B2

AB1

Db1,3

A1

B2

AB1

(1) (4)

(2)

(3)

A

B

C

AD

B

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

48

Page 49: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

Db1,4

A1

B2

AB1

C4

AC1

BC2

ABC1

(4)

A

B

C

AD

B

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

Db1,5

A5 B(AD)2

B2 ABD1

AB1 AB(AD)1

C4 CA4

AC1 CD4

BC2 C(AD)4

ABC1 ACD1

D5 AC(AD)1

(AD)5 BCA2

AD1 BCD2

A(AD)1 BC(AD)2

BA2 ABCD1

BD2 ABC(AD)1

(5)

49

Page 50: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

A

B

C

AD

B

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

Db1,5

A5 B(AD)2

B2 ABD1

AB1 AB(AD)1

C4 CA4

AC1 CD4

BC2 C(AD)4

ABC1 ACD1

D5 AC(AD)1

(AD)5 BCA2

AD1 BCD2

A(AD)1 BC(AD)2

BA2 ABCD1

BD2 ABC(AD)1

(5)

Db2,6

A5

B2

C4

BC2

D5

(AD)5

BA2

BD2

B(AD)2

CA4

CD4

C(AD)4

BCA2

BCD2

BC(AD)2

(6)

50

Page 51: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

A

B

C

AD

B

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

Db2,6

A5

B2

C4

BC2

D5

(AD)5

BA2

BD2

B(AD)2

CA4

CD4

C(AD)4

BCA2

BCD2

BC(AD)2

(6)

Db3,7

A5

C4

D5

(AD)5

CA4

CD4

C(AD)4

B7

AB5

CB4

DB5

(AD)B5

CAB4

CDB4

C(AD)B4

(7)

51

Page 52: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

Db1,1

A1

Db1,5

A5 B(AD)2

B2 ABD1

AB1 AB(AD)1

C4 CA4

AC1 CD4

BC2 C(AD)4

ABC1 ACD1

D5 AC(AD)1

(AD)5 BCA2

AD1 BCD2

A(AD)1 BC(AD)2

BA2 ABCD1

BD2 ABC(AD)1

Db1,4

A1

B2

AB1

C4

AC1

BC2

ABC1

Db1,2

A1

B2

AB1

Db2,6

A5

B2

C4

BC2

D5

(AD)5

BA2

BD2

B(AD)2

CA4

CD4

C(AD)4

BCA2

BCD2

BC(AD)2

Db1,3

A1

B2

AB1

Db3,7

A5

C4

D5

(AD)5

CA4

CD4

C(AD)4

B7

AB5

CB4

DB5

(AD)B5

CAB4

CDB4

C(AD)B4

(1) (4) (5) (6) (7)

(2)

(3)

52

Page 53: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

Db1,2

A1

B2

AB1

S01

time

A C AD

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

B B

AD B

BCD C

CD B DA

A

A

B CBCA A

CS01

S02

S03

S04

S05

S06 A C

BD

C

D

D

SID

Db1,2

A1 AB1

D1 DB1

(AD)1 (AD)B1

B2

S02 S03

Db1,2

A1 AB1

B2 AC1

C2 A(BC)1

(BC)2

Db1,2(4)

AB1 3

A(BC)1 1

AC1 1

(AD)B1 1

DB1 1

AB1(3)

Db1,2

D2

S04

53

Page 54: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

Db1,2(4)

AB1 3

A(BC)1 1

AC1 1

(AD)B1 1

DB1 1

Db1,3(5)

AB1 3

A(BC)1 1

AC1 1

(AD)B1 1

DB1 1

A(BC)B1 1

ACB1 1

(BC)B2 1

CB2 1

DC2 1

AB1(3)

AB1(3)AB1(3)

AB1(3) DA3(3)BA4(3)

(2) (3) (4) (5)

Db1,4(5)

AB1 3A(BC)B

C1 1

A(BC)1 1 A(BC)C1 1

AC1 2 (AD)A1 1

(AD)B1 1 (AD)BA1 1

DB3 2 BA2 1

A(BC)B1 1 BC3 2

ACB1 1 (BC)BC2 1

(BC)B2 1 (BC)C2 1

CB2 1 DA1 1

DC2 1 DBA1 1

ABC1 2

Db1,5(5)

AB1 3 ABC1 2 DBA3 2 BCA2 1

A(BC)1 1A(BC)BC

1 1 A(AD)1 1BC(AD)

2 1

AC1 2 A(BC)C1 1 AB(AD)1 1 BCD2 1

(AD)B1 1 (AD)A1 1 ABC(AD)1 1 BD2 1

DB3 2 (AD)BA1 1 ABCD1 1 CA4 2

A(BC)B1 1 BA4 3 ABD1 1 C(AD)4 1

ACB1 1 BC3 2 AC(AD)1 1 CD4 1

(BC)B2 1 (BC)BC2 1 ACD1 1 DCA2 1

CB2 1 (BC)C2 1 AD1 1

DC2 1 DA3 3 B(AD)2 1

54

Page 55: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

Db2,6(5)

DB3 1 BC(AD)2 1

(BC)B2 1 BCD2 1

CB2 1 BD2 1

DC2 1 CA4 3

BA4 4 C(AD)4 1

BC3 2 CD4 1

(BC)BC2 1 DCA2 1

(BC)C2 1 (BC)A2 1

DA3 2 (BC)BA2 1

DBA3 1(BC)BC

A2 1

B(AD)2 1 (BC)CA2 1

BCA3 2 CBA2 1BA4(4) CA4(3)

(6)

Db3,7(5)

DB5 2 (AD)B5 1

BA4 2 BAC4 1

BC4 2 CAB4 2

DA3 1 CA(BC)3 1

DBA3 1 C(AD)B4 1

BCA3 1 CB4 2

CA4 3 C(BC)3 1

C(AD)4 1 CDB4 1

CD4 1 DAC3 1

AB5 2 DBAC3 1

A(BC)5 1 DBC3 1

AC5 2 DC3 1

(7)

Db4,8(6)

DB5 1 BAC4 1

BA4 1 CAB4 1

BC7 2 C(AD)B4 1

CA4 2 CB4 1

C(AD)4 1 CDB4 1

CD4 1 ABC5 1

AB5 2 (AD)BC5 1

A(BC)5 1 (AD)C5 1

AC6 4 DBC5 1

(AD)B5 1 DC5 1

(8)

Db5,9(5)

DB5 1

BC7 1

AB5 2

A(BC)5 1

AC8 5

(AD)B5 1

ABC5 1

(AD)BC5 1

(AD)C5 1

DBC5 1

DC5 1

ACD6 2

AD6 2

CD8 2

AC6(4)

(9)

CA4(3)

AC8(5)

55

Page 56: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 56

The Advantages of DirApp DirApp needs only one scan of newly

arriving elements and the candidate set at each timestamp rather than quadratic scans by conventional algorithms.

DirApp can maintain latest data sequences find the complete set of up-to-date

sequential patterns delete obsolete data and patterns rapidly

56

Page 57: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 57

The Disadvantages of DirApp

DirApp needs lots of working space to store the candidate sets for all sequences.

Scanning all candidate sets induces huge computation in execution time.

DirApp needs another data structure to calculate the occurrence frequencies of all candidate sequential patterns.

57

Page 58: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 58

Outlines Introduction Preliminaries Algorithm Pisa Experiments Conclusions Q & A

58

Page 59: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 59

Algorithm Pisa Pisa stands for Progressive mIning of

Sequential pAtterns Pisa utilizes a Progressive Sequential

tree (abbreviated as PS-tree) to maintain the information of all sequences in each POI to update each sequence find up-to-date sequential patterns

59

Page 60: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 60

PS-tree The nodes in PS-tree can be divided

into two different types Root node Common nodes

Each common node stores two information Node label = element in a sequence Sequence list

sequence IDs containing this element marked by corresponding timestamps

Root

Sequence ID

Timestamp

Label

60

Page 61: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 61

PS-tree Whenever there are a series of

elements appearing in the same sequence, there will be a series of nodes labeled by each element with the same sequence IDs in their sequence lists. The first node will be connected to the Root

node representing the first element. The other nodes will be connected to the

first node analogously.61

Page 62: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 62

PS-treeRoot

A C AD

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

B B

AD B

BC

D C

CD B DA

A

A

B CBCA A

CS01

S02

S03

S04

S05

S06 A C

BD

C

D

D

SID

Root

Sequence ID

Timestamp

Label

011

A

011

B

011

C

62

Page 63: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 63

PS-tree The path from Root node to any other

node represents the candidate sequential pattern appearing in this sequence.

The appearing timestamp for each candidate sequential pattern will be marked in the node labeled by the last element.

63

Page 64: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 64

PS-treeRoot

A C AD

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

B B

AD B

BC

D C

CD B DA

A

A

B CBCA A

CS01

S02

S03

S04

S05

S06 A C

BD

C

D

D

SID

Root

Sequence ID

Timestamp

Label

011

A012

B014

C

011

B

011

C

012

C011

C

64

Page 65: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 65

Algorithm Pisa When receiving elements at

timestamp t+1, Pisa traverses the PS-tree in post-order to delete the obsolete elements from update current sequences in insert newly arriving elements into

the PS-tree of timestamp t andtransforms it into PS-tree of timestamp t+1.

65

Page 66: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 66

For a common node Pisa deletes the obsolete sequences in

the sequence list of this node If there is no sequence ID left in the sequence

list, Pisa prunes this node away from its parent

Pisa checks the sequence IDs left in the sequence list to see if there is newly arriving element of the sequences If there is no newly arriving element, Pisa

goes to the next node

66

Page 67: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 67

For a common node Otherwise, Pisa generates all combination of

candidate elements from the arriving element Ex) ABC -> A, B, C, AB, AC, BC, ABC

For each candidate element that does not exist on the path from Root to the current node :

If there is a child of the same label, Pisa updates the timestamp of this sequence to the timestamp of the same sequence in parent’s sequence list.

Otherwise, Pisa creates a new child of this element with the sequence ID and the timestamp of the same sequence in parent’s sequence list.

67

Page 68: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 68

For Root node Instead of checking the sequence list, Pisa

examines all sequences that have newly arriving elements.

After Pisa generates all combination of candidate element, for each of them : If there is a child of the same label, Pisa

updates the timestamp of this sequence to t+1.

Otherwise, Pisa creates a new child of this element with sequence ID and timestamp t+1.

68

Page 69: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 69

Algorithm Pisa After Pisa processes a common node,

if the number of sequence IDs in the sequence list is larger than the min_supp*|Dbp,q|,

the path from Root to this node will be outputted as a frequent sequential pattern.

69

Page 70: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 70

PS-treeRoot

A C AD

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

B B

AD B

BC

D C

CD B DA

A

A

B CBCA A

CS01

S02

S03

S04

S05

S06 A C

BD

C

D

D

SID

Root

Sequence ID

Timestamp

Label

011

A012

B014

C

011

B

011

C

012

C011

C

70

Page 71: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

Root A C AD

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

B B

AD B

BC

D C

CD B DA

A

A

B CBCA A

CS01

S02

S03

S04

S05

S06 A C

BD

C

D

D

SID

Sequence ID

Timestamp

Label

POI=5, min_supp=0.571

Page 72: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

Db1,1(3)

03

1

02

1

01

1

A

02

1

D

02

1

AD

A

t1

AD

A

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

72

Page 73: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

Db1,2(4)

03

1

02

1

01

1

A

04

2

02

1

D

03

1

02

1

01

1

B

03

1

C

03

1

BC

02

1

B

02

1

B

02

1

AD

03

2

C

03

2

BC

03

2

02

2

01

2

B

AB1(3)

B

B

D

BC

t2

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

Db1,1(3)

03

1

02

1

01

1

A

02

1

D

02

1

AD

73

Page 74: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

Db1,3(5)

03

1

02

1

01

1

A

03

1

02

1

01

1

B

03

1

C

03

1

BC

02

1

B

02

1

B

02

1

AD

03

2

BC

03

3

02

2

01

2

B

03

1

B

03

1

B

04

2

C

05

3

04

2

02

1

D

04

3

03

2

C

03

2

B

AB1(3)

03

2

B

t3

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

C

D

B

74

Page 75: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

Db1,4(5)

03

1

02

4

01

1

A

03

1

02

1

01

1

B

03

1

BC

02

1

B

02

1

AD

03

2

BC

03

1

B

03

1

B

04

2

C

05

3

04

2

02

1

D

03

2

B

03

1

01

1

C

03

1

01

1

C

03

1

C

02

1

A

03

1

C

02

1

A

05

3

02

1

B

02

1

A

02

1

A

03

3

01

2

C

02

2

A

03

3

02

2

01

2

B

05

4

03

2

C

03

2

C

04

3

03

4

01

4

C

AB1(3)

03

2

B

t4

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

C

B

A

C

75

Page 76: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

Db1,5(5)

03

1

02

1

01

1

B

03

1

BC

02

1

B

03

2

BC

03

1

B

03

1

B

04

2

C

03

2

B

03

1

01

1

C

03

1

01

1

C

03

1

C

03

1

C

05

3

02

1

B

02

1

A

02

1

A

03

3

01

2

C

03

3

02

2

01

2

B

05

4

03

2

C

03

2

C

04

3

03

4

01

4

C

AB1(3)01

1

D

01

1

AD

01

1

D

01

1

AD

01

1

D

01

1

AD

01

1

D

01

1

AD

05

3

02

1

A

04

2

A

03

1

02

4

01

5

A

04

5

05

5

04

2

02

1

01

5

D

05

3

02

1

01

5

AD

01

2

D

01

2

AD

01

2

A

01

2

D

01

2

AD

05

4

02

2

01

2

A

01

4

D

01

4

AD

04

3

01

4

A

BA4(3)

05

3

04

2

02

1

A

DA3(3)

03

2

B

Sequence ID

Timestamp

Label

76

Page 77: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

Db2,6(5)

03

2

BC

04

2

C

03

2

B

03

3

01

2

C

03

3

02

2

01

2

B

05

4

03

2

C

03

2

C

04

3

03

4

01

4

C

CA4(3)

04

2

A

05

3

04

2

A

01

2

D

01

2

AD

01

2

D

01

2

AD

01

4

D

01

4

AD

BA4(4)

05

3

B

05

3

A

05

3

04

2

01

5

D

01

5

AD

03

3

01

2

A

03

3

02

2

01

2

A

05

4

04

3

03

4

01

4

A

03

2

A

03

2

A

03

2

A

03

2

A

03

6

02

4

01

5

A

04

5

05

5

03

2

B

03

2

A

t6

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

A

77

Page 78: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

Db3,7(5)

04

7

03

3

01

7

B

05

4

CA4(3)

01

4

D

01

4

AD

05

3

A

01

5

AD

04

3

03

4

01

4

A

04

5

BC

05

5

04

5

C

05

3

C

05

3

C

05

4

03

3

A

05

3

A

05

3

C

05

3

01

5

D

03

3

A

03

3

C

05

4

C

04

3

BC

04

3

BC

05

3

01

5

B

01

5

B

04

3

01

4

B

01

4

B

01

4

B

04

3

01

4

B

04

7

03

4

01

4

C

05

7

04

7

BC

03

6

02

4

01

5

A

04

5

05

5

04

5

01

5

B

05

3

C

t7

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

B

BCC

78

Page 79: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

Db4,8(6)

AC6(3)

01

4

D

01

4

AD

01

5

AD

04

5

BC

01

5

C

05

3

01

5

D

05

4

C

01

4

B

01

5

B

01

4

B

01

4

B

04

7

03

8

01

8

C

05

7

04

7

BC

01

5

C

01

5

C

01

5

C

05

5

04

5

03

6

C

05

4

A

05

4

04

7

01

7

B

01

4

B

03

4

01

4

A

01

7

C

04

5

03

8

02

4

A

05

5

06

8

01

5

04

5

01

5

B

01

5

C

01

5

B

t8

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

C

C

A

79

Page 80: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

Db5,9(5)

AC8(4)

01

5

AD

04

5

BC

01

5

B

01

5

B

04

7

03

8

01

8

C

05

7

04

7

BC

01

5

C

01

5

C

01

5

C

01

7

C

05

5

03

6

D

05

5

03

6

D

05

5

04

5

03

6

C

06

8

05

9

03

9

01

5

D

04

7

01

7

B

05

7

03

8

D

04

5

01

5

B

01

5

C

04

5

03

8

01

5

A

05

5

06

8

t9

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

D

C

D

80

Page 81: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

Db6,10(5)

CD8(4)

01

10

BD

04

7

03

8

01

8

C

05

7

04

7

BC

01

7

C

04

7

01

10

B

06

8

03

6

C

03

6

D

03

6

D

06

8

03

8

A

01

7

BD

01

7

D

04

10

03

9

01

10

D

05

9

01

8

BD

01

8

B

04

7

03

8

01

8

D

05

7

04

7

D

04

7

D

01

7

BD

t10

S01

S02

S03

S04

S05

S06

SID

Sequence ID

Timestamp

Label

BD

D

81

Page 82: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 82

The Advantages of Pisa Pisa needs only one scan of newly

arriving elements and the PS-tree at each timestamp rather than quadratic scans by conventional algorithms.

Pisa can maintain latest data sequences find the complete set of up-to-date

sequential patterns delete obsolete data and patterns rapidly

82

Page 83: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 83

The Advantages of Pisa Each path from Root to any other node on

PS-tree forms a unique candidate sequential pattern. Thus Pisa combines the same candidate patterns together and all patterns do not have to store their prefix elements. PS-tree consumes smaller space. Dealing with the same sequential patterns

together is also very efficient in execution time. Fast Pisa with approximation results.

83

Page 84: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 84

Outlines Introduction Preliminaries Algorithm Pisa Experiments Conclusions Q & A

84

Page 85: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 85

Experiments Comparative algorithms

GSP+ -- re-mining version of GSP SPAM+ -- re-mining version of SPAM DirApp

Environment Pentium 4 — 3GHz CPU and 2GB RAM Coded in C++

85

Page 86: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 86

Experiments The synthetic datasets are generated

in the way similar to the IBM data generator designed for testing sequential pattern mining algorithms.

86

Page 87: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 87

Experiments We divide the target dataset into n

timestamps. According to the POI, the first m

timestamps (m = POI and m < n) are viewed as the original database and the rest of transactions in the dataset are received by the system incrementally.

87

Page 88: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 88

Experiments The first run of the experiments mines

the first POI from the beginning m timestamps of the dataset.

After that, we shift the POI forward t (t<<m) timestamps forward for the following runs.

88

Page 89: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 89

Experiments The real data sets are from

KDDCUP’07. We randomly choose successive 120 days

for the performance evaluation. A timestamp is set as 3 days in order to obtain sufficient frequent sequential patterns.

Therefore, there are total 40 timestamps and POI is set as 10. The new datasets contain more than 5000 sequences and 2000 different items.

89

Page 90: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 90

Cumulative Execution Time

90

Page 91: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 91

Minimum Support

91

Page 92: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 92

Length of POI

92

Page 93: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 93

Number of Sequences

93

Page 94: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 94

Scalability of Pisa

94

Page 95: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 95

Real Data Set

95

Page 96: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 96

Improvement of FastPisa

96

Page 97: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 97

Information Lose of FastPisa

97

Page 98: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 98

Outlines Introduction Preliminaries Algorithm Pisa Experiments Conclusions Q & A

98

Page 99: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 99

Conclusions We proposed a progressive algorithm

Pisa to handle the progressive sequential pattern mining problem without re-mining all sub-databases at each timestamp.

Pisa needs only one scan of newly arriving elements and the PS-tree at each timestamp rather than quadratic scans by conventional algorithms.

99

Page 100: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 100

Conclusions Pisa can

maintain the latest information of sequences

find the complete set of up-to-date sequential patterns

delete obsolete data and patterns rapidly Pisa also

consumes less space has high efficiency possesses great scalability

100

Page 101: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 101

References R. Srikant and R.Agrawal, “Mining Sequential

Patterns: Generalizations and Performance Improvements.” Proc. of ICDE, 1995

J. Ayres, J. Gehrke, T. Yiu, and J. Flannick. “Sequential pattern mining using a bitmap representation.” Proc. of ACM SIGKDD, 2002.

M. Zhang, B. Kao, D. W.-L. Cheung, and C. L. Yip. “Efficient algorithms for incremental update of frequent sequences.” Proc. of PAKDD, 2002.

101

Page 102: 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

112/04/21 Jen-Wei Huang 102

Thank You ! Q & A

102