Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011

21
Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and Technology 1 An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data Presenter : Cheng-Han Tsai Authors : Liang Bai, Jiye Liang, Chuangyin Dang KBS, 2011

description

An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data. Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011. Outlines. Motivation Objectives Methodology Experiments - PowerPoint PPT Presentation

Transcript of Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011

Page 1: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

1

An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data

Presenter : Cheng-Han Tsai  Authors : Liang Bai, Jiye Liang, Chuangyin Dang

KBS, 2011

Page 2: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

2

Outlines

Motivation Objectives Methodology Experiments Conclusions Comments

Page 3: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

· The k-modes algorithm is sensitive to initial cluster centers and needs to give the number of clusters in advance.

· We can’t guarantee the number of clusters we select are the best.

3

Page 4: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Objectives

4

• To propose an initialization method to find initial cluster centers and the number of clusters.

• The method can efficiently deal with large categorical data in linear time.

Page 5: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

5

Data SetConstruct a

potential exemplars set S

Set the estimated number of clusters

K-modes-type algorithm

The clustering result

1 2

3

4

5

67

Page 6: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

· The k-modes algorithm

6

· Hamming distance:Differences between two codes(using XOR)ex: 10001001XOR 10110001------------------------

00111000 → Hamming distance = 3

Page 7: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

· New cluster centers initialization method· Finding the number of clusters

7

Page 8: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

· New cluster centers initialization method.

8

Page 9: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

9

Page 10: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

10

Page 11: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

11

Page 12: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

· Finding the number of clusters─ We need to input a value k’ which is a estimated

number of clusters─ If k’ can’t be determined, we set k’ = |S|

12

Page 13: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

13

Page 14: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

14

Page 15: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

· More than 1 knee point of the function P(k)· More than 1 peak of the function C(k)

15

Page 16: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

· Performance analysis─ Soybean dada (4 diseases)─ Lung cancer data (3 classes)─ Zoo data (7 classes which has 3 big clusters and 4

small clusters)─ Mushroom data (2 classes)

· Scalability analysis

16

Page 17: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

· Performance analysis

17

Page 18: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

18

Page 19: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

· Scalability analysis─ 67557 data points and 42 categorical attribute

19

Page 20: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusions

· The proposed method is effective and efficient for obtaining the good initial cluster centers and the number of clusters

· The time complexity has been analyzed in linear time

20

Page 21: Presenter  :  Cheng-Han Tsai  Authors      : Liang  Bai ,  Jiye  Liang,  Chuangyin  Dang KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

21

Comments

· Advantages─ Improve the old method about setting the two

parameters· Applications

─ Data clustering