Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011
-
Upload
anthony-cabrera -
Category
Documents
-
view
66 -
download
0
description
Transcript of Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011
Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
1
An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data
Presenter : Cheng-Han Tsai Authors : Liang Bai, Jiye Liang, Chuangyin Dang
KBS, 2011
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
2
Outlines
Motivation Objectives Methodology Experiments Conclusions Comments
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Motivation
· The k-modes algorithm is sensitive to initial cluster centers and needs to give the number of clusters in advance.
· We can’t guarantee the number of clusters we select are the best.
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Objectives
4
• To propose an initialization method to find initial cluster centers and the number of clusters.
• The method can efficiently deal with large categorical data in linear time.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
5
Data SetConstruct a
potential exemplars set S
Set the estimated number of clusters
K-modes-type algorithm
The clustering result
1 2
3
4
5
67
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
· The k-modes algorithm
6
· Hamming distance:Differences between two codes(using XOR)ex: 10001001XOR 10110001------------------------
00111000 → Hamming distance = 3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
· New cluster centers initialization method· Finding the number of clusters
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
· New cluster centers initialization method.
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
· Finding the number of clusters─ We need to input a value k’ which is a estimated
number of clusters─ If k’ can’t be determined, we set k’ = |S|
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
· More than 1 knee point of the function P(k)· More than 1 peak of the function C(k)
15
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiments
· Performance analysis─ Soybean dada (4 diseases)─ Lung cancer data (3 classes)─ Zoo data (7 classes which has 3 big clusters and 4
small clusters)─ Mushroom data (2 classes)
· Scalability analysis
16
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiments
· Performance analysis
17
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiments
18
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiments
· Scalability analysis─ 67557 data points and 42 categorical attribute
19
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Conclusions
· The proposed method is effective and efficient for obtaining the good initial cluster centers and the number of clusters
· The time complexity has been analyzed in linear time
20
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
21
Comments
· Advantages─ Improve the old method about setting the two
parameters· Applications
─ Data clustering