2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper...

33
2001/12/18 CHAMELEON 1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mini ng class Presenter : 許許許 ; 許許許 Data : 2001/12/18

Transcript of 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper...

Page 1: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 1

CHAMELEON:A Hierarchical Clustering Algorithm Using

Dynamic Modeling

Paper presentation in data mining class

Presenter : 許明壽 ; 蘇建仲Data : 2001/12/18

Page 2: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 2

About this paper …

Department of Computer Science and Engineering , University of Minnesota George Karypis Eui-Honh (Sam) Han Vipin Kumar

IEEE Computer Journal - Aug. 1999

Page 3: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 3

Outline

Problems definition Main algorithm Keys features of CHAMELEON Experiment and related worked Conclusion and discussion

Page 4: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 4

Problems definition

Clustering Intracluster similarity is maximized Intercluster similarity is minimized

Problems of existing clustering algorithms Static model constrain Breakdown when clusters that are of diverse shap

es, densities, and sizes Susceptible to noise , outliers , and artifacts

Page 5: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 5

Static model constrain Data space constrain

K means , PAM … etc Suitable only for data in metric spaces

Cluster shape constrain K means , PAM , CLARANS

Assume cluster as ellipsoidal or globular and are similar sizes Cluster density constrain

DBScan Points within genuine cluster are density-reachable and point across

different clusters are not Similarity determine constrain

CURE , ROCK Use static model to determine the most similar cluster to merge

Page 6: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 6

Partition techniques problem

(a) Clusters of widely different sizes (b) Clusters with convex shapes

Page 7: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 7

Hierarchical technique problem (1/2) The {(c) , (d)} will be choose to merge when we only consider closeness

Page 8: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 8

Hierarchical technique problem (2/2) The {(a) , (c)} will be choose to merge when we only consider inter-

connectivity

Page 9: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 9

Main algorithm

Two phase algorithm PHASE I

Use graph partitioning algorithm to cluster the data items into a large number of relatively small sub-clusters.

PHASE II Uses an agglomerative hierarchical clustering

algorithm to find the genuine clusters by repeatedly combining together these sub-clusters.

Page 10: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 10

Framework

Construct

Sparse Graph Partition the Graph

Merge Partition

Final Clusters

Data Set

Page 11: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 11

Keys features of CHAMELEON

Modeling the data Modeling the cluster similarity Partition algorithms Merge schemes

Page 12: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 12

Terms

Arguments needed K

K-nearest neighbor graph MINSIZE

The minima size of initial cluster TRI

Threshold of related inter-connectivity TRC

Threshold of related intra-connectivity α

Coefficient for weight of RI and RC

Page 13: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 13

Modeling the data

K-nearest neighbor graph approach Advantages

Data points that are far apart are completely disconnected in the Gk

Gk capture the concept of neighborhood dynamically

The edge weights of dense regions in Gk tend to be large and the edge weights of sparse tend to be small

Page 14: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 14

Example of k-nearest neighbor graph

Page 15: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 15

Modeling the clustering similarity (1/2)

Relative interconnectivity

Relative closeness2

||||||

),(},{

CjCi

CjCi

ECECEC

CjCiRI

CjCi

CjCi

ECEC

EC

SCjCi

CjS

CjCiCi

SCjCiRC

||||||

||||||

),(},{

Page 16: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 16

Modeling the clustering similarity (2/2)• If related is considered , {(c) , (d)} will be merged

Page 17: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 17

Partition algorithm (PHASE I)

What Finding the initial sub-clusters

Why RI and RC can’t be accurately calculated for clusters contai

ning only a few data points

How Utilize multilevel graph partitioning algorithm (hMETIS)

Coarsening phase Partitioning phase Uncoarsening phase

Page 18: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 18

Partition algorithm (cont.)

Initial all points belonging to the same cluster

Repeat until (size of all clusters < MINSIZE) Select the largest cluster and use hMETIS to bise

ct Post scriptum

Balance constrain Spilt Ci into CiA and CiB and each sub-clusters contains

at least 25% of the node of Ci

Page 19: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 19

Page 20: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 20

What Merging sub-clusters using a dynamic framework

How Finding and merging the pair of sub-clusters that are the

most similar Scheme 1

Scheme 2

Merge schemes (Phase II)

),(*),( jiji CCRCCCRI

RIji TCCRI ),( and RCji TCCRC ),(

Page 21: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 21

Experiment and related worked

Introduction of CURE Introduction of DBSCAN Results of experiment Performance analysis

Page 22: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 22

Introduction of CURE (1/n)

Clustering Using Representative points

1. Properties : Fit for non-spherical shapes. Shrinking can help to dampen the effects of outliers. Multiple representative points chosen for non-spherical Each iteration , representative points shrunk ratio related

to merge procedure by some scattered points chosen Random sampling in data sets is fit for large databases

Page 23: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 23

Introduction of CURE (2/n)

2. Drawbacks : Partitioning method can not prove data points chosen are

good. Clustering accuracy with respect to the parameters below :

(1) Shrink factor s : CURE always find the right clusters by range of s values from 0.2 to 0.7.

(2) Number of representative points c : CURE always found right clusters for value of c greater than 10.

(3) Number of Partitions p : with as many as 50 partitions , CURE always discovered the desired clusters.

(4) Random Sample size r : (a) for sample size up to 2000 , clusters found poor quality (b) from 2500 sample points and above , about 2.5% of the

data set size , CURE always correctly find the clusters.

Page 24: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 24

3. Clustering algorithm : Representative points

Page 25: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 25

• Merge procedure

Page 26: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 26

Introduction of DBSCAN (1/n) Density Based Spatial Clustering of Application With

Noise1. Properties :

Can discovery clusters of arbitrary shape. Each cluster with a typical density of points which is higher

than outside of cluster. The density within the areas of noise is lower than the dens

ity in any of the clusters. Input the parameters MinPts only Easy to implement in C++ language using R*-tree Runtime is linear depending on the number of points. Time complexity is O(n * log n)

Page 27: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 27

Introduction of DBSCAN (2/n)

2. Drawbacks : Cannot apply to polygons. Cannot apply to high dimensional feature spaces. Cannot process the shape of k-dist graph with multi-featur

es. Cannot fit for large database because no method applied t

o reduce spatial database.

3. Definitions Eps-neighborhood of a point p

NEps(p)={q€D | dist(p,q)<=Eps} Each cluster with MinPts points

Page 28: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 28

Introduction of DBSCAN (3/n)

4. p is directly density-reachable from q(1) p€ NEps(q) and

(2) | NEps(q) | >=MinPts (core point condition) We know directly density-reachable is symmetric when p

and q both are core point , otherwise is asymmetric if one core point and one border point.

5. p is density-reachable from q if there is a chain of points between p and q

Density-reachable is transitive , but not symmetric Density-reachable is symmetric for core points.

Page 29: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 29

Introduction of DBSCAN (4/n)

6. A point p is density-connected to a point q if there is a point s such that both p and q are density-reachable from s.

Density-connected is symmetric and reflexive relation A cluster is defined to be a set of density-connected points

which is maximal density-reachability. Noise is the set of points not belong to any of clusters.

7. How to find cluster C ? Maximality

∆ p , q : if p€ C and q is density-reachable from p , then q € C Connectivity

∆ p , q € C : p is density-connected to q

8. How to find noises ? ∆ p , if p is not belong to any clusters , then p is noise point

Page 30: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 30

Results of experiment

Page 31: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 31

Performance analysis (1/2) The time of construct the k-nearest neighbor

Low-dimensional data sets based on k-d trees , overall complexity of O(n log n)

High-dimensional data sets based on k-d trees not applicable , overall complexity of O(n2)

Finding initial sub-clusters Obtains m clusters by repeated partitioning successively s

maller graphs , overall computational complexity is O(n log (n/m))

Is bounded by O(n log n) A faster partitioning algorithm to obtain the initial m cluster

s in time O(n+m log m) using multilevel m-way partitioning algorithm

Page 32: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 32

Performance analysis (2/2)

Merging sub-clusters using a dynamic framework

The time of compute the internal inter-connectivity and internal closeness for each initial cluster is which is O(nm)

The time of the most similar pair of clusters to merge is O(m2 log m) by using a heap-based priority queue

So overall complexity of CHAMELEON’s is O(n log n + nm + m2 log m)

Page 33: 2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

2001/12/18 CHAMELEON 33

Conclusion and discussion

Dynamic model with related interconnectivity and closeness

This paper ignore the issue of scaling to large data

Other graph representation methodology?? Other Partition algorithm??