Similarity/Clustering 인공지능연구실 문홍구 2006. 1. 17. 2 Content What is Clustering ...

Post on 15-Dec-2015

217 views 1 download

Transcript of Similarity/Clustering 인공지능연구실 문홍구 2006. 1. 17. 2 Content What is Clustering ...

Similarity/Clustering

인공지능연구실문홍구

2006. 1. 17

2

Content

What is Clustering

Clustering Method Distance-based

-Hierarchical

-Flat Geometric embedding approach

-self-organizing maps

-multidimensional scaling

-latent semantic indexing

3

Formulations and Approaches

Partitioning Approaches One possible goal that we can set up for a clustering algorithm is t

o

partition the document collection into k subsets or clusters D1,···,Dk so

as to minimize the intracluster distance or maximize the intracluster resemblance.

Bottom-up clustering

Top-down clustering

4

Formulations and Approaches

5

Distance based

Hierarchical clustering

-The tree of hierarchical clustering can be produced Bottom-up(agglomerative clustering)

– start with the individual object and grouping the most similar ones

– join cluster with maximum similarity

Top-down(divisive clustering)

– start with all the object and divides them into groups in order to maximize within-group similarity

– split least coherent part in cluster

6

Three methods in hierarchical clustering

Single-link Similarity of two most similar members

Complete link Similarity of two least similar members

Group average Average similarity between members

7

Single link Clustering

Similarity of two most similar members => O(n2) Locally Coherent

close objects are in the same cluster

Chaining Effect Because of following a chain of large similarities without taking

into account the global context => low global cluster quality

8

Complete link Clustering

Similarity of two least similar members => O(n3) The function focused on global cluster quality

avoids elongated cluster a/f or b/e is tighter than a/d (tighter cluster are better than

‘straggly’ cluster)

9

Group average agglomerative clustering

Averages similarity between members The complexity of computing average similarity is O(n2)

Average similarities are computed at each time a new group is formed

compromise between single-link and complete-link

10

Comparison

Single-link Relative efficient Long straggly clusters

– Ellipsoidal cluster Loosely bound cluster

Complete-link Tightly bound cluster

Group average Intermediate between single and complete

11

Distance based

Flat clustering

-k – means

- k – means 군집방법은 계층적 군집 분석과는 달리 개체가 어느 한 군집에만 속하도록 하는 상호 배반적 군집 방법이다 .

이 방법은 군집의 수를 미리 정하고 , 각 개체가 어느 군집에 속 하는지를 분석하는 방법으로서 대량의 데이터의 군집분석에 유용하게 이용되는 방법이다 .

12

Distance based

k – means

13

Geometric Embedding Approaches

Self - organizing maps

Multidimensional scaling

Latent semantic indexing

★ A different form of partition-based clustering is to identify dense regions in space.

14

Geometric Embedding Approaches

Self - organizing maps(SOMs)

- Self – organizing maps are a close cousin to k-means, except that

unlike k-means, which is concerned only with determining the

association between clusters and documents, the SOM algorithm

also embeds the clusters in a low – dimensional space right from

the beginning and proceeds in as way that places related clusters

close together in that space.

15

SOM : Example

SOM computed from over a million documents taken from 80 Usenet newsgroups. Light

areas have a high density of documents.

16

Geometric Embedding Approaches

Multidimensional scaling (MDS)

-The goal of MDS is to present documents as point in a low – dimensional

space (often 2D-3D) such that the Euclidean distance between any pair of

points is as close as possible to the distance between them specified by the

input

17

Geometric Embedding Approaches

Latent semantic indexing (LSI)

- The latent semantic indexing (LSI) method is an attempt to solve the

synonymy problem while staying within the vector space model

framework

18

Latent semantic indexing (LSI)

-

k

k-dim vector

A

Documents

Ter

ms

U

d

t

r

D V

d

SVD

Term Document

car

auto

19

EM algorithm

A soft version of K-means clustering

① both cluster move towards the centroid of all three objects

② reach the stable final state

20

EM algorithm(2)

We want to calculate probability P(cj| vector xi)

Assume that clusteri has a normal distribution

Maximum likelihood of the form

)()(

2

1exp

)2(

1),;( 1

jjT

j

jm

jj xxxn

k

jjjiji xnxP

1

),;()(

21

Procedure of EM

Expectation Step (E) Compute hij that is expectation of zij

Maximization Step (M)

n

h

h

xxh

h

xhn

iij

jn

iij

n

i

Tjijiij

jn

iij

n

iiij

j

1

1

1

1

1

))((

k

1lli

jiiijij

Θ);n|xP(

Θ);n|xP(Θ);x|E(zh