DP-IV presentation - ashutosh

Performance analysis of C-means Clustering on Big Data using Hadoop

Fuzz

y C

-mea

ns

Guided ByProf. A. J. Umbarkar

Presented ByA. S. Sathe

BROAD AREA : DISTRIBUTED COMPUTING, DATA MINING

SUB AREA: CLUSTERING ALGORITHMS, DATA CLUSTERING

1

Presentation Agenda• Literature Survey• Problem Statement• Objectives achieved• Results• Future Scope• References

Fuzz

y C

-mea

ns

2

Data Growth Rate[7]

Fuzz

y C

-mea

ns

3

Relevance • Data Clustering - Classification of a data set into a Similar groups based on

some criteria

• Big Data- Amount of data that is difficult to process using traditional database and software techniques

• Hadoop – A MapReduce Architecture based distributed computing framework

• Document Clustering • Text based data stored in file format or unstructured format• Based on text property like frequency of words, keywords provided etc.• Text properties are considered as similarity criteria• Based on similarity criteria documents are differentiated

Fuzz

y C

-mea

ns

4

Fuzz

y C

-mea

ns

5

Relevance• Need of data clustering• Data Mining is used for Knowledge Discovery from Data [KDD].• Based on historical data• Historical data may be Big Data• Big data processing is very tedious task• Data clustering is preprocessing for Big data processing• Processed data will be used for data mining• Data clustering give better results than randomly placed data.

Fuzz

y C

-mea

ns

6

Relevance• Why Text clustering• Type of unstructured data• Free from any database constraints• File can be very large without any restrictions• In real time scenario text clustering

• Retrieve, Filter, and Categorize documents• Information Retrieval

• Clustered data is useful for Knowledge Data Retrieval

Fuzz

y C

-mea

ns

7

Relevance• Why Hadoop• Distributed Framework• Can use processor capacity on the fly• Made for Big data processing

Fuzz

y C

-mea

ns

8

Problem Statement

• “Performance Analysis of C-means Clustering on Big Data using Hadoop.”

Fuzz

y C

-mea

ns

9

Objectives achieved Design of processing model of Fuzzy C-Means

Algorithm for Map-Reduce Implementation of C-means algorithm on Map-Reduce Testing & Performance analysis of above algorithm

with Big-Data on Map-Reduce Compare C-means with other equivalent works

Fuzz

y C

-mea

ns

10

Fuzzy C-means Clustering

Fuzz

y C

-mea

ns

11


Fuzz

y C

-mea

ns

12


Fuzz

y C

-mea

ns

13


Fuzz

y C

-mea

ns

14


Fuzz

y C

-mea

ns

15


Fuzz

y C

-mea

ns

16


17

Fuzz

y C

-mea

ns


• For example: we have initial centroid 3 & 11 (with m=2)

• For node 2 (1st element): U11 = The membership of first node to first cluster

U12 =The membership of first node to second cluster

Fuzz

y C

-mea

ns

%78.988281

8111

1

11232

3232

1

122

122

%22.1821

1811

112112

32112

1

122

122

Dataset Conversion

Fuzz

y C

-mea

ns

19

Hadoop based

K-Meanson

Documents

Fuzz

y C

-mea

ns

20

Fuzzy C-Means

on Documents

Fuzz

y C

-mea

ns

21

Hadoop based

Fuzzy C-Means

on Documents Fu

zzy

C-m

eans

22

Results

Experimental Setup

3 Centroids

4 Centroids

5 Centroids 6 Centroids Split

4 Itr 6 Itr 4 Itr 6 Itr 4 Itr 6 Itr 4 Itr 6 Itr

Classical K-Means √ √ √ √ √ √ √ √ Not Applicable

Hadoop Based K-Means

√ √ √ √ √ √ √ √ 4 Mb Split

√ √ √ √ √ √ √ √ 8 Mb Split

16 Mb Split

√ √ √ √ √ √ √ √ 32 Mb Split

Classical Fuzzy C-Means √ √ √ √ √ √ √ √ Not Applicable

Hadoop Based Fuzzy C-Means

√ √ √ √ √ √ √ √ 4 Mb Split

√ √ √ √ √ √ √ √ 8 Mb Split

16 Mb Split

√ √ √ √ √ √ √ √ 32 Mb Split

23

Fuzz

y C

-mea

ns

Experimental Setup

Fuzz

y C

-mea

ns

24

ClassicalK-Means

2 Node K-Means

4 NodeK-Means

8 NodeK-Means

0 100 200 300 400 500 600 700 800 900 1000

6 centroid5 centroid4 centroid3 centroid

Time (Sec)

No.

of N

odes

Fuzz

y C

-mea

ns

25Classical

FCM

2 Node FCM

4 NodeFCM

8 NodeFCM

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

6 centroid5 centroid4 centroid3 centroid

Time in sec

No.

of N

odes

26

Fuzz

y C

-mea

ns

2Node 4 Node 8 Node0

0.5

1

1.5

2

2.5

3

4MB Split KM Performance

4 ITR6 ITR

No. of Nodes

Spee

dup


1

2

3

4

5

6

4MB Split FCM Performance

4 ITR6 ITR

No. of Nodes

Spee

dup

Speedup Comparison of KM w.r.t. HKM

Speedup Comparison of FCM w.r.t. HFCM

27

Fuzz

y C

-mea

ns


0.5

1

1.5

2

2.5

8MB Split HKM Performance

4 ITR6 ITR

No of Nodes

Spee

dup


1

2

3

4

5

6

8MB Split HFCM Performance

4 ITR6 ITR

No. of Nodes

Spee

dup



28

Fuzz

y C

-mea

ns4 Mb Split 8 Mb Split 32 mb Split

4 Mb Split 8 Mb Split 32 mb Split

0

1

2

3

4

5

6

2Node4 Node8 Node

HKM HFCM

Spee

dup

4 Mb Split 8 Mb Split 32 mb Split 4 Mb Split 8 Mb Split 32 mb Split0

1

2

3

4

5

6

2Node4 Node8 Node

HKM HFCM

Spee

dup

HKM and HFCM speedup performances and comparison

4 Ite

ratio

ns6

Itera

tions

29

Fuzz

y C

-mea

ns

Analysis based on cluster sizes

KM 2 Node HKM 4 Node HKM 8 Node HKM0

2000

4000

6000

8000

10000

12000

3 Centroids4 Centroids5 Centroids6 Centroids

Tim

e

Average FCM and HFCM time consumption w.r.t cluster sizes

CONT…

30

Fuzz

y C

-mea

ns

Average KM and HKM time consumption w.r.t cluster sizes

FCM 2 Node HFCM 4 Node HKM 8 Node HKM0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

3 Centroids4 Centroids5 Centroids6 Centroids

Tim

e

Future Scope

Fuzz

y C

-mea

ns

31

Paper publication• Submitted to IEEE CONECCT 2015

Fuzz

y C

-mea

ns

32

Tools and Platform Required1. Text Dataset4. Hadoop 1.215. JDK 1.66. O.S. Ubuntu 14.04

Fuzz

y C

-mea

ns

33

References1. Cui, Xiaoli et al. "Optimized big data K-means clustering using

MapReduce." The Journal of Supercomputing, Vol 70, pp.1249-1259, 2014.

2. Jain, Anil K., M. NarasimhaMurty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR), Vol.31, pp.264-323, (1999). DOI:10.1145/331499.331504

3. Zhao, Weizhong et al. "Parallel k-means clustering based on mapreduce." In Cloud Computing Springer Berlin Heidelberg, Vol. 5931, pp. 674-679, 2009.

4. Xie, Jiong, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, and Xiao Qin. "Improving mapreduce performance through data placement in heterogeneous hadoop clusters." In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, pp. 1-9. IEEE, 2010. DOI:10.1109/IPDPSW.2010.5470880

Fuzz

y C

-mea

ns

34

http://dx.doi.org/10.1145/331499.331504

References(cont...)5. J.Dean, S.Ghemawat, MapReduce, Commun. ACM 51(1) (2008)107,Jan

6. A.Asuncionand, D.J.Newman, UCI Machine Learning Repository, available http://archive.ics.uci.edu/ml/ (accessed:07-Jan-2015)

7. https://www.linkedin.com/pulse/big-data-whats-deal-debarchan-sarkar [Used on Apr 9, 2015]

Fuzz

y C

-mea

ns

35

https://www.linkedin.com/pulse/big-data-whats-deal-debarchan-sarkar

https://www.linkedin.com/pulse/big-data-whats-deal-debarchan-sarkar

Fuzz

y C

-mea

nsQUESTIONS???

36

Thank You

DP-IV presentation - ashutosh

Documents

Transcript of DP-IV presentation - ashutosh