DP-IV presentation - ashutosh

37
Performance analysis of C- means Clustering on Big Data using Hadoop Fuzzy C-means Guided By Prof. A. J. Umbarkar Presented By A. S. Sathe BROAD AREA : DISTRIBUTED COMPUTING, DATA MINING SUB AREA: CLUSTERING ALGORITHMS, DATA CLUSTERING 1

Transcript of DP-IV presentation - ashutosh

Page 1: DP-IV presentation - ashutosh

Performance analysis of C-means Clustering on Big Data using Hadoop

Fuzz

y C

-mea

ns

Guided ByProf. A. J. Umbarkar

Presented ByA. S. Sathe

BROAD AREA : DISTRIBUTED COMPUTING, DATA MINING

SUB AREA: CLUSTERING ALGORITHMS, DATA CLUSTERING

1

Page 2: DP-IV presentation - ashutosh

Presentation Agenda• Literature Survey• Problem Statement• Objectives achieved• Results• Future Scope• References

Fuzz

y C

-mea

ns

2

Page 3: DP-IV presentation - ashutosh

Data Growth Rate[7]

Fuzz

y C

-mea

ns

3

Page 4: DP-IV presentation - ashutosh

Relevance • Data Clustering - Classification of a data set into a Similar groups based on

some criteria

• Big Data- Amount of data that is difficult to process using traditional database and software techniques

• Hadoop – A MapReduce Architecture based distributed computing framework

• Document Clustering • Text based data stored in file format or unstructured format• Based on text property like frequency of words, keywords provided etc.• Text properties are considered as similarity criteria• Based on similarity criteria documents are differentiated

Fuzz

y C

-mea

ns

4

Page 5: DP-IV presentation - ashutosh

Fuzz

y C

-mea

ns

5

Page 6: DP-IV presentation - ashutosh

Relevance• Need of data clustering• Data Mining is used for Knowledge Discovery from Data [KDD].• Based on historical data• Historical data may be Big Data• Big data processing is very tedious task• Data clustering is preprocessing for Big data processing• Processed data will be used for data mining• Data clustering give better results than randomly placed data.

Fuzz

y C

-mea

ns

6

Page 7: DP-IV presentation - ashutosh

Relevance• Why Text clustering• Type of unstructured data• Free from any database constraints• File can be very large without any restrictions• In real time scenario text clustering

• Retrieve, Filter, and Categorize documents• Information Retrieval

• Clustered data is useful for Knowledge Data Retrieval

Fuzz

y C

-mea

ns

7

Page 8: DP-IV presentation - ashutosh

Relevance• Why Hadoop• Distributed Framework• Can use processor capacity on the fly• Made for Big data processing

Fuzz

y C

-mea

ns

8

Page 9: DP-IV presentation - ashutosh

Problem Statement

• “Performance Analysis of C-means Clustering on Big Data using Hadoop.”

Fuzz

y C

-mea

ns

9

Page 10: DP-IV presentation - ashutosh

Objectives achieved Design of processing model of Fuzzy C-Means

Algorithm for Map-Reduce Implementation of C-means algorithm on Map-Reduce Testing & Performance analysis of above algorithm

with Big-Data on Map-Reduce Compare C-means with other equivalent works

Fuzz

y C

-mea

ns

10

Page 11: DP-IV presentation - ashutosh

Fuzzy C-means Clustering

Fuzz

y C

-mea

ns

11

Page 12: DP-IV presentation - ashutosh

Fuzzy C-means Clustering

Fuzz

y C

-mea

ns

12

Page 13: DP-IV presentation - ashutosh

Fuzzy C-means Clustering

Fuzz

y C

-mea

ns

13

Page 14: DP-IV presentation - ashutosh

Fuzzy C-means Clustering

Fuzz

y C

-mea

ns

14

Page 15: DP-IV presentation - ashutosh

Fuzzy C-means Clustering

Fuzz

y C

-mea

ns

15

Page 16: DP-IV presentation - ashutosh

Fuzzy C-means Clustering

Fuzz

y C

-mea

ns

16

Page 17: DP-IV presentation - ashutosh

Fuzzy C-means Clustering

17

Fuzz

y C

-mea

ns

Page 18: DP-IV presentation - ashutosh

Fuzzy C-means Clustering

• For example: we have initial centroid 3 & 11 (with m=2)

• For node 2 (1st element): U11 = The membership of first node to first cluster

U12 =The membership of first node to second cluster

Fuzz

y C

-mea

ns

%78.988281

8111

1

11232

3232

1

122

122

%22.1821

1811

112112

32112

1

122

122

Page 19: DP-IV presentation - ashutosh

Dataset Conversion

Fuzz

y C

-mea

ns

19

Page 20: DP-IV presentation - ashutosh

Hadoop based

K-Meanson

Documents

Fuzz

y C

-mea

ns

20

Page 21: DP-IV presentation - ashutosh

Fuzzy C-Means

on Documents

Fuzz

y C

-mea

ns

21

Page 22: DP-IV presentation - ashutosh

Hadoop based

Fuzzy C-Means

on Documents Fu

zzy

C-m

eans

22

Page 23: DP-IV presentation - ashutosh

Results

Experimental Setup

3 Centroids

4 Centroids

5 Centroids 6 Centroids Split

4 Itr 6 Itr 4 Itr 6 Itr 4 Itr 6 Itr 4 Itr 6 Itr

Classical K-Means √ √ √ √ √ √ √ √ Not Applicable

Hadoop Based K-Means

√ √ √ √ √ √ √ √ 4 Mb Split

√ √ √ √ √ √ √ √ 8 Mb Split

16 Mb Split

√ √ √ √ √ √ √ √ 32 Mb Split

Classical Fuzzy C-Means √ √ √ √ √ √ √ √ Not Applicable

Hadoop Based Fuzzy C-Means

√ √ √ √ √ √ √ √ 4 Mb Split

√ √ √ √ √ √ √ √ 8 Mb Split

16 Mb Split

√ √ √ √ √ √ √ √ 32 Mb Split

23

Fuzz

y C

-mea

ns

Experimental Setup

Page 24: DP-IV presentation - ashutosh

Fuzz

y C

-mea

ns

24

ClassicalK-Means

2 Node K-Means

4 NodeK-Means

8 NodeK-Means

0 100 200 300 400 500 600 700 800 900 1000

6 centroid5 centroid4 centroid3 centroid

Time (Sec)

No.

of N

odes

Page 25: DP-IV presentation - ashutosh

Fuzz

y C

-mea

ns

25Classical

FCM

2 Node FCM

4 NodeFCM

8 NodeFCM

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

6 centroid5 centroid4 centroid3 centroid

Time in sec

No.

of N

odes

Page 26: DP-IV presentation - ashutosh

26

Fuzz

y C

-mea

ns

2Node 4 Node 8 Node0

0.5

1

1.5

2

2.5

3

4MB Split KM Performance

4 ITR6 ITR

No. of Nodes

Spee

dup

2Node 4 Node 8 Node0

1

2

3

4

5

6

4MB Split FCM Performance

4 ITR6 ITR

No. of Nodes

Spee

dup

Speedup Comparison of KM w.r.t. HKM

Speedup Comparison of FCM w.r.t. HFCM

Page 27: DP-IV presentation - ashutosh

27

Fuzz

y C

-mea

ns

2Node 4 Node 8 Node0

0.5

1

1.5

2

2.5

8MB Split HKM Performance

4 ITR6 ITR

No of Nodes

Spee

dup

2Node 4 Node 8 Node0

1

2

3

4

5

6

8MB Split HFCM Performance

4 ITR6 ITR

No. of Nodes

Spee

dup

Speedup Comparison of KM w.r.t. HKM

Speedup Comparison of KM w.r.t. HKM

Page 28: DP-IV presentation - ashutosh

28

Fuzz

y C

-mea

ns4 Mb Split 8 Mb Split 32 mb Split

4 Mb Split 8 Mb Split 32 mb Split

0

1

2

3

4

5

6

2Node4 Node8 Node

HKM HFCM

Spee

dup

4 Mb Split 8 Mb Split 32 mb Split 4 Mb Split 8 Mb Split 32 mb Split0

1

2

3

4

5

6

2Node4 Node8 Node

HKM HFCM

Spee

dup

HKM and HFCM speedup performances and comparison

4 Ite

ratio

ns6

Itera

tions

Page 29: DP-IV presentation - ashutosh

29

Fuzz

y C

-mea

ns

Analysis based on cluster sizes

KM 2 Node HKM 4 Node HKM 8 Node HKM0

2000

4000

6000

8000

10000

12000

3 Centroids4 Centroids5 Centroids6 Centroids

Tim

e

Average FCM and HFCM time consumption w.r.t cluster sizes

CONT…

Page 30: DP-IV presentation - ashutosh

30

Fuzz

y C

-mea

ns

Average KM and HKM time consumption w.r.t cluster sizes

FCM 2 Node HFCM 4 Node HKM 8 Node HKM0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

3 Centroids4 Centroids5 Centroids6 Centroids

Tim

e

Page 31: DP-IV presentation - ashutosh

Future Scope

Fuzz

y C

-mea

ns

31

Page 32: DP-IV presentation - ashutosh

Paper publication• Submitted to IEEE CONECCT 2015

Fuzz

y C

-mea

ns

32

Page 33: DP-IV presentation - ashutosh

Tools and Platform Required1. Text Dataset4. Hadoop 1.215. JDK 1.66. O.S. Ubuntu 14.04

Fuzz

y C

-mea

ns

33

Page 34: DP-IV presentation - ashutosh

References1. Cui, Xiaoli et al. "Optimized big data K-means clustering using

MapReduce." The Journal of Supercomputing, Vol 70, pp.1249-1259, 2014.

2. Jain, Anil K., M. NarasimhaMurty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR), Vol.31, pp.264-323, (1999). DOI:10.1145/331499.331504

3. Zhao, Weizhong et al. "Parallel k-means clustering based on mapreduce." In Cloud Computing Springer Berlin Heidelberg, Vol. 5931, pp. 674-679, 2009.

4. Xie, Jiong, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, and Xiao Qin. "Improving mapreduce performance through data placement in heterogeneous hadoop clusters." In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, pp. 1-9. IEEE, 2010. DOI:10.1109/IPDPSW.2010.5470880

Fuzz

y C

-mea

ns

34

Page 35: DP-IV presentation - ashutosh

References(cont...)5. J.Dean, S.Ghemawat, MapReduce, Commun. ACM 51(1) (2008)107,Jan

6. A.Asuncionand, D.J.Newman, UCI Machine Learning Repository, available http://archive.ics.uci.edu/ml/ (accessed:07-Jan-2015)

7. https://www.linkedin.com/pulse/big-data-whats-deal-debarchan-sarkar [Used on Apr 9, 2015]

Fuzz

y C

-mea

ns

35

Page 36: DP-IV presentation - ashutosh

Fuzz

y C

-mea

nsQUESTIONS???

36

Page 37: DP-IV presentation - ashutosh

Thank You