Tarazu Optimizing MapReduce On Heterogeneous Clusters

24
Tarazu Optimizing MapReduce On Heterogeneous Clusters 72130310 임임임

description

Tarazu Optimizing MapReduce On Heterogeneous Clusters. 72130310 임규찬. 목차. Abstract of Paper Abstract of paper Reference of paper – LATE Introduction Issue with Heterogeneity Tarazu Experimental Result. Abstract of Paper. Heterogeneous Cluster 환경에서 MapReduce 기법의 최적화를 연구함 . - PowerPoint PPT Presentation

Transcript of Tarazu Optimizing MapReduce On Heterogeneous Clusters

Page 1: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

TarazuOptimizing MapReduce On Heterogeneous Clusters

72130310 임규찬

Page 2: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

1. Abstract of Paper1. Abstract of paper2. Reference of paper – LATE

2. Introduction3. Issue with Heterogeneity4. Tarazu5. Experimental Result

목차

Page 3: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Heterogeneous Cluster 환경에서 MapReduce 기법의 최적화를 연구함 .◦ 데이터 센터 규모의 클러스터 환경에서 경제적 이유로 Het-

erogeneous 를 도입하고 있음 .◦ MapReduce 기법을 통한 BigData 처리가 가능해짐 .

기존의 기법으로는 성능이 오히려 떨어졌음 .◦ Straggler task Managing 이용한 기존 연구는 효과 없음

그 예시로써 Improving MapReduce Performance in Heterogeneous Environments 논문을 비교함 .

Abstract of Paper

Page 4: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Straggler Task 제어를 통한 Heterogeneous 최적화◦ Node is available but is performing poorly Condition ◦ Can arise many reason, faulty hardware and mis-

configuration LATE Scheduler 제안

◦ Longest Approximate Time to End◦ Task 별 Progress rate 를 이용함

P rogressScore/Amount of time the task Unfortunately, LATE alone is not sufficient to

address hardware heterogeneity.

Reference of PaperImproving MapReduce Performance in Heterogeneous Environments

Page 5: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

‘ 균형’을 뜻하는 힌디어◦ MapReduce 연산에 있어서 균형을 추구하도록 설계

Introduction - Tarazu(तराजू)

Page 6: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

대용량 데이터를 분산 컴퓨팅 환경에서 병렬처리 하도록 만들어진 프레임워크◦ Homogeneous cluster 에 최적화 .

Introduction -MapReduce

Page 7: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

서로 다른 코어로 이루어진 시스템을 이용한 Computing◦ CPU/GPU 를 이용한 GPGPU

CPU/GPU 각각의 장점을 극대화하여 성능 향상을 꾀함 . OpenCL, CUDA, DirectCompute 등 존재 . 본 논문에서 다루지 않음

◦ High/Row Node 를 이용한 Clustering 전력 , 가격 등 금전적인 요소에서의 최적화 본 논문에서 10 개의 Xeon Node, 80 개의 Atom Node 사용

Introduction -Heterogeneous Computing

Page 8: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Four phase Excution Model◦ Map computation

produces <Key, Value> tuple◦ Shuffle

all Map to all Reduce personalized Communication◦ Sorts

Grouping all the tuples for same Key◦ Reduce computation

Processes all the tuples for a key & produce final output

Issue with Heterogeneity-Background : MapReduce

Page 9: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Dynamic Load-balancing in MapReduce◦ Slower nodes fewer tasks/faster nodes more tasks

Heterogeneity is slow than Homogeneity◦ 20-75% slower for six out of eleven benchmarks.◦ Heterogeneity can be degrades performance

Poor performance is due to two Key factors◦ Non-intuitive◦ Other intuitive

Issue with Heterogeneity-Reasons for poor performance on heterogeneous clusters

Page 10: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Factor 1 : Non-intuitive◦ Interaction between load balancing

and network traffic

In Heterogeneous, cause remote task◦ Xeon is fast, Atom is slow. So Xeon stole Atom task◦ Remote task can Network Traffic◦ Network Traffic is exacerbated heavy Shuffle

Issue with Heterogeneity-Reasons for poor performance on heterogeneous clusters

Page 11: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Factor 2 : Intuitive◦ Reduce phase imbalance amplified by heterogeneity

Reduce phase load imbalance ◦ Different processing speeds cause long time

Issue with Heterogeneity-Reasons for poor perfermance on heterogeneous clusters

Page 12: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Issue with Heterogeneity-A Simple(?) analytical model

Map Finish Time(High/Low System 중 Map 연산이 늦게 끝나는 시간값 )

Number of input data in bisection(Remote Task 로 인한 데이터 + 셔플 데이터 )

Shuffle Finish Time(Remote task 로 인한 시간 혹은 MFT)

Reduce Finish Time(Remote task 로 인한 시간 혹은 MFT)

Page 13: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Two problems in MapReduce◦ Map-side built-in load balancing results in remote Map◦ Reduce-side load imbalance across the nodes

Tarazu consist of three components◦ Communication-Aware Load Balancing of Map computa-

tion◦ Communication-Aware Scheduling of Map computation◦ Predictive Load Balancing of Reduce computation

Tarazu

Page 14: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Based on key observation◦ Due to the overlap between Map computation and Shuf -

fle

In Shuffle is critical, ‘no-steal mode’◦ Pick up remote task when Shuffle end

There are no remote Map tasks to compete with Shuffle Reduce the I/O Processing overhead Slower nodes perform more work

Tarazu- Communication-Aware Load Balancing of Map computation

Page 15: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

In Map Computation is Critical, ‘task-steal mode’◦ Concern of CAS.

CALB’s mode change using shuffleLag◦ Using MapReduce monitor for fault tolerance

Diffence of number of Map task that have completed their computation Have completed their communication

in all nodes◦ Deciding the Source of criticality once is enough

without repeated, dynamic check.

Tarazu- Communication-Aware Load Balancing of Map computation

Page 16: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Determine how many remote tasks needed◦ Using in CALB ‘task-steal’ mode◦ Using to avoid increase SFT

To avoid traffic, CAS spreads out the remote task by interleaving them with local task

Tarazu- Communication-Aware Scheduling of Map computation

Page 17: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

CAS has other benefits◦ By interleaving remote tasks with local tasks,

CAS achieves better overlap between remote task communication and local task computation on both sender and receiver sides

◦ Remote tasks read input data faster by avoiding bursts

Tarazu- Communication-Aware Scheduling of Map computation

Page 18: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Better load balance in the Reduce phase◦ Skewing the intermediate key distribution◦ Reduce max term RFT

Each Reduce task save number of fast/slow nodes.

Tarazu- Predictive Load Balancing of Reduce computation

Page 19: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Using Heterogeneous Cluster Environment◦ 10 Xeon-based/80 Atom-based server nodes

Using Hadoop 0.20.2 Compare another solution, LATE

Experimental Methodology

Page 20: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Heterogeneous 기법을 통한 시스템 장점 극대화◦ Shuffle-Critical 의 경우에는 Atom 의 물량 반영◦ Map-Critical 의 경우에는 Xeon 의 성능 반영

Experimental Result-Performance

Page 21: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Experimental Result-Effect of CALB, CAS and PLB

Page 22: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Experimental Result-Sensitivity to extent of heterogeneity

Page 23: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Experimental Result-Effect of skewed input data dist.

Page 24: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Improving MapReduce Performance in Heterogeneous Environments –University of California, Berkeley

https://developers.google.com/appengine/docs/python/dataprocessing/ http://www.cpubenchmark.net/

Reference