Tarazu Optimizing MapReduce On Heterogeneous Clusters
description
Transcript of Tarazu Optimizing MapReduce On Heterogeneous Clusters
TarazuOptimizing MapReduce On Heterogeneous Clusters
72130310 임규찬
1. Abstract of Paper1. Abstract of paper2. Reference of paper – LATE
2. Introduction3. Issue with Heterogeneity4. Tarazu5. Experimental Result
목차
Heterogeneous Cluster 환경에서 MapReduce 기법의 최적화를 연구함 .◦ 데이터 센터 규모의 클러스터 환경에서 경제적 이유로 Het-
erogeneous 를 도입하고 있음 .◦ MapReduce 기법을 통한 BigData 처리가 가능해짐 .
기존의 기법으로는 성능이 오히려 떨어졌음 .◦ Straggler task Managing 이용한 기존 연구는 효과 없음
그 예시로써 Improving MapReduce Performance in Heterogeneous Environments 논문을 비교함 .
Abstract of Paper
Straggler Task 제어를 통한 Heterogeneous 최적화◦ Node is available but is performing poorly Condition ◦ Can arise many reason, faulty hardware and mis-
configuration LATE Scheduler 제안
◦ Longest Approximate Time to End◦ Task 별 Progress rate 를 이용함
P rogressScore/Amount of time the task Unfortunately, LATE alone is not sufficient to
address hardware heterogeneity.
Reference of PaperImproving MapReduce Performance in Heterogeneous Environments
‘ 균형’을 뜻하는 힌디어◦ MapReduce 연산에 있어서 균형을 추구하도록 설계
Introduction - Tarazu(तराजू)
대용량 데이터를 분산 컴퓨팅 환경에서 병렬처리 하도록 만들어진 프레임워크◦ Homogeneous cluster 에 최적화 .
Introduction -MapReduce
서로 다른 코어로 이루어진 시스템을 이용한 Computing◦ CPU/GPU 를 이용한 GPGPU
CPU/GPU 각각의 장점을 극대화하여 성능 향상을 꾀함 . OpenCL, CUDA, DirectCompute 등 존재 . 본 논문에서 다루지 않음
◦ High/Row Node 를 이용한 Clustering 전력 , 가격 등 금전적인 요소에서의 최적화 본 논문에서 10 개의 Xeon Node, 80 개의 Atom Node 사용
Introduction -Heterogeneous Computing
Four phase Excution Model◦ Map computation
produces <Key, Value> tuple◦ Shuffle
all Map to all Reduce personalized Communication◦ Sorts
Grouping all the tuples for same Key◦ Reduce computation
Processes all the tuples for a key & produce final output
Issue with Heterogeneity-Background : MapReduce
Dynamic Load-balancing in MapReduce◦ Slower nodes fewer tasks/faster nodes more tasks
Heterogeneity is slow than Homogeneity◦ 20-75% slower for six out of eleven benchmarks.◦ Heterogeneity can be degrades performance
Poor performance is due to two Key factors◦ Non-intuitive◦ Other intuitive
Issue with Heterogeneity-Reasons for poor performance on heterogeneous clusters
Factor 1 : Non-intuitive◦ Interaction between load balancing
and network traffic
In Heterogeneous, cause remote task◦ Xeon is fast, Atom is slow. So Xeon stole Atom task◦ Remote task can Network Traffic◦ Network Traffic is exacerbated heavy Shuffle
Issue with Heterogeneity-Reasons for poor performance on heterogeneous clusters
Factor 2 : Intuitive◦ Reduce phase imbalance amplified by heterogeneity
Reduce phase load imbalance ◦ Different processing speeds cause long time
Issue with Heterogeneity-Reasons for poor perfermance on heterogeneous clusters
Issue with Heterogeneity-A Simple(?) analytical model
Map Finish Time(High/Low System 중 Map 연산이 늦게 끝나는 시간값 )
Number of input data in bisection(Remote Task 로 인한 데이터 + 셔플 데이터 )
Shuffle Finish Time(Remote task 로 인한 시간 혹은 MFT)
Reduce Finish Time(Remote task 로 인한 시간 혹은 MFT)
Two problems in MapReduce◦ Map-side built-in load balancing results in remote Map◦ Reduce-side load imbalance across the nodes
Tarazu consist of three components◦ Communication-Aware Load Balancing of Map computa-
tion◦ Communication-Aware Scheduling of Map computation◦ Predictive Load Balancing of Reduce computation
Tarazu
Based on key observation◦ Due to the overlap between Map computation and Shuf -
fle
In Shuffle is critical, ‘no-steal mode’◦ Pick up remote task when Shuffle end
There are no remote Map tasks to compete with Shuffle Reduce the I/O Processing overhead Slower nodes perform more work
Tarazu- Communication-Aware Load Balancing of Map computation
In Map Computation is Critical, ‘task-steal mode’◦ Concern of CAS.
CALB’s mode change using shuffleLag◦ Using MapReduce monitor for fault tolerance
Diffence of number of Map task that have completed their computation Have completed their communication
in all nodes◦ Deciding the Source of criticality once is enough
without repeated, dynamic check.
Tarazu- Communication-Aware Load Balancing of Map computation
Determine how many remote tasks needed◦ Using in CALB ‘task-steal’ mode◦ Using to avoid increase SFT
To avoid traffic, CAS spreads out the remote task by interleaving them with local task
Tarazu- Communication-Aware Scheduling of Map computation
CAS has other benefits◦ By interleaving remote tasks with local tasks,
CAS achieves better overlap between remote task communication and local task computation on both sender and receiver sides
◦ Remote tasks read input data faster by avoiding bursts
Tarazu- Communication-Aware Scheduling of Map computation
Better load balance in the Reduce phase◦ Skewing the intermediate key distribution◦ Reduce max term RFT
Each Reduce task save number of fast/slow nodes.
Tarazu- Predictive Load Balancing of Reduce computation
Using Heterogeneous Cluster Environment◦ 10 Xeon-based/80 Atom-based server nodes
Using Hadoop 0.20.2 Compare another solution, LATE
Experimental Methodology
Heterogeneous 기법을 통한 시스템 장점 극대화◦ Shuffle-Critical 의 경우에는 Atom 의 물량 반영◦ Map-Critical 의 경우에는 Xeon 의 성능 반영
Experimental Result-Performance
Experimental Result-Effect of CALB, CAS and PLB
Experimental Result-Sensitivity to extent of heterogeneity
Experimental Result-Effect of skewed input data dist.
Improving MapReduce Performance in Heterogeneous Environments –University of California, Berkeley
https://developers.google.com/appengine/docs/python/dataprocessing/ http://www.cpubenchmark.net/
Reference