[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から by NTT 小沢健史

41
Copyright©2014 NTT corp. All Rights Reserved. Apache Hadoop -What’s next?- @db tech showcase 2014 Tsuyoshi Ozawa [email protected]

description

次期リリースとなるApache Hadoop 2.6 は,2系リリース後の最大のアップデートと言えるほど新しい機能が目白押しです。本講演では、Hadoop開発者の視点からHadoop 2系の中心となる YARN に関する基本的な説明と、Apache Hadoop 2.6 でリリース予定の最新機能の紹介を行います。特に、当方が開発に関わっている YARN のマスタ高可用化の仕組みや、Hadoop 2系を運用する上で必須なYARNのリソース管理の方法について詳細に解説します。

Transcript of [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から by NTT 小沢健史

Page 1: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

Copyright©2014 NTT corp. All Rights Reserved.

Apache Hadoop-What’s next?-@db tech showcase 2014

Tsuyoshi [email protected]

Page 2: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

2Copyright©2014 NTT corp. All Rights Reserved.

• Tsuyoshi Ozawa

• Researcher & Engineer @ NTTTwitter: @oza_x86_64

• A Hadoop developer

• Merged patches – 53 patches!

• Author of “Hadoop 徹底入門 2nd Edition”Chapter 22(YARN)

About me

Page 3: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

3Copyright©2014 NTT corp. All Rights Reserved.

Quiz!!

Page 4: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

4Copyright©2014 NTT corp. All Rights Reserved.

Does Hadoophave SPoF?

Quiz

Page 5: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

5Copyright©2014 NTT corp. All Rights Reserved.

Quiz

All master nodes in Hadoopcan run as highly available mode

Page 6: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

6Copyright©2014 NTT corp. All Rights Reserved.

Is Hadooponly for MapReduce?

Quiz

Page 7: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

7Copyright©2014 NTT corp. All Rights Reserved.

Quiz

Hadoop isnot only for MapReduce

but also Spark/Tez/Storm and so on…

Page 8: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

8Copyright©2014 NTT corp. All Rights Reserved.

• Current Status of Hadoop- New features since Hadoop 2 -

• HDFS• No SPoF with Namenode HA + JournalNode

• Scaling out Namenode with Namenode Federation

• YARN• Resource Management with YARN

• No SPoF with ResourceManager HA

• MapReduce• No SPoF with ApplicationMaster restart

• What’s next?- Coming features in 2.6 release -

• HDFS• Heterogeneous Storage

• Memory as Storage Tier

• YARN• Label-based scheduling

• RM HA Phase 2

Agenda

Page 9: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

9Copyright©2014 NTT corp. All Rights Reserved.

HDFS IN HADOOP 2

Page 10: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

10Copyright©2014 NTT corp. All Rights Reserved.

• Once on a time, NameNode was SPoF

• In Hadoop 2, NameNode hasQuorum JournalManager

• Replication is done by Pasxos-based protocol

See also:

http://blog.cloudera.com/blog/2012/10/quorum-based-journaling-in-cdh4-1/

NameNode with JournalNode

NameNode

Quorum

JournalManager

JournalNode JournalNode JournalNode

Local disk Local disk Local disk

Page 11: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

11Copyright©2014 NTT corp. All Rights Reserved.

• Once on a time, scalability of NameNode was limited to memory

• In Hadoop 2, NameNode hasFederation feature

• Distributing metadata per namespace

NameNode Federation

Figures from:

https://hadoop.apache.org/docs/r2.3.0/had

oop-project-dist/hadoop-

hdfs/Federation.html

Page 12: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

12Copyright©2014 NTT corp. All Rights Reserved.

RESOURCE MANAGEMENT IN HADOOP 2

Page 13: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

13Copyright©2014 NTT corp. All Rights Reserved.YARN

• Generic resource management framework

• YARN = Yet Another Resource Negotiator

• Proposed by Arun C Murthy in 2011

• Container-level resource management

• Container is more generic unit of resource than slots

• Separate JobTracker’s role

• Job Scheduling/Resource Management/Isolation

• Task Scheduling

What’s YARN?

JobTracker

MRv1 architectureMRv2 and YARN Architecture

YARN ResourceManager

Impala Master Spark MasterMRv2 Master

TaskTracker YARN NodeManager

map slot reduce slot containercontainercontainer

Page 14: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

14Copyright©2014 NTT corp. All Rights Reserved.

• Running various processing frameworkson same cluster

• Batch processing with MapReduce

• Interactive query with Impala

• Interactive deep analytics(e.g. Machine Learning)with Spark

Why YARN?(Use case)

MRv2/Tez

YARN

HDFS

Impala Spark

Periodic long batch

query

Interactive

Aggregation

query

Interactive

Machine Learning

query

Page 15: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

15Copyright©2014 NTT corp. All Rights Reserved.

• More effective resource management for multiple processing frameworks

• difficult to use entire resources without thrashing

• Cannot move *Real* big data from HDFS/S3

Why YARN?(Technical reason)

Master for MapReduce Master for Impala

Slave

Impala slavemap slot reduce slot

MapReduce slave

Slave Slave Slave

HDFS slave

Each frameworks has own scheduler Job2Job1 Job1

thrashing

Page 16: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

16Copyright©2014 NTT corp. All Rights Reserved.

• Resource is managed by JobTracker

• Job-level Scheduling

• Resource Management

MRv1 Architecture

Master for MapReduce

Slave

map slot reduce slot

MapReduce slave

Slave

map slot reduce slot

MapReduce slave

Slave

map slot reduce slot

MapReduce slave

Master for Impala

Schedulers only now own resource usages

Page 17: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

17Copyright©2014 NTT corp. All Rights Reserved.

• Idea

• One global resource manager(ResourceManager)

• Common resource pool for all frameworks(NodeManager and Container)

• Schedulers for each frameworks(AppMaster)

YARN Architecture

ResourceManager

Slave

NodeManager

Container Container Container

Slave

NodeManager

Container Container Container

Slave

NodeManager

Container Container Container

Master Slave Slave MasterSlave SlaveMaster Slave Slave

Client

1. Submit jobs

2. Launch Master 3. Launch Slaves

Page 18: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

18Copyright©2014 NTT corp. All Rights Reserved.

YARN and Mesos

YARN

• AppMaster is launched for each jobs

• More scalability

• Higher latency

• One container per req

• One Master per Job

Mesos

• AppMaster is launched for each app(framework)

• Less scalability

• Lower latency

• Bundle of containers per req

• One Master per Framework

ResourceManager

NM NM NM

ResourceMaster

Slave Slave Slave

Master1

Master2

Master1 Master2

Policy/Philosophy is different

Page 19: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

19Copyright©2014 NTT corp. All Rights Reserved.

• MapReduce• Of course, it works

• DAG-style processing framework• Spark on YARN

• Hive on Tez on YARN

• Interactive Query • Impala on YARN(via llama)

• Users• Yahoo!

• Twitter

• LinkdedIn

• Hadoop 2 @ Twitter http://www.slideshare.net/Hadoop_Summit/t-235p210-cvijayarenuv2

YARN Eco-system

Page 20: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

20Copyright©2014 NTT corp. All Rights Reserved.

YARN COMPONENTS

Page 21: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

21Copyright©2014 NTT corp. All Rights Reserved.

• Master Node of YARN

• Role

• Accepting requests from

1. Application Masters for allocating containers

2. Clients for submitting jobs

• Managing Cluster Resources

• Job-level Scheduling

• Container Management

• Launching Application-level Master(e.g. for MapReduce)

ResourceManager(RM)

ResourceManager Client

Slave

NodeManager

Container Container

Master

4.Container allocation

requests to NodeManager

1. Submitting Jobs

2. Launching Master of jobs

3.Container allocation requests

Page 22: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

22Copyright©2014 NTT corp. All Rights Reserved.

• Slave Node of YARN

• Role

• Accepting requests from RM

• Monitoring local machine and report it to RM

• Health Check

• Managing local resources

NodeManager(NM)

NodeManagerResourceManager

2. Allocating containers

Clients

Master

or

3. Launching containers

containers

4. Containers information

(host, port, etc.)

1. Request containers

Periodic health check via heartbeat

Page 23: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

23Copyright©2014 NTT corp. All Rights Reserved.

• Master of Applications(e.g. Master of MapReduce, Tez , Spark etc.)

• Run on Containers

• Roles

• Getting containers from ResourceManager

• Application-level Scheduling• How much and where Map tasks run?

• When reduce tasks will be launched?

ApplicationMaster(AM)

NodeManager

Container

Master of MapReduce ResourceManager

1. Request containers

2. List of Allocated containers

Page 24: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

24Copyright©2014 NTT corp. All Rights Reserved.

RESOURCE MANAGER HA

Page 25: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

25Copyright©2014 NTT corp. All Rights Reserved.

• What’s happen when ResourceManager fails?

• cannot submit new jobs

• NOTE:

• Launched Apps continues to run

• AppMaster recover is done in each frameworks• MRv2

ResourceManager High Availability

ResourceManager

Slave

NodeManager

Container Container Container

Slave

NodeManager

Container Container Container

Slave

NodeManager

Container Container Container

Master Slave Slave MasterSlave SlaveMaster Slave Slave

Client

Submit jobs

Continue to run each jobs

Page 26: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

26Copyright©2014 NTT corp. All Rights Reserved.

• Approach

• Storing RM information to ZooKeeper

• Automatic Failover by Embedded Elector

• Manual Failover by RMHAUtils

• NodeManagers uses local RMProxy to access them

ResourceManager High Availability

ResourceManager

Active

ResourceManager

Standby

ZooKeeper ZooKeeper ZooKeeper

2. failure

3. Embedded

Detects

failure

EmbeddedElector EmbeddedElector

4. Failover

RMState RMState RMState

1. Active Node stores

all state into RMStateStore

3. Standby

Node become

active

5. Load states from

RMStateStore

Page 27: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

27Copyright©2014 NTT corp. All Rights Reserved.

CAPACITY PLANNING ON YARN

Page 28: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

28Copyright©2014 NTT corp. All Rights Reserved.

• Define resources with XML(etc/hadoop/yarn-site.xml)

Resource definition on NodeManager

NodeManager

CPUCPU

CPUCPU

CPU

MemoryMemory

MemoryMemory

Memory

<property><name>yarn.nodemanager.resource.cpu-vcores</name><value>8</value>

</property>

<property><name>yarn.nodemanager.resource.memory-mb</name><value>8192</value>

</property>

8 CPU cores 8 GB memory

Page 29: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

29Copyright©2014 NTT corp. All Rights Reserved.

Container allocation on ResourceManager

• RM accepts container request and send it to NM, but the request can be rewritten

• Small requests will be rounded up to minimum-allocation-mb

• Large requests will be rounded down tomaximum-allocation-mb

<property><name>yarn.scheduler.minimum-allocation-mb</name><value>1024</value>

</property><property>

<name>yarn.scheduler.maximum-allocation-mb</name><value>8192</value>

</property>

ResourceManagerClient

Request 512MBNodeManager

NodeManagerNodeManager

Request 1024MB

Master

Page 30: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

30Copyright©2014 NTT corp. All Rights Reserved.

• Define how much MapTasks or ReduceTasks use resource

• MapReduce: etc /hadoop/mapred-site.xml

Container allocation at framework side

NodeManager

CPUCPU

CPUCPU

CPU

MemoryMemory

MemoryMemory

Memory

8 CPU cores

8 GB memory

<property><name>mapreduce.map.memory.mb</name><value>1024</value>

</property>

<property><name>mapreduce.reduce.memory.mb</name><value>4096</value>

</property>

Slave

NodeManager

Container Container

Master

Giving us containers

For map task

- 1024 MB memory,

1 CPU core

Container

1024MB memory

1 core

Page 31: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

31Copyright©2014 NTT corp. All Rights Reserved.

WHAT’S NEXT? – HDFS -

Page 32: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

32Copyright©2014 NTT corp. All Rights Reserved.

• HDFS-2832, HDFS-5682

• Handling various storage types in HDFS

• SSD, memory, disk, and so on.

• Setting quota per storage types

• Setting SSD quota on /home/user1 to 10 TB.

• Setting SSD quota on /home/user2 to 10 TB.

• (c) Not configuring any SSD quota on the remaining user directories (i.e. leaving it to defaults).

Heterogeneous Storages for HDFS Phase 2

<configuration>

...

<property>

<name>dfs.datanode.data.dir</name>

<value>[DISK]/mnt/sdc2/,[DISK]/mnt/sdd2,[SSD]/mnt/sde2</value>

</property>

...

</configuration>

Page 33: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

33Copyright©2014 NTT corp. All Rights Reserved.

• HDFS-5851

• Introducing obvious “Cache” layer in HDFS

• Discardable Distributed Memory

• Applications can accelerate their speedsby using memory

• Discardable Memory and Materialized Queries is one of examples

• Difference between RDD and DDM

• Multi-tenancy aware

• Handling data in processing layer or in Storage layer

Support memory as a storage medium

Page 34: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

34Copyright©2014 NTT corp. All Rights Reserved.

• Archival storage

• HDFS-6584

• Transparent encryption

• HDFS-6134

And, more!

Page 35: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

35Copyright©2014 NTT corp. All Rights Reserved.

WHAT’S NEXT? – YARN -

Page 36: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

36Copyright©2014 NTT corp. All Rights Reserved.

• Non-stop YARN updating(YARN-666)

• NodeManger, ResourceManager, Applications

• Before 2.6.0

• Restarting RM -> RM restarts all AMs -> restart all jobs

• Restarting NMs -> NMs are removed from cluster-> Containers are restarted!

• After 2.6.0

• Restarting RM -> AMs continue run

• Restarting NM -> NMs restore the state from local data

Support for rolling upgrades in YARN

ResourceManager

Slave

NodeManager

Container Container Container

Slave

NodeManager

Container Container Container

Slave

NodeManager

Container Container Container

Master Slave Slave MasterSlave SlaveMaster Slave Slave

Page 37: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

37Copyright©2014 NTT corp. All Rights Reserved.

• Now we can run various subsystems on YARN

• Interactive query engines : Spark, Impala, …

• Batch processing engines : MapReduce, Tez, …

• Problem

• Interactive query engines allocates resources at the same time – it can delay daily batch.

• Time-based reservation scheduling

• 8:00am – 6:00pm, allocating resources for Impala

• 6:00pm – 0:00am, allocating resources for MapReduce

YARN reservation-subsystem

Allocation for

Interactive query engine

Batch processing for

The next day!

8:00am 6:00pm 0:00am

Page 38: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

38Copyright©2014 NTT corp. All Rights Reserved.

• YARN-796

• Handling heterogeneous machinesin one YARN cluster

• GPU cluster

• High memory cluster

• 40Gbps Network cluster

• Labeling them and scheduling based on labels

• Admin can add/remove labels via yarn rmadmincommands

Support for admin-specified labels in YARN

NodeManager NodeManager

NodeManager NodeManager

GPU

NodeManager NodeManager

NodeManager NodeManager

40Gnetwork

ResourceManager Client

Submit jobs

On GPU!

Page 39: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

39Copyright©2014 NTT corp. All Rights Reserved.

• Timeline service security

• YARN-1935

• Minimal support for running long-running services on YARN

• YARN-896

• Support for automatic, shared cache for YARN application artifacts

• YARN-1492

• And, and more!

• Please check Wiki http://wiki.apache.org/hadoop/Roadmap

And, more!

Page 40: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

40Copyright©2014 NTT corp. All Rights Reserved.

• Hadoop 2 is evolving rapidly• I appreciate if you can catch up via this presentaion!

• New components from V2• HDFS

• Quorum Journal Manager• Namenode Federation

• ResourceManager• NodeManager• Application Master

• New features in 2.6: • Discardable memory store on HDFS, and so on.• Rolling update, labels for heterogeneous cluster on YARN,

Reservation system, and so on…

• Questions or Feedbacks ->[email protected]

• Issue -> https://issues.apache.org/jira/browse /{HDFS,YARN,HADOOP,MAPREDUCE}

Summary

Page 41: [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史

41Copyright©2014 NTT corp. All Rights Reserved.

• YARN-666

• https://www.youtube.com/watch?v=O4Q73e2ua9Y&feature=youtu.be

• http://www.slideshare.net/Hadoop_Summit/t-145p230avavilapalli-mac