[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から by NTT 小沢健史

Copyright©2014 NTT corp. All Rights Reserved.

Apache Hadoop-What’s next?-@db tech showcase 2014

Tsuyoshi [email protected]

2Copyright©2014 NTT corp. All Rights Reserved.

• Tsuyoshi Ozawa

• Researcher & Engineer @ NTTTwitter: @oza_x86_64

• A Hadoop developer

• Merged patches – 53 patches!

• Author of “Hadoop 徹底入門 2nd Edition”Chapter 22(YARN)

About me


Quiz!!


Does Hadoophave SPoF?

Quiz


Quiz

All master nodes in Hadoopcan run as highly available mode


Is Hadooponly for MapReduce?

Quiz


Quiz

Hadoop isnot only for MapReduce

but also Spark/Tez/Storm and so on…


• Current Status of Hadoop- New features since Hadoop 2 -

• HDFS• No SPoF with Namenode HA + JournalNode

• Scaling out Namenode with Namenode Federation

• YARN• Resource Management with YARN

• No SPoF with ResourceManager HA

• MapReduce• No SPoF with ApplicationMaster restart

• What’s next?- Coming features in 2.6 release -

• HDFS• Heterogeneous Storage

• Memory as Storage Tier

• YARN• Label-based scheduling

• RM HA Phase 2

Agenda


HDFS IN HADOOP 2


• Once on a time, NameNode was SPoF

• In Hadoop 2, NameNode hasQuorum JournalManager

• Replication is done by Pasxos-based protocol

See also:

http://blog.cloudera.com/blog/2012/10/quorum-based-journaling-in-cdh4-1/

NameNode with JournalNode

NameNode

Quorum

JournalManager

JournalNode JournalNode JournalNode

Local disk Local disk Local disk


• Once on a time, scalability of NameNode was limited to memory

• In Hadoop 2, NameNode hasFederation feature

• Distributing metadata per namespace

NameNode Federation

Figures from:

https://hadoop.apache.org/docs/r2.3.0/had

oop-project-dist/hadoop-

hdfs/Federation.html


RESOURCE MANAGEMENT IN HADOOP 2

13Copyright©2014 NTT corp. All Rights Reserved.YARN

• Generic resource management framework

• YARN = Yet Another Resource Negotiator

• Proposed by Arun C Murthy in 2011

• Container-level resource management

• Container is more generic unit of resource than slots

• Separate JobTracker’s role

• Job Scheduling/Resource Management/Isolation

• Task Scheduling

What’s YARN?

JobTracker

MRv1 architectureMRv2 and YARN Architecture

YARN ResourceManager

Impala Master Spark MasterMRv2 Master

TaskTracker YARN NodeManager

map slot reduce slot containercontainercontainer


• Running various processing frameworkson same cluster

• Batch processing with MapReduce

• Interactive query with Impala

• Interactive deep analytics(e.g. Machine Learning)with Spark

Why YARN?(Use case)

MRv2/Tez

YARN

HDFS

Impala Spark

Periodic long batch

query

Interactive

Aggregation

query

Interactive

Machine Learning

query


• More effective resource management for multiple processing frameworks

• difficult to use entire resources without thrashing

• Cannot move *Real* big data from HDFS/S3

Why YARN?(Technical reason)

Master for MapReduce Master for Impala

Slave

Impala slavemap slot reduce slot

MapReduce slave

Slave Slave Slave

HDFS slave

Each frameworks has own scheduler Job2Job1 Job1

thrashing


• Resource is managed by JobTracker

• Job-level Scheduling

• Resource Management

MRv1 Architecture

Master for MapReduce

Slave

map slot reduce slot

MapReduce slave

Slave


MapReduce slave

Slave


MapReduce slave

Master for Impala

Schedulers only now own resource usages


• Idea

• One global resource manager(ResourceManager)

• Common resource pool for all frameworks(NodeManager and Container)

• Schedulers for each frameworks(AppMaster)

YARN Architecture

ResourceManager

Slave

NodeManager

Container Container Container

Slave

NodeManager


Slave

NodeManager


Master Slave Slave MasterSlave SlaveMaster Slave Slave

Client

1. Submit jobs

2. Launch Master 3. Launch Slaves


YARN and Mesos

YARN

• AppMaster is launched for each jobs

• More scalability

• Higher latency

• One container per req

• One Master per Job

Mesos

• AppMaster is launched for each app(framework)

• Less scalability

• Lower latency

• Bundle of containers per req

• One Master per Framework

ResourceManager

NM NM NM

ResourceMaster

Slave Slave Slave

Master1

Master2

Master1 Master2

Policy/Philosophy is different


• MapReduce• Of course, it works

• DAG-style processing framework• Spark on YARN

• Hive on Tez on YARN

• Interactive Query • Impala on YARN(via llama)

• Users• Yahoo!

• Twitter

• LinkdedIn

• Hadoop 2 @ Twitter http://www.slideshare.net/Hadoop_Summit/t-235p210-cvijayarenuv2

YARN Eco-system


YARN COMPONENTS


• Master Node of YARN

• Role

• Accepting requests from

1. Application Masters for allocating containers

2. Clients for submitting jobs

• Managing Cluster Resources

• Job-level Scheduling

• Container Management

• Launching Application-level Master(e.g. for MapReduce)

ResourceManager(RM)

ResourceManager Client

Slave

NodeManager

Container Container

Master

4.Container allocation

requests to NodeManager

1. Submitting Jobs

2. Launching Master of jobs

3.Container allocation requests


• Slave Node of YARN

• Role

• Accepting requests from RM

• Monitoring local machine and report it to RM

• Health Check

• Managing local resources

NodeManager(NM)

NodeManagerResourceManager

2. Allocating containers

Clients

Master

or

3. Launching containers

containers

4. Containers information

(host, port, etc.)

1. Request containers

Periodic health check via heartbeat


• Master of Applications(e.g. Master of MapReduce, Tez , Spark etc.)

• Run on Containers

• Roles

• Getting containers from ResourceManager

• Application-level Scheduling• How much and where Map tasks run?

• When reduce tasks will be launched?

ApplicationMaster(AM)

NodeManager

Container

Master of MapReduce ResourceManager

1. Request containers

2. List of Allocated containers


RESOURCE MANAGER HA


• What’s happen when ResourceManager fails?

• cannot submit new jobs

• NOTE:

• Launched Apps continues to run

• AppMaster recover is done in each frameworks• MRv2

ResourceManager High Availability

ResourceManager

Slave

NodeManager


Slave

NodeManager


Slave

NodeManager



Client

Submit jobs

Continue to run each jobs


• Approach

• Storing RM information to ZooKeeper

• Automatic Failover by Embedded Elector

• Manual Failover by RMHAUtils

• NodeManagers uses local RMProxy to access them

ResourceManager High Availability

ResourceManager

Active

ResourceManager

Standby

ZooKeeper ZooKeeper ZooKeeper

2. failure

3. Embedded

Detects

failure

EmbeddedElector EmbeddedElector

4. Failover

RMState RMState RMState

1. Active Node stores

all state into RMStateStore

3. Standby

Node become

active

5. Load states from

RMStateStore


CAPACITY PLANNING ON YARN


• Define resources with XML(etc/hadoop/yarn-site.xml)

Resource definition on NodeManager

NodeManager

CPUCPU

CPUCPU

CPU

MemoryMemory

MemoryMemory

Memory

<property><name>yarn.nodemanager.resource.cpu-vcores</name><value>8</value>

</property>

<property><name>yarn.nodemanager.resource.memory-mb</name><value>8192</value>

</property>

8 CPU cores 8 GB memory


Container allocation on ResourceManager

• RM accepts container request and send it to NM, but the request can be rewritten

• Small requests will be rounded up to minimum-allocation-mb

• Large requests will be rounded down tomaximum-allocation-mb

<property><name>yarn.scheduler.minimum-allocation-mb</name><value>1024</value>

</property><property>

<name>yarn.scheduler.maximum-allocation-mb</name><value>8192</value>

</property>

ResourceManagerClient

Request 512MBNodeManager

NodeManagerNodeManager

Request 1024MB

Master


• Define how much MapTasks or ReduceTasks use resource

• MapReduce: etc /hadoop/mapred-site.xml

Container allocation at framework side

NodeManager

CPUCPU

CPUCPU

CPU

MemoryMemory

MemoryMemory

Memory

8 CPU cores

8 GB memory

<property><name>mapreduce.map.memory.mb</name><value>1024</value>

</property>

<property><name>mapreduce.reduce.memory.mb</name><value>4096</value>

</property>

Slave

NodeManager

Container Container

Master

Giving us containers

For map task

- 1024 MB memory,

1 CPU core

Container

1024MB memory

1 core


WHAT’S NEXT? – HDFS -


• HDFS-2832, HDFS-5682

• Handling various storage types in HDFS

• SSD, memory, disk, and so on.

• Setting quota per storage types

• Setting SSD quota on /home/user1 to 10 TB.

• Setting SSD quota on /home/user2 to 10 TB.

• (c) Not configuring any SSD quota on the remaining user directories (i.e. leaving it to defaults).

Heterogeneous Storages for HDFS Phase 2

<configuration>

...

<property>

<name>dfs.datanode.data.dir</name>

<value>[DISK]/mnt/sdc2/,[DISK]/mnt/sdd2,[SSD]/mnt/sde2</value>

</property>

...

</configuration>


• HDFS-5851

• Introducing obvious “Cache” layer in HDFS

• Discardable Distributed Memory

• Applications can accelerate their speedsby using memory

• Discardable Memory and Materialized Queries is one of examples

• Difference between RDD and DDM

• Multi-tenancy aware

• Handling data in processing layer or in Storage layer

Support memory as a storage medium


• Archival storage

• HDFS-6584

• Transparent encryption

• HDFS-6134

And, more!


WHAT’S NEXT? – YARN -


• Non-stop YARN updating(YARN-666)

• NodeManger, ResourceManager, Applications

• Before 2.6.0

• Restarting RM -> RM restarts all AMs -> restart all jobs

• Restarting NMs -> NMs are removed from cluster-> Containers are restarted!

• After 2.6.0

• Restarting RM -> AMs continue run

• Restarting NM -> NMs restore the state from local data

Support for rolling upgrades in YARN

ResourceManager

Slave

NodeManager


Slave

NodeManager


Slave

NodeManager




• Now we can run various subsystems on YARN

• Interactive query engines : Spark, Impala, …

• Batch processing engines : MapReduce, Tez, …

• Problem

• Interactive query engines allocates resources at the same time – it can delay daily batch.

• Time-based reservation scheduling

• 8:00am – 6:00pm, allocating resources for Impala

• 6:00pm – 0:00am, allocating resources for MapReduce

YARN reservation-subsystem

Allocation for

Interactive query engine

Batch processing for

The next day!

8:00am 6:00pm 0:00am


• YARN-796

• Handling heterogeneous machinesin one YARN cluster

• GPU cluster

• High memory cluster

• 40Gbps Network cluster

• Labeling them and scheduling based on labels

• Admin can add/remove labels via yarn rmadmincommands

Support for admin-specified labels in YARN

NodeManager NodeManager


GPU



40Gnetwork

ResourceManager Client

Submit jobs

On GPU!


• Timeline service security

• YARN-1935

• Minimal support for running long-running services on YARN

• YARN-896

• Support for automatic, shared cache for YARN application artifacts

• YARN-1492

• And, and more!

• Please check Wiki http://wiki.apache.org/hadoop/Roadmap

And, more!


• Hadoop 2 is evolving rapidly• I appreciate if you can catch up via this presentaion!

• New components from V2• HDFS

• Quorum Journal Manager• Namenode Federation

• ResourceManager• NodeManager• Application Master

• New features in 2.6: • Discardable memory store on HDFS, and so on.• Rolling update, labels for heterogeneous cluster on YARN,

Reservation system, and so on…

• Questions or Feedbacks ->[email protected]

• Issue -> https://issues.apache.org/jira/browse /{HDFS,YARN,HADOOP,MAPREDUCE}

Summary


• YARN-666

• https://www.youtube.com/watch?v=O4Q73e2ua9Y&feature=youtu.be

• http://www.slideshare.net/Hadoop_Summit/t-145p230avavilapalli-mac

[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から by NTT 小沢健史

Technology

Transcript of [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から by NTT 小沢健史