Hadoop Ecosystem Fundamentals - Big Data€¦ · Hadoop Ecosystem Fundamentals ... • Feb 2009 –...
Transcript of Hadoop Ecosystem Fundamentals - Big Data€¦ · Hadoop Ecosystem Fundamentals ... • Feb 2009 –...
Hadoop Ecosystem Fundamentals
Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı
kapsamında yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi
dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma
Bakanlığı’nın görüşlerini yansıtmamaktadır.
How to Scale for Big Data?
• Data Volumes are massive
• Reliability of Storing PBs of data is challenging
• All kinds of failures: Disk/Hardware/Network Failures
• Probability of failures simply increase with the number of machines …
Distributed processing is non-trivial
• How to assign tasks to different workers in an efficient way?
• What happens if tasks fail?
• How do workers exchange results?
• How to synchronize distributed tasks allocated to different workers?
One popular solution: Hadoop
Hadoop Cluster at Yahoo! (Credit: Yahoo)
Hadoop History • Dec 2004 – Google GFS paper published
• July 2005 – Nutch uses MapReduce
• Feb 2006 – Becomes Lucene subproject
• Apr 2007 – Yahoo! on 1000-node cluster
• Jan 2008 – An Apache Top Level Project
• Jul 2008 – A 4000 node test cluster
• Sept 2008 – Hive becomes a Hadoop subproject • Feb 2009 – The Yahoo! Search Webmap is a Hadoop application
that runs on more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query.
• June 2009 – On June 10, 2009, Yahoo! made available the source code to the version of Hadoop it runs in production.
• In 2010 Facebook claimed that they have the largest Hadoop cluster in the world with 21 PB of storage. On July 27, 2011 they announced the data has grown to 30 PB.
• Hadoop is an open-source implementation based on Google File System (GFS) and MapReduce from Google
• Hadoop was created by Doug Cutting and Mike Cafarella in 2005
• Hadoop was donated to Apache in 2006
Hadoop offers
• Redundant, Fault-tolerant data storage
• Parallel computation framework
• Job coordination
Programmers
No longer need to worry about
Q: Where file is located?
Q: How to handle failures & data lost?
Q: How to divide computation?
Q: How to program for scaling?
Typical Hadoop Cluster
Aggregation switch
Rack switch
• 40 nodes/rack, 1000-4000 nodes in cluster
• 1 GBps bandwidth in rack, 8 GBps out of rack
• Node specs (Yahoo terasort):
8 x 2.0 GHz cores, 8 GB RAM, 4 disks (= 4 TB?)
A real world example of New York Times
• Goal: Make entire archive of articles available online: 11 million, from 1851
• Task: Translate 4 TB TIFF images to PDF files
• Solution: Used Amazon Elastic Compute Cloud (EC2) and Simple Storage System (S3)
• Time: ?
• Costs: ?
A real world example of New York Times
• Goal: Make entire archive of articles available online: 11 million, from 1851
• Task: Translate 4 TB TIFF images to PDF files
• Solution: Used Amazon Elastic Compute Cloud (EC2) and Simple Storage System (S3)
• Time: < 24 hours
• Costs: $240
The Basic Hadoop Components
• Hadoop Common - libraries and utilities
• Hadoop Distributed File System (HDFS) – a distributed file-systemHadoop
• MapReduce – a programming model for large scale data processing
• Hadoop YARN – a resource-management platform, scheduling
Hadoop Stack
Computation
Storage
Applications and Frameworks
• HBase – a scalable data warehouse with support for large tables. Column-oriented database management system, Key-value store, Based on Google Big Table
• Hive – a data warehouse infrastructure that provides data summarization and ad hoc querying the data using a SQL- like language called HiveQL
• Pig – A high-level data-flow language and execution framework for parallel computation. Uses the language: Pig Latin
• Spark –a fast and general compute engine for Hadoop data. Wide range of applications ETL, Machine Learning, stream processing, and graph analytics.
MapReduce Framework
• Typically compute and storage nodes are the same.
• MapReduce tasks and HDFS running on the same nodes
• Can schedule tasks on nodes with data already present.
Original MapReduce Framework
• Single master JobTracker
• JobTracker schedules, monitors, and re-executes failed tasks.
• One slave TaskTracker per cluster node
• TaskTracker executes tasks per JobTracker requests.
Original HDFS Design
• Single NameNode - a master server that manages the file system namespace and regulates access to files by clients.
• Multiple DataNodes – typically one per node in the cluster. Functions: • Manage storage • Serving read/write requests from clients • Block creation, deletion, replication based
on instructions from NameNode
Hadoop 1.0 and Hadoop 2.0(YARN)
Whatʼs wrong with Original Mapreduce?
• Scaling >4000 nodes • HA of Job Tracker • Poor resource utilization
YARN(Yet Another Resource Negotiator)
Main idea – Separate resource management and job scheduling/monitoring of the JobTracker into separate daemons:
• Global ResourceManager(RM) - high availability scheduler
• ApplicationMaster(AM) – one for each application, framework specific entity tasked with negotiating resources from the RM
Resource Manager
The ResourceManager has a Pure scheduler, • responsible for allocating resources to running
applications • no monitoring or tracking of status for the application,
heartbeat to NodeManager. • no guarantees on restarting failed tasks either due to
application failure or hardware failures. • Runs on master node.
The scheduler performs its scheduling
• by using a resource container.
NodeManager and Application Manager
The NodeManager is a generalized Task Tracker • One per-machine slave, runs on slave nodes. • launching the containers, and killing them. • monitoring their resource usage (CPU, memory, disk,
network), • reporting the same to the ResourceManager.
The ApplicationMaster
• One per-application • negotiating resources with the RM
YARN Architecture
Resources model
An application can request resources via the ApplicationMaster with highly specific requirements such as:
• Resource-name (including hostname, rackname and possibly complex network topologies)
• Amount of Memory • CPUs (number/type of cores) • Eventually resources like disk/network I/O, GPUs, etc.
The Scheduler responds to a resource request by granting a container, which satisfies the requirements
Container
The YARN Container launch specification API is platform agnostic and contains:
• Command line to launch the process within the container. • Environment variables. • Local resources necessary on the machine prior to launch,
such as jars, shared-objects, auxiliary data files etc. • Security-related tokens. • This design allows the ApplicationMaster to work with the
NodeManager to launch containers ranging from simple shell scripts to C/Java/Python processes on Unix/Windows to full- fledged virtual machines.
Basic Hadoop API*
Mapper
void setup(Mapper.Context context)
Called once at the beginning of the task
void map(K key, V value, Mapper.Context context)
Called once for each key/value pair in the input split
void cleanup(Mapper.Context context)
Called once at the end of the task
Reducer/Combiner
void setup(Reducer.Context context)
Called once at the start of the task
void reduce(K key, Iterable<V> values, Reducer.Context context)
Called once for each key
void cleanup(Reducer.Context context)
Called once at the end of the task
*Note that there are two versions of the API!
Basic Hadoop API*
Partitioner
int getPartition(K key, V value, int numPartitions)
Get the partition number given total number of partitions
Job
Represents a packaged Hadoop job for submission to cluster
Need to specify input and output paths
Need to specify input and output formats
Need to specify mapper, reducer, combiner, partitioner classes
Need to specify intermediate/final key/value classes
Need to specify number of reducers (but not mappers, why?)
Don’t depend of defaults!
*Note that there are two versions of the API!
Basic Cluster Components*
One of each:
Namenode (NN): master node for HDFS
Set of each per slave machine:
Datanode (DN): serves HDFS data blocks
NameNode Metadata
Meta-data in Memory
– The entire metadata is in main memory
– No demand paging of meta-data
Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
A Transaction Log
– Records file creations, file deletions. etc
Namenode Responsibilities
Managing the file system namespace:
Holds file/directory structure, metadata, file-to-block mapping,
access permissions, etc.
Coordinating file operations:
Directs clients to datanodes for reads and writes
No data is moved through the namenode
Maintaining overall health:
Periodic communication with the datanodes
Block re-replication and rebalancing
Garbage collection
DataNode
A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores meta-data of a block (e.g. CRC)
– Serves data and meta-data to Clients
Block Report
– Periodically sends a report of all existing blocks to the NameNode
Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
Adapted from (Ghemawat et al., SOSP 2003)
(file name, block id)
(block id, block location)
instructions to datanode
datanode state (block id, byte range)
block data
HDFS namenode
HDFS datanode
Linux file system
…
HDFS datanode
Linux file system
…
File namespace
/foo/bar
block 3df2
Application
HDFS Client
HDFS Working Flow
NameNode Failure
A single point of failure
Transaction Log stored in multiple directories
– A directory on the local file system
– A directory on a remote file system (NFS/CIFS)
Need to develop a real HA solution
Hadoop Workflow
Hadoop Cluster You
1. Load data into HDFS
2. Develop code locally
3. Submit MapReduce job 3a. Go back to Step 2
4. Retrieve data from HDFS
Recommended Workflow
Example:
Develop code in Eclipse on host machine
Build distribution on host machine
Check out copy of code on VM
Copy (i.e., scp) jars over to VM (in same directory structure)
Run job on VM
Iterate
…
Commit code on host machine and push
Pull from inside VM, verify
Avoid using the UI of the VM
Directly ssh into the VM
Hadoop Platforms
• Platforms: Unix and on Windows.
– Linux: the only supported production platform.
– Other variants of Unix, like Mac OS X: run Hadoop for development.
– Windows + Cygwin: development platform (openssh)
• Java 8
– Java 1.8.x (aka 8.0.x aka 8) is recommended for running Hadoop.
Hadoop Installation
• Download a stable version of Hadoop: – http://hadoop.apache.org/core/releases.html
• Untar the hadoop file: – tar xvfz hadoop-0.20.2.tar.gz
• JAVA_HOME at hadoop/conf/hadoop-env.sh: – Mac OS:
/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home (/Library/Java/Home)
– Linux: which java
• Environment Variables: – export PATH=$PATH:$HADOOP_HOME/bin
Hadoop Modes
• Standalone (or local) mode – There are no daemons running and everything runs in a
single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them.
• Pseudo-distributed mode – The Hadoop daemons run on the local machine, thus
simulating a cluster on a small scale.
• Fully distributed mode – The Hadoop daemons run on a cluster of machines.
Pseudo Distributed Mode
• Create an RSA key to be used by hadoop when ssh’ing to Localhost: – ssh-keygen -t rsa -P "" – cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys – ssh localhost
• Configuration Files – Core-site.xml – Mapred-site.xml – Yarn-site.xml – Hdfs-site.xml
<?xml version="1.0"?>
<!-- core-site.xml -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
<?xml version="1.0"?>
<!-- yarn-site.xml --> <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>localhost</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> <?xml version="1.0"?> <!-- hdfs-site.xml --> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Start Hadoop
• hdfs namenode –format
• bin/star-all.sh (start-dfs.sh/start-yarn.sh)
• jps
• bin/stop-all.sh
• Web-based UI
– http://localhost:50070 (Namenode report)
Basic File Command in HDFS
• hdfs fs –cmd <args> – hadoop dfs
• URI: //authority/path – authority: hdfs://localhost:9000
• Adding files – hdfs fs –mkdir – hdfs fs -put
• Retrieving files – hdfs fs -get
• Deleting files – hdfs fs –rm
• hdfs fs –help ls
Run WordCount
• Create an input directory in HDFS
• Run wordcount example – hadoop jar hadoop-examples-2.7.4.jar wordcount
/user/hduser/input /user/hduser/ouput
• Check output directory
– hdfs fs lsr /user/hduser/ouput
– http://localhost:50070
Installation Tutorials • http://www.michael-noll.com/tutorials/running-
hadoop-on-ubuntu-linux-single-node-cluster/
• http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html
• http://oreilly.com/other-programming/excerpts/hadoop-tdg/installing-apache-hadoop.html
• http://snap.stanford.edu/class/cs246-2011/hw_files/hadoop_install.pdf