Hadoop Week2 PPT

Post on 03-Oct-2015

42 views 2 download


Hadoop Week2 PPT

Transcript of Hadoop Week2 PPT

  • Course Topics

    Week 1 Introduction to HDFS

    Week 3 Map-Reduce Basics, types and


    Week 5 HIVE

    Week 7 ZOOKEEPER

    Week 2 Setting Up Hadoop Cluster

    Week 4 PIG

    Week 6 HBASE

    Week 8 SQOOP

  • Topics for Today


    Hadoop Modes

    Terminal Commands

    Web UI Urls

    Usecase in Healthcare

    Sample example list in Hadoop

    Running Teragen Example

    Hadoop Configuration Files

    Slaves & Masters

    Name Node Recovery

    Dump of MR jobs

    Data Loading Techniques

  • HDFS Hadoop Distributed File System (storage)

    MapReduce (processing)

    Class 1 - Revision

  • Lets Revise

    1. What is HDFS?

    3. What is Namenode?

    2. What is the difference between a Hadoop database and Relational Database?

    4. What is Secondary Namenode?

    5. Gen 1 and Gen 2 Hadoop.

  • Hadoop Modes

    no daemons, everything runs in a single JVM

    suitable for running MapReduce programs during development

    has no dfs

    Standalone (or local) mode

    Hadoop daemons run on the local machine

    Pseudo-distributed mode

    Hadoop daemons run on a cluster of machines

    Fully distributed mode

    Hadoop can be run in one of three modes:

  • Terminal Commands

  • Terminal Commands

  • Web UI URLs

    NameNode status: http://localhost:50070/dfshealth.jsp

    JobTracker status: http://localhost:50030/jobtracker.jsp

    TaskTracker status: http://localhost:50060/tasktracker.jsp

    DataBlock Scanner Report: http://localhost:50075/blockScannerReport


  • Sample Examples List

  • Running the Teragen Example

  • Checking the Output

  • Checking the Output

  • Hadoop Configuration Files

  • Sample Cluster Configuration





    Slave05 DataNode





  • Hadoop Configuration Files

    Configuration Filenames Description of log files

    hadoop-env.sh Enviroment variables that are used in the scripts to run Hadoop

    core-site.xml Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce

    hdfs-site.xml Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data nodes.

    mapred-site.xml Configuration settings for MapReduce daemons : the jobtracker and the task trackers

    masters A list of machines(one per line) that each run a secondary namenode

    slaves A list of machines(one per line) that each run a datanode and a task tracker

    hadoop-metrics.properties Properties for controlling how metrics are published in Hadoop

    log4j.properties Properties for system log files, the namenode audit log and the task log for the tasktracker child process

  • DD for each component

    Core core-site.xml

    HDFS hdfs-site.xml

    MapReduce mapred-site.xml

  • core-site.xml and hdfs-site.xml

    hdfs-site.xml core-site.xml

    dfs.replication fs.default.name

    1 hdfs://localhost:8020/

  • Defining HDFS details in hdfs-site.xml Property Value Description

    dfs.data.dir /disk1/hdfs/data,/di


    A list of directories where the datanode stores

    blocks. Each block is stored in only one of these

    directories. ${hadoop.tmp.dir}/dfs/data

    fs.checkpoint.dir /disk1/hdfs/names



    A list of directories where the secondary

    namenode stores checkpoints. It stores a copy of

    the checkpoint in each directory in the list


  • Mapred-site.xml



  • All Properties

    1. http://hadoop.apache.org/docs/r1.1.2/core-default.html

    2. http://hadoop.apache.org/docs/r1.1.2/mapred-default.html

    3. http://hadoop.apache.org/docs/r1.1.2/hdfs-default.html


  • Defining mapred-sites.xml

    Property Value Description

    Mapred.job.tracker localhost:


    The hostname and the port that the jobtrackers RPC server

    runs on. If set to the default value of local, then the jobtracker

    is run in-process on demand when you run a MapReduce job

    (you dont need to start the jobtracker in this case, and in fact

    you will get an error if you try to start it in this mode)

    Mapred.local.dir ${hadoop.tmp.dir}


    A list of directories where MapReduce stores intermediate

    data for jobs. The data is cleared out when the job ends.

    Mapred.system.dir ${hadoop.tmp.dir}


    The directory relative to fs.default.name where shared files

    are stored, during a job run.



    2 The number of map tasks that may be run on a tasktracker at

    any one time



    2 The number of reduce tasks tat may be run on a tasktracker

    at any one time.

  • Slaves and masters

    contains a list of hosts, one per line, that are to host DataNode and TaskTracker servers


    contains a list of hosts, one per line, that are to host secondary NameNode servers


    Two files are used by the startup and shutdown commands:

  • Per-process runtime environment

    JVM Hadoop-env.sh

    hadoop-env.sh file:

    This file also offers a way to provide custom parameters for each of the servers. Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in the conf/

    directory of the installation.

    Set parameter JAVA_HOME

  • hadoop.env-sh

    Examples of environment variables that you can specify: export HADOOP_DATANODE_HEAPSIZE="128


  • hadoop.env-sh Sample

    # Set Hadoop-specific environment variables here. # The only required environment variable is JAVA_HOME. All others are # optional. When running a distributed configuration it is best to # set JAVA_HOME in this file, so that it is correctly defined on # remote nodes. # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.6-sun # Extra Java runtime options. Empty by default. export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true ${HADOOP_OPTS}" .. .. .. # A string representing this instance of hadoop. $USER by default. export HADOOP_IDENT_STRING=$USER

  • Namenode Recovery

    1 Shut down the secondary NameNode

    2 secondary:fs.checkpoint.dir Namenode:dfs.name.dir

    3 secondary:fs.checkpoint.edits Namenode:dfs.name.edits.dir


    When the copy completes, start the NameNode and restart the secondary NameNode

  • Reporting

    This file controls the reporting

    The default is not to report


  • Dump of a MR Job

  • Data Loading Techniques

    Using Hadoop Copy Commands

    Using Flume

    Using Sqoop



    Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data.


    Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.


  • Assignment for this Week

    Attempt the following assignment using the document present in the LMS under the tab Week 2: Flume Set-up on Cloudera Attempt Assignment Week 2

    Refresh your Java Skills using Java for Hadoop Tutorial on LMS

  • Ask your doubts

    Q & A..?

  • Thank You See You in Class Next Week