MapReduce in Action
description
Transcript of MapReduce in Action
MapReduce in Action
Team 306Led by
Chen Lin
College of Information Science and Technology
数据挖掘研究组Data Mining Group @ Xiamen University
YOUR SITE HERE
LOGO
1. Basic MapReduce Programs
2. Advanced MapReduce
3. Beyond the horizon
4. discussion
Contents
YOUR SITE HERE
LOGO
JobConfiguration
MasterJobtracker
MasterJobtracker Job
Basic MapReduce Programs
YOUR SITE HERE
LOGO
Implement Interface
Environment Configuration
Basic MapReduce Programs
Job Configuration?
Java Class
YOUR SITE HERE
LOGO
Interface
CombinerInputFormatOutputFormat
MapperReducer Partitioner
YOUR SITE HERE
LOGO
Configure
jvm:Mapred.child.java.opts
{mapred.local.dir}
InputPathOutputPath
How many Map/ReduceTasks?
YOUR SITE HERE
LOGO
InputFormat Map Reduce OutputFormat
Basic MapReduce Program
Text
Inputsplit <K1,V2>
K1,List<V1>List<K1,V1>
YOUR SITE HERE
LOGOBasic MapReduce
YOUR SITE HERE
LOGO
Combiners an optimization in MapReduce that allow for local aggregation before the shue and sort phase
Partitioner determines which reducer will be responsible for processing a particular key, and the execution framework uses this information to copy the data to the right location during the shue and sort phase
PARTITIONERS AND COMBINERS
YOUR SITE HERE
LOGO
CREATING CUSTOM INPUTFORMAT
KeyValueText
Sequence File NLine
Text InputFormat
Basic MapReduce Program InputFormat
YOUR SITE HERE
LOGO
• TextInputFormat - Each line in the text fi les is a record. Key is the byte offset of the line, and value is the content of the line.• KeyValueTextInputFormat - Each line in the text fi les is a record. The fi rst separator character divides each line. Everything before the separator is the key, and everything after is the value. The separator is set by the key.value.separator.in.input.line property, and the default is the tab (\t) character.• NLineInputFormat - Same as TextInputFormat, but each split is guaranteed to have exactly N lines. The mapred.line.input.format. Lines/map property, which defaults to one, sets N.
InputFormat
YOUR SITE HERE
LOGO
4
Basic MapReduce Program
types for the key/value pairs
YOUR SITE HERE
LOGO
code for mapper, reducer,
combiner, partitioner, along with
job conguration parameters
The execution framework handles
everything else
Summary for basic Program
What’s a complete MapReduce job ??
YOUR SITE HERE
LOGO
Chaining MapReduce jobs
LOCAL AGGREGATION
SECONDARY SORTING
Work on Hadoop Files
Advanced MapReduce
YOUR SITE HERE
LOGO
You’ve been doing data processing tasks which a single MapReduce job can accomplish.
But……As you get more comfortable writing
MapReduce programs and take on more ambitious data processing tasks
you’ll find many complex tasks need to be broken down into simpler subtasks, each accomplished by an individual MapReduce job
Chaining MapReduce jobs
YOUR SITE HERE
LOGO
in Hadoop, intermediate results are written to local disk before being sent over the network.
Reductions in the amount of intermediate data translate should increase in algorithmic efficiency
use of the combiner is possible to substantially reduce both the number and size of key-value pairs that need to be shuffled from the mappers to the reducers
LOCAL AGGREGATION
YOUR SITE HERE
LOGOseudo-code for computing the mean of values associated with the same string.
YOUR SITE HERE
LOGOLOCAL AGGREGATION , Is it right ??
YOUR SITE HERE
LOGO
1. combiners must have the same input and output key-value type
2. Combiners are optimizations that cannot change the correctness of the algorithm
Hadoop makes no guarantees on how many times combiners are called; it could be zero, one, or multiple times
LOCAL AGGREGATION
YOUR SITE HERE
LOGOLOCAL AGGREGATION , right usage !
YOUR SITE HERE
LOGO
we also need to sort by value sometimes (k1;m1; v8) (k1;m2; v1) (k1;m3; v7) ::: (k2;m1; v2) (k2;m2; v6) (k2;m3; v9)
k1 (m1; k8) (k1; m1) (k8)
SECONDARY SORTING
YOUR SITE HERE
LOGO
It’s a shameThe rest I will talk about Plays an
important role in MapReduce, but, they are beyond my horizon.
So, need all your help, to master them together….
Beyond the horizon
YOUR SITE HERE
LOGOBeyond the horizon
Creat user custom
Inputformat Manipulate
local fileCreat user customPartitioner
Pipes for C++Streaming other language
YOUR SITE HERE
LOGOBeyond the horizon
Joining data from
different sourcesHive
Pig
HBase
MultipleFileoutput
Joining data from different sources
Orders files CSV formatfields: (Customer ID, Order ID, Price, and Purchase Date)
Customers file
CSV format
record fields:
(Customer ID,
Name, and Phone
Number)
YOUR SITE HERE
LOGOJoey Leung,555-555-55Edward,123-456-7890Jose Madriz,281-330-8004David Stork,408-555-0000…....
A,12.95,02-Jun-2008B,88.25,20-may-2008C,32.00,30-Nov-2007D,25.02,22-Jan-2009
Joining data from different sources
Joey Leung,555-555-5555,B,88.25,20-May-2008Edward,123-456-7890,C,32.00,30-Nov-2007Jose Madriz,281-330-8004,A,12.95,02-Jun-2008Jose Madriz,281-330-8004,D,25.02,22-Jan-2009
YOUR SITE HERE
LOGO
Thank you!
数据挖掘研究组Data Mining Group @ Xiamen University