Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois,...
-
Upload
sharlene-rose -
Category
Documents
-
view
217 -
download
0
Transcript of Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois,...
“Introducing Hadoop on Azure:
Joe Hummel, PhD
Visiting Researcher: U. of California, IrvineAdjunct Professor: U. of Illinois, Chicago &
Loyola U., Chicago
Materials: http://www.joehummel.net/downloads.htmlEmail: [email protected]
hello Map-Reduce!”
Hadoop on Azure 3
Map-Reduce is from functional programming
A little history…
// function returns 1 if i is prime, 0 if not:let isPrime(i) = ...
// sums 2 numbers:let sum(x, y) = return x + y
// count the number of primes in 1..N:let countPrimes(N) = let L = [ 1 .. N ] // [ 1, 2, 3, 4, 5, 6, ... ] let T = map isPrime L // [ 0, 1, 1, 0, 1, 0, ... ] let count = reduce sum T // 42 return count
4
Created by to drive internet search
◦ BIG data ― scalable to TBs and beyond
◦ Parallelism: to get the performance
◦ Data partitioning: to drive the parallelism
◦ Fault tolerance: at this scale, machines are going to crash, a lot…
A little more history…
BIGData
pagehits
Hadoop on Azure 5
Search engines: Google, Yahoo, Bing Facebook Twitter Financials Health industry Insurance Credit card companies Just about any company collecting user data…
Who’s using Hadoop
6
Freely-available framework for big data◦ http://hadoop.apache.org/
Based on concept of Map-Reduce:
Hadoop today
BIGdata
Map
Map
Map
Map...
Reduce R
map function reduce intermediate results
...
Hadoop on Azure 7
Massively-parallel
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Reducer
Reducer
Reducer
8
Workflow
Map
Sort
Reduce
Merge
[ <key1, [value,value,…]>, <key2, [value,value,…]>, … ]
[ <key1, value>, <key2, value>… ] R
Data
Map
Sort
Map
Sort
[ <key1,value>, <key4,value>, <key2,value>, … ]
[ <key1,value>, <key1,value>, … ]
Hadoop on Azure 9
Netflix data-mining…
Example
NetflixMovieReview
s(.txt)
Netflix Data
Mining App
Average rating…
movieid,userid,rating,date1,2390087,3,2005-09-06217,5567801,5,2006-01-0342,1121098,3,2006-03-251,8972234,5,2003-12-02...
10
Map
Sort
Reduce
Merge
[ <1, [3,5]>, <42, [3,1]>, <134, [2, …]>, <217, [5, …]>, … ]
[ <1, 4>, <42, 2>, <134, ?>, … ] R
Data
Map
Sort
Map
Sort
[ <1,3>, <217,5>, <42,3>, <1,5>, <134,2>, <42,1>, … ]
[ <1,3>, <1,5>, <42,3>, <42,1>, <134,2>, <217,5>, … ]
NetflixWorkflow
Hadoop on Azure 11
To compute average rating for every movie:
Netflix map/ reduce functions?
// Javascript version:
var map = function (key, value, context){ var values = value.split(","); // field 0 contains movieid, field 2 the rating: context.write(values[0], values[2]);};
var reduce = function (key, values, context) { var sum = 0; var count = 0;
while (values.hasNext()) { count++; sum += parseInt(values.next()); } context.write(key, sum/count);};
Hadoop on Azure 12
Traditional use of Hadoop Upload data to HDFS
◦ Hadoop file system
Write map / reduce functions◦ default is to use Java
◦ most languages supported: C, C++, C#, JavaScript, Python, …
Compile and upload code◦ For Java, you upload .jar file
◦ For others, .exe or script
Submit MapReduce job
Wait for job to complete
Hadoop on Azure 13
When to use Hadoop? Queries against big datasets Embarrassingly-parallel problems
◦ Solution must fit into map-reduce framework
Non-real-time demands
Hadoop is not for:◦ Small datasets (< 1GB?)
◦ Sub-second / real-time needs (though clearly Google makes it work)
14
We’ll be working with Chicago crime data…◦ https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
◦ http://www.cityofchicago.org/city/en/narr/foia/CityData.html
Data set for demo
1 GB
5M rows
15
Compute top-10 crimes…
Goal?
0486 3669030820 308074...0890 166916
IUCR Count
IUCR = Illinois Uniform Crime Codes
https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e
16
Hadoop on Azure… Supports traditional Hadoop usage
◦ Upload data
◦ Write MapReduce program
◦ Submit job
Additional features:◦ Allows access to persistent data from Azure Storage Vault
◦ Provides interactive JavaScript console
◦ Built-in higher-level query languages (PIG, HIVE)
Demo
Hadoop on Azure
Hadoop on Azure 17
Demo: map reduce functions
// Javascript version:
var map = function (key, value, context){ var values = value.split(","); context.write(values[4], 1);};
var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum);};
0486 3669030820 308074...
Hadoop on Azure 18
Demo: PIG command
// interactive PIG with explicit Map-Reduce functions:
pig.from("asv://datafiles/CC-from-2001.txt"). mapReduce("scripts/IUCR-Count.js", "IUCR, Count:long"). orderBy("Count DESC"). take(10). to("output-from-2001")
// visualize the results:
file = fs.read("output-from2001/part-r-00000")data = parse(file.data, "IUCR, Count:long")graph.bar(data)
19
Microsoft is offering free access to Hadoop◦ Request invitation @ http://www.hadooponazure.com/
Hadoop connector for Excel◦ Process data using Hadoop, analyze/visualize using Excel
Hadoop on Azure
Hadoop on Azure
21Hadoop on Azure
Summary Hadoop is all about big data processing
◦ Scalable, parallel, fault-tolerant
Easy to understand programming model◦ Map-Reduce
◦ But then solution must fit into this framework…
Rich ecosystem developing around Hadoop◦ Technologies: PIG, HIVE, HBase, …
◦ Companies: Cloudera, Hortonworks, MapR, …
22
Presenter: Joe Hummel◦ Email: [email protected]◦ Materials: http://www.joehummel.net/downloads.html
For more info:◦ http://www.hadooponazure.com/
◦ http://msdn.microsoft.com/en-us/magazine/jj190805.aspx
◦ Overview, including how to access via .NET API:
http://www.simple-talk.com/cloud/data-science/analyze-big-data-with-apache-hadoop-on-windows-azure-preview-service-update-3/
Thank you for attending
Hadoop on Azure