Herramientas y ejemplos de trabajos MapReduce con Apache Hadoop
Large Scale Math with Hadoop MapReduce
-
date post
13-Sep-2014 -
Category
Technology
-
view
17.510 -
download
4
description
Transcript of Large Scale Math with Hadoop MapReduce
Large Scale Math withHadoop MapReduce
Tsz-Wo (Nicholas) Sze, PhD
Hadoop SummitJune 29, 2011
1
Who am I?
• Hortonworks Software Engineer
• Apache Hadoop PMC Member
• Mathematician
I Interests:
F Distributed Computing
F Algorithms
F Number Theory
2
Agenda
• Introduction
• Integer Multiplication
• MapReduce-FFT
• MapReduce-Sum
• MapReduce-SSA
• A New World Record
• The “Machine” Behind the Computation
Tsz-Wo Sze, Hadoop Summit 2011 3
Agenda
• Introduction
• Integer Multiplication
• MapReduce-FFT
• MapReduce-Sum
• MapReduce-SSA
• A New World Record
• The “Machine” Behind the Computation
Tsz-Wo Sze, Hadoop Summit 2011 4
Typical Hadoop Applications
I Major applications of Hadoop include
• Search and crawling
• Text processing
• Machine learning
• ...
Tsz-Wo Sze, Hadoop Summit 2011 5
Typical Hadoop Applications
I Major applications of Hadoop include
• Search and crawling
• Text processing
• Machine learning
• ...
I But not yet commonly used in scientific
or mathematical applications.
Why?
Tsz-Wo Sze, Hadoop Summit 2011 6
Why Not Math?
I No MapReduce math libraries available, and
I More fundamentally,
MapReduce math algorithms are not well studied.
Tsz-Wo Sze, Hadoop Summit 2011 7
Existing Library
I Really no MapReduce Math Library?
Not exactly.
Tsz-Wo Sze, Hadoop Summit 2011 8
Existing Library
I Really no MapReduce Math Library?
Not exactly.
I Apache Mahout
• A machine learning library.
• Includes packages for matrix operations.
Tsz-Wo Sze, Hadoop Summit 2011 9
Existing Library
I Really no MapReduce Math Library?
Not exactly.
I Apache Mahout
• A machine learning library.
• Includes packages for matrix operations.
I Apache Hama (Incubation)
• A matrix computational package.
Tsz-Wo Sze, Hadoop Summit 2011 10
Computational Intensive Problems(1)
I Integer Factoring
• a.k.a. breaking RSA cryptosystemGiven N , e and c, compute m such that
c ≡ me (mod N),
where N is a product of two primes.
• a 768-bit RSA modulus was factored1 in 2009
1 Kleinjung et al., Factorization of a 768-bit RSA modulus, CRYPTO 2010.
Tsz-Wo Sze, Hadoop Summit 2011 11
Computational Intensive Problems(2)
I Solving PDEs (Partial Differential Equations)
• Fluid dynamics
• Electromagnetism
• Financial analysis
• ...
(Two-dimensional Turbulence, courtesy of Y.K. Tsang)
Tsz-Wo Sze, Hadoop Summit 2011 12
Computational Intensive Problems(3)
I Finding complex zeros of Riemann Zeta function
ζ(s) =∞∑n=1
1
nsfor s ∈ C, <(s) > 1
and then analytically continued to all s 6= 1.
Tsz-Wo Sze, Hadoop Summit 2011 13
Computational Intensive Problems(3)
I Finding complex zeros of Riemann Zeta function
ζ(s) =∞∑n=1
1
nsfor s ∈ C, <(s) > 1
and then analytically continued to all s 6= 1.
• Disprove Riemann Hypothesis (RH)
Then, you will get $1,000,000 dollars2. ,
However, RH is unlikely to be false.
2 See http://www.claymath.org/millennium/Riemann_Hypothesis/.
Tsz-Wo Sze, Hadoop Summit 2011 14
Computational Intensive Problems(3)
I Finding complex zeros of Riemann Zeta function
ζ(s) =∞∑n=1
1
nsfor s ∈ C, <(s) > 1
and then analytically continued to all s 6= 1.
• Disprove Riemann Hypothesis (RH)
Then, you will get $1,000,000 dollars.
However, RH is unlikely to be false.
• More likely:
Obtain more evidents which support RH. ,
Tsz-Wo Sze, Hadoop Summit 2011 15
Computational Intensive Problems(4)
I Computing π
Latest world records:
• Five trillion decimal digits (August 2010)
F by Alexander Yee & Shigeru Kondo3
3 See http://www.numberworld.org/misc_runs/pi-5t/announce_en.html
Tsz-Wo Sze, Hadoop Summit 2011 16
Computational Intensive Problems(4)
I Computing π
Latest world records:
• Five trillion decimal digits (August 2010)
F by Alexander Yee & Shigeru Kondo
• The two quadrillionth bits (July 2010)
F by Tsz-Wo Sze &
the Yahoo! Cloud Computing Team4
4 See http://developer.yahoo.net/blogs/hadoop/2010/09/two_quadrillionth_bit_pi.html
Tsz-Wo Sze, Hadoop Summit 2011 17
Missing Functionalities
I Fast Fourier Transform (FFT)– the basic rountine behind many algorithms.
I Arbitrary Precision Arithmetic
F Integer functions
F Floating-point functions
F Complex functions
I ...
Tsz-Wo Sze, Hadoop Summit 2011 18
Agenda
• Introduction
• Integer Multiplication
• MapReduce-FFT
• MapReduce-Sum
• MapReduce-SSA
• A New World Record
• The “Machine” Behind the Computation
Tsz-Wo Sze, Hadoop Summit 2011 19
Why Integer Multiplication?
I There exist fast algorithms.
I Many applications
• Division
• Logarithm
• Trigonometric functions
• ...
Tsz-Wo Sze, Hadoop Summit 2011 20
Prerequisite of Algorithms
(D.J. Bernstein, Fastmultiplication and itsapplications, ANTS 2008.
)Tsz-Wo Sze, Hadoop Summit 2011 21
Integer Multiplication Algorithms
I Naıve, O(N 2)
I Karatsuba, O(N log2 3) = O(N 1.585)
I Toom-Cook, O(N log(2D−1)/ logD)
If D = 3, then O(N log 5/ log 3) = O(N 1.465)
I FFT-based algorithms O(N logN · · · )
Tsz-Wo Sze, Hadoop Summit 2011 22
FFT-based Algorithms
I Basic FFT, O(N logN log logN log log logN · · · )
I Schonhage-Strassen, O(N logN log logN)
I Nussbaumer, O(N logN log logN)
I Furer, O(N(logN)2log∗N)
I De-Kurur-Saha-Saptharishi, O(N(logN)2log∗N)
Tsz-Wo Sze, Hadoop Summit 2011 23
Convolution
I By the convolution theorem,
a× b = dft−1(dft(a) ∗ dft(b)),
where
× denotes the convolution operator ,
∗ denotes componentwise multiplication,
dft( · ) denotes discrete Fourier transform.
Tsz-Wo Sze, Hadoop Summit 2011 24
Schonhage-Strassen Algorithm(SSA)
I Represent integers as polynomials. Then, com-
pute convolution with DFTs modulo an integer5.
5 It has the form 2n + 1 and is called the Schonhage-Strassen modulas.
Tsz-Wo Sze, Hadoop Summit 2011 25
SSA StepsI Step 1: two DFTs,
adef= dft(a) and b
def= dft(b);
I Step 2: componentwise multiplication,
pdef= a ∗ b;
I Step 3: a DFT inverse,
p = dft−1(p);
I Step 4: normalization.
Tsz-Wo Sze, Hadoop Summit 2011 26
Calculating DFTs
I DFT can be calculated by a family of algorithms
called Fast Fourier Transform (FFT).
Tsz-Wo Sze, Hadoop Summit 2011 27
FFT Family
I Recursive-FFT
I Parallel-FFT
I Cooley-Tukey (decimation-in-time)
I Gentleman-Sande (decimation-in-frequency)
I Danielson-Lanczos
I Ping-pong FFT
I ...
Tsz-Wo Sze, Hadoop Summit 2011 28
Data Model(1)
I Need a data model which allows accessing
terabit integers efficiently.
I An integer x is represented as a D-dimensional
tuple
x = (xD−1, xD−2, . . . , x0).
Tsz-Wo Sze, Hadoop Summit 2011 29
Data Model(2)
I Write
D = IJ.
where I and J are powers of two.
I Define J-dimensional tuples
x(i) def= (x(J−1)I+i, x(J−2)I+i, . . . , xi)
for 0 ≤ i < I.
Tsz-Wo Sze, Hadoop Summit 2011 30
Data Model(3)
I Then,x(0)
x(1)
...
x(I−1)
=
x(J−1)I x(J−2)I . . . x0
x(J−1)I+1 x(J−2)I+1 . . . x1... ... . . . ...
x(J−1)I+(I−1) x(J−2)I+(I−1) . . . xI−1
I We call it the (I, J)-format of x.
Tsz-Wo Sze, Hadoop Summit 2011 31
Data Model(4)
I Each x(i) is a sequence of J records.
I Each record is a key-value pair.
Record # <Key, Value>
0 < i, xi >
1 < J + i, xJ+i >... ...
J − 1 < (J − 1)I + i, x(J−1)I+i >
Tsz-Wo Sze, Hadoop Summit 2011 32
Data Model(5)
I Thus, an integer is stored as I SequenceFiles in
HDFS, each SequenceFile contains J records.
Tsz-Wo Sze, Hadoop Summit 2011 33
Parallel-FFT Steps
I Step 1: I inner DFTs with J-point,
a(i) = dft(a(i));
I Step 2: componentwise shifting,
zjI+idef= ζ ij a(i)
j;
I Step 3: transposition,
z[j] def= (zjI+(I−1), zjI+(I−2), . . . , zjI);
I Step 4: J outer DFTs with I-point,
z[j] def= dft(z[j]).
Tsz-Wo Sze, Hadoop Summit 2011 34
MapReduce Model
Map1 Map2 Map3 Map4
Reduce1 Reduce2 Reduce3 Reduce4
Shuffle
Input
Output
Tsz-Wo Sze, Hadoop Summit 2011 35
MapReduce-FFT
Inner FFT1 Inner FFT2 Inner FFT3 Inner FFT4
Outer FFT1 Outer FFT2 Outer FFT3 Outer FFT4
Transposition(by shuffle)
Input
Output
Tsz-Wo Sze, Hadoop Summit 2011 36
Data Locality
I The FFT transposition, which is traditionally dif-
ficult in preserving locality, becomes trivial in
MapReduce.
Tsz-Wo Sze, Hadoop Summit 2011 37
MapReduce-FFT(1)
I Map function:
(k1, v1) −→ list〈k2, v2〉
Algorithm 1 (Forward FFT, Mapper).
(f.m.1) read key i, value a(i);
(f.m.2) calculate a J-point DFT;
(f.m.3) componentwise multiply;
(f.m.4) for 0 ≤ j < J , emit key j, value (i, zjI+i).
Tsz-Wo Sze, Hadoop Summit 2011 38
MapReduce-FFT(2)
I Reduce function:
(k2, list〈v2〉) −→ list〈k3, v3〉.
Algorithm 2 (Forward FFT, Reducer).
(f.r.1) receive key j, list [(i, zjI+i)]0≤i<I;
(f.r.2) calculate an I-point DFT;
(f.r.3) write key j, value z[j].
Tsz-Wo Sze, Hadoop Summit 2011 39
Normalization
I Normalization can be viewed as a summation ofthree integers.
Tsz-Wo Sze, Hadoop Summit 2011 40
Summation
I Integer summation can be done by (1) componen-twise summation, (2) carry evaluation and then
(3) parallel carrying.
Tsz-Wo Sze, Hadoop Summit 2011 41
MapReduce Model
Map1 Map2 Map3 Map4
Reduce1 Reduce2 Reduce3 Reduce4
Shuffle
Input
Output
Tsz-Wo Sze, Hadoop Summit 2011 42
MapReduce-Sum
Summation1 Summation2 Summation3 Summation4
Carrying1 Carrying2 Carrying3 Carrying4
Carry Evaluation(modified shuffle)
Input
Output
Tsz-Wo Sze, Hadoop Summit 2011 43
Job 1: Componwise Summation
Summation1 Summation2 Summation3 Summation4
Input
Output
I A map-only job.
Tsz-Wo Sze, Hadoop Summit 2011 44
Job 2: Carrying
Carry Evaluation
Carrying1 Carrying2 Carrying3 Carrying4
Input
Output
Tsz-Wo Sze, Hadoop Summit 2011 45
MapReduce-SSA
I two concurrent forward FFT jobs;
I a backward FFT job with componentwise
multiplication and splitting ;
I a componentwise summation map-only job;
I a carrying job6.
6 It is possible to combine the last two jobs if we modify the shuffle process in MapReduce [.next].
Tsz-Wo Sze, Hadoop Summit 2011 46
Prototype Implementation
I DistMpMult– distributed multi-precision multiplication
F DistFft – distributed FFT
F DistCompSum – distributed componentwise
summation
F DistCarrying – distributed carrying
I Open source – available at
https://issues.apache.org/jira/browse/MAPREDUCE-2471
Tsz-Wo Sze, Hadoop Summit 2011 47
Cluster Configuration
I A shared cluster:
F Apache Hadoop 0.20
F 1350 nodes
F 6 GB memory per node
F 2 map tasks & 1 reduce task per node
F Imposed a limitation on the aggregated
memory usage of individual jobs. /
Tsz-Wo Sze, Hadoop Summit 2011 48
Running Time
I Actual running time for 236 ≤ N ≤ 240.
7
7.5
8
8.5
9
9.5
10
10.5
11
11.5
32 33 34 35 36 37 38 39 40
log
(t)
t is
the
ela
pse
d tim
e in s
econ
ds
log(N)
Tsz-Wo Sze, Hadoop Summit 2011 49
Agenda
• Introduction
• Integer Multiplication
• MapReduce-FFT
• MapReduce-Sum
• MapReduce-SSA
• A New World Record
• The “Machine” Behind the Computation
Tsz-Wo Sze, Hadoop Summit 2011 50
What is π?
I π is a mathematical
constant such that,
for any circle,
π =circumference
diameter=C
d.
Tsz-Wo Sze, Hadoop Summit 2011 51
What is π?
I π is a mathematical
constant such that,
for any circle,
π =circumference
diameter=C
d.
I We have π = 3.244
Tsz-Wo Sze, Hadoop Summit 2011 52
What is π?
I π is a mathematical
constant such that,
for any circle,
π =circumference
diameter=C
d.
I We have π = 3.244(in hexadecimal ,)
Tsz-Wo Sze, Hadoop Summit 2011 53
Decimal, Hexadecimal & Binary
I Representing π in different bases
π = 3.1415926535 8979323846 2643383279 ...
= 3.243F6A88 85A308D3 13198A2E ...
= 11.00100100 00111111 01101010 ...
I Bit position is counted after the radix point.
I e.g., the eight bits starting at the ninth bit position
are 00111111 in binary or 3F in hexadecimal.
Tsz-Wo Sze, Hadoop Summit 2011 54
A New World Record
I Yahoo! Cloud Computing (July 2010)
• Machines: Idle slices of 1000-node clusters
Each node has two quad-core 1.8-2.5 GHz CPUs
• Duration: 23 days
• CPU time: 503 years
• Verification: 582 years CPU time
Tsz-Wo Sze, Hadoop Summit 2011 55
A New World Record
I Bit values (in hexadecimal)
0E6C1294 AED40403 F56D2D76 4026265B
CA98511D 0FCFFAA1 0F4D28B1 BB5392B8
Tsz-Wo Sze, Hadoop Summit 2011 56
A New World Record
I Bit values (in hexadecimal)
0E6C1294 AED40403 F56D2D76 4026265B
CA98511D 0FCFFAA1 0F4D28B1 BB5392B8
(256 bits)
F The first bit position: 1,999,999,999,999,997 (= 2 · 1015− 3)
F The last bit position: 2,000,000,000,000,252 (= 2·1015+252)
F The two quadrillionth (2 · 1015th) bit is 0.
Tsz-Wo Sze, Hadoop Summit 2011 57
BBC News (16 Sep 2010)
I Pi record smashed as team finds two-quadrillionth digit
http://www.bbc.co.uk/news/technology-11313194
Tsz-Wo Sze, Hadoop Summit 2011 58
NewScientist (17 Sep 2010)
I New pi record exploits Yahoo’s computers
http://www.newscientist.com/article/dn19465-new-pi-record-exploits-yahoos-computers.
html
Tsz-Wo Sze, Hadoop Summit 2011 59
Other News Coverage
I New Pi Record Exploits Yahoo’s Computers
http://cacm.acm.org/news/99207-new-pi-record-exploits-yahoos-computers
I The Yahoo! boffin scores pi’s two
quadrillionth bit
http://www.theregister.co.uk/2010/09/16/pi_record_at_yahoo
I Pi calculation more than doubles old record
http://www.radionz.co.nz/news/world/57128/pi-calculation-more-than-doubles-old-record
I Hadoop used to calculate Pi’s two quadrillionth bit
http://www.zdnet.co.uk/blogs/mapping-babel-10017967/hadoop-used-to-calculate-pis-two-quadrillionth-bit-10018670/
Tsz-Wo Sze, Hadoop Summit 2011 60
I Yahoo! researcher breaks Pi record in finding
the two-quadrillionth digit
http://www.engadget.com/2010/09/17/yahoo-researcher-breaks-pi-record-in-finding-the-two-quadrillio
I Nicholas Sze of Yahoo Finds Two-Quadrillionth
Digit of Pi
http://science.slashdot.org/story/10/09/16/2155227/Nicholas-Sze-of-Yahoo-Finds-Two-Quadrillionth-Digit-of-Pi
I The 2,000,000,000,000,000th digit of the mathemat-
ical constant pi discovered
http://news.gather.com/viewArticle.action?articleId=281474978525563
I Researcher Shatters Pi Record by Finding
Two-Quadrillionth Digit
http://www.maximumpc.com/article/news/researcher_shatters_pi_record_finding_
two-quadrillionth_digit
Tsz-Wo Sze, Hadoop Summit 2011 61
I A bigger slice of pi
http://radar.oreilly.com/2010/09/strata-week-grabbing-a-slice.html
I 2 Quadrillionth digit of PI is found: Scientist
celebration in worldwide Pandemonium
http://engforum.pravda.ru/showthread.php?296242-2-Quadrillionth-digit-of-PI-is-found-Scientist-celebration-in-worldwide-Pandemonium
I And the number is...0
http://www.hexus.net/content/item.php?item=26505
I Pi Record Smashed as Team Finds Two-
Quadrillionth Digit
http://hardocp.com/news/2010/09/16/pi_record_smashed_as_team_finds_twoquadrillionth_
digit
Tsz-Wo Sze, Hadoop Summit 2011 62
I Yahoo Engineer Calculates Two Quadrillionth
Bit Of Pi
http://www.webpronews.com/topnews/2010/09/17/yahoo-engineer-calculates-two-quadrillionth-bit-of-pi
I A Cloud Computing Milestone: Yahoo!
Reaches the 2 Quadrillionth Bit of Pi
http://www.readwriteweb.com/cloud/2010/09/a-cloud-computing-milestone-ya.
php
I Yahoo researcher Nicolas Sze determines
the 2,000,000,000,000,000th digit of the mathematical con-
stant pi
http://www.thaindian.com/newsportal/sci-tech/yahoo-researcher-nicolas-sze-determines-the-2000000000000000th-digit-of-the-mathematical-constant-pi_
100430278.html
I ...
Tsz-Wo Sze, Hadoop Summit 2011 63
Computing π
I How to compute the nth bits of π?
Tsz-Wo Sze, Hadoop Summit 2011 64
Computing π
I How to compute the nth bits of π?
Let’s ignore this question in this talk ...
and focus on:
Tsz-Wo Sze, Hadoop Summit 2011 65
Computing π
I How to compute the nth bits of π?
Let’s ignore this question in this talk ...
and focus on:
I How to execute such huge computation?
Tsz-Wo Sze, Hadoop Summit 2011 66
Map- & Reduce-side Computations
I Developed a generic framework to execute tasks
on either the map-side or the reduce-side.
I Applications define two functions:
• partition(c,m):
partition the computation c into m parts.
• compute(c):
execute the computation c
Tsz-Wo Sze, Hadoop Summit 2011 67
Map-side Job
I Contains multiple mappers and zero reducers
• A PartitionInputFormat partitions c
into m parts
• Each part is executed by a mapper
Tsz-Wo Sze, Hadoop Summit 2011 68
Reduce-side Job
I Contains a mapper and multiple reducers
• A SingletonInputFormat launches
a PartitionMapper
• An Indexer launches m reducers.
Tsz-Wo Sze, Hadoop Summit 2011 69
Abstract Machine(1)
I Machine
– an abstract base class allows abstract Runner(s)
to execute MachineComputable tasks.
I Machine subclasses
• Map Side Machinem100t3: 100 maps with 3 threads each.
• Reduce Side Machiner50t2: 50 reduces with 2 threads each.
Tsz-Wo Sze, Hadoop Summit 2011 70
Abstract Machine(2)
I More Machine subclasses
• Mix Machine – chooses Map-/Reduce-side
jobs according to the cluster status.
x-m200t1-r100t2-5: either launch a job with 200 maps
with 1 thread each; or a job with 100 reduces with 2 thread each.
• Alternation Machine – alternates Map-side
and Reduce-side jobs in a regular pattern.
a-m200t1-r100t2-mrr: submit a map job, then a re-
duce job, then another reduce job and repeat this pattern.
• Null Machine – does nothing for testing.
Tsz-Wo Sze, Hadoop Summit 2011 71
Utilizing The Idle Slices
I Monitor cluster status
• Submit a map-side (or reduce-side) job if there
are sufficient available map (or reduce) slots.
I Small jobs
• Hold resource only for a short period of time
I Interruptible & resumable
• can be interrupted at any time by simply
killing the running jobs
Tsz-Wo Sze, Hadoop Summit 2011 72
Running The Jobs
Tsz-Wo Sze, Hadoop Summit 2011 73
The Implementation
I Main programs:
F DistBbp – a program to submit jobs.
F DistSum – distributed summation.
I Open source – available at
https://issues.apache.org/jira/browse/MAPREDUCE-1923
Tsz-Wo Sze, Hadoop Summit 2011 74
The World Record Computation
I 35,000 MapReduce jobs, each job either has:
• 200 map tasks with one thread each, or
• 100 reduce tasks with two threads each.
I Each thread computes 200,000,000 terms
• ∼45 minutes.
I Submit up to 60 concurrent jobs
I The entire computation took:
• 23 days of real time and 503 CPU years
Tsz-Wo Sze, Hadoop Summit 2011 75
Referneces
• [1] Tsz-Wo Sze. Schonhage-Strassen Algorithm with MapReduce for Mul-tiplying Terabit Integers. Symbolic-Numeric Computation 2011, to ap-pear. Preprint available at http://people.apache.org/~szetszwo/
ssmr20110430.pdf
• [2] Tsz-Wo Sze. The Two Quadrillionth Bit of Pi is 0! DistributedComputation of Pi with Apache Hadoop. In IEEE 2nd InternationalConference on Cloud Computing Technology and Science (CloudCom),pages 727-732, 2010. (Earlier versions available at http://arxiv.org/
abs/1008.3171)
Tsz-Wo Sze, Hadoop Summit 2011 76
Thank you!
Tsz-Wo Sze, Hadoop Summit 2011 77