Large Scale Math with Hadoop MapReduce

77
Large Scale Math with Hadoop MapReduce Tsz-Wo (Nicholas) Sze, PhD Hadoop Summit June 29, 2011 1
  • date post

    13-Sep-2014
  • Category

    Technology

  • view

    17.510
  • download

    4

description

Hadoop Summit 2011 presentation on Large Scale Math with Apache Hadoop MapReduce

Transcript of Large Scale Math with Hadoop MapReduce

Page 1: Large Scale Math with Hadoop MapReduce

Large Scale Math withHadoop MapReduce

Tsz-Wo (Nicholas) Sze, PhD

Hadoop SummitJune 29, 2011

1

Page 2: Large Scale Math with Hadoop MapReduce

Who am I?

• Hortonworks Software Engineer

• Apache Hadoop PMC Member

• Mathematician

I Interests:

F Distributed Computing

F Algorithms

F Number Theory

2

Page 3: Large Scale Math with Hadoop MapReduce

Agenda

• Introduction

• Integer Multiplication

• MapReduce-FFT

• MapReduce-Sum

• MapReduce-SSA

• A New World Record

• The “Machine” Behind the Computation

Tsz-Wo Sze, Hadoop Summit 2011 3

Page 4: Large Scale Math with Hadoop MapReduce

Agenda

• Introduction

• Integer Multiplication

• MapReduce-FFT

• MapReduce-Sum

• MapReduce-SSA

• A New World Record

• The “Machine” Behind the Computation

Tsz-Wo Sze, Hadoop Summit 2011 4

Page 5: Large Scale Math with Hadoop MapReduce

Typical Hadoop Applications

I Major applications of Hadoop include

• Search and crawling

• Text processing

• Machine learning

• ...

Tsz-Wo Sze, Hadoop Summit 2011 5

Page 6: Large Scale Math with Hadoop MapReduce

Typical Hadoop Applications

I Major applications of Hadoop include

• Search and crawling

• Text processing

• Machine learning

• ...

I But not yet commonly used in scientific

or mathematical applications.

Why?

Tsz-Wo Sze, Hadoop Summit 2011 6

Page 7: Large Scale Math with Hadoop MapReduce

Why Not Math?

I No MapReduce math libraries available, and

I More fundamentally,

MapReduce math algorithms are not well studied.

Tsz-Wo Sze, Hadoop Summit 2011 7

Page 8: Large Scale Math with Hadoop MapReduce

Existing Library

I Really no MapReduce Math Library?

Not exactly.

Tsz-Wo Sze, Hadoop Summit 2011 8

Page 9: Large Scale Math with Hadoop MapReduce

Existing Library

I Really no MapReduce Math Library?

Not exactly.

I Apache Mahout

• A machine learning library.

• Includes packages for matrix operations.

Tsz-Wo Sze, Hadoop Summit 2011 9

Page 10: Large Scale Math with Hadoop MapReduce

Existing Library

I Really no MapReduce Math Library?

Not exactly.

I Apache Mahout

• A machine learning library.

• Includes packages for matrix operations.

I Apache Hama (Incubation)

• A matrix computational package.

Tsz-Wo Sze, Hadoop Summit 2011 10

Page 11: Large Scale Math with Hadoop MapReduce

Computational Intensive Problems(1)

I Integer Factoring

• a.k.a. breaking RSA cryptosystemGiven N , e and c, compute m such that

c ≡ me (mod N),

where N is a product of two primes.

• a 768-bit RSA modulus was factored1 in 2009

1 Kleinjung et al., Factorization of a 768-bit RSA modulus, CRYPTO 2010.

Tsz-Wo Sze, Hadoop Summit 2011 11

Page 12: Large Scale Math with Hadoop MapReduce

Computational Intensive Problems(2)

I Solving PDEs (Partial Differential Equations)

• Fluid dynamics

• Electromagnetism

• Financial analysis

• ...

(Two-dimensional Turbulence, courtesy of Y.K. Tsang)

Tsz-Wo Sze, Hadoop Summit 2011 12

Page 13: Large Scale Math with Hadoop MapReduce

Computational Intensive Problems(3)

I Finding complex zeros of Riemann Zeta function

ζ(s) =∞∑n=1

1

nsfor s ∈ C, <(s) > 1

and then analytically continued to all s 6= 1.

Tsz-Wo Sze, Hadoop Summit 2011 13

Page 14: Large Scale Math with Hadoop MapReduce

Computational Intensive Problems(3)

I Finding complex zeros of Riemann Zeta function

ζ(s) =∞∑n=1

1

nsfor s ∈ C, <(s) > 1

and then analytically continued to all s 6= 1.

• Disprove Riemann Hypothesis (RH)

Then, you will get $1,000,000 dollars2. ,

However, RH is unlikely to be false.

2 See http://www.claymath.org/millennium/Riemann_Hypothesis/.

Tsz-Wo Sze, Hadoop Summit 2011 14

Page 15: Large Scale Math with Hadoop MapReduce

Computational Intensive Problems(3)

I Finding complex zeros of Riemann Zeta function

ζ(s) =∞∑n=1

1

nsfor s ∈ C, <(s) > 1

and then analytically continued to all s 6= 1.

• Disprove Riemann Hypothesis (RH)

Then, you will get $1,000,000 dollars.

However, RH is unlikely to be false.

• More likely:

Obtain more evidents which support RH. ,

Tsz-Wo Sze, Hadoop Summit 2011 15

Page 16: Large Scale Math with Hadoop MapReduce

Computational Intensive Problems(4)

I Computing π

Latest world records:

• Five trillion decimal digits (August 2010)

F by Alexander Yee & Shigeru Kondo3

3 See http://www.numberworld.org/misc_runs/pi-5t/announce_en.html

Tsz-Wo Sze, Hadoop Summit 2011 16

Page 17: Large Scale Math with Hadoop MapReduce

Computational Intensive Problems(4)

I Computing π

Latest world records:

• Five trillion decimal digits (August 2010)

F by Alexander Yee & Shigeru Kondo

• The two quadrillionth bits (July 2010)

F by Tsz-Wo Sze &

the Yahoo! Cloud Computing Team4

4 See http://developer.yahoo.net/blogs/hadoop/2010/09/two_quadrillionth_bit_pi.html

Tsz-Wo Sze, Hadoop Summit 2011 17

Page 18: Large Scale Math with Hadoop MapReduce

Missing Functionalities

I Fast Fourier Transform (FFT)– the basic rountine behind many algorithms.

I Arbitrary Precision Arithmetic

F Integer functions

F Floating-point functions

F Complex functions

I ...

Tsz-Wo Sze, Hadoop Summit 2011 18

Page 19: Large Scale Math with Hadoop MapReduce

Agenda

• Introduction

• Integer Multiplication

• MapReduce-FFT

• MapReduce-Sum

• MapReduce-SSA

• A New World Record

• The “Machine” Behind the Computation

Tsz-Wo Sze, Hadoop Summit 2011 19

Page 20: Large Scale Math with Hadoop MapReduce

Why Integer Multiplication?

I There exist fast algorithms.

I Many applications

• Division

• Logarithm

• Trigonometric functions

• ...

Tsz-Wo Sze, Hadoop Summit 2011 20

Page 21: Large Scale Math with Hadoop MapReduce

Prerequisite of Algorithms

(D.J. Bernstein, Fastmultiplication and itsapplications, ANTS 2008.

)Tsz-Wo Sze, Hadoop Summit 2011 21

Page 22: Large Scale Math with Hadoop MapReduce

Integer Multiplication Algorithms

I Naıve, O(N 2)

I Karatsuba, O(N log2 3) = O(N 1.585)

I Toom-Cook, O(N log(2D−1)/ logD)

If D = 3, then O(N log 5/ log 3) = O(N 1.465)

I FFT-based algorithms O(N logN · · · )

Tsz-Wo Sze, Hadoop Summit 2011 22

Page 23: Large Scale Math with Hadoop MapReduce

FFT-based Algorithms

I Basic FFT, O(N logN log logN log log logN · · · )

I Schonhage-Strassen, O(N logN log logN)

I Nussbaumer, O(N logN log logN)

I Furer, O(N(logN)2log∗N)

I De-Kurur-Saha-Saptharishi, O(N(logN)2log∗N)

Tsz-Wo Sze, Hadoop Summit 2011 23

Page 24: Large Scale Math with Hadoop MapReduce

Convolution

I By the convolution theorem,

a× b = dft−1(dft(a) ∗ dft(b)),

where

× denotes the convolution operator ,

∗ denotes componentwise multiplication,

dft( · ) denotes discrete Fourier transform.

Tsz-Wo Sze, Hadoop Summit 2011 24

Page 25: Large Scale Math with Hadoop MapReduce

Schonhage-Strassen Algorithm(SSA)

I Represent integers as polynomials. Then, com-

pute convolution with DFTs modulo an integer5.

5 It has the form 2n + 1 and is called the Schonhage-Strassen modulas.

Tsz-Wo Sze, Hadoop Summit 2011 25

Page 26: Large Scale Math with Hadoop MapReduce

SSA StepsI Step 1: two DFTs,

adef= dft(a) and b

def= dft(b);

I Step 2: componentwise multiplication,

pdef= a ∗ b;

I Step 3: a DFT inverse,

p = dft−1(p);

I Step 4: normalization.

Tsz-Wo Sze, Hadoop Summit 2011 26

Page 27: Large Scale Math with Hadoop MapReduce

Calculating DFTs

I DFT can be calculated by a family of algorithms

called Fast Fourier Transform (FFT).

Tsz-Wo Sze, Hadoop Summit 2011 27

Page 28: Large Scale Math with Hadoop MapReduce

FFT Family

I Recursive-FFT

I Parallel-FFT

I Cooley-Tukey (decimation-in-time)

I Gentleman-Sande (decimation-in-frequency)

I Danielson-Lanczos

I Ping-pong FFT

I ...

Tsz-Wo Sze, Hadoop Summit 2011 28

Page 29: Large Scale Math with Hadoop MapReduce

Data Model(1)

I Need a data model which allows accessing

terabit integers efficiently.

I An integer x is represented as a D-dimensional

tuple

x = (xD−1, xD−2, . . . , x0).

Tsz-Wo Sze, Hadoop Summit 2011 29

Page 30: Large Scale Math with Hadoop MapReduce

Data Model(2)

I Write

D = IJ.

where I and J are powers of two.

I Define J-dimensional tuples

x(i) def= (x(J−1)I+i, x(J−2)I+i, . . . , xi)

for 0 ≤ i < I.

Tsz-Wo Sze, Hadoop Summit 2011 30

Page 31: Large Scale Math with Hadoop MapReduce

Data Model(3)

I Then,x(0)

x(1)

...

x(I−1)

=

x(J−1)I x(J−2)I . . . x0

x(J−1)I+1 x(J−2)I+1 . . . x1... ... . . . ...

x(J−1)I+(I−1) x(J−2)I+(I−1) . . . xI−1

I We call it the (I, J)-format of x.

Tsz-Wo Sze, Hadoop Summit 2011 31

Page 32: Large Scale Math with Hadoop MapReduce

Data Model(4)

I Each x(i) is a sequence of J records.

I Each record is a key-value pair.

Record # <Key, Value>

0 < i, xi >

1 < J + i, xJ+i >... ...

J − 1 < (J − 1)I + i, x(J−1)I+i >

Tsz-Wo Sze, Hadoop Summit 2011 32

Page 33: Large Scale Math with Hadoop MapReduce

Data Model(5)

I Thus, an integer is stored as I SequenceFiles in

HDFS, each SequenceFile contains J records.

Tsz-Wo Sze, Hadoop Summit 2011 33

Page 34: Large Scale Math with Hadoop MapReduce

Parallel-FFT Steps

I Step 1: I inner DFTs with J-point,

a(i) = dft(a(i));

I Step 2: componentwise shifting,

zjI+idef= ζ ij a(i)

j;

I Step 3: transposition,

z[j] def= (zjI+(I−1), zjI+(I−2), . . . , zjI);

I Step 4: J outer DFTs with I-point,

z[j] def= dft(z[j]).

Tsz-Wo Sze, Hadoop Summit 2011 34

Page 35: Large Scale Math with Hadoop MapReduce

MapReduce Model

Map1 Map2 Map3 Map4

Reduce1 Reduce2 Reduce3 Reduce4

Shuffle

Input

Output

Tsz-Wo Sze, Hadoop Summit 2011 35

Page 36: Large Scale Math with Hadoop MapReduce

MapReduce-FFT

Inner FFT1 Inner FFT2 Inner FFT3 Inner FFT4

Outer FFT1 Outer FFT2 Outer FFT3 Outer FFT4

Transposition(by shuffle)

Input

Output

Tsz-Wo Sze, Hadoop Summit 2011 36

Page 37: Large Scale Math with Hadoop MapReduce

Data Locality

I The FFT transposition, which is traditionally dif-

ficult in preserving locality, becomes trivial in

MapReduce.

Tsz-Wo Sze, Hadoop Summit 2011 37

Page 38: Large Scale Math with Hadoop MapReduce

MapReduce-FFT(1)

I Map function:

(k1, v1) −→ list〈k2, v2〉

Algorithm 1 (Forward FFT, Mapper).

(f.m.1) read key i, value a(i);

(f.m.2) calculate a J-point DFT;

(f.m.3) componentwise multiply;

(f.m.4) for 0 ≤ j < J , emit key j, value (i, zjI+i).

Tsz-Wo Sze, Hadoop Summit 2011 38

Page 39: Large Scale Math with Hadoop MapReduce

MapReduce-FFT(2)

I Reduce function:

(k2, list〈v2〉) −→ list〈k3, v3〉.

Algorithm 2 (Forward FFT, Reducer).

(f.r.1) receive key j, list [(i, zjI+i)]0≤i<I;

(f.r.2) calculate an I-point DFT;

(f.r.3) write key j, value z[j].

Tsz-Wo Sze, Hadoop Summit 2011 39

Page 40: Large Scale Math with Hadoop MapReduce

Normalization

I Normalization can be viewed as a summation ofthree integers.

Tsz-Wo Sze, Hadoop Summit 2011 40

Page 41: Large Scale Math with Hadoop MapReduce

Summation

I Integer summation can be done by (1) componen-twise summation, (2) carry evaluation and then

(3) parallel carrying.

Tsz-Wo Sze, Hadoop Summit 2011 41

Page 42: Large Scale Math with Hadoop MapReduce

MapReduce Model

Map1 Map2 Map3 Map4

Reduce1 Reduce2 Reduce3 Reduce4

Shuffle

Input

Output

Tsz-Wo Sze, Hadoop Summit 2011 42

Page 43: Large Scale Math with Hadoop MapReduce

MapReduce-Sum

Summation1 Summation2 Summation3 Summation4

Carrying1 Carrying2 Carrying3 Carrying4

Carry Evaluation(modified shuffle)

Input

Output

Tsz-Wo Sze, Hadoop Summit 2011 43

Page 44: Large Scale Math with Hadoop MapReduce

Job 1: Componwise Summation

Summation1 Summation2 Summation3 Summation4

Input

Output

I A map-only job.

Tsz-Wo Sze, Hadoop Summit 2011 44

Page 45: Large Scale Math with Hadoop MapReduce

Job 2: Carrying

Carry Evaluation

Carrying1 Carrying2 Carrying3 Carrying4

Input

Output

Tsz-Wo Sze, Hadoop Summit 2011 45

Page 46: Large Scale Math with Hadoop MapReduce

MapReduce-SSA

I two concurrent forward FFT jobs;

I a backward FFT job with componentwise

multiplication and splitting ;

I a componentwise summation map-only job;

I a carrying job6.

6 It is possible to combine the last two jobs if we modify the shuffle process in MapReduce [.next].

Tsz-Wo Sze, Hadoop Summit 2011 46

Page 47: Large Scale Math with Hadoop MapReduce

Prototype Implementation

I DistMpMult– distributed multi-precision multiplication

F DistFft – distributed FFT

F DistCompSum – distributed componentwise

summation

F DistCarrying – distributed carrying

I Open source – available at

https://issues.apache.org/jira/browse/MAPREDUCE-2471

Tsz-Wo Sze, Hadoop Summit 2011 47

Page 48: Large Scale Math with Hadoop MapReduce

Cluster Configuration

I A shared cluster:

F Apache Hadoop 0.20

F 1350 nodes

F 6 GB memory per node

F 2 map tasks & 1 reduce task per node

F Imposed a limitation on the aggregated

memory usage of individual jobs. /

Tsz-Wo Sze, Hadoop Summit 2011 48

Page 49: Large Scale Math with Hadoop MapReduce

Running Time

I Actual running time for 236 ≤ N ≤ 240.

7

7.5

8

8.5

9

9.5

10

10.5

11

11.5

32 33 34 35 36 37 38 39 40

log

(t)

t is

the

ela

pse

d tim

e in s

econ

ds

log(N)

Tsz-Wo Sze, Hadoop Summit 2011 49

Page 50: Large Scale Math with Hadoop MapReduce

Agenda

• Introduction

• Integer Multiplication

• MapReduce-FFT

• MapReduce-Sum

• MapReduce-SSA

• A New World Record

• The “Machine” Behind the Computation

Tsz-Wo Sze, Hadoop Summit 2011 50

Page 51: Large Scale Math with Hadoop MapReduce

What is π?

I π is a mathematical

constant such that,

for any circle,

π =circumference

diameter=C

d.

Tsz-Wo Sze, Hadoop Summit 2011 51

Page 52: Large Scale Math with Hadoop MapReduce

What is π?

I π is a mathematical

constant such that,

for any circle,

π =circumference

diameter=C

d.

I We have π = 3.244

Tsz-Wo Sze, Hadoop Summit 2011 52

Page 53: Large Scale Math with Hadoop MapReduce

What is π?

I π is a mathematical

constant such that,

for any circle,

π =circumference

diameter=C

d.

I We have π = 3.244(in hexadecimal ,)

Tsz-Wo Sze, Hadoop Summit 2011 53

Page 54: Large Scale Math with Hadoop MapReduce

Decimal, Hexadecimal & Binary

I Representing π in different bases

π = 3.1415926535 8979323846 2643383279 ...

= 3.243F6A88 85A308D3 13198A2E ...

= 11.00100100 00111111 01101010 ...

I Bit position is counted after the radix point.

I e.g., the eight bits starting at the ninth bit position

are 00111111 in binary or 3F in hexadecimal.

Tsz-Wo Sze, Hadoop Summit 2011 54

Page 55: Large Scale Math with Hadoop MapReduce

A New World Record

I Yahoo! Cloud Computing (July 2010)

• Machines: Idle slices of 1000-node clusters

Each node has two quad-core 1.8-2.5 GHz CPUs

• Duration: 23 days

• CPU time: 503 years

• Verification: 582 years CPU time

Tsz-Wo Sze, Hadoop Summit 2011 55

Page 56: Large Scale Math with Hadoop MapReduce

A New World Record

I Bit values (in hexadecimal)

0E6C1294 AED40403 F56D2D76 4026265B

CA98511D 0FCFFAA1 0F4D28B1 BB5392B8

Tsz-Wo Sze, Hadoop Summit 2011 56

Page 57: Large Scale Math with Hadoop MapReduce

A New World Record

I Bit values (in hexadecimal)

0E6C1294 AED40403 F56D2D76 4026265B

CA98511D 0FCFFAA1 0F4D28B1 BB5392B8

(256 bits)

F The first bit position: 1,999,999,999,999,997 (= 2 · 1015− 3)

F The last bit position: 2,000,000,000,000,252 (= 2·1015+252)

F The two quadrillionth (2 · 1015th) bit is 0.

Tsz-Wo Sze, Hadoop Summit 2011 57

Page 58: Large Scale Math with Hadoop MapReduce

BBC News (16 Sep 2010)

I Pi record smashed as team finds two-quadrillionth digit

http://www.bbc.co.uk/news/technology-11313194

Tsz-Wo Sze, Hadoop Summit 2011 58

Page 59: Large Scale Math with Hadoop MapReduce

NewScientist (17 Sep 2010)

I New pi record exploits Yahoo’s computers

http://www.newscientist.com/article/dn19465-new-pi-record-exploits-yahoos-computers.

html

Tsz-Wo Sze, Hadoop Summit 2011 59

Page 60: Large Scale Math with Hadoop MapReduce

Other News Coverage

I New Pi Record Exploits Yahoo’s Computers

http://cacm.acm.org/news/99207-new-pi-record-exploits-yahoos-computers

I The Yahoo! boffin scores pi’s two

quadrillionth bit

http://www.theregister.co.uk/2010/09/16/pi_record_at_yahoo

I Pi calculation more than doubles old record

http://www.radionz.co.nz/news/world/57128/pi-calculation-more-than-doubles-old-record

I Hadoop used to calculate Pi’s two quadrillionth bit

http://www.zdnet.co.uk/blogs/mapping-babel-10017967/hadoop-used-to-calculate-pis-two-quadrillionth-bit-10018670/

Tsz-Wo Sze, Hadoop Summit 2011 60

Page 61: Large Scale Math with Hadoop MapReduce

I Yahoo! researcher breaks Pi record in finding

the two-quadrillionth digit

http://www.engadget.com/2010/09/17/yahoo-researcher-breaks-pi-record-in-finding-the-two-quadrillio

I Nicholas Sze of Yahoo Finds Two-Quadrillionth

Digit of Pi

http://science.slashdot.org/story/10/09/16/2155227/Nicholas-Sze-of-Yahoo-Finds-Two-Quadrillionth-Digit-of-Pi

I The 2,000,000,000,000,000th digit of the mathemat-

ical constant pi discovered

http://news.gather.com/viewArticle.action?articleId=281474978525563

I Researcher Shatters Pi Record by Finding

Two-Quadrillionth Digit

http://www.maximumpc.com/article/news/researcher_shatters_pi_record_finding_

two-quadrillionth_digit

Tsz-Wo Sze, Hadoop Summit 2011 61

Page 62: Large Scale Math with Hadoop MapReduce

I A bigger slice of pi

http://radar.oreilly.com/2010/09/strata-week-grabbing-a-slice.html

I 2 Quadrillionth digit of PI is found: Scientist

celebration in worldwide Pandemonium

http://engforum.pravda.ru/showthread.php?296242-2-Quadrillionth-digit-of-PI-is-found-Scientist-celebration-in-worldwide-Pandemonium

I And the number is...0

http://www.hexus.net/content/item.php?item=26505

I Pi Record Smashed as Team Finds Two-

Quadrillionth Digit

http://hardocp.com/news/2010/09/16/pi_record_smashed_as_team_finds_twoquadrillionth_

digit

Tsz-Wo Sze, Hadoop Summit 2011 62

Page 63: Large Scale Math with Hadoop MapReduce

I Yahoo Engineer Calculates Two Quadrillionth

Bit Of Pi

http://www.webpronews.com/topnews/2010/09/17/yahoo-engineer-calculates-two-quadrillionth-bit-of-pi

I A Cloud Computing Milestone: Yahoo!

Reaches the 2 Quadrillionth Bit of Pi

http://www.readwriteweb.com/cloud/2010/09/a-cloud-computing-milestone-ya.

php

I Yahoo researcher Nicolas Sze determines

the 2,000,000,000,000,000th digit of the mathematical con-

stant pi

http://www.thaindian.com/newsportal/sci-tech/yahoo-researcher-nicolas-sze-determines-the-2000000000000000th-digit-of-the-mathematical-constant-pi_

100430278.html

I ...

Tsz-Wo Sze, Hadoop Summit 2011 63

Page 64: Large Scale Math with Hadoop MapReduce

Computing π

I How to compute the nth bits of π?

Tsz-Wo Sze, Hadoop Summit 2011 64

Page 65: Large Scale Math with Hadoop MapReduce

Computing π

I How to compute the nth bits of π?

Let’s ignore this question in this talk ...

and focus on:

Tsz-Wo Sze, Hadoop Summit 2011 65

Page 66: Large Scale Math with Hadoop MapReduce

Computing π

I How to compute the nth bits of π?

Let’s ignore this question in this talk ...

and focus on:

I How to execute such huge computation?

Tsz-Wo Sze, Hadoop Summit 2011 66

Page 67: Large Scale Math with Hadoop MapReduce

Map- & Reduce-side Computations

I Developed a generic framework to execute tasks

on either the map-side or the reduce-side.

I Applications define two functions:

• partition(c,m):

partition the computation c into m parts.

• compute(c):

execute the computation c

Tsz-Wo Sze, Hadoop Summit 2011 67

Page 68: Large Scale Math with Hadoop MapReduce

Map-side Job

I Contains multiple mappers and zero reducers

• A PartitionInputFormat partitions c

into m parts

• Each part is executed by a mapper

Tsz-Wo Sze, Hadoop Summit 2011 68

Page 69: Large Scale Math with Hadoop MapReduce

Reduce-side Job

I Contains a mapper and multiple reducers

• A SingletonInputFormat launches

a PartitionMapper

• An Indexer launches m reducers.

Tsz-Wo Sze, Hadoop Summit 2011 69

Page 70: Large Scale Math with Hadoop MapReduce

Abstract Machine(1)

I Machine

– an abstract base class allows abstract Runner(s)

to execute MachineComputable tasks.

I Machine subclasses

• Map Side Machinem100t3: 100 maps with 3 threads each.

• Reduce Side Machiner50t2: 50 reduces with 2 threads each.

Tsz-Wo Sze, Hadoop Summit 2011 70

Page 71: Large Scale Math with Hadoop MapReduce

Abstract Machine(2)

I More Machine subclasses

• Mix Machine – chooses Map-/Reduce-side

jobs according to the cluster status.

x-m200t1-r100t2-5: either launch a job with 200 maps

with 1 thread each; or a job with 100 reduces with 2 thread each.

• Alternation Machine – alternates Map-side

and Reduce-side jobs in a regular pattern.

a-m200t1-r100t2-mrr: submit a map job, then a re-

duce job, then another reduce job and repeat this pattern.

• Null Machine – does nothing for testing.

Tsz-Wo Sze, Hadoop Summit 2011 71

Page 72: Large Scale Math with Hadoop MapReduce

Utilizing The Idle Slices

I Monitor cluster status

• Submit a map-side (or reduce-side) job if there

are sufficient available map (or reduce) slots.

I Small jobs

• Hold resource only for a short period of time

I Interruptible & resumable

• can be interrupted at any time by simply

killing the running jobs

Tsz-Wo Sze, Hadoop Summit 2011 72

Page 73: Large Scale Math with Hadoop MapReduce

Running The Jobs

Tsz-Wo Sze, Hadoop Summit 2011 73

Page 74: Large Scale Math with Hadoop MapReduce

The Implementation

I Main programs:

F DistBbp – a program to submit jobs.

F DistSum – distributed summation.

I Open source – available at

https://issues.apache.org/jira/browse/MAPREDUCE-1923

Tsz-Wo Sze, Hadoop Summit 2011 74

Page 75: Large Scale Math with Hadoop MapReduce

The World Record Computation

I 35,000 MapReduce jobs, each job either has:

• 200 map tasks with one thread each, or

• 100 reduce tasks with two threads each.

I Each thread computes 200,000,000 terms

• ∼45 minutes.

I Submit up to 60 concurrent jobs

I The entire computation took:

• 23 days of real time and 503 CPU years

Tsz-Wo Sze, Hadoop Summit 2011 75

Page 76: Large Scale Math with Hadoop MapReduce

Referneces

• [1] Tsz-Wo Sze. Schonhage-Strassen Algorithm with MapReduce for Mul-tiplying Terabit Integers. Symbolic-Numeric Computation 2011, to ap-pear. Preprint available at http://people.apache.org/~szetszwo/

ssmr20110430.pdf

• [2] Tsz-Wo Sze. The Two Quadrillionth Bit of Pi is 0! DistributedComputation of Pi with Apache Hadoop. In IEEE 2nd InternationalConference on Cloud Computing Technology and Science (CloudCom),pages 727-732, 2010. (Earlier versions available at http://arxiv.org/

abs/1008.3171)

Tsz-Wo Sze, Hadoop Summit 2011 76

Page 77: Large Scale Math with Hadoop MapReduce

Thank you!

Tsz-Wo Sze, Hadoop Summit 2011 77