Basisopleiding Tekla Structures Precast Concrete Detailing ...
One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as...
Transcript of One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as...
![Page 1: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/1.jpg)
Indian Institute of Science Bangalore, India
भारतीय विज्ञान संस्थान
बंगलौर, भारत
One Trillion Edges :
Graph Processing at Facebook Scale A v e r y C h i n g , S e r g e y E d u n o v , M a j a K a b i l j o ,
D i o n y s i o s L o g o t h e t i s , S a m b h a v i M u t h u k r i s h n a n
F a c e b o o k
P r e s e n t e d b y : S w a p n i l G a n d h i
2 1 s t N o v e m b e r 2 0 1 8
![Page 2: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/2.jpg)
2
Konigsberg* Bridge Problem
Its negative resolution by Leonhard Euler in 1736 laid the foundations of graph theory.
* Located in Kingdom of Prussia (Now in Russia)
![Page 3: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/3.jpg)
Graphs are common Web & Social Networks
‣ Web graph, Citation Networks, Twitter, Facebook, Internet
Knowledge networks & relationships ‣ Google’s Knowledge Graph, CMU’s NELL
Cybersecurity ‣ Telecom call logs, financial transactions, Malware
Internet of Things ‣ Transport, Power, Water networks
Bioinformatics ‣ Gene sequencing, Gene expression networks
3
![Page 4: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/4.jpg)
Graph Algorithms
Traversals: Paths & flows between different parts of the graph
‣ Breadth First Search, Shortest path, Minimum Spanning Tree, Eulerian paths, Max-Cut
Clustering: Closeness between sets of vertices ‣ Community detection & Evolution, Connected
components, K-means clustering, Max Independent Set
Centrality: Relative importance of vertices ‣ PageRank, Betweenness Centrality
4
![Page 5: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/5.jpg)
Graphs are Central to Analytics
Raw
Wikipedia
< / > < / > < / > XML
Hyperlinks PageRank Top 20 Pages
Title PR Text
Table
Title Body Topic Model
(LDA) Word Topics
Word Topic
Editor Graph
Community
Detection
User
Community
User Com.
Term-Doc
Graph
Discussion
Table
User Disc.
Community
Topic
Topic Com.
![Page 6: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/6.jpg)
But Graphs can be challenging Shared memory algorithms don’t scale!
Do not fit naturally to Hadoop/MapReduce ‣ Multiple MR jobs (iterative MR)
‣ Topology & Data written to HDFS each time
‣ Tuple, rather than graph-centric, abstraction
Lot of work on parallel graph libraries for HPC ‣ Boost Graph Library, Graph500
‣ Storage & compute are (loosely) coupled, not fault tolerant
‣ But everyone does not have a supercomputer! • If in-case you do own a supercomputer, stick with HPC
algorithms 6
![Page 7: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/7.jpg)
PageRank using MR
7 MapReduce : https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-
osdi04.pdf
![Page 8: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/8.jpg)
PageRank using MR MR will run for multiple iterations (Typically 30)
Mapper will ‣ Initially, load adjacency list and initialize default PR ‣ Subsequent iterations will load adjacency list and
new PR ‣ Emit two types of messages from Map
Reducer will ‣ Reconstruct the adjacency list for each vertex ‣ Update the PageRank values for the vertex based
on neighbour’s PR messages ‣ Write adjacency list and new PR values to HDFS, to
be used by next Map iteration
8 SQL v/s MapReduce : http://www.science.smith.edu/dftwiki/images/6/6a/ComparisonOfApproachesToLargeScaleDataAnalysis.p
df
![Page 9: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/9.jpg)
Two Birds
9
Half-caf Double Expresso
Less Data movement over Network
Fault Tolerance
Credits : Google Images
![Page 10: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/10.jpg)
One Stone
10
Pregel
Credits : Google Images
It’s a word play on the English proverb : “Kill two birds with one stone”
![Page 11: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/11.jpg)
Pregel To overcome these challenges, Google came up with Pregel
11
![Page 12: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/12.jpg)
Valiant’s BSP
12
Superstep 1
Superstep 2
P1 P2 P3 P4
Computation
Communication
Barrier Synchronization
Computation
“Often expensive and should be used as sparingly as possible”
![Page 13: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/13.jpg)
Vertex State Machine
13
In superstep 0, every vertex is in the active state.
A vertex deactivates itself by voting to halt.
It can be reactivated by receiving an (external) message.
Algorithm termination is based on every vertex voting to halt.
![Page 14: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/14.jpg)
Vertex Centric Programming
Vertex Centric Programming Model ‣ Logic written from perspective on a single vertex.
‣ Executed on all vertices.
Vertices know about ‣ Their own value(s) ‣ Their outgoing edges
14
![Page 15: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/15.jpg)
15
6 6 2 6
6 6 6 6
6 6 6 6
3 6 2 1 Superstep 0
Superstep 1
Superstep 2
Superstep 3
Active
Voted
to Halt
Message
Finding Largest Value in a Graph using Pregel
Worker
Edges
![Page 16: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/16.jpg)
Advantages
Makes distributed programming easy ‣ No locks, semaphores, race conditions ‣ Separates computing from communication phase
Vertex-level parallelization ‣ Bulk message passing for efficiency
Stateful (in-memory) ‣ Only messages & checkpoints hit disk
16
![Page 17: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/17.jpg)
Lifecycle of a Pregel Program
17 Apache Giraph, Claudio Martella, Hadoop Summit, Amsterdam, April
2014
![Page 18: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/18.jpg)
Applications
18
![Page 19: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/19.jpg)
SSSP class ShortestPathVertex: public Vertex<int, int, int> {
void Compute(MessageIterator* msgs) { int mindist = IsSource(vertex_id()) ? 0 : INF; for ( ; !msgs->Done(); msgs->Next())
mindist = min(mindist, msgs->Value());
if (mindist < GetValue()) {
*MutableValue() = mindist; OutEdgeIterator iter = GetOutEdgeIterator();
for ( ; !iter.Done(); iter.Next())
SendMessageTo(iter.Target(),
mindist + iter.GetValue());
}
VoteToHalt();
}
};
19
In the 0th superstep, only source vertex will
update its value
![Page 20: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/20.jpg)
SSSP (1/6)
20
A
B
C
D
E
H
F
G
1
2 4
1 1
3
2
5 2
2
Input Graph
Worker 1
Worker 2
Worker 3
Worker 4
![Page 21: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/21.jpg)
SSSP (2/6)
21
0
∞
∞
∞
∞
∞
∞
∞
1
2 4
1 1
3
2
5 2
2
Superstep 0
Worker 1
Worker 2
Worker 3
Worker 4
![Page 22: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/22.jpg)
SSSP (3/6)
22
0
1
2
∞
∞
∞
∞
∞
1
2 4
1 1
3
2
5 2
2
Superstep 1
Worker 1
Worker 2
Worker 3
Worker 4
![Page 23: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/23.jpg)
SSSP (4/6)
23
0
1
2
3
4
6
3
∞
1
2 4
1 1
3
2
5 2
2
Superstep 2
Worker 1
Worker 2
Worker 3
Worker 4
![Page 24: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/24.jpg)
SSSP (5/6)
24
0
1
2
3
4
4
3
6
1
2 4
1 1
3
2
5 2
2
Superstep 3
Worker 1
Worker 2
Worker 3
Worker 4
![Page 25: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/25.jpg)
SSSP (6/6)
25
0
1
2
3
4
4
3
6
1
2 4
1 1
3
2
5 2
2
Worker 1
Worker 2
Worker 3
Worker 4
Algorithm has converged
![Page 26: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/26.jpg)
Apache Giraph
26
![Page 27: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/27.jpg)
Platform Improvements (1/2)
Efficient Memory Management (MM) ‣ Vertex and Edge data stored using serialized byte array
‣ Better MM -> Less GC
Support for Multi-Threading ‣ Maximized resource utilization
‣ Linear speed-up for CPU bound applications like K-Means Clustering
27
![Page 28: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/28.jpg)
Platform Improvements (2/2)
Flexible IO Format ‣ Reduces Pre-processing
‣ Allows Vertex and Edge data to be loaded from different sources
Sharded Aggregator ‣ Aggregator responsibilities are balanced across workers
‣ Different Aggregators can be assigned to different workers.
28
![Page 29: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/29.jpg)
29
Refer Class-room Discussion
![Page 30: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/30.jpg)
Beyond Pregel
Master Compute ‣ Allows centralized execution of computation
‣ Refer Class-room Discussion
Worker Phases ‣ Special methods which by-pass Pregel Model, but add
ease of usability
‣ Applicability is application specific
Computation Composability ‣ Decouples Vertex and Computation
‣ Existing Computation implementation can be decoupled for multiple applications
30
![Page 31: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/31.jpg)
Superstep Splitting
Master runs same “Message Heavy” Superstep for fixed number of iterations
For an iteration: ‣ Vertex computation invoked if vertex passes hash
function H ‣ Message sent to destination only if destination passes
hash function H’
Applicable to computation where messages are not “aggregatable”. ‣ If they can be aggregated (commutative and associative)
then stick with Combiners
Example : Friends-of-Friends Computation
31
![Page 32: One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as open-source to try out ! Memoir detailing Facebook’s experience of using Giraph for](https://reader036.fdocument.pub/reader036/viewer/2022062604/5fbe8dd4e6811e253915d82a/html5/thumbnails/32.jpg)
Key Take-aways
Usability, Performance and scalability improvement to Apache Giraph ‣ Code available as open-source to try out !
Memoir detailing Facebook’s experience of using Giraph for production applications
Headline Grabber :
“Scales to a Trillion Edge graph”
32