Till RohrmannFlink committer / data Artisans
[email protected]@stsffap
Computing Recommendations at Extreme Scale with Apache Flink
Recommendations: Collaborative Filtering
2
Recommendations
Omnipresent nowadays Important for user
experience and sales
3
Recommending Websites
4
Collaborative Filtering
Recommend items based on users with similar preferences
Latent factor models capture underlying characteristics of items and preferences of user
Predicted preference:
5
r̂u,i xuTyi
Rating Matrix
Explicit or implicit ratings Prediction goal: Rating for unseen items
6
Princess
Items
Users10 5 ?? 2 107 ? 5
Matrix Factorization Calculate low rank approximation to obtain latent factors
7
Princess
= rating of user u for item i = latent factors of user u = latent factors of item i
= regularization constant = number of rated items for user u = number of ratings for item i
minX,Y ru,i xuTyi 2
nu xu2 ni yi
2
i
u
ru,i0
Items
Users10 5 ?? 2 107 ? 5
X
Y
xuyi
nuni
ru,i
Alternating Least Squares
Hard to optimize since we have two variables
Fixing one variables gives quadratic problem
8
We only need the item vectors rated by user u
X,Y
xu YSuYT nu 1Yru
T
Siiu
1 if ru,i 0
0 else
ALS Algorithm
Update user matrix
9
Keep fixed
Calculate updateItems
Users10 5 ?? 2 107 ? 5
X
Y
ALS Algorithm contd.
Update item matrix
10
Keep fixed
Calculate update
Items
Users10 5 ?? 2 107 ? 5
X
Y
ALS Algorithm contd.
Repeat update step until convergence
11
Keep fixed
Calculate updateItems
Users10 5 ?? 2 107 ? 5
X
Y
Apache Flink
12
What is Apache Flink?Apache Flink deep-dive by Stephan EwenTomorrow, 12:20 – 13:00 on Stage 3
13
Gelly
Gelly
Table
Table
Flin
kM
LFl
inkM
L
SA
MO
AS
AM
OA
DataSet (Java/Scala/Python)DataSet (Java/Scala/Python) DataStream (Java/Scala)DataStream (Java/Scala)
Hadoop M
/RH
adoop M
/R
LocalLocal RemoteRemote YarnYarn TezTez EmbeddedEmbedded
Data
fow
Data
fow
Data
fow
(W
iP)
Data
fow
(W
iP)
MR
QL
MR
QL
Table
Table
Casc
adin
g
(WiP
)C
asc
adin
g
(WiP
)
Streaming datafow runtime
Why Using Flink for ALS?
Expressive API Pipelined stream processor Closed loop iterations Operations on managed memory
14
Expressive APIs DataSet: Abstraction for distributed data Computation specified as sequence of lazily evaluated
transformations
15
case class W ord(w ord: String, frequency: Int)
val lines: DataSet[String] = env.readTextFile(… )
lines.flatM ap(line = > line.split(“ “).m ap(w ord = > W ord(w ord, 1)).groupBy(“w ord”).sum (“frequency”).print()
Pipelined Stream Processor
17
Avoiding materialization ofintermediate results
Iterate in the Datafow
19
Memory Management
21
ALS implementations with Apache Flink
22
Naïve Implementation
1. Join item vectors with ratings
2. Group on user ID3. Compute new
user vectors
23
xu YSuYT nu 1Yru
T
Pros and Cons of Naïve ALS
Pros• Easy to implement
Cons• Item vectors are sent redundantly to
network nodes• Two shuffle steps make execution
expensive24
Blocked ALS Implementation1. Create user and item
rating blocks2. Cache them on
worker nodes3. Send all item vectors
needed by user rating block bundled
4. Compute block of item vectors
25
Based on Spark’s MLlib implementation
Pros and Cons of Blocked ALS
Pros• Reduces network load by avoiding data
duplication• Caching ratings: Only one shuffle step needed
Cons• Duplicates the rating matrix (user block/item
block partitioning)
26
Performance Comparison
27
• 40 node GCE cluster, highmem-8
• 10 ALS iterations with 50 latent factors
• Rating matrix has 28 billion non zero entries: Scale of Netflix or Spotify
Machine Learning with FlinkML FlinkML contains
blocked ALS Support for many
other tasks• Clustering• Regression• Classification
scikit-learn like pipeline support
28
val als = ALS()
val ratingDS = env.readCsvFile[(Int, Int, Double)](ratingData)
val param eters = Param eterM ap() .add(ALS.Iterations, 10) .add(ALS.Num Factors, 50) .add(ALS.Lam bda, 1.5)
als.fit(ratingDS, param eters)
val testingDS = env.readCsvFile[(Int, Int)](testingData)
val predictions = als.predict(testingDS)
Closing
29
What Have You Seen?
How to use collaborative filtering to make recommendations
Apache Flink, a powerful parallel stream processing engine
How to use Apache Flink and alternating least squares to factorize really large matrices
30
31
fink.apache.org@ApacheFlink
Top Related