Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.
Transcript of Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.
HADOOP MR
Spark tries to keep things in memory, whereas MapReduce keeps shuffling things in and out of disk.
Spark is much faster and more convenient that Hadoop• Caches data in memory• Pipelines calculations through RDDs with optional caching• Organizes calculations with DAG• Provides user-friendly Scala, Python and Java APIs• Gives a number of useful Spark libs: GraphX, MLib etc
Spark also adds libraries for doing things like machine learning, streaming, graph programming and SQL
SOME ACTIONS AND TRANSFORMATIONS
map(func)flatMap(func)froupByKey()reduceByKey(func)mapValues(func)sample(…)union(other)distinct()sortByKey()..
reduce(func)collect()count()first()take(n)saveAsTextFile(path)countByKey()foreach(func)…
ALS MODEL AND ALGORITHM
Model Ratings as product of User (A) and Movie Feature (B) matrices of size UxK and MxK
Alternating Least Squares (ALS)• Start with random A nd B vectors• Optimize user vectors (A) based on campaigns• Optimize campaign vectors (B) based on users• Repeat until converged