Unsupervised Learning with Apache Spark

● Data scientist at Cloudera● Recently lead Apache Spark development at

Cloudera● Before that, committing on Apache Hadoop● Before that, studying combinatorial

optimization and distributed systems at Brown

● How many kinds of stuff are there?● Why is some stuff not like the others?

● How do I contextualize new stuff?● Is there a simpler way to represent this stuff?

● Learn hidden structure of your data● Interpret new data as it relates to this

structure

● Clustering○ Partition data into categories

● Dimensionality reduction○ Find a condensed representation of your

● Designing a system for processing huge data in parallel

● Taking advantage of it with algorithms that work well in parallel

bigfile.txt lines

val lines = sc.textFile (“bigfile.txt”)

numbers

Partition

Driver

val numbers = lines.map ((x) => x.toDouble) numbers.sum()

bigfile.txt lines

val lines = sc.textFile (“bigfile.txt”)

numbers

Partition

Driver

val numbers = lines.map ((x) => x.toInt) numbers.cache()

.sum()

bigfile.txt lines numbers

Partition

Driver

Discrete Continuous

Supervised Classification● Logistic regression (and

regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests

(soon)

Regression● Linear regression (and

regularized variants)

Unsupervised Clustering● K-means

Dimensionality reduction, matrix factorization

● Principal component analysis / singular value decomposition

● Alternating least squares

Discrete Continuous

(soon)

Unsupervised Clustering

● K-meansDimensionality reduction, matrix factorization

● Anomalies as data points far away from any cluster

val data = sc.textFile("kmeans_data.txt")

val parsedData = data.map( _.split(' ').map(_.toDouble))

// Cluster the data into two classes using KMeans

val numIterations = 20

val numClusters = 2

val clusters = KMeans.train(parsedData, numClusters,

numIterations)

● Alternate between two steps:○ Assign each point to a cluster based on

existing centers○ Recompute cluster centers from the

points in each cluster

● Alternate between two steps:○ Assign each point to a cluster based on

existing centers■ Process each data point independently

○ Recompute cluster centers from the points in each cluster■ Average across partitions

// Find the sum and count of points mapping to each center

val totalContribs = data.mapPartitions { points =>

val k = centers.length

val dims = centers(0).vector.length

val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]])

val counts = Array.fill(k)(0L)

points.foreach { point =>

val (bestCenter, cost) = KMeans.findClosest(centers, point)

costAccum += cost

sums(bestCenter) += point.vector

counts(bestCenter) += 1

val contribs = for (j <- 0 until k) yield {

(j, (sums(j), counts(j)))

contribs.iterator

}.reduceByKey(mergeContribs).collectAsMap()

// Update the cluster centers and costs

var changed = false

var j = 0

while (j < k) {

val (sum, count) = totalContribs(j)

if (count != 0) {

sum /= count.toDouble

val newCenter = new BreezeVectorWithNorm(sum)

if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) {

changed = true

centers(j) = newCenter

j += 1

if (!changed) {

logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")

cost = costAccum.value

● K-Means is very sensitive to initial set of center points chosen.

● Best existing algorithm for choosing centers is highly sequential.

● Start with random point from dataset● Pick another one randomly, with probability

proportional to distance from the closest already chosen

● Repeat until initial centers chosen

● Initial cluster has expected bound of O(log k) of optimum cost

● Requires k passes over the data

● Do only a few (~5) passes● Sample m points on each pass● Oversample● Run K-Means++ on sampled points to find

initial centers

Discrete Continuous

(soon)

Unsupervised Clustering● K-means

Dimensionality reduction, matrix factorization

● Select a basis for your data that○ Is orthonormal

○ Maximizes variance along its axes

● Find dominant trends

● Find a lower-dimensional representation that lets you visualize the data

● Feature learning - find a representation that’s good for clustering or classification

● Latent Semantic Analysis

val data: RDD[Vector] = ...

val mat = new RowMatrix(data)

// compute the top 5 principal components

val principalComponents =

mat.computePrincipalComponents(5)

// project data into subspace

val transformed = data.map(_.toBreeze *

mat.toBreeze)

● Center data● Find covariance matrix● Its eigenvectors are the principal

components

Covariance Matrix

... ...

def computeGramianMatrix (): Matrix = {

val n = numCols().toInt

val nt: Int = n * (n + 1) / 2

// Compute the upper triangular part of the gram matrix.

val GU = rows.aggregate( new BDV[Double](new Array[Double](nt)))(

seqOp = (U, v) => {

RowMatrix.dspr( 1.0, v, U.data)

combOp = (U1, U2) => U1 += U2

RowMatrix.triuToFull(n, GU.data)

● n^2 must fit in memory

● n^2 must fit in memory● Not yet implemented: EM algorithm can do it

with O(kn), where k is the number of principal components

Unsupervised Learning with Apache Spark

Engineering

Transcript of Unsupervised Learning with Apache Spark

Apache Spark 입문에서 머신러닝까지

Apache Spark Overview part2 (20161117)

Apache Spark ile Twitter’ı izlemek

Apache Spark 2.0: Faster, Easier, and Smarter

¿Por que cambiar de Apache Hadoop a Apache Spark?

Plugin Apache Spark

#HSTokyo16 Apache Spark Crash Course

Apache Spark Performance Observations

Beneath RDD in Apache Spark by Jacek Laskowski

Apache Spark - Introduccion a RDDs

Apache Spark : Genel Bir Bakış

Event driven architecture with Apache Spark and Spring Reactor · Apache Kafka Apache Spark Spring Reactor Sadržaj. pruža strateške, stručne i provedbene usluge javnom sektoru

Wprowadzenie do Apache Spark · 2017-01-20 · Wprowadzenie do Apache Spark Jakub Toczek. Epoka informacyjna. MapReduce. MapReduce. Apache Hadoop narodziny w 2006 roku z Apache Nutch

Apache Spark RDDs

APACHE SPARK - Big Data - …informatica.gonzalonazareno.org/proyectos/2016-17/Apache_spark... · 1. INTRODUCCIÓN Apache Spark 1.1 Introducción a Apache Spark Hoy en día se genera

Apache Sparkの紹介

Apache spark linkedin

Deep Learning On Apache Spark

Apache spark

The Data Scientist's Guide to Apache Spark