Apache Spark 2.0: Faster, Easier, and Smarter

44
Apache Spark 2.0: Faster, Easier, and Smarter Reynold Xin @rxin 2016-05-05 Webinar

Transcript of Apache Spark 2.0: Faster, Easier, and Smarter

Page 1: Apache Spark 2.0: Faster, Easier, and Smarter

Apache Spark 2.0: Faster, Easier, and Smarter

Reynold Xin@rxin2016-05-05 Webinar

Page 2: Apache Spark 2.0: Faster, Easier, and Smarter

About Databricks

Founded by creators of Spark in 2013

Cloud enterprise data platform- Managed Spark clusters- Interactive data science- Production pipelines- Data governance, security, …

Page 3: Apache Spark 2.0: Faster, Easier, and Smarter

What is Apache Spark?

Unified engine across data workloads and platforms

SQLStreaming ML Graph Batch …

Page 4: Apache Spark 2.0: Faster, Easier, and Smarter

A slide from 2013 …

Page 5: Apache Spark 2.0: Faster, Easier, and Smarter
Page 6: Apache Spark 2.0: Faster, Easier, and Smarter
Page 7: Apache Spark 2.0: Faster, Easier, and Smarter

Spark 2.0Steps to bigger & better things….

Builds on all we learned in past 2 years

Page 8: Apache Spark 2.0: Faster, Easier, and Smarter

Versioning in Spark

In reality, we hate breaking APIs!Will not do so except for dependency conflicts (e.g. Guava) and experimental APIs

1 .6.0Patch version (only bug fixes)

Major version (may change APIs)

Minor version (adds APIs / features)

Page 9: Apache Spark 2.0: Faster, Easier, and Smarter

Major Features in 2.0

Tungsten Phase 2speedups of 5-20x

Structured Streaming SQL 2003& Unifying Datasets

and DataFrames

Page 10: Apache Spark 2.0: Faster, Easier, and Smarter

API Foundation for the Future

Dataset, DataFrame, SQL, ML

Page 11: Apache Spark 2.0: Faster, Easier, and Smarter

Towards SQL 2003

As of this week, Spark branch-2.0 can run all 99 TPC-DS queries!

- New standard compliant parser (with good error messages!)- Subqueries (correlated & uncorrelated)- Approximate aggregate stats

Page 12: Apache Spark 2.0: Faster, Easier, and Smarter

Datasets and DataFrames

In 2015, we added DataFrames & Datasets as structured data APIs• DataFrames are collections of rows with a schema• Datasets add static types, e.g. Dataset[Person]• Both run on Tungsten

Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]

Page 13: Apache Spark 2.0: Faster, Easier, and Smarter

SparkSession – a new entry point

SparkSession is the “SparkContext” for Dataset/DataFrame

- Entry point for reading data- Working with metadata- Configuration- Cluster resource management

Page 14: Apache Spark 2.0: Faster, Easier, and Smarter

Notebook demo

http://bit.ly/1SMPEzQ

and

http://bit.ly/1OeqdSn

Page 15: Apache Spark 2.0: Faster, Easier, and Smarter

Long-Term

RDD will remain the low-level API in Spark

Datasets & DataFrames give richer semantics and optimizations• New libraries will increasingly use these as interchange format• Examples: Structured Streaming, MLlib, GraphFrames

Page 16: Apache Spark 2.0: Faster, Easier, and Smarter

Other notable API improvements

DataFrame-based ML pipeline API becoming the main MLlib API

ML model & pipeline persistence with almost complete coverage• In all programming languages: Scala, Java, Python, R

Improved R support• (Parallelizable) User-defined functions in R• Generalized Linear Models (GLMs), Naïve Bayes, Survival Regression, K-Means

Page 17: Apache Spark 2.0: Faster, Easier, and Smarter

Structured Streaming

How do we simplify streaming?

Page 18: Apache Spark 2.0: Faster, Easier, and Smarter

Background

Real-time processing is vital for streaming analytics

Apps need a combination: batch & interactive queries• Track state using a stream, then run SQL queries• Train an ML model offline, then update it

Page 19: Apache Spark 2.0: Faster, Easier, and Smarter

Integration Example

Streaming engine

Stream(home.html, 10:08)

(product.html, 10:09)

(home.html, 10:10)

. . .

What can go wrong?• Late events• Partial outputs to MySQL• State recovery on failure• Distributed reads/writes • ...

MySQL

Page Minute Visits

home 10:09 21

pricing 10:10 30

... ... ...

Page 20: Apache Spark 2.0: Faster, Easier, and Smarter

ProcessingBusiness logic change & new ops

(windows, sessions)

Complex Programming Models

OutputHow do we define

output over time & correctness?

DataLate arrival, varying distribution over time, …

Page 21: Apache Spark 2.0: Faster, Easier, and Smarter

The simplest way to perform streaming analyticsis not having to reason about streaming.

Page 22: Apache Spark 2.0: Faster, Easier, and Smarter

Spark 2.0Infinite DataFrames

Spark 1.3Static DataFrames

Single API !

Page 23: Apache Spark 2.0: Faster, Easier, and Smarter

logs = ctx.read.format("json").open("s3://logs")

logs.groupBy(logs.user_id).agg(sum(logs.time))

.write.format("jdbc")

.save("jdbc:mysql//...")

Example: Batch Aggregation

Page 24: Apache Spark 2.0: Faster, Easier, and Smarter

logs = ctx.read.format("json").stream("s3://logs")

logs.groupBy(logs.user_id).agg(sum(logs.time))

.write.format("jdbc")

.startStream("jdbc:mysql//...")

Example: Continuous Aggregation

Page 25: Apache Spark 2.0: Faster, Easier, and Smarter

Structured Streaming

High-level streaming API built on Spark SQL engine• Declarative API that extends DataFrames / Datasets• Event time, windowing, sessions, sources & sinks

Support interactive & batch queries• Aggregate data in a stream, then serve using JDBC• Change queries at runtime• Build and apply ML models Not just streaming, but

“continuous applications”

Page 26: Apache Spark 2.0: Faster, Easier, and Smarter

Goal: end-to-end continuous applications

Example

Reporting Applications

ML Model

Ad-hoc Queries

Traditional streamingOther processing types

Kafka DatabaseETL

Page 27: Apache Spark 2.0: Faster, Easier, and Smarter

Tungsten Phase 2

Can we speed up Spark by 10X?

Page 28: Apache Spark 2.0: Faster, Easier, and Smarter

Demo

http://bit.ly/1X8LKmH

Page 29: Apache Spark 2.0: Faster, Easier, and Smarter

Going back to the fundamentals

Difficult to get order of magnitude performance speed ups with profiling techniques• For 10x improvement, would need to find top hotspots that add up to 90%

and make them instantaneous• For 100x, 99%

Instead, look bottom up, how fast should it run?

Page 30: Apache Spark 2.0: Faster, Easier, and Smarter

Scan

Filter

Project

Aggregate

select count(*) from store_saleswhere ss_item_sk = 1000

Page 31: Apache Spark 2.0: Faster, Easier, and Smarter

Volcano Iterator Model

Standard for 30 years: almost all databases do it

Each operator is an “iterator” that consumes records from its input operator

class Filter {def next(): Boolean = {var found = falsewhile (!found && child.next()) {

found = predicate(child.fetch())}return found

}

def fetch(): InternalRow = {child.fetch()

}…

}

Page 32: Apache Spark 2.0: Faster, Easier, and Smarter

What if we hire a college freshman toimplement this query in Java in 10 mins?

select count(*) from store_saleswhere ss_item_sk = 1000

var count = 0for (ss_item_sk in store_sales) {

if (ss_item_sk == 1000) {count += 1

}}

Page 33: Apache Spark 2.0: Faster, Easier, and Smarter

Volcano model30+ years of database research

college freshmanhand-written code in 10 mins

vs

Page 34: Apache Spark 2.0: Faster, Easier, and Smarter

Volcano 13.95 millionrows/sec

collegefreshman

125 millionrows/sec

Note: End-to-end, single thread, single column, and data originated in Parquet on disk

High throughput

Page 35: Apache Spark 2.0: Faster, Easier, and Smarter

How does a student beat 30 years of research?

Volcano

1. Many virtual function calls

2. Data in memory (or cache)

3. No loop unrolling, SIMD, pipelining

hand-written code

1. No virtual function calls

2. Data in CPU registers

3. Compiler loop unrolling, SIMD, pipelining

Take advantage of all the information that is known after query compilation

Page 36: Apache Spark 2.0: Faster, Easier, and Smarter

Scan

Filter

Project

Aggregate

long count = 0;for (ss_item_sk in store_sales) {

if (ss_item_sk == 1000) {count += 1;

}}

Tungsten Phase 2: Spark as a “Compiler”

Functionality of a general purpose execution engine; performance as if hand built system just to run your query

Page 37: Apache Spark 2.0: Faster, Easier, and Smarter

Performance of Core Primitives

cost per row (single thread)

primitive Spark 1.6 Spark 2.0

filter 15 ns 1.1 ns

sum w/o group 14 ns 0.9 ns

sum w/ group 79 ns 10.7 ns

hash join 115 ns 4.0 ns

sort (8 bit entropy) 620 ns 5.3 ns

sort (64 bit entropy) 620 ns 40 ns

sort-merge join 750 ns 700 ns

Intel Haswell i7 4960HQ 2.6GHz, HotSpot 1.8.0_60-b27, Mac OS X 10.11

Page 38: Apache Spark 2.0: Faster, Easier, and Smarter

0

100

200

300

400

500

600

Runt

ime

(sec

onds

)Preliminary TPC-DS Spark 2.0 vs 1.6 – Lower is Better

Time (1.6)

Time (2.0)

Page 39: Apache Spark 2.0: Faster, Easier, and Smarter

DatabricksCommunity Edition

Best place to try & learn Spark.

Page 40: Apache Spark 2.0: Faster, Easier, and Smarter
Page 41: Apache Spark 2.0: Faster, Easier, and Smarter

Release Schedule

Today: work-in-progress source code available on GitHub

Next week: preview of Spark 2.0 in Databricks Community Edition

Early June: Apache Spark 2.0 GA

Page 42: Apache Spark 2.0: Faster, Easier, and Smarter

Today’s talk

Spark 2.0 doubles down on what made Spark attractive:• Faster: Project Tungsten Phase 2, i.e. “Spark as a compiler”• Easier: unified APIs & SQL 2003• Smarter: Structured Streaming• Only scratched the surface here, as Spark 2.0 will resolve ~ 2000 tickets.

Learn Spark on Databricks Community Edition• join beta waitlist https://databricks.com/ce/

Page 43: Apache Spark 2.0: Faster, Easier, and Smarter

Discount code: Meetup16SF

Page 44: Apache Spark 2.0: Faster, Easier, and Smarter

Thank you.Don’t forget to register for Spark Summit SF!