Track A-2 基於 Spark 的數據分析

59
1 Spark Drives Big Data Analytics Application 基於Spark的數據分析 James Chen Etu CTO June 16, 2015

Transcript of Track A-2 基於 Spark 的數據分析

Page 1: Track A-2 基於 Spark 的數據分析

1

Spark Drives Big Data Analytics Application

基於Spark的數據分析

James ChenEtu CTO

June 16, 2015

Page 2: Track A-2 基於 Spark 的數據分析

2

• Spark Brief• What Cloudera is doing on Spark• Spark Use Cases• Cloudera’s Position on Spark• Etu and Cloudera

Agenda

Page 3: Track A-2 基於 Spark 的數據分析

3

Key Advances by MapReduce:

• Data Locality: Automatic split computation and launch of mappers appropriately

• Fault-Tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware

• Linear Scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problems

A Brief Review of MapReduce

Map

Map

Map

Map

Map

Map

Map

Map

Map

Map

Map

Map

Reduce

Reduce

Reduce

Reduce

Page 4: Track A-2 基於 Spark 的數據分析

4

MapReduce: Good

The Good:• Built in fault tolerance• Optimized IO path• Scalable• Developer focuses on Map/Reduce, not infrastructure • Simple? API

Page 5: Track A-2 基於 Spark 的數據分析

5

MapReduce: Bad

The Bad:•Optimized for disk IO

– Doesn’t leverage memory– Iterative algorithms go through disk IO path again and

again•Primitive API

– Developer’s have to build on very simple abstraction– Key/Value in/out– Even basic things like join require extensive code

•Result often many files that need to be combined appropriately

Page 6: Track A-2 基於 Spark 的數據分析

6

Spark is a general purpose computational framework with more flexibility than MapReduce

Key properties:• Leverages distributed memory• Full Directed Graph expressions for data parallel computations• Improved developer experience

Yet retains:Linear scalability, Fault-tolerance, and Data Locality based computations

Reference:https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

What is Spark?

Page 7: Track A-2 基於 Spark 的數據分析

7

Easy to Develop– High productive

language support – Clean and expressive

APIs– Interactive shell– Out of box

functionality

Spark: Easy and Fast Big Data

Fast to Run–General execution graphs

– In-memory storage

2-5× less codeUp to 10× faster on disk,

100× in memory

Page 8: Track A-2 基於 Spark 的數據分析

8

Spark

Easy: Example – Word CountHadoop MapReduce

public static class WordCountMapClass extends MapReduceBaseimplements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);private Text word = new Text();

public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException {

String line = value.toString();StringTokenizer itr = new StringTokenizer(line);while (itr.hasMoreTokens()) {

word.set(itr.nextToken());output.collect(word, one);

}}

}

public static class WorkdCountReduce extends MapReduceBaseimplements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException {

int sum = 0;while (values.hasNext()) {

sum += values.next().get();}output.collect(key, new IntWritable(sum));

}}

val spark = new SparkContext(master, appName, [sparkHome], [jars])val file = spark.textFile("hdfs://...")val counts = file.(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")

Page 9: Track A-2 基於 Spark 的數據分析

9

Hadoop Integration• Works with Hadoop Data• Runs With YARN

Libraries• MLlib• Spark Streaming• GraphX (alpha)

Out-of-the-Box FunctionalityLanguage support:

• Improved Python support• SparkR• Java 8• Schema support in Spark’s

APIs

Page 10: Track A-2 基於 Spark 的數據分析

10

data = spark.textFile(...).map(readPoint).cache()

w = numpy.random.rand(D)

for i in range(iterations):gradient = data.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))

* p.y * p.x).reduce(lambda x, y: x + y)

w -= gradient

print “Final w: %s” % w

Example: Logistic Regression

Page 11: Track A-2 基於 Spark 的數據分析

11

• Hadoop cluster with 100 nodes contains 10+TB of RAM today and will double next year

• 1 GB RAM ~ $10-$20

• Trends:• ½ price every 18 months• 2x bandwidth every 3 years

Memory Management Leads to Greater Performance

64-­‐128GB  RAM

16  cores

50  GB  per  sec

Memory can be enabler for high performance big data applications

Page 12: Track A-2 基於 Spark 的數據分析

12

In-memory Caching• Data Partitions read from

RAM instead of disk

Operator Graphs• Scheduling Optimizations• Fault Tolerance

Fast: Using RAM, Operator Graphs

join

filter

groupBy

B: B:

C: D: E:

F:

Ç√Ω

map

A:

map

take

=  cached  partition=  RDD

Page 13: Track A-2 基於 Spark 的數據分析

13

Expressiveness of Programming ModelMap

ReduceMap

MapReduce

Map

Reduce Efficient  group-­‐by   aggregations  and  other  analytics

Pipelined  MapReduce  JobsMap

Reduce

Map

ReduceX X X

Map

Reduce Iterative  jobs  (Machine  Learning)

Page 14: Track A-2 基於 Spark 的數據分析

14

Logistic Regression Performance (Data Fits in Memory)

0

500

1000

1500

2000

2500

3000

3500

4000

1 5 10 20 30

Runn

ing  Time  (s)

Number  of  Iterations

Hadoop

Spark

110  s  /  iteration

first  iteration 80  sfurther   iterations 1  s

Page 15: Track A-2 基於 Spark 的數據分析

15

• Spark Brief• What Cloudera is doing on Spark• Spark Use Cases• Cloudera’s Position on Spark• Etu and Cloudera

Agenda

Page 16: Track A-2 基於 Spark 的數據分析

16

Spark Engineering in Cloudera

• Cloudera embraced Spark in early 2014• Engineering with Intel to broaden Spark ecosystem

– Hive-on-Spark– Pig-on-Spark– Spark-over-YARN– Spark Streaming Reliability– General Spark Optimization

Page 17: Track A-2 基於 Spark 的數據分析

17

Hive on Spark

• Technology– Hive: “standard” SQL tool in Hadoop– Spark: next-gen distributed processing framework– Hive + Spark

• Performance• Minimum feature gap

• Industry– A lot of customers heavily invest in Hive– Want to leverage the Spark engine

Page 18: Track A-2 基於 Spark 的數據分析

18

Design Principles

• No or limited impact on Hive’s existing code path• Maximize code reuse• Minimum feature customization• Low future maintenance cost

Page 19: Track A-2 基於 Spark 的數據分析

19

Class Hierarchy

TaskCompiler

MapRedCompiler TezCompiler

Task Work

MapRedTask TezTask TezWorkMapRedWork

SparkCompiler SparkTask SparkWork

generates described by

Page 20: Track A-2 基於 Spark 的數據分析

20

Work – Metadata for Task

• MapReduceWork contains one MapWork and a possible ReduceWork• SparkWork contains a graph of MapWorks and ReduceWorks

MapWork1

ReduceWork1

MapWork2

ReduceWork2

MapWork1

ReduceWork1

ReduceWork2

Query:  select  name,  sum(value)   as  v  from  decgroup  by  name  order  by  v;

Spark  Job

MR  Job  2

MR    Job  1

Page 21: Track A-2 基於 Spark 的數據分析

21

Data Processing via Spark

• Treat Table as HadoopRDD (input RDD)• Apply the function that wraps MR’s map-side processing• Shuffle map output using Spark’s transformations (groupByKey,

sortByKey, etc)• Apply the function that wraps MR’s reduce-side processing

Page 22: Track A-2 基於 Spark 的數據分析

22

Spark Plan

• MapInput – encapsulate a table• MapTran – map-side processing• ShuffleTran – shuffling• ReduceTran – reduce-side processing

Query: Select name, sum(value) as v from dec group by name order by v;

Page 23: Track A-2 基於 Spark 的數據分析

23

Current Status

• All functionality in Hive is implemented• First round of optimization is completed

– Map join, SMB– Split generation and grouping– CBO, vectorization

• More optimization and benchmarking coming• Beta in CDH

– http://archive-primary.cloudera.com/cloudera-labs/hive-on-spark/– http://www.cloudera.com/content/cloudera/en/documentation/hive-

spark/latest/PDF/hive-spark-get-started.pdf

Page 24: Track A-2 基於 Spark 的數據分析

24

• Spark Brief• What Cloudera is doing on Spark• Spark Use Cases• Cloudera’s Position on Spark

Agenda

Page 25: Track A-2 基於 Spark 的數據分析

25

User Use Case Spark’s Value

Conviva 通過實時分析流量規則以及更精細的流量控制,優化終端用戶的在線視頻體驗

• 快速原型開發• 共享的離線和在線計算業務邏輯• 開源的機器學習算法

Yahoo!加速廣告投放的模型訓練週期,特徵提取提高3備,採用協同過濾算法進行內容推薦

• 降低數據管道的延遲• 迭代式機器學習• 高效的P2P廣播

Anonymous(Large Tech Company)

準實時日誌聚合於分析,實現監控和告警

• 低延遲、高頻度的運行“mini”批總也來處理最新數據

Technicolor 為(電信)客戶提供實時分析;提供流處理和實時查詢能力

• 部署簡單,只需要Spark和SparkStreaming

• 在線數據的隨機查詢

Sample Use Cases

Page 26: Track A-2 基於 Spark 的數據分析

26

Large Tech Company – Spark is used for new machine learning investigations for search personalization

Financial Services – Process millions of stock positions and future scenarios in 4hrs with Spark (compared with 1 week using MapReduce)

University – Genomics research using Spark pipelines

Video – Spark and Spark Streaming for video streaming and analysis

Hospital – Spark for predictive modeling of disease conditions

Cloudera Use Cases in Verticals

Page 27: Track A-2 基於 Spark 的數據分析

27

• Run ETL on Spark using PIG– To achieve very tight SLA’s. – Accenture Smart Water Application.

• Spark Analytics over Hbase– Patients physiological data, experiment and user data – Serving Researchers.

• Traffic analysis using MLlib/Clustering at Dylan • Annotated Variants analysis on Spark

– Using the Spark/Java framework in Duke• Sepsis detection with Spark Streaming

Cloudera Use cases with different Components

Page 28: Track A-2 基於 Spark 的數據分析

28

• A car shopping website where people from all across the nation come to read reviews, compare prices, and in general get help in all matters car related.

• The goal was to build a near real-time dashboard that would provide both unique visitor and page view counts per make and make/model that could be engineered in a couple of weeks.

• In the past, these updates have been restricted to hourly granularities with an additional hour delay.

• Furthermore, as this data was not available in an easy-to-use dashboard, manual processing was needed to visualize the data.

Near real-time dashboard by Edmunds.com

Page 29: Track A-2 基於 Spark 的數據分析

29

Prototype Architecture

Page 30: Track A-2 基於 Spark 的數據分析

30

Page View Per Minute

Page 31: Track A-2 基於 Spark 的數據分析

31

Unique Visitor Per Minute

Page 32: Track A-2 基於 Spark 的數據分析

32

Total UV by Maker/Model

Page 33: Track A-2 基於 Spark 的數據分析

33

Case Study in Etu Insight

l Problem domain:− Analyze user behavior from web site interaction log− Analyze users behavior from existing offline data− Make data aggregation on the data grouping by time

and usersl Approach:

− ETL process from the web log to Hive structure data− Import existing database data − Define and implement the aggregation function in Spark

(with Scala)− Output the calculation result to HBase

Page 34: Track A-2 基於 Spark 的數據分析

34

Architecture & Flow

Web log User Data

Hive(Structured Data)

SPARKHBase

Page 35: Track A-2 基於 Spark 的數據分析

35

Etu Insight Dashboard

Page 36: Track A-2 基於 Spark 的數據分析

36

Advanced Analytics with Spark

• Written by Cloudera data science team– First ever book bridging ML with

Hadoop ecosystem– Focusing on use cases and examples

rather than a manual– Target for data scientist solving real

word analysis problems– Generally available in May 2015

Page 37: Track A-2 基於 Spark 的數據分析

37

Analyzing Big Data

• Building a model to detect credit card fraud using thousandsof features and billions of transactions

• Intelligently recommend millions of products to millions ofusers

• Estimate financial risk through simulations of portfoliosincluding millions of instruments

• Easily manipulate data from thousands of human genomes todetect genetic associations with disease

Page 38: Track A-2 基於 Spark 的數據分析

38

• Spark Brief• What Cloudera is doing on Spark• Spark Use Cases• Cloudera’s Position on Spark• Etu and Cloudera

Agenda

Page 39: Track A-2 基於 Spark 的數據分析

39

Spark is a fully integrated and supported part of Cloudera’s enterprise data hub

• First vendor to ship and support Spark– Invested early to make it a cohesive part of the platform– Complemented by Intel’s early investment– Developed and supported in collaboration with Databricks to

ensure success• Only vendor with Spark committers on staff• Several Spark use cases in production• Well-trained support staff and external Training Courses

Cloudera’s Investment in Spark

Page 40: Track A-2 基於 Spark 的數據分析

40

Hadoop in the Spark World

YARN

Spark

SparkStreaming GraphX MLlib

HDFS, HBase

HivePig

Impala

MapReduce2

SparkSQL

Search

Core  Hadoop

Support  Spark  components

Unsupported  add-­‐ons

Page 41: Track A-2 基於 Spark 的數據分析

41

Focusing on Open Standards, not just Open Source

Open  Standards  are  just  as  important  as  Open  Source.

Why  does   it  matter?• Diverse  engineering  is  more  sustainable.• Broad  support   ensures  vendor  

portability.• Project  utility   depends   on  ecosystem  

compatibility,  which  depends   on  standards.

Cloudera  leads  in  definingthe  de  facto  open  standards  adopted  by  the  market.

Vendor   Support

Component(Founder)

Cloudera Pivotal

MapR Amazon

IBM Hortonworks

Spark(UC  Berkeley)

✔ ✔ ✔ ✔ ✔ ✔

Impala  (Cloudera) ✔ ✖ ✔ ✔ ✖ ✖

Hue  (Cloudera) ✔ ✖ ✔ ✔ ✖ ✔

Sentry  (Cloudera) ✔ ✔ ✔ ✖ ✔ ✖

Flume  (Cloudera) ✔ ✔ ✔ ✖ ✔ ✔

Parquet  (Cloudera/Twitter)

✔ ✔ ✔ ✔ ✔ ✖

Sqoop   (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔

Falcon   ✖ ✖ ✖ ✖ ✖ ✔

Knox ✖ ✖ ✖ ✖ ✖ ✔

Tez ✖ ✖ ✔ ✖ ✖ ✔

Ranger   ✖ ✖ ✖ ✖ ✖ ✔

ORCfile ✖ ✖ ✖ ✖ ✖ ✔

Page 42: Track A-2 基於 Spark 的數據分析

42

Cloudera is a member of, and aligned with, the broader Spark community

Spark:• Will replace MapReduce as the general purpose Hadoop framework

– Broad community and vendor adoption – Hadoop ecosystem integration (native & 3rd party)

• Goes beyond data science/machine learning– Cloudera working on Spark Core, Streaming, Security, YARN, and MLlib

• Does not replace special purpose frameworks– One size does not fit all for SQL, Search, Graph, Stream

Cloudera’s Position on Spark

Page 43: Track A-2 基於 Spark 的數據分析

43

• Spark Brief• What Cloudera is doing on Spark• Spark Use Cases• Cloudera’s Position on Spark• Etu and Cloudera

Agenda

Page 44: Track A-2 基於 Spark 的數據分析

44

Cloudera Partner with Etu

Page 45: Track A-2 基於 Spark 的數據分析

45

Etu 在 Hadoop 企業化的定位與價值

人才

招聘

團隊

建立

程式開發

資料 架構探勘 設計

部署、調校

運維、管理

應用

平台

搶佔市場

核心價值

資源調配

標準化、自動化降低 Hadoop  平台

部署與維運的複雜度

• 省力:到府安裝調校、專案技術服務

• 省時:顧問與教育訓練,協助迅速上手

• 安心:本土技術支援,降低導入風險

• 智開:多年經驗分享,打通任督二脈

易Etu

Manager

Etu Professional Service

Etu Consulting

Etu Training

Etu Services

Etu Support

Page 46: Track A-2 基於 Spark 的數據分析

46

Etu Support

Etu Professional Service

Etu Consulting

ClouderaSupport

Etu Manager Etu Services

Etu Big Data 軟體平台與服務

ClouderaManager

EtuManager

Cloudera Manager inside

Etu Training

Page 47: Track A-2 基於 Spark 的數據分析

47

主流 X86 商用伺服器

效能最佳化

全叢集管理

空機自動部署

全自動、高效能、易管理的巨量資料處理平台

唯一在地 Hadoop 專業服務

Etu Manager 讓 Hadoop 更容易

Page 48: Track A-2 基於 Spark 的數據分析

48

Etu Services

• Etu Manager 功能模組更新• HDFS / MapReduce / HBase / Pig / Hive / Impala / Spark 技術諮詢 (電⼦子郵件)

• 配合 CDH 提供升級與更新套件• 客⼾戶問題管理 (Issue Management)

• Hadoop叢集規劃與設計 ● Hadoop軟體架構與資料模型設計• Hadoop系統安裝與建置(on-site) ● Hadoop資料處理與應⽤用軟體開發• Hadoop叢集維護檢測與調教(on-site) ● Hadoop資料移轉服務

Etu 專業服務 (以⼈人天計費)

• 叢集規劃與網路架構設計/顧問服務• 應⽤用程式架構設計/顧問服務

Etu 科技顧問 (以⼈人天計費)

• 標準課程:Hadoop 直通學習地圖 – 針對不同職務需求,全⽅方位巨量資料技術實作學習• 企業包班

Etu 教育訓練 (以⼈人次計費)

Etu 技術⽀支援 8X5 (以年計算)

Page 49: Track A-2 基於 Spark 的數據分析

49

Booth 4 : Etu Data Lake

Booth 5 : Cloudera

進一步了解

Page 50: Track A-2 基於 Spark 的數據分析

50

AppendixConcepts

Page 51: Track A-2 基於 Spark 的數據分析

51

• Driver & Workers• RDD – Resilient Distributed Dataset• Transformations• Actions• Caching

Spark Concepts - Overview

Page 52: Track A-2 基於 Spark 的數據分析

52

Drivers and Workers

Driver

Worker

Worker

Worker

Data

Data

RAM

Data

RAM

Tasks

Results

RAM

Page 53: Track A-2 基於 Spark 的數據分析

53

• Read-only partitioned collection of records• Created through:

– Transformation of data in storage– Transformation of RDDs

• Contains lineage to compute from storage• Lazy materialization• Users control persistence and partitioning

RDD – Resilient Distributed Dataset

Page 54: Track A-2 基於 Spark 的數據分析

54

• Map• Filter• Sample• Join

Operations

• Reduce• Count• First, Take• SaveAs

Transformations Actions

Page 55: Track A-2 基於 Spark 的數據分析

55

• Transformations create new RDD from an existing one

• Actions run computation on RDD and return a value

• Transformations are lazy • Actions materialize RDDs by computing

transformations• RDDs can be cached to avoid re-computing

Operations

Page 56: Track A-2 基於 Spark 的數據分析

56

• RDDs contain lineage• Lineage – source location and list of transformations• Lost partitions can be re-computed from source data

Fault-Tolerance

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))

.map(lambda s: s.split(“\t”)[2])

HDFS  File Filtered  RDD Mapped  RDDfilter

(func =  startsWith(…))map

(func =  split(...))

Page 57: Track A-2 基於 Spark 的數據分析

57

• Persist() and cache() mark data • RDD is cached after first action• Fault-tolerant – lost partitions will re-compute• If not enough memory, some partitions will not be

cached• Future actions are performed on cached

partitioned, so they are much faster

Use caching for iterative algorithms

Caching

Page 58: Track A-2 基於 Spark 的數據分析

58

• MEMORY_ONLY• MEMORY_AND_DISK• MEMORY_ONLY_SER• MEMORY_AND_DISK_SER• DISK_ONLY• MEMORY_ONLY_2, MEMORY_AND_DISK_2…

Caching – Storage Levels

Page 59: Track A-2 基於 Spark 的數據分析

59

• map

• filter

• groupBy

• sort

• union

• join

• leftOuterJoin

• rightOuterJoin

Easy: Expressive API• reduce

• count

• fold

• reduceByKey

• groupByKey

• cogroup

• cross

• zip

• sample

• take

• first

• partitionBy

• mapWith

• pipe

• save ...