OLAP options on Hadoop

42
OLAP op(ons on Hadoop Yuta Imai, Hortonworks Jul 15, 2016

Transcript of OLAP options on Hadoop

Page 1: OLAP options on Hadoop

OLAPop(onsonHadoopYutaImai,HortonworksJul15,2016

Page 2: OLAP options on Hadoop

2 ©HortonworksInc.2011–2016.AllRightsReserved

Hortonworks会社概要

IPO4Q14(NASDAQ:HDP)

Page 3: OLAP options on Hadoop

3 ©HortonworksInc.2011–2016.AllRightsReserved

Hortonworks Data Platform

Hortonworks Data Platform 2.4

GOVERNANCE OPERATIONSBATCH,INTERACTIVE&REAL-TIMEDATAACCESS

YARN:DataOpera(ngSystem(ClusterResourceManagement)

Map

Redu

ce

ApacheFalcon

ApacheSqoop

ApacheFlume

ApacheKaLa

Apache

Hive

Apache

Pig

Apache

HBa

se

Apache

Accum

ulo

Apache

Solr

Apache

Spark

Apache

Storm

1 • • • • • • • • • • •

• • • • • • • • • • • •

HDFS(HadoopDistributedFileSystem)

ApacheAmbari

ApacheZooKeeper

ApacheOozie

DeploymentChoiceLinux Windows On-premises Cloud

ApacheAtlas

Cloudbreak

SECURITY

ApacheRanger

ApacheKnox

ApacheAtlas

HDFSEncrypXon

ISV

Engine

s

Apache

Spark

Page 4: OLAP options on Hadoop

4 ©HortonworksInc.2011–2016.AllRightsReserved

Hortonworks Data Platformpowered by Apache Hadoop

Hortonworks Data Platformpowered by Apache Hadoop

EnrichContext

Store Data and Metadata

Internetof Anything

Hortonworks DataFlow powered by Apache NiFi

動的・鮮度が重要なインサイト

静的・過去データによるインサイト

Hortonworks DataFlowによるHadoopの可能性の拡⼤

Hortonworks DataFlowとHortonworks Data Platformにより、ビックデータ基盤のエンド・ツー・エンドソリューションを提供します。

Connected Data Platform

Page 5: OLAP options on Hadoop

5 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Communityへの貢献

1,500を超えるエコシステムパートナーHortonworks テクノロジーパートナー

専⾨家集団: 開発に深く携わるコア・メンバーにより構成

Ã  コミッターの多くがHortonworksの社員です。Apache Hadoop プロジェクトに関わるコミッターの約1/3はHortonworksの社員です。Apache NiFiの⼤半を始めとする多くの重要なプロジェクトに関わっています。

Ã  コミッターはコネクティッド・データプラットフォームを改良し、⾰新を続けています。

Ã  Hadoopのロードマップに関わっています。     コミュニティに対し、重要なリクワイアメントを⾔える⽴場にいます。

Hortonworks はApache Communityに⾮常に深く関与しています。

Page 6: OLAP options on Hadoop

6 ©HortonworksInc.2011–2016.AllRightsReserved

HDP Enterprise and Enterprise Plus サブスクリプション

Enterprise EnterprisePlus

ApacheHadoop&YARN ✔ ✔

ApacheTez ✔ ✔

ApacheHive ✔ ✔

ApachePig ✔ ✔

ApacheSqoop ✔ ✔

ApacheFlume ✔ ✔

ApacheMahout ✔ ✔

ApacheAmbari ✔ ✔

ApacheOozie ✔ ✔

ApacheFalcon ✔ ✔

ApacheKnox ✔ ✔

ApacheHBase ✔ ✔

ApacheAccumulo ✔

ApacheStorm ✔

HDPAdvancedSecurity ✔

ApacheSolr Separate Separate

Enterprise EnterprisePlus

ApacheHadoop&YARN ✔ ✔

ApacheAmbari ✔ ✔

ApacheFalcon ✔ ✔

ApacheFlume ✔ ✔

ApacheHBase ✔ ✔

ApacheHive ✔ ✔

ApacheKnox ✔ ✔

ApacheMahout ✔ ✔

ApacheOozie ✔ ✔

ApachePhoenix ✔ ✔

ApachePig ✔ ✔

ApacheSqoop ✔ ✔

ApacheTez ✔ ✔

ApacheZookeeper ✔ ✔

ApacheAccumulo ✔

ApacheKaLa ✔

ApacheRanger ✔

ApacheSpark ✔

ApacheStorm ✔

ApacheSolr Separate Separate

HDP サブスクリプリョンに含まれる内容

•  24x7, 365⽇/年のグローバルサポート•  Web 及び電話によるサポート(⽇本語窓⼝あり)•  バグフィックスや、エンハンスメントのリクエストが可能•  アップグレード、アップデート、パッチへのアクセス権•  HDP旧リリースバージョンへの複数年におけるZ-Stream メンテナン

ス•  カスタマーサポートポータル、ナレッジベースへのアクセス•  WEBベースセルフラーニングHortonworks利⽤権•  以下のリモートトラブル対応及び解析⽀援:

•  設定に関する問い合わせ、クラスタマネジメント•  パフォーマンス問題•  データロード、プロセス、クエリー問題

•  アプリ開発の質問についてのリモートアドバイス•  半年ごとのチェックポイントレビュー (ent plusでは四半期毎)

Page 7: OLAP options on Hadoop

7 ©HortonworksInc.2011–2016.AllRightsReserved

HDPサブスクリプションサービスのValueとは-アプリケーション開発の問い合わせ-分析、ユースケースの追加の相談例 -こんな分析をしたいが、どのようなデータをとればよいか -やりたいことを実現するためにどのようなコンポーネントを揃えればよいか-機械学習による提案型システムヘルスチェックサービスSmart Senseの提供

Hadoopの開発エンジニアを多数抱えるホートンワークスだから⾃信を持ってお届けできるサービス。内製化にも対応。Hadoopエンジニア、ディヴェロッパー、コミッターを抱えるユーザ企業もサブスクリプションサポートを有効活⽤。

Page 8: OLAP options on Hadoop

8 ©HortonworksInc.2011–2016.AllRightsReserved

AgendaOLAPop(onsonHadoop

Overview

ApacheKylin

ApacheDruid

SoluXonArchitecture

Wrapup

Page 9: OLAP options on Hadoop

9 ©HortonworksInc.2011–2016.AllRightsReserved

SQL evolution on HadoopCa

pabi

litie

s

Batch SQL OLAP / CubeInteractive SQL

Sub-Second SQL

ACID / MERGE

Speed Feature

Hive0.x(MapReduce)

Hive1.2-(Tez, Vectorize, ORC, CBO)

Hive2.0(LLAP)

PrestoImpala

Drill

Spark SQLHAWQ

MPP

KylinDruid

CommercialKyvos Insights

AtScaleSource

Hive(WIP)

Page 10: OLAP options on Hadoop

10 ©HortonworksInc.2011–2016.AllRightsReserved

SQL evolution on HadoopCa

pabi

litie

s

Batch SQL OLAP / CubeInteractive SQL

Sub-Second SQL

ACID / MERGE

Speed Feature

Hive0.x(MapReduce)

Hive1.2-(Tez, Vectorize, ORC, CBO)

Hive2.0(LLAP)

PrestoImpala

Drill

Spark SQLHAWQ

MPP

KylinDruid

CommercialKyvos Insights

AtScaleSource

Hive(WIP)

Page 11: OLAP options on Hadoop

11 ©HortonworksInc.2011–2016.AllRightsReserved

AgendaOLAPop(onsonHadoop

Overview

ApacheKylin

ApacheDruid

SoluXonArchitecture

Wrapup

Page 12: OLAP options on Hadoop

12 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Kylin

Ã  eBayで開発されたOLAPエンジン。Ã  2014年10⽉にオープンソース化Ã  2015年にApacheのTop Level Projectに昇格。Ã  読み⽅

–  きりん–  かいりん–  ちりん

Page 13: OLAP options on Hadoop

13 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Kylin – Motivation

Ã  eBayではもともとHiveでOLAP的なことをしていたが、速度に満⾜できなかった。

Ã  ⼀般的なOLAPクエリに対して数秒〜10秒前後でレスポンスが返ってくることが求められていた。

Ã  当時、オープンソースのソフトウェアでその要求を満たすものがなかった

Page 14: OLAP options on Hadoop

14 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Kylin – Architecture

RESTAPI

QueryEngine

Router

CubeBuilderHive HBase

Metadata Cube

RESTAPI JDBC/ODBC

3rdPartyApp BITools

Page 15: OLAP options on Hadoop

15 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Kylin – Interface

QueriesÃ  ANSI SQLÃ  No direct cube exposure / Just through Hive metastoreÃ  No MDX

Page 16: OLAP options on Hadoop

16 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Kylin – Cube Designer

Ã  Apache Kylin does not provide build scheduler

Page 17: OLAP options on Hadoop

17 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Kylin – Partial CubeÃ  Balance Between Space and Time

A,B,C,D

A,B,C A,B,D A,C,D B,C,D

A,B B,C B,D A,C C,D A,D

A B C D

Page 18: OLAP options on Hadoop

18 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Kylin – Cube vs. HBase schema

Pre-Joined/AggregatedTable

Dimensions D1

D2

D3

D4

Measures M1

M2

M3

M4

CuboidID D1 D2 D3 D4

M1 M2 M3 M4

ROWKEY

ROWVALUE

Page 19: OLAP options on Hadoop

19 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Kylin – Incremental cube build

Ã  Incremental build

Y-2011-2012 M-2013-1-8 D-2013-09-1-20 D-2013-09-21

•  Minutes micro cubes•  Kafka source•  in-memory cubing•  Auto merge

Page 20: OLAP options on Hadoop

20 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Kylin – Streaming cubing(WIP)

Stream

CubeCube

Before Last Hour

HybridStorageInterface

QueryEngine

ANSI SQ

L

InvertedIndex

InvertedIndex

Last Hour

Page 21: OLAP options on Hadoop

21 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Kylin – Who is using?

Ã  eBay–  90% queries < 5 seconds

•  User Session Analysis: 26TB, 28+ billion rows•  Traffic Analysis: 21TB, 20+ billion rows•  Behavior Analysis: 560GB, 1.2+ billion rows

Ã  Baidu–  Baidu Map internal analysis

Ã  Many other Proof of Concepts–  Huawei, Boomberg, Law, British GAS, JD.com, Microsoft, StubHub, Tableau…

Page 22: OLAP options on Hadoop

22 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Kylin – Support

Ã  Kyligence

Page 23: OLAP options on Hadoop

23 ©HortonworksInc.2011–2016.AllRightsReserved

AgendaOLAPop(onsonHadoop

Overview

ApacheKylin

ApacheDruid

SoluXonArchitecture

Wrapup

Page 24: OLAP options on Hadoop

24 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Druid

Ã  MetaMarketで開発されたOLAPエンジン。Ã  2012年10⽉にオープンソース化。この時点ではGPLライセンス。Ã  2015年にApache License 2.0に。Ã  150⼈以上のコントリビューター。Ã  名前は、RPGのクラス

Page 25: OLAP options on Hadoop

25 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Druid - Concept

Ã  列志向の分散データストアÃ  sub-secondでのクエリレスポンスÃ  Realtime streaming ingestionÃ  ⾃在なスライシングとダイシングÃ  ⾃動で⾏われるデータサマライズÃ  概算を計算するアルゴリズムも利⽤(hyperloglog, theta)Ã  ペタバイトスケールÃ  ⾼い可⽤性

Page 26: OLAP options on Hadoop

26 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Druid– Motivation

Ã  広告配信サービスのリアルタイムダッシュボードÃ  ⼤量のトランザクションデータを⾼い速度で投⼊し、探索可能にした

かった。Ã  append heavyÃ  low latencyÃ  multi-tenantÃ  highly availableÃ  real-time alerting, actionable

Page 27: OLAP options on Hadoop

27 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Druid– Solutions Evaluated

Ã  RDBMS–  Star schema with aggregate tables–  Slow performance on large scale (upto 20 sec page load times)–  Query caching helped, arbitrary queries still slow

Ã  Key/Value stores(Hbase, Cassandra, BigTable)–  Pre-aggregate all dimensional combinations–  Fast queries were achieved–  Precomputation scales exponentially–  Takes time to precompute(9hrs with 14 dimensions)–  Not cost effective

Page 28: OLAP options on Hadoop

28 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Druid – Architecture in early days

Ingest

HistoricalNode

HistoricalNode

HistoricalNode

BrokerNode

BatchData

Page 29: OLAP options on Hadoop

29 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Druid – Historical Node

DeepStorage Disk

Memory

Segment Segment

Segment

Ã  Shared NothingÃ  Main workhorsers of druid clusterÃ  Load immutable read optimized

segmentsÃ  Respond to queriesÃ  Use memory mapped files to load

segmengs

Page 30: OLAP options on Hadoop

30 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Druid – Current architecture

Ingest

HistoricalNode

HistoricalNode

HistoricalNode

BrokerNode

BatchData

RealXmeNode

RealXmeNode

KaLaBatchData

DeepStorage

Page 31: OLAP options on Hadoop

31 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Druid – Current architecture

Page 32: OLAP options on Hadoop

32 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Druid – Broker Node

Ã  Keeps track of segment announcements in clusterÃ  Scatters query across historical and realtime nodesÃ  Merge results from different query nodesÃ  Caching layer

Page 33: OLAP options on Hadoop

33 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Druid – Coordinator Node

Ã  Assign segments to historical nodesÃ  Interval based cost function to distribute segmentsÃ  Make sure query load is uniform across historical nodesÃ  Handles replication of dataÃ  Configurable rules to load/drop data

Page 34: OLAP options on Hadoop

34 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Druid – Interface

QueriesÃ  REST APIÃ  SQL(Community effort, No ANSI compliance yet)

Page 35: OLAP options on Hadoop

35 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Druid – Pivot

Ã  CubedesignandvisualizaXon

Page 36: OLAP options on Hadoop

36 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Druid– Building cube

Ã  Druid is totally ʻtime-basedʼ OLAP data store.Ã  Basically it builds cube based on ʻtime x dimensions .̓Ã  But how about Unique?

–  Theta Sketches KMV: Open source by Yahoo!–  Predictable approximation error can be trade-off by sketch size

•  k=4096: RSE of +/-3.2% -> 32768bytes•  k=16K: RSE of +/-1.6% -> 131072bytes

–  Mergeable at query time•  ʻmerge rate of about 14.5 million skethces per second per

processor threadʼ–  Intersection can be computed at query time–  Duplication insensitive

Page 37: OLAP options on Hadoop

37 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Druid – Who is using?

Page 38: OLAP options on Hadoop

38 ©HortonworksInc.2011–2016.AllRightsReserved

Data mart architecture in Yahoo, Inc

HourETL

EventData

DailyRollup Aggregate

ETLData

Aggregate

Druid HDFS

User Interface

1x 24x ?x

Page 39: OLAP options on Hadoop

39 ©HortonworksInc.2011–2016.AllRightsReserved

AgendaOLAPop(onsonHadoop

Overview

ApacheKylin

ApacheDruid

SoluXonArchitecture

Wrapup

Page 40: OLAP options on Hadoop

40 ©HortonworksInc.2011–2016.AllRightsReserved

Solution Architecture for SQL analysis

OLAP

HDFS Hive

Kafka

OLAP Access

Row Level Access

Data Source Source of truth Application

Page 41: OLAP options on Hadoop

41 ©HortonworksInc.2011–2016.AllRightsReserved

AgendaScalableDataWarehousingonHadoop

Overview

ApacheKylin

ApacheDruid

SoluXonArchitecture

Wrapup

Page 42: OLAP options on Hadoop

42 ©HortonworksInc.2011–2016.AllRightsReserved

SQL evolution on HadoopCa

pabi

litie

s

Batch SQL OLAP / CubeInteractive SQL

Sub-Second SQL

ACID / MERGE

Speed Feature

Hive0.x(MapReduce)

Hive1.2-(Tez, Vectorize, ORC, CBO)

Hive2.0(LLAP)

PrestoImpala

Drill

Spark SQLHAWQ

MPP

KylinDruid

CommercialKyvos Insights

AtScaleSource

Hive(WIP)