20151014 spark勉強会補足資料
-
Upload
dmmlabo -
Category
Technology
-
view
602 -
download
1
Transcript of 20151014 spark勉強会補足資料
![Page 1: 20151014 spark勉強会補足資料](https://reader031.fdocument.pub/reader031/viewer/2022021507/58a39df21a28abb1348b6593/html5/thumbnails/1.jpg)
Spark勉強会補足資料2015.10.14
CTO room: tanaka.yuichi
![Page 2: 20151014 spark勉強会補足資料](https://reader031.fdocument.pub/reader031/viewer/2022021507/58a39df21a28abb1348b6593/html5/thumbnails/2.jpg)
補足資料
この資料はDMMのビッグデータ収集のご紹介~Kafka+HBase+SparkStreamingによる大規模情報収集解析 ~(https://prezi.com/an2frel_xvyl/dmm/?utm_campaign=share&utm_medium=copy)
の補足資料です
![Page 3: 20151014 spark勉強会補足資料](https://reader031.fdocument.pub/reader031/viewer/2022021507/58a39df21a28abb1348b6593/html5/thumbnails/3.jpg)
Kafkaの補足
API(Node.js)
API(Node.js)
Producer
Broker
Broker
Kafka
Zookeeper
Zookeeper
Zookeeper
Broker
Broker
API(Node.js)
SparkStreaming#partition
SparkStreaming#partition
Consumer
SparkStreaming#partirion
DMMでの今回の事例における Kafkaとその周辺技術のマッピング
![Page 4: 20151014 spark勉強会補足資料](https://reader031.fdocument.pub/reader031/viewer/2022021507/58a39df21a28abb1348b6593/html5/thumbnails/4.jpg)
Kafkaの補足
Producer Kafka Consumer
Kafkaの Topicと ConsumerGroupの関係
Topic:”tracking”
Zookeeper
Topic:”tracking”
CG:”Streaming1”
CG:”Streaming2”
CG:”Flume1”
Inputは一つ
一つの Topicから複数の Consumer
Topicから何処まで取得したかはConsumerGroup毎に
Zookeeperで管理
![Page 5: 20151014 spark勉強会補足資料](https://reader031.fdocument.pub/reader031/viewer/2022021507/58a39df21a28abb1348b6593/html5/thumbnails/5.jpg)
Sparkの補足:丁寧な説明
Apache Spark?
高速に大規模データ処理を行うエンジン
![Page 6: 20151014 spark勉強会補足資料](https://reader031.fdocument.pub/reader031/viewer/2022021507/58a39df21a28abb1348b6593/html5/thumbnails/6.jpg)
SparkCoreについての説明
Cluster Manager
Yarn
Mesos
Data Source
Stream
HDFS
![Page 7: 20151014 spark勉強会補足資料](https://reader031.fdocument.pub/reader031/viewer/2022021507/58a39df21a28abb1348b6593/html5/thumbnails/7.jpg)
SparkRDDsの補足説明Sparkでは RDDsという単位でデータを扱います。
val test = sc.textFile("/tmp/sample.txt")
これ RDD
test.count この RDDに対してcounttakecollectmapfilterといった処理を行います
![Page 8: 20151014 spark勉強会補足資料](https://reader031.fdocument.pub/reader031/viewer/2022021507/58a39df21a28abb1348b6593/html5/thumbnails/8.jpg)
SparkStreamingの補足説明Tweet10
Tweet9
Tweet8
Tweet7
Tweet6
Tweet5
Tweet4
Tweet3
Tweet2
Tweet1
時間
DStream1秒
DStream
2秒DStream
3秒DStream
4秒DStream
![Page 9: 20151014 spark勉強会補足資料](https://reader031.fdocument.pub/reader031/viewer/2022021507/58a39df21a28abb1348b6593/html5/thumbnails/9.jpg)
HBaseの補足
Master
Master
HBase
Zookeeper ZookeeperZookeeper
ReagionServer
Flume
Flume
Consumer
Flume
DMMでの今回の事例における HBaseとその周辺技術のマッピング
ReagionServer
ReagionServer ReagionServer
ReagionServer ReagionServer
![Page 10: 20151014 spark勉強会補足資料](https://reader031.fdocument.pub/reader031/viewer/2022021507/58a39df21a28abb1348b6593/html5/thumbnails/10.jpg)
HBaseの補足
SparkStreaming#partition
SparkStreaming#partition
Consumer
SparkStreaming#partirion
HBaseのデータはRowKey,ColumnFamily,Column,Value,Timestampで構成され、重要: RowKeyによって RegionServerに振り分けられる
Master
RegionServer2
RegionServer3
RegionServer
RegionServer1
jp.co.dmmcom.dmm
com.r18
![Page 11: 20151014 spark勉強会補足資料](https://reader031.fdocument.pub/reader031/viewer/2022021507/58a39df21a28abb1348b6593/html5/thumbnails/11.jpg)
Kafka+Spark(ハマりどころ3の補足)
SparkStreaming#partition
SparkStreaming#partition
Streaming
SparkStreaming#partirion
• SparkStreamingでは単位時間 (BatchInterval)内で処理を終わらせる必要がある• 単位時間辺りに処理するデータ量の制御は重要• SparkStreamingで用意されている Inputは基本的に maxRateで制御可能
Input
ここの最大値(MaxRate)
![Page 12: 20151014 spark勉強会補足資料](https://reader031.fdocument.pub/reader031/viewer/2022021507/58a39df21a28abb1348b6593/html5/thumbnails/12.jpg)
HBaseの補足(ハマりどころ4の補足)
SparkStreaming#partition
SparkStreaming#partition
Consumer
SparkStreaming#partirion
【再掲】HBaseのデータはRowKey,ColumnFamily,Column,Value,Timestampで構成され、重要: RowKeyによって RegionServerに振り分けられる
Master
RegionServer2
RegionServer3
RegionServer
RegionServer1
jp.co.dmmcom.dmm
com.r18
![Page 13: 20151014 spark勉強会補足資料](https://reader031.fdocument.pub/reader031/viewer/2022021507/58a39df21a28abb1348b6593/html5/thumbnails/13.jpg)
HBaseの補足(ハマりどころ4の補足)HBaseの RawKeyと Reagionの関係
RegionServer2
RegionServer3
RegionServer
RegionServer1
com.dmm.www_X
jp.co.dmm.news_X
com.dmm.p-town_X
com.dmm.nikukai_X
R1
R2
R3
R4
R5
jp.co.dmm.ip1_X
jp.co.dmm.blog_X
RowKey( RowKeyの範囲)に対して、 Regionが分かれる
![Page 14: 20151014 spark勉強会補足資料](https://reader031.fdocument.pub/reader031/viewer/2022021507/58a39df21a28abb1348b6593/html5/thumbnails/14.jpg)
HBaseの補足(ハマりどころ4の補足)HBaseの RawKeyと Reagionの関係
RegionServer2
RegionServer3
RegionServer
RegionServer1
com.dmm.www_X
jp.co.dmm.news_X
com.dmm.p-town_X
com.dmm.nikukai_X
R1
R2
R3
R4
R5
jp.co.dmm.ip1_X
jp.co.dmm.blog_X
こいつらにアクセスが著しく偏るすると、書き込み先の殆どがこいつになる
![Page 15: 20151014 spark勉強会補足資料](https://reader031.fdocument.pub/reader031/viewer/2022021507/58a39df21a28abb1348b6593/html5/thumbnails/15.jpg)
以上2015.10.14
CTO room: tanaka.yuichi