Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
-
Upload
memsql -
Category
Data & Analytics
-
view
157 -
download
1
Transcript of Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
![Page 1: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/1.jpg)
Eric Frenkiel, MemSQL CEO and co-founder
August 11, 2015 • San Diego, CA
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
![Page 2: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/2.jpg)
2
What’s In Store
MemSQL and a fresh look at Lambda architectures
Building real-time data pipelines for immediate impact
One architecture for many applications
![Page 3: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/3.jpg)
3
MemSQL at a Glance
• Enable every company to be a real-time enterprise
• Founded 2011, based in San Francisco
• Founders are ex-Facebook, SQL Server engineers
• Deliver a database technology for modern architecture
Enterprise Focus
![Page 4: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/4.jpg)
4
The Real-Time Database for Transactions and Analytics
In-Memory Distributed Relational
Data CenterSoftware Cloud
![Page 5: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/5.jpg)
5
Speed
Serving
Batch Fast Updates
Unified queries, full SQL
Fast Appends
A Fresh Look at Lambda Architectures
![Page 6: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/6.jpg)
6
Comprehensive Architecture
Tra
nsac
tions
![Page 7: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/7.jpg)
7
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
RowstoreTra
nsac
tions
![Page 8: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/8.jpg)
8
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
Analytics
Tra
nsac
tions
![Page 9: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/9.jpg)
9
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
HistoricalBatch Layer
Fast Appends
Columnstore
Analytics
Tra
nsac
tions
![Page 10: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/10.jpg)
10
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
HistoricalBatch Layer
Fast Appends
Columnstore
Analytics
Tra
nsac
tions
Execution engine that spans the data spectrum
![Page 11: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/11.jpg)
11
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
HistoricalBatch Layer
Fast Appends
Columnstore
Analytics
Tra
nsac
tions
![Page 12: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/12.jpg)
12
Building Real-Time Data Pipelines for Immediate Impact
![Page 13: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/13.jpg)
By 2020, HP predicts that over a trillion sensors will be online
“The Internet of Things Will Drastically Change Our Future” – Datafloq
![Page 14: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/14.jpg)
Going Real-Time is the Next Phase for Big Data
MoreDevices
More Interconnectivity
MoreUser Demand
…and companies are at risk of being left behind
![Page 15: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/15.jpg)
ExpensiveNot scalableBatch onlySAN-burdened
1%
![Page 16: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/16.jpg)
Success will be driven by real-time analytic applications.
![Page 17: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/17.jpg)
17
Designing the Ideal Real-Time Pipeline
Message Queue Transformation Speed/Serving Layer
End-to-End Data Pipeline Under One Second
![Page 18: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/18.jpg)
18
A high-throughput distributed messaging system
Publish and subscribe to Kafka “topics”
Centralized data transport for the organization
Kafka
![Page 19: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/19.jpg)
19
In-memory execution engine
High level operators for procedural and programmatic analytics
Faster than MapReduce
Spark
![Page 20: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/20.jpg)
20
In-memory, distributed database
Full transactions and complete durability
Enable real-time, performant applications
MemSQL
![Page 21: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/21.jpg)
21
Use Spark and Operational Databases Together
Spark Operational Databases
Interface Programatic Declarative
Execution Environment Job Scheduler SQL Engine and Query Optimizer
Persistent Storage Use another system Built-in
![Page 22: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/22.jpg)
22
Subscribing to Kafka
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
0111001010101111101111100000001010111100001110101100000010010010111…
Publish to Kafka Topic
0111001010101111101111100000001010111100001110101100000010010010111…
1110010101000101010001010100010111111010100011110101100011010101000…
0101111000011100101010111110001111011010111100000000101110101100000…
Event added to message queue
![Page 23: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/23.jpg)
23
Enrich and Transform the Data
Spark polling Kafka for new messages
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
(2015-07-06T16:43:40.33Z, 329280, 94110, 23, ‘kitchen_appliance’, 60)
Deserialization
Enrichment
0111001010101111101111100000001010111100001110101100000010010010111…
![Page 24: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/24.jpg)
24
Persist and Prepare for Production
RDD.saveToMemSQL()
INSERT INTO memcity_table ...
timehouse_i
dzip
device_id
device_type watts
2015-07-
06T16:43:40.33
Z
329280
94110 23‘kitchen_app
liance’60
… … … … … …
![Page 25: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/25.jpg)
25
Go to Production
Compress development timelines
SELECT ... FROM memcity_table ...
![Page 26: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/26.jpg)
26
One Architecturefor Many Applications
![Page 27: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/27.jpg)
27
Lambda Applies to Real-Time Data Pipelines
Message Queue
Batch
Inputs DatabaseTransformation Application
![Page 28: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/28.jpg)
28
Kafka, Spark, and MemSQL Make it Simple
Batch
Inputs Application
![Page 29: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/29.jpg)
Monitoring real-time Xfinity programming and video health
![Page 30: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/30.jpg)
30
Collect streaming data at scale (hundreds of MemSQL machines)
Proactively diagnose issues Query ad-hoc and in real-time
with full SQL
From 30 minutes to less than 1 second
Real-time Analytics
![Page 31: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/31.jpg)
Real-Time Trend Analytics
![Page 32: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/32.jpg)
32
Massive Ingest and Concurrent Analytics
Instant accuracy to the latest repin Build real-time analytic applications
Real-time analytics
![Page 33: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/33.jpg)
Watch the Pinterest Demo Video here: https://youtu.be/KXelkQFVz4E
![Page 34: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/34.jpg)
34
Real-Time
Segmentation
![Page 35: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/35.jpg)
35
Using Real-Time for Personalization
Ad Servers EC2
Real-time analytics
PostgreSQL
Legacy reportsMonitoring S3 (replay)
HDFS
Data Science
Vertica
Operational Data Store (ODS)
Star Schema MictoStrategy
Reach overlap and ad optimization Over 60,000 queries per second Millisecond response times
![Page 36: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/36.jpg)
![Page 37: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases](https://reader030.fdocument.pub/reader030/viewer/2022032700/55d34f70bb61eb26628b4795/html5/thumbnails/37.jpg)
37
Thank You!
Visit MemSQL at Booth #518
Real-Time Demos T-Shirt GiveawayGames