Handling IOT Data with a Modern Data Architecture
Cliff Gilmore - Data Practice Director @ 1904labs
Capture, Process and Serve (All the Things)
Challenges of IOT Data
3
Scale
Frequency of events
Size of Data
Number of Devices
Number of Users
Latency Demands
Geo Distribution
Processing
Batch Analytics
Realtime Analytics
Aggregations
Machine Learning
Reporting
Applications
Realtime Access
Report Visualization
Production Analytics
ML Driven Decisions
Microservices
IOT Data ArchitectureLeading to Lambda and Kappa
Architectural Requirements
❏ Must scale out linearly on ingestion, processing, storage, and access.
❏ Need to be able to store huge amounts of data organized for different access patterns.
❏ Must have the ability to process data inflight for real time decision making, alerting and pattern matching
❏ Need to serve the data to the rest of the organization through a common API/Service
❏ The architecture must be agile to accommodate new changes to business logic and processing algorithms
Stack Components
❏ Distributed Log / Queue❏ A pub/sub partitioned queue❏ Kafka is the defacto choice due to it’s wide use in production
❏ Stream Processing❏ Ability to process events as they arrive❏ Event at a Time
❏ Samza, Flink, Storm❏ Micro Batch
❏ Spark Streaming❏ Batch Processing
❏ Process event history in bulk❏ Spark, MapReduce on top of HDFS or Wide Column Stores
❏ Serving❏ Expose data to the rest of the organization and serve application requests❏ Wide Column Store
❏ Cassandra, HBase, BigTable❏ Can also be RDBMS for some data sets (Reports, BI Rollups, etc)
Lambda Architecture
EventsDistributed
Log
Batch Layer
Speed Layer Serving LayerRaw Data, Pattern Matching and Aggregates
Patterns, Rollups, Recommendations
Kappa Architecture
EventsDistributed
Log
Bat
ch
Streaming Serving
Stream Results
Stream V1Stream V2
Table V1Table V2Raw DataRaw
Cassandra toServe IOT Data
The Art of Time Series
Why Cassandra?
❏ Proven linear scale up to 1000s of nodes in a single cluster
❏ Geo redundancy to collect data where it is created and replicate across the globe
❏ High capacity to ingest parallel individual writes
❏ Low latency and high throughput reads
❏ Wide-column store data model allows for data to be structured around query patterns
❏ Continuous availability suited to and used for the most mission critical systems
❏ AP platform by definition of the CAP theorem, consistency is tunable to give availability
Cassandra 101
DC1 DC2
Physical Data Model
Partition Key
Clustering Key Val1
Col1:ValCol2:ValCol3:ValCol4:ValCol5:ValCol6:Val
….
Clustering Key Val2
Col1:ValCol6:Val
….
Clustering Key Val3
Col6:Val….
Clustering Key Val4
Col1:ValCol2:ValCol3:ValCol4:ValCol5:ValCol6:Val
….
Clustering Key Val5
Col1:ValCol2:Vall
….
...
….
CQL - Cassandra Query Language
❏ Simple to use language that looks like SQL
❏ No joins, group by etc
❏ Example Queries
❏ SELECT * FROM readings WHERE event_time > ? AND event_tiime <= ? WHERE device_id= ?;
❏ INSERT INTO readings (device_id, event_time, temperature) VALUES (?,?,?);
I’ve got this!
TimeSeries Table Example
CREATE TABLE readings (
sensor_id text,
event_time TimeUUID,
temperature decimal,
PRIMARY KEY (sensor_id,event_time)
);
TimeSeries Table Example
CREATE TABLE readings (
sensor_id text,
event_time TimeUUID,
temperature decimal,
PRIMARY KEY (sensor_id,event_time)
);
Time Ordered Sortable UUID
TimeSeries Table Example
CREATE TABLE readings (
sensor_id text,
event_time TimeUUID,
temperature decimal,
PRIMARY KEY (sensor_id,event_time)
);
Partition Key
TimeSeries Table Example
CREATE TABLE readings (
sensor_id text,
event_time TimeUUID,
temperature decimal,
PRIMARY KEY (sensor_id,event_time)
);
Clustering Key
Physical Data Model
Station #1
12:05.15
15.9 C
12:05.16
15.9 C
12:05.17
16.0 C
12:05.18
16.1 C
12:05.19
16.0 C
...
….
Station #2
12:05.15
22.0 C
12:05.20
22.1 C
12:05.25
27.9 C
12:05.30
27.7 C
12:05.35
30.2 C
...
….
Advanced Data Model Topics
❏ Consider bucketing time in Partition Key if sample rate is high
❏ Primary Key ((device_id,year,week),event_time)
❏ If per event granularity not needed can batch or rollup events
❏ Primary Key (device_id, event_minute)
❏ If we batch events
❏ JSON blob of sensor readings within the minute
❏ Can’t update sensor readings without read-before-write
Top Related