Technologies, Data Analytics Service and Enterprise Business
-
Upload
satoshi-tagomori -
Category
Software
-
view
551 -
download
0
Transcript of Technologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise Businesses
SENDAI IT COMMUNE #2 2018-01-09
Satoshi Tagomori (@tagomoris) Treasure Data, Inc.
Satoshi Tagomori (@tagomoris)
Fluentd, MessagePack-Ruby, Norikra, Woothee, ...
Treasure Data, Inc.
Retry-able Failures or Not
Idempotent Operations: (冪等な操作)べきとう
Technologies Data Analytics Service Enterprise Business
Technologies ↓ Data Analytics Service ↓ Enterprise Business
Enterprise Business ?
• Many different definitions and discussions about "Enterprise"... :(
• MY DEFINITION IN THIS TALK: "Businesses NOT about IT"
• Thus, most of businesses are "Enterprise", everywhere, not only in Tokyo
Data Analytics Service ?
• Provides ways to know: • How many people are reaching our products? • How many times are they seeing our advertisements?
• And how many times do they buy our products? • When are they use our products? • When did they buy our products? • Where did they buy our products? • ...
• Something helps our business using data
Data Analytics Service for Enterprise Business ?
• Something helps "Business not about IT", using data (IT)
• Staffs (using data analytics service) doesn't know about IT • and also don't take care about IT • but "need" result of analytics
• Everyone are checking report about yesterday at 10:00 AM • We need results before 10:00AM • 10:10 AM is too late, but 2:00 AM is too early...
Deadline and Retries
Big Job: Power 1
10:00AM00:00AM 05:30AM01:00AM
Big Job: Power 1Crash! Delay...
Big Job: Power 2
Big Job: Power 2Crash! OK!
Small Jobs: Power 1
Small Jobs: Power 1Crash! OK!
Missions of Data Analytics Service for Enterprise Business
Fast "enough" Cheap "enough"
Stable Easy to use "enough"
Technologies for Data Analytics Service
• Data Management System
• Distributed Processing System
• Queue and Scheduler
• Connecting Systems and Services
• Controlling Jobs, Tasks and Workflows
• Managing Retries
Data Management Systems
• Data Collecting Systems • Fluentd, Embulk, ...
• Distributed Database and Storage • Storing data in efficient format (MPC1, MessagePack columnar format) • Managing index • Managing schema • Providing transactional operations
Distributed Processing System
• Running Analytics Queries • MapReduce engines: Hadoop + Hive • MPP (Massive Parallel Processing systems): Presto
• Running Data Management Jobs • Converting data formats, re-index, detecting schema, ...
• Computing Resource Management • Customer queries (and internal use) must be separated!
Queue and Scheduler
• Queuing Queries • Allow to enqueue queries, run these next-to-next
Power 1
CustomerRequest
• Scheduling Queries • Run queries when it's ok to run
Data for Queries
01:00AM 03:00AM
Connecting Systems and Services
• Non-"connected" Data Analytics Service
Ultra Super GreatAnalytics Service
Database QueryResult
Not "easy enough"
Connecting Systems and Services
• Data Analytics Service MUST be "connected"
Treasure DataDatabase
QueryResult
Control Jobs/Tasks
• A Job needs results of other jobs
"Risky"Time based schedule
A,B,C -> D,E -> F
01:00AM
03:10AM ?
03:30AM06:30AM ?
07:00AM 10:00AM
"Risky"Time based schedule
A,B,C -> D,E -> F
01:00AM
Crash!
03:30AM
Oops, No Data...
10:00AM
• "Risk" for failures07:00AM
Oops, No Data...
08:15AM ?
Control Jobs/Tasks
• A Job needs results of other jobs
Time based scheduleA,B,C -> D,E -> F
01:00AM
03:10AM ?
06:00AM08:30AM ?
11:00AM ???
• "Time based schedule" needs • Wide space for retries • Big resource for fast results (not cheap!)
Space for Retries Space for Retries
Control Jobs/Tasks
• Workflow pattern
Workflow executionA,B,C -> D,E -> F
01:00AM07:15AM ?
10:00AMWorkflow control barriers
Workflow executionA,B,C -> D,E -> F
01:00AM 10:00AMWorkflow control barriers
• Workflow pattern with retries
Crash!
Retries !!!!!!!!!!!!!!!!!!!!!!!!
Retry-able Failures or Not
• "Retry-able Failures" • Crash of compute nodes • Communication errors • Service down of "connected" services • ...
• Non-"Retry-able Failures" • SQL syntax error • Missing data sources / Missing tables • Wrong API key of "connected" services • ...
Table B
Table B
Retry-able Operations ?
• For example.... : • Run Query A • Append result of A into B • Count rows of B
• Failures?: • Run Query A • Append result of A into B ... (Failed!) • Retry Query A • Retry to append result of A into B • Count rows of B
1234
12
1234
Idempotent Operations
• "Idempotent" (冪等である) operation
• can get "same" result when it's executed twice or more
べきとう
Table B
1234
• Idempotent Operation: • Run Query A • "Replace" table B with result of A • Count rows of B
Table B
12
Replay-able Data Analytics Workflow
• Need to do many "try-and-error" • w/ updated queries • w/ updated data...
• Idempotent operations makes workflow "Replay-able" • Fast try-and-error (PDCA!) cycles • → Fast business growth!
Enterprise Business ❤
Technologies
Thank you! @tagomoris