Post on 23-Jan-2018
To Have Own Data Analytics Platform, Or NOT To青山エンジニア勉強交流会 April 24, 2017
Satoshi Tagomori (@tagomoris)
Satoshi "Moris" Tagomori (@tagomoris)
Fluentd, MessagePack-Ruby, Norikra, ...
Treasure Data, Inc.
http://tsuchinoko.dmmlabs.com/?p=1770
At Feb 23, 2015• To Have Own Data Analytics Platform, Or NOT To,
In Startup Companies:
• "NOT To, in general"
• Data analytics services: • AWS EMR, Redshift • Google BigQuery • Treasure Data
Options In 2017• On Premise
• Cloudera CDH, Hortonworks HDP, ...
• Services • AWS EMR, Redshift, Athena, Kinesis Analytics, ... • Google BigQuery, Cloud Dataflow, Cloud
Dataproc, ... • MS Azure SQL Data Warehouse, Stream Analytics,
Data Lake Analytics, ... • Treasure Data
TO HAVE OR
NOT TO HAVE ?
DO NOT
😝
Anyway,
NO FINE CONCLUSION IN THIS PRESENTATION
On Premise Platform In Past• 2011-2014: On-premise Hadoop&Presto cluster
• w/ Fluentd stream processing cluster • w/ Norikra stream processing • w/ Web UI (Shib)
https://www.slideshare.net/tagomoris/lambda-architecture-using-sql-hadoopcon-2014-taiwan
To Be Considered• Distributed Processing Platform
• Data Management
• Process Management
• Platform Management
• Visualization and BI
• Connecting Data
Distributed Processing Platform
• Hadoop, Presto, Spark, Flink, Storm, ... • + Servers
• EMR, Redshift, Dataproc, ... • Cost per instances
• BigQuery, Athena, Treasure Data, .... • Cost per data/queries/...
Data Management
• How to collect data?
• How to ingest data?
• How to manage schema?
• How to move data from here to there?
Process Management
• How to run queries on schedule?
• How to build workflow between queries?
• How to run queries after data ingestion?
• How to move data from the platform to elsewhere after queries?
Platform Management• How to upgrade software?
• How to add nodes?
• How to manage failures / downtime?
• How to replace hardware?
• How to switch platforms?
• How to provide compatibility for queries?
Visualization and BI
• How to show query results graphically?
• How to show relations between data graphically?
• How to query data interactively?
Connecting Data• How to join logs and master data?
• How to join logs and user list?
• How to join logs and CRM data?
• How to push query results to marketing tools/services?
• How to send notifications using query results?
Additional Topics
• Stream Processing Platform
• Machine Learning Platform
• AI(?) Services
In My Past Case:• Distributed Processing Platform
• Hadoop & Presto (& Norikra)
• Data Management • Hive schema & Custom made UI (Shib) • Managed by engineers of each services
• Process Management • Custom made query scheduler (ShibUI)
• Platform Management • By tagomoris
• Visualization, BI: N/A
• Connecting Data: N/A
About Treasure Data• Distributed Processing Platform: Hive, Presto
• Data Management: Fluentd & Schema-less DB
• Process Management: Digdag / Treasure Workflow
• Platform Management: Automatic
• Visualization and BI: Treasure BI
• Connecting Data: Embulk / Data Connector
😝
Recent Improvements around Data Analytics
• Improvements of CDH/HDP to manage clusters • Online Upgrade • Support many processing frameworks
• Many new data processing software/frameworks • Apache Flink, Apache Arrow, Apache Beam, ...
• Many new services available • Stream processing, Machine learning, ...
MONEY
• Saving money is important - it's true.
MONEY
• Saving money introduces many issues - it's true!
MONEY
• Money solves many problems - is it true?
Complexity
• Connecting data / processing with applications
• Connecting data / processing with services
• Connecting data / processing with people
Chasing the World• Many new software / services / platform /
paradigm, day by day
• Data sizes are growing day by day
• Complexity is growing day by day
• A data platform CANNOT live as-is 5 years!
Finding Treasure From Data
• "Data Processing" is: • NOT the purpose • just a tool to get something great
• Use developers and their time to find treasures!
TBD
Thank you! @tagomoris