DataTalks #4: Построение хранилища данных на основе...
Transcript of DataTalks #4: Построение хранилища данных на основе...
BUILDING A DATA WAREHOUSE WITH HADOOP
10.10.2015
IGOR NAKHVAT, DATA INTEGRATION ENGINEER
CONTENT TABLE
I. Building a Data Warehouse with HadoopA.Data sourcesB.Data storageC.Data flowD.ETL toolE.Conclusions
3
DATA SOURCESGamesDATA SOURCES
DATA STORAGEDATA FLOW
ETL TOOLCONCLUSIONS
4
SPA Payment
Forum
eSport
Clan wars
Update
DATA SOURCESServicesDATA SOURCES
DATA STORAGEDATA FLOW
ETL TOOLCONCLUSIONS
5
DATA SOURCESGeographyDATA SOURCES
DATA STORAGEDATA FLOW
ETL TOOLCONCLUSIONS
6
DATA SOURCES
Total: 294
222
72
Tables
Total: 1264
Relational databasesDATA SOURCESDATA STORAGE
DATA FLOWETL TOOL
CONCLUSIONS
7
DATA SOURCESNon - Relational data sourcesDATA SOURCES
DATA STORAGEDATA FLOW
ETL TOOLCONCLUSIONS
8
• It is an open architecture.
• Cost effective.
• Many interfaces to data (SQL, Spark, Java, Scala, Python).
• Many ways/formats for storing the data.
• Many tools available for the data analytics.
DATA STORAGEWhy Hadoop?DATA SOURCES
DATA STORAGEDATA FLOW
ETL TOOLCONCLUSIONS
9
Keep in mind
• Lack of employees
• Security
DATA STORAGEWhy Hadoop?DATA SOURCES
DATA STORAGEDATA FLOW
ETL TOOLCONCLUSIONS
10
DATA STORAGEHadoop ecosystem
SQOOP
PIG
DATA SOURCESDATA STORAGE
DATA FLOWETL TOOL
CONCLUSIONS
11
DATA STORAGEHow HDFS works?DATA SOURCES
DATA STORAGEDATA FLOW
ETL TOOLCONCLUSIONS
12
DATA STORAGEHow HDFS works?DATA SOURCES
DATA STORAGEDATA FLOW
ETL TOOLCONCLUSIONS
13
DATA STORAGEHow HDFS works?DATA SOURCES
DATA STORAGEDATA FLOW
ETL TOOLCONCLUSIONS
14
DATA FLOW
Shell
SQL
CSV + GZIP
Check count
rows
Compute
stats
Parquet
DATA SOURCESDATA STORAGE
DATA FLOWETL TOOL
CONCLUSIONS
15
DATA FLOW
Parquet
Shell
DATA SOURCESDATA STORAGE
DATA FLOWETL TOOL
CONCLUSIONS
16
DATA FLOW
Parquet
DATA SOURCESDATA STORAGE
DATA FLOWETL TOOL
CONCLUSIONS
17
DATA FLOW
Shell
Aggregation
Presentation
Audience
Balance
Finance
Data scientist
Manager
DATA SOURCESDATA STORAGE
DATA FLOWETL TOOL
CONCLUSIONS
18
ETL TOOL
Continuous integration tool
+ =
ETL tool
Plugins
5 - 20 Hours
1000+ Jobs
JenkinsDATA SOURCESDATA STORAGE
DATA FLOWETL TOOL
CONCLUSIONS
19
ETL TOOL
Apache NiFiDATA SOURCESDATA STORAGE
DATA FLOWETL TOOL
CONCLUSIONS
20
Apache NiFi
• Drag – n – drop works!
• Great visualization.
• Data provenance.
• Flow can be modified at runtime.
ETL TOOLDATA SOURCESDATA STORAGE
DATA FLOWETL TOOL
CONCLUSIONS
21
Apache NiFi
Keep in mind
• Multiuser development.
• No templates.
• NiFi is not an orchestration tool.
ETL TOOLDATA SOURCESDATA STORAGE
DATA FLOWETL TOOL
CONCLUSIONS
22
CONCLUSIONSDATA SOURCESDATA STORAGE
DATA FLOWETL TOOL
CONCLUSIONS • Hadoop is good for data warehousing
• Poor Hadoop security
• Impala (SQL on Hadoop) performs and scales
• Data format choice is a key (Avro, Parquet)
TANKS A LOT!