Apache Hiveの今とこれから

53
© Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Hiveの今とこれから Yifeng Jiang Solutions Engineer, Hortonworks, inc. June 11, 2015

Transcript of Apache Hiveの今とこれから

1. Hortonworks Inc. 2011 2015. All Rights Reserved Apache Hive Yifeng Jiang Solutions Engineer, Hortonworks, inc. June 11, 2015 2. Hortonworks Inc. 2011 2015. All Rights Reserved (Yifeng Jiang) Solutions Engineer @ Hortonworks Japan HBase book author Twitter: @uprush 3. Hortonworks Inc. 2011 2015. All Rights Reserved Apache Hive 100 4. Hortonworks Inc. 2011 2015. All Rights Reserved : 20116: Yahoo! Hadoop 24 201412: 600Hadoop 4002/3F1000 Apache Project Committers PMC Members Hadoop 27 21 Pig 5 5 Hive 18 6 Tez 16 15 HBase 6 4 Phoenix 4 4 Accumulo 2 2 Storm 3 2 Slider 11 11 Falcon 5 3 Flume 1 1 Sqoop 1 1 Ambari 36 28 Oozie 3 2 Zookeeper 2 1 Knox 13 3 Ranger 11 n/a TOTAL 164 109 5. Hortonworks Inc. 2011 2015. All Rights Reserved 6. Hortonworks Inc. 2011 2015. All Rights Reserved Hortonworks Data Platform (HDP) 2.2 Stack Hive: SQL on Hadoop 7. Hortonworks Inc. 2011 2015. All Rights Reserved Apache Hive Transaction, Temp Table, Security, Performance 8. Hortonworks Inc. 2011 2015. All Rights Reserved Apache Hive Page 8Hive & HBase For Transaction Processing FacebookOSSHadoopSQL 2009Apache Hive SQLMapReduce ETL 9. Hortonworks Inc. 2011 2015. All Rights Reserved Hive 1SQL OLTP,ERP,CRMSystems Unstructureddocuments,emails Clickstream Serverlogs Sen>ment,WebData Sensor.MachineData Geoloca>on / Hive - SQL ETL 10. Hortonworks Inc. 2011 2015. All Rights Reserved Hortonworks Inc. 2015 Hive Page 10 Hive at Facebook 100PB 15TBHive 6Hive 1Hive 11. Hortonworks Inc. 2011 2015. All Rights Reserved Insert, Update and Delete SQL Statements 12. Hortonworks Inc. 2011 2015. All Rights Reserved (YES) TPS OLTP (NO) Hive OLTP Hive Replication Analytics Modifications Hive High Concurrency OLTP 13. Hortonworks Inc. 2011 2015. All Rights Reserved Deep Dive: HiveACID Hive : Phase 1: Hive Streaming Ingest (append) Phase 2: INSERT / UPDATE / DELETE Support Phase 3: BEGIN / COMMIT / ROLLBACK Txn [Done] [Done] [Next] Read- Optimized ORCFile Delta File Merged Read- Optimized ORCFile 1. Original File Task reads the latest ORCFile Task Read- Optimized ORCFile Task Task 2. Edits Made Task reads the ORCFile and merges the delta file with the edits 3. Edits Merged Task reads the updated ORCFile Hive ACID Compactor periodically merges the delta files in the background. 14. Hortonworks Inc. 2011 2015. All Rights Reserved Read- Optimized ORCFile Delta File Merged Read- Optimized ORCFile Read- Optimized ORCFile Delta File Delta File Delta File Minor Compaction 10% local Major Compaction 10% global 15. Hortonworks Inc. 2011 2015. All Rights Reserved Hive Users perspective 16. Hortonworks Inc. 2011 2015. All Rights Reserved Ranger: Apache Ranger Hadoop 17. Hortonworks Inc. 2011 2015. All Rights Reserved (Hive) 17 18. Hortonworks Inc. 2011 2015. All Rights Reserved 100 ORC, Tez, CBO, Vectorization 19. Hortonworks Inc. 2011 2015. All Rights Reserved Stinger: Hive100 2013220144 100%Apache SQLEngine Vectorized SQLEngine Columnar Storage ORCFile =100X+ + Distributed Execu>on ApacheTez 20. Hortonworks Inc. 2011 2015. All Rights Reserved TPC-DS Benchmark at 30 Terabyte Scale TPC-DS 50 30 terabyte 52 , 160 7.8 9.3 Hive 14 Cost-Based Optimizer 2.5 21. Hortonworks Inc. 2011 2015. All Rights Reserved ORC Columnar Storage for Hive 22. Hortonworks Inc. 2011 2015. All Rights Reserved Hortonworks Inc. 2015 ORCFile Hive (min, max, sum, count) Page 22 23. Hortonworks Inc. 2011 2015. All Rights Reserved Hortonworks Inc. 2015 ORCFile Map/reduce 24. Hortonworks Inc. 2011 2015. All Rights Reserved Hortonworks Inc. 2015 ORCFile Page 24 create table Addresses ( name string, street string, city string, state string, zip int ) stored as orc tblproperties ("orc.compress"=ZLIB"); 25. Hortonworks Inc. 2011 2015. All Rights Reserved Hortonworks Inc. 2015 ORCFile ORC ORC SQL 1ORC Page 25 -- Create Text & ORC tables CREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE; CREATE TABLE test_details_orc( visit_id INT, store_id SMALLINT) STORED AS ORC; -- Load into Text table LOAD DATA LOCAL INPATH '/home/user/test_details.csv' INTO TABLE test_details_txt; -- Copy to ORC table INSERT OVERWRITE INTO test_details_orc SELECT * FROM test_details_txt; 26. Hortonworks Inc. 2011 2015. All Rights Reserved Tez Beyond MapReduce 27. Hortonworks Inc. 2011 2015. All Rights Reserved Hortonworks Inc. 2015 I/O Synchronization Barrier I/O Synchronization Barrier Job 1 ( Join a & b ) Job 3 ( Group by of c ) Job 2 (Group by of a Join b) Job 4 (Join of S & R ) Hive - MR MR vs. Tez Page 27 Single Job Hive - Tez Join a & b Group by of a Join b Group by of c Job 4 (Join of S & R ) 28. Hortonworks Inc. 2011 2015. All Rights Reserved Hortonworks Inc. 2015 Tez Page 28 Hive on Tez, Pig on Tez, Cascading on Tez, MapReduce Petabytes YARN 29. Hortonworks Inc. 2011 2015. All Rights Reserved Hortonworks Inc. 2015 Tez MapReduce 1MapReduceTez Page 29 set hive.execution.engine=tez; SELECT * FROM my_table; Hadoop 2Tez $ vi hive-site.xml hive.execution.engine=tez 30. Hortonworks Inc. 2011 2015. All Rights Reserved Cost Based Optimizer Making the SQL smarter 31. Hortonworks Inc. 2011 2015. All Rights Reserved Cost Based Optimizer in Hive Cost-Based Optimizer (CBO) Hive cost-based optimization? Join SQL Page 31 32. Hortonworks Inc. 2011 2015. All Rights Reserved Star Schema FACT TABLE Dimension Table Dimension Table Dimension Table Dimension Table 33. Hortonworks Inc. 2011 2015. All Rights Reserved CBO on Selected Queries - 17 store_sales store_returns catalog_sales items store date_dim d1 date_dim d2 date_dim d3 Filter: quarterFilter: quarterFilter: quarter Filter: dateFilter: dateFilter: date customer_sk ticket_number customer_sk Item_sk date_sk date_sk date_sk item_sk store_sk 34. Hortonworks Inc. 2011 2015. All Rights Reserved OLD: Left Deep Plan Reducer 3 Merge join 2 & 10 Map join 1 Map join 6 Map Join 7 Map Join 8 store Map Join 11 item Filter Group By Reduce Map 12 Table_scan Store_returns Map 6 Table_scan d2, filter Map 7 Table_scan d3, filter Reducer 4 Group_By Reduce Reducer 10 Merge join 12, 9 Map 9 Table_scan store_sales Map 1 Table_scan d1, filter Map 2 Table_scan catalog_sales Reducer 5 Limit B B B Map 11 Table_scan item Map 8 Table_scan store B Large Fact tables joined together without filters B 35. Hortonworks Inc. 2011 2015. All Rights Reserved NEW: Complex Bushy Plan Reducer 4 Merge join 3 & 8 Map join store Map join item Reduce Map 10 table_scan store Map 12 Table_scan item Map 3 Store_sales Map join Map 8 Store_returns Map join Reducer 5 Merge_Join Group_By Reduce Map 11 catalog_sales, Map Join Map 9 Table_scan d1, filter Map 1 Table_scan d1, filter Map 2 Table_scan d1, filter Reducer 6 Group by Reduce Reducer7 Limit B B B B B All 3 Large Fact tables joined with date dimension limiting data to few quarters 36. Hortonworks Inc. 2011 2015. All Rights Reserved Query 17 Scale = 30TB Input records ~18600 CBO Elapsed Time (sec) Elapsed Time Intermediate data (GB) Output and Intermediate Records OFF 10,683 ~3 hrs 5,017 135,647,792,123 ON 1,284 ~20 mins 275 8,543,232,360 37. Hortonworks Inc. 2011 2015. All Rights Reserved Hortonworks Inc. 2015 CBO CBO Page 37 set hive.cbo.enable=true; set hive.compute.query.using.stats=true; set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true; ANALYZE TABLE my_table COMPUTE STATISTICS FOR COLUMNS; 38. Hortonworks Inc. 2011 2015. All Rights Reserved Vectorized Query Execution Process 1024 Rows at a Time 39. Hortonworks Inc. 2011 2015. All Rights Reserved Hortonworks Inc. 2015 Vectorization SQL : 1024 : CPU Page 39 40. Hortonworks Inc. 2011 2015. All Rights Reserved Hortonworks Inc. 2015 Vectorization SQL Page 40 set hive.vectorized.execution.enabled = true set hive.vectorized.execution.reduce.enabled = true; ORC 41. Hortonworks Inc. 2011 2015. All Rights Reserved Hive: HiveSQL Petebyte Scalei Hadoop SQL SQLi 90 Hortonworks SQL Hive-on-Tez Speedi HortonworksCustomerSupportmetricsasofFeb/2015 42. Hortonworks Inc. 2011 2015. All Rights Reserved Solving Hives Top Performance Challenges 43. Hortonworks Inc. 2011 2015. All Rights Reserved : Stinger.next and Sub-Second SQL LLAPHBase 44. Hortonworks Inc. 2011 2015. All Rights Reserved ApacheHive:ModernArchitectureStorage ColumnarStorage ORCFile Parquet UnstructuredData JSON CSV Text Avro Custom Weblog Engine SQLEngines RowEngine VectorEngine SQL SQLSupport SQL:2011 Op>mizer HCatalog HiveServer2 Cache BlockCache LinuxCache Distributed Execu>on Hadoop1 MapReduce Hadoop2 Tez Historical Current InDevelopment Legend 45. Hortonworks Inc. 2011 2015. All Rights Reserved ApacheHive:ModernArchitectureStorage ColumnarStorage ORCFile Parquet UnstructuredData JSON CSV Text Avro Custom Weblog Engine SQLEngines RowEngine VectorEngine SQL SQLSupport SQL:2011 Op>mizer HCatalog HiveServer2 Cache BlockCache LinuxCache Distributed Execu>on Hadoop1 MapReduce Hadoop2 Tez VectorCache LLAP PersistentServer Historical Current InDevelopment Legend 46. Hortonworks Inc. 2011 2015. All Rights Reserved HBase : Page 46Hive & HBase For Transaction Processing 47. Hortonworks Inc. 2011 2015. All Rights Reserved LLAP Page 47 Node LLAP Process HDFS Query Fragment LLAP In-Memory columnar cache LLAP process running read task for a query LLAPTez Node Hive Query Node NodeNode Node LLAP LLAP LLAP LLAP LLAP = Live Long And Process 48. Hortonworks Inc. 2011 2015. All Rights Reserved LLAP: Page 48 LLAP LLAP HDFS SQL YARN Node LLAP Process Query Fragment LLAP In- Memory columnar cache LLAP process running a task for a query HDFS 49. Hortonworks Inc. 2011 2015. All Rights Reserved Hive = Sub-Second Hive Metadata Fast,Scalable Metadata Catalog Persistent Server LLAP + + SQLEngine Vectorized HashJoin Choiceof Execu>on Engines Tez + 50. Hortonworks Inc. 2011 2015. All Rights Reserved Hive Present and Future 51. Hortonworks Inc. 2011 2015. All Rights Reserved Hive HiveSQL on Hadoop SQL ETLSQLHive Hive SQL:2011 Analytics 52. Hortonworks Inc. 2011 2015. All Rights Reserved Hive Hive Hive on Tez ORC CBO SQL SQL Hive 53. Hortonworks Inc. 2011 2015. All Rights Reserved Thank you Yifeng Jiang, Solutions Engineer, Hortonworks @uprush