HiveとImpalaのおいしいとこ取り

21
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 1 1 HiveImpalaのおいしいとこ取り 『プログラミング Hive』『Hadoop 3版』 刊行記念 Hadoopセミナー 2013724 株式会社セラン R&D戦室 須幸憲

description

『Hadoop 第3版』『プログラミング Hive』刊行記念 Hadoopセミナー

Transcript of HiveとImpalaのおいしいとこ取り

  • 1. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 1 1 HiveImpala HiveHadoop 3 Hadoop 2013724 R&D
  • 2. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 2
  • 3. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v@sudabon) v R&D v l19972004 NEC / l20052006 BIGLOBE / BtoB l2012/8 vHadoop1 vHadoopHive 3
  • 4. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v MOBYLOG l Web l 200512 l PC l MOBYLOG ENGINE OmnitureAdobeSiteCatalystWebTrends WebTrends AnalyticsOEM v Xdata collect l l 20135 l Web CSV l 50 4
  • 5. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / vMOBYLOG l v lEC l l Hadoop 5
  • 6. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v Hadoop v lAmazon EMR AliveTransient lHadoop Streaming S3EC2PHP Hadoop1 6
  • 7. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / vAWSR Hadoop Hadoop2 7 ClouderaHadoop CDH Cloudera ManagerGUI HiveJavaSQL MapReduce
  • 8. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v lHive lHive vHive/MapReduce Cloudera Impala Hive745 Hadoop3 8
  • 9. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Impala 9 2012/10/24 0.1 beta 2012/11/14 0.2 beta 2012/12/05 0.3 beta 2013/01/19 0.4 beta 2013/02/05 0.5 beta 2013/02/27 0.6 beta 2013/04/15 0.7 beta 2013/05/01 1.0 2013/06/18 1.0.1 2012/11/02 5 slideshare 2012/12/07 14 @IT slideshare 2013/03/06 14 slideshare 2013/05/01 14 slideshare HiBench RCFile mem_limitParquet
  • 10. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Our System Environment 10 v Install using Cloudera Manager Free Edition 4.5.2 Master Slave 11 Servers All servers are connected with 1Gbps Ethernet through an L2 switch Active NameNode DataNode TaskTracker Impalad Stand-by NameNode JobTracker statestored 3 Servers DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad
  • 11. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / vCPU lIntel Core 2 Duo 2.13 GHz with Hyper Threading vMemory l4GB vDisk l7,200 rpm SATA mechanical Hard Disk Drive * 1 vOS lCent OS 6.2 Our Server Specication 11
  • 12. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v Use CDH4.2.1 + Impala 1.0 v Use hivebench in open-sourced benchmark tool HiBench l https://github.com/hibench v Modied datasets to 1/10 scale l Default conguration generates table with 1 billion rows v Modied query sentence l Deleted INSERT INTO TABLE to evaluate read-only performance v Combines a few storage format with a few compression method l TextFile, SequenceFile, RCFile, ParquestFile l No compression, Gzip, Snappy v Comparison with job query latency v Average job latency over 5 measurements Benchmark 12
  • 13. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v Uservisits table l 100 million rows l 16,895 MB as TextFile l Table Denitions sourceIP string destURL string visitDate string adRevenue double userAgent string countryCode string languageCode string searchWord string duration int Modied Datasets 13 v Rankings table l 12 million rows l 744 MB as TextFile l Table Denitions pageURL string pageRank int avgDuration int
  • 14. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / SELECT sourceIP, sum(adRevenue) as totalRevenue, avg(pageRank) FROM rankings_t R JOIN ( SELECT sourceIP, destURL, adRevenue FROM uservisits_t UV WHERE (datedi(UV.visitDate, '1999-01-01')>=0 AND datedi(UV.visitDate, '2000-01-01')