Introduction to Hive and HCatalog

33
1 Introduction to Apache Hive (and HCatalog) Mark Grover github.com/markgrover/nyc-hug- hive

description

Introduction to Hive and HCatalog presentation by Mark Grover at NYC HUG. A video of this presentation is available at https://www.youtube.com/watch?v=JGwhfr4qw5s

Transcript of Introduction to Hive and HCatalog

Page 1: Introduction to Hive and HCatalog

1

Introduction to Apache Hive (and HCatalog)  

Mark Grover github.com/markgrover/nyc-hug-hive

Page 2: Introduction to Hive and HCatalog

Me!

•  Contributor  to  Apache  Hive  •  Sec3on  Author  of  O’Reilly’s  Programming  Hive  book  •  So?ware  Developer  at  Cloudera  •  @mark_grover  •  [email protected]  •  hFps://github.com/markgrover/nyc-­‐hug-­‐hive  

Page 3: Introduction to Hive and HCatalog

Agenda

•  What  is  Hive?  •  Why  use  Hive?  •  Hive  features  •  Hive  architecture  •  HCatalog  •  Demo!  

Page 4: Introduction to Hive and HCatalog

Preamble

•  This  is  a  remote  talk  •  Feel  free  to  ask  ques3ons  any  3me!  

Page 5: Introduction to Hive and HCatalog

Agenda

•  What  is  Hive?  •  Why  use  Hive?  •  Hive  features  •  Hive  architecture  •  HCatalog  •  Demo!  

Page 6: Introduction to Hive and HCatalog

Hive

•  Data  warehouse  system  for  Hadoop  •  Enables  Extract/Transform/Load  (ETL)  •  Associate  structure  with  a  variety  of  data  formats  

•  Logical  Table  -­‐>  Physical  Loca3on  •  Logical  Table  -­‐>  Physical  Data  Format  Handler  (SerDe)  

•  Integrates  with  HDFS,  HBase,  MongoDB,  etc.  •  Query  execu3on  in  MapReduce  

Page 7: Introduction to Hive and HCatalog

Agenda

•  What  is  Hive?  •  Why  use  Hive?  •  Hive  features  •  Hive  architecture  •  HCatalog  •  Demo!  

Page 8: Introduction to Hive and HCatalog

Why use Hive?

•  MapReduce  is  catered  towards  developers  •  Run  SQL-­‐like  queries  that  get  compiled  and  run  as  

MapReduce  jobs  •  Data  in  Hadoop  even  though  generally  unstructured  

has  some  vague  structure  associated  with  it  •  Benefits  of  MapReduce  +  HDFS  (Hadoop)  

•  Fault  tolerant  •  Robust  •  Scalable  

Page 9: Introduction to Hive and HCatalog

Agenda

•  What  is  Hive?  •  Why  use  Hive?  •  Hive  features  •  Hive  architecture  •  HCatalog  •  Demo!  

Page 10: Introduction to Hive and HCatalog

Hive features

•  Create  table,  create  view,  create  index  -­‐  DDL  •  Select,  where  clause,  group  by,  order  by,  joins  •  Pluggable  User  Defined  Func3ons  -­‐  UDFs  (e.g  

from_unix3me)  •  Pluggable  User  Defined  Aggregate  Func3ons  -­‐  

UDAFs  (e.g.  count,  avg)  •  Pluggable  User  Defined  Table  Genera3ng  Func3ons  

-­‐  UDTFs  (e.g.  explode)  

Page 11: Introduction to Hive and HCatalog

Hive features

•  Pluggable  custom  Input/Output  format  •  Pluggable  Serializa3on  Deserializa3on  libraries  

(SerDes)  •  Pluggable  custom  map  and  reduce  scripts  

Page 12: Introduction to Hive and HCatalog

What Hive does NOT support

•  OLTP  workloads  -­‐  low  latency  •  Correlated  subqueries  •  Not  super  performant  with  small  amounts  of  data  

•  How  much  data  do  you  need  to  call  it  “Big  Data”?  

Page 13: Introduction to Hive and HCatalog

Other Hive features

•  Par33oning  •  Sampling  •  Bucke3ng  •  Various  join  op3miza3ons  •  Integra3on  with  HBase  and  other  storage  handlers  •  Views  –  Unmaterialized  •  Complex  data  types  –  arrays,  structs,  maps  

Page 14: Introduction to Hive and HCatalog

Connecting to Hive

•  Hive  Shell  •  JDBC  driver  •  ODBC  driver  •  Thri?  client  

Page 15: Introduction to Hive and HCatalog

Agenda

•  What  is  Hive?  •  Why  use  Hive?  •  Hive  features  •  Hive  architecture  •  HCatalog  •  Demo!  

Page 16: Introduction to Hive and HCatalog

Hive metastore

•  Backed  by  RDBMS  •  Derby,  MySQL,  PostgreSQL,  etc.  supported  

•  Default  Embedded  Derby  •  Not  recommend  for  anything  but  a  quick  Proof  of  

Concept  

•  3  different  modes  of  opera3on:  •  Embedded  Derby  (default)  •  Local  •  Remote  

Page 17: Introduction to Hive and HCatalog

Hive architecture

Page 18: Introduction to Hive and HCatalog

Hive Remote Mode

Page 19: Introduction to Hive and HCatalog

Hive server

Page 20: Introduction to Hive and HCatalog

Problems with Hive Server

•  No  sessions/concurrency  •  Essen3ally  need  1  server  per  client  •  Security  •  Auding/Logging  

Page 21: Introduction to Hive and HCatalog

Hive server 2

Page 22: Introduction to Hive and HCatalog

Hive architecture

•  Compiler  •  Parser  •  Type  checking  •  Seman3c  Analyzer  •  Plan  Genera3on  •  Task  Genera3on  

Page 23: Introduction to Hive and HCatalog

Hive architecture

•  Execu3on  Engine  •  Plan  •  Operators  •  SerDes  •  UDFs/UDAFs/UDTFs  

•  Metastore  •  Stores  schema  of  data  •  HCatalog  

Page 24: Introduction to Hive and HCatalog

Architecture Summary

•  Use  remote  metastore  service  for  sharing  the  metastore  with  HCatalog  and  other  tools  

•  Use  Hive  Server2  for  concurrent  queries  

Page 25: Introduction to Hive and HCatalog

Agenda

•  What  is  Hive?  •  Why  use  Hive?  •  Hive  features  •  Hive  architecture  •  HCatalog  •  Demo!  

Page 26: Introduction to Hive and HCatalog

Hive Metastore Remote Mode

Page 27: Introduction to Hive and HCatalog

HCatalog

•  Sub-­‐component  of  Hive  •  Table  and  storage  management  service  •  Public  APIs  and  webservice  wrappers  for  accessing  

metadata  in  Hive  metastore  •  Metastore  contains  informa3on  of  interest  to  other  

tools  (Pig,  MapReduce  jobs)  •  Expose  that  informa3on  as  REST  interface  •  WebHCat:  Web  Server  for  engaging  with  the  Hive  

metastore    

Page 28: Introduction to Hive and HCatalog

WebHCat

REST

REST

Page 29: Introduction to Hive and HCatalog

Agenda

•  What  is  Hive?  •  Why  use  Hive?  •  Hive  features  •  Hive  architecture  •  HCatalog  •  Demo!  

Page 30: Introduction to Hive and HCatalog

Applications of Hive

•  Web  Analy3cs  •  Retail  •  Healthcare  •  Spam  detec3on  •  Data  Mining  •  Ad  op3miza3on  •  ETL  workloads  

Page 31: Introduction to Hive and HCatalog

How to install Hive

•  Download  Apache  Hive  tarball  •  Use  Apache  Bigtop  packages  •  Use  a  Demo  VM  

Page 32: Introduction to Hive and HCatalog

Want to learn more about Hive?

Page 33: Introduction to Hive and HCatalog

Contact info

@mark_grover  github.com/markgrover  linkedin.com/in/grovermark  [email protected]