Effective Hive Queries

Effec%ve Hive Queries Secrets From the Pros

We will be star,ng at 11:03 PDT Use the Chat Pane in GoToWebinar to Ask Ques%ons!

Assess your level and learn new stuff This webinar is intended for intermediate audiences (familiar with Apache Hive and Hadoop, but not experts) ?

AGENDA This Webinar provides %ps on improving the performance and beJer u%lizing resources using the following best prac%ces:

•  Data Layout (Par%%ons and Buckets) •  Data Sampling (Bucket and Block sampling) •  Data Processing (Bucket Map Join and Parallel execu%on)

Dataset Used

# of records: 276M records Columns: I%nerary ID Year &Quarter of Travel Trip Origin City & State Trip Des%na%on City & State Distance between Origin & Des%na%on

Airline Bookings All Includes stops at intermediate ci%es

# of records: 116M records Columns: I%nerary ID Year &Quarter of Travel Trip Origin City & State Trip Des%na%on City & State Distance between Origin & Des%na%on

Airline Bookings Origin Only Only first leg of travel

# of records: 50 Columns: State code & Name Popula%on

Census Human popula%on by US State

#1 -‐ Data Par%%oning •  Problem PaJern

–  Query a subset of data in a table –  Subset iden%fied by “Column_Name = X” filter

•  Solu%on paJern –  Layout data in sub-‐directories with each directory associated with a value of the par%%on column

–  The filter on par%%on column just picks a single sub directory •  Approach

–  Use PARTITION BY clause •  Benefit

–  Par%%on pruning –  2.7x faster on a query on Airline Bookings Dataset (29 seconds)

#1 -‐ Data Par%%oning

Airline Bookings All Table

Origin State (Par%%on Column / Sub-‐directory) CA WY AL

File1001.dat

File1002.dat

File100n.dat

File3001.dat

File3002.dat

File300n.dat

Filex001.dat

Filex002.dat

Filex00n.dat

Files inside the par%%on

SELECT origin_city, origin_state FROM Airline_Bookings_All WHERE origin_state = ‘CA’

CREATE TABLE Airline_Bookings_All …. PARTITIONED BY (origin_state STRING)

#2 -‐ Data Bucke%ng

•  Problem PaJern –  Join data in two large tables efficiently –  Sample data inside a table efficiently

•  Solu%on paJern – More efficient processing by storing data in hash buckets

•  Approach •  Use bucke%ng using CLUSTERED BY .. INTO n BUCKETS

•  Benefit –  Bucket Map Join –  Bucket Sampling

#2 – Data Bucke%ng CREATE TABLE Airline_Bookings_All … CLUSTERED BY (i%nid) INTO 64 BUCKETS

set hive.enforce.bucke%ng = true; INSERT OVERWRITE TABLE Airline_Bookings_All SELECT … FROM ..

Ailrine_Bookings_All

File00.dat

File63.dat

File01.dat Each File contains all

the rows that correspond to the same hash of i%nid

column

#2 -‐ Data Bucke%ng

File1001.dat

File1002.dat

File100n.dat

Filex001.dat

Filex002.dat

Filex00m.dat

Files containing table data bucketed on a

column

set hive.op%mize.bucketmapjoin = true; SELECT /*+ MAPJOIN(a, b) */ a.*, b.* FROM Airline_Bookings_All a JOIN Airline_Bookings_Origin_Only b ON a.i%nid = b.i%nid

Note: 1.  Both the tables are bucketed on i%nid column 2.  The numbers of buckets in the two tables are a strict mul%ple of each other

#3 -‐ Bucket Sampling

•  Problem PaJern – Work on joinable samples of data from different tables

•  Solu%on paJern –  Use Bucket Sampling

•  Approach •  TABLESAMPLE (BUCKET x OUT OF Y ON column)

•  Benefit –  Useful while working with sample data and joins

#3 -‐ Bucket Sampling

Filex002.dat

Filex030.dat

Filex064.dat

Files containing bookings data bucketed on i%nid

SELECT a.*, b.* FROM Airline_Bookings_All TABLESAMPLE(bucket 30 out of 64 on i%nid) a , Airline_Bookings_Origin_Only TABLESAMPLE(bucket 30 out of 64 on i%nid) b WHERE a.i%nid = b.i%nid

Filex001.dat

Filex063.dat

Filey002.dat

Filey030.dat

Filey064.dat

Filey001.dat

Filey063.dat

#4 – Block Sampling •  Problem PaJern

–  View a sample of a data with in a table –  Sample size expressed as number of rows, %age of data, or number of MBs

•  Solu%on paJern –  Use Block sampling

•  Approach –  Use TABLESAMPLE (n%, nM, or n ROWS)

•  Benefit –  Geyng a random sample from the table –  More op%ons to specify how many samples to generate

#5 – Parallel Execu%on

SELECT a.year, a.quarter, a.origin, a.originstate, count(*) ct FROM (

SELECT itinid, year, quarter, origin, originstate FROM air_travel_bookings_8

)a JOIN (

SELECT itinid, origin, originstate FROM air_travel_origins_8

)B ON ( A.itinid = b.itinid and a.origin = b.origin and a.originstate = b.originstate) GROUP BY

a.year, a.quarter, a.origin, a.originstate;

Stage 1

Stage 2

Stage 3

Stage 1

Stage 2

Stage 3

Stage 1 Stage 2

Stage 3

set hive.exec.parallel = false;

set hive.exec.parallel = true;

Summary

•  Iterate quickly on Query Design – Use Bucket and Block Sampling

•  Run queries faster – Par%%oning to invoke Par%%on Pruning – Bucke%ng to invoke Bucket Map Joins – Execute complex queries in parallel

THANK YOU

Managed Cluster Built-‐In Connectors Friendly User-‐Interface Dedicated Support

•  100% Managed Hadoop Cluster in the Cloud •  Auto-‐Scaling Cluster. Full Life-‐cycle Management •  +12 Connectors to Applica%ons and Data Sources •  14-‐Day Free Trial (free account available) •  24/7 Customer Support

What’s Included?

è www.qubole.com/try ç

Effective Hive Queries

Technology

Transcript of Effective Hive Queries

Advanced Wordpress Queries

Tabelas Para Queries

Conversion of fuzzy queries into standard SQL queries ...

Hive Anatomy

Hive ppt (1)

Apresentaçao hive

More SQL: Complex Queries - Texas Southern Universitycs.tsu.edu/ghemri/cs346/classnotes/more_sql.pdf · More SQL: Complex Queries ... SELECT E.EMP_MGR, M.EMP_LNAME ... Outer queries

Componentes - Hive

Hive begins

DESARROLLO INDUSTRIAL Querétaro, Querétaro...HIVE Buenavista Digital- Brochure 2020_Low Author Joaquín Calderón Subject HIVE Buenavista Keywords HIVE, Hive, Buenavista, condominio,

Chương 3: QUERIES

Apache hive

Parameter Queries!

THE HIVE @ PARC

Hoofdstuk 8: Werken met SQL voor eindgebruikers. Overzicht Inleiding op SQL: de select-instructie Eenvoudige queries Join queries Nested queries Queries.

kintone hive amida_150525

Achieving 100k Queries per Hour on Hive on Tez

Triple Hive Stand

Supporting visual queries

1/19 DGFIndex for Smart Grid: Enhancing Hive with a Cost Effective Multi-dimensional Range Index Yue Liu, Songlin Hu*, Tilmann Rabl, Wantao Liu, Hans-Arno.