Post on 31-Oct-2014
description
Effec%ve Hive Queries Secrets From the Pros
We will be star,ng at 11:03 PDT Use the Chat Pane in GoToWebinar to Ask Ques%ons!
Assess your level and learn new stuff This webinar is intended for intermediate audiences (familiar with Apache Hive and Hadoop, but not experts) ?
AGENDA This Webinar provides %ps on improving the performance and beJer u%lizing resources using the following best prac%ces:
• Data Layout (Par%%ons and Buckets) • Data Sampling (Bucket and Block sampling) • Data Processing (Bucket Map Join and Parallel execu%on)
Dataset Used
# of records: 276M records Columns: I%nerary ID Year &Quarter of Travel Trip Origin City & State Trip Des%na%on City & State Distance between Origin & Des%na%on
Airline Bookings All Includes stops at intermediate ci%es
# of records: 116M records Columns: I%nerary ID Year &Quarter of Travel Trip Origin City & State Trip Des%na%on City & State Distance between Origin & Des%na%on
Airline Bookings Origin Only Only first leg of travel
# of records: 50 Columns: State code & Name Popula%on
Census Human popula%on by US State
#1 -‐ Data Par%%oning • Problem PaJern
– Query a subset of data in a table – Subset iden%fied by “Column_Name = X” filter
• Solu%on paJern – Layout data in sub-‐directories with each directory associated with a value of the par%%on column
– The filter on par%%on column just picks a single sub directory • Approach
– Use PARTITION BY clause • Benefit
– Par%%on pruning – 2.7x faster on a query on Airline Bookings Dataset (29 seconds)
#1 -‐ Data Par%%oning
Airline Bookings All Table
Origin State (Par%%on Column / Sub-‐directory) CA WY AL
File1001.dat
File1002.dat
File100n.dat
File3001.dat
File3002.dat
File300n.dat
Filex001.dat
Filex002.dat
Filex00n.dat
Files inside the par%%on
SELECT origin_city, origin_state FROM Airline_Bookings_All WHERE origin_state = ‘CA’
CREATE TABLE Airline_Bookings_All …. PARTITIONED BY (origin_state STRING)
#2 -‐ Data Bucke%ng
• Problem PaJern – Join data in two large tables efficiently – Sample data inside a table efficiently
• Solu%on paJern – More efficient processing by storing data in hash buckets
• Approach • Use bucke%ng using CLUSTERED BY .. INTO n BUCKETS
• Benefit – Bucket Map Join – Bucket Sampling
#2 – Data Bucke%ng CREATE TABLE Airline_Bookings_All … CLUSTERED BY (i%nid) INTO 64 BUCKETS
set hive.enforce.bucke%ng = true; INSERT OVERWRITE TABLE Airline_Bookings_All SELECT … FROM ..
Ailrine_Bookings_All
File00.dat
File63.dat
File01.dat Each File contains all
the rows that correspond to the same hash of i%nid
column
#2 -‐ Data Bucke%ng
a
File1001.dat
File1002.dat
File100n.dat
Filex001.dat
Filex002.dat
Filex00m.dat
Files containing table data bucketed on a
column
b
set hive.op%mize.bucketmapjoin = true; SELECT /*+ MAPJOIN(a, b) */ a.*, b.* FROM Airline_Bookings_All a JOIN Airline_Bookings_Origin_Only b ON a.i%nid = b.i%nid
Note: 1. Both the tables are bucketed on i%nid column 2. The numbers of buckets in the two tables are a strict mul%ple of each other
#3 -‐ Bucket Sampling
• Problem PaJern – Work on joinable samples of data from different tables
• Solu%on paJern – Use Bucket Sampling
• Approach • TABLESAMPLE (BUCKET x OUT OF Y ON column)
• Benefit – Useful while working with sample data and joins
#3 -‐ Bucket Sampling
Filex002.dat
Filex030.dat
Filex064.dat
Files containing bookings data bucketed on i%nid
a
SELECT a.*, b.* FROM Airline_Bookings_All TABLESAMPLE(bucket 30 out of 64 on i%nid) a , Airline_Bookings_Origin_Only TABLESAMPLE(bucket 30 out of 64 on i%nid) b WHERE a.i%nid = b.i%nid
Filex001.dat
Filex063.dat
Filey002.dat
Filey030.dat
Filey064.dat
b
Filey001.dat
Filey063.dat
#4 – Block Sampling • Problem PaJern
– View a sample of a data with in a table – Sample size expressed as number of rows, %age of data, or number of MBs
• Solu%on paJern – Use Block sampling
• Approach – Use TABLESAMPLE (n%, nM, or n ROWS)
• Benefit – Geyng a random sample from the table – More op%ons to specify how many samples to generate
#5 – Parallel Execu%on
SELECT a.year, a.quarter, a.origin, a.originstate, count(*) ct FROM (
SELECT itinid, year, quarter, origin, originstate FROM air_travel_bookings_8
)a JOIN (
SELECT itinid, origin, originstate FROM air_travel_origins_8
)B ON ( A.itinid = b.itinid and a.origin = b.origin and a.originstate = b.originstate) GROUP BY
a.year, a.quarter, a.origin, a.originstate;
Stage 1
Stage 2
Stage 3
Stage 1
Stage 2
Stage 3
Stage 1 Stage 2
Stage 3
set hive.exec.parallel = false;
set hive.exec.parallel = true;
Summary
• Iterate quickly on Query Design – Use Bucket and Block Sampling
• Run queries faster – Par%%oning to invoke Par%%on Pruning – Bucke%ng to invoke Bucket Map Joins – Execute complex queries in parallel
THANK YOU
Managed Cluster Built-‐In Connectors Friendly User-‐Interface Dedicated Support
• 100% Managed Hadoop Cluster in the Cloud • Auto-‐Scaling Cluster. Full Life-‐cycle Management • +12 Connectors to Applica%ons and Data Sources • 14-‐Day Free Trial (free account available) • 24/7 Customer Support
What’s Included?
è www.qubole.com/try ç