Hive sql的编译过程

190
Hive sql的编译过程 [email protected] Monday, 30 December, 13

description

 

Transcript of Hive sql的编译过程

Page 1: Hive sql的编译过程

Hive sql的编译过程

[email protected]

Monday, 30 December, 13

Page 2: Hive sql的编译过程

⺫⽬目录1. MapReduce实现Join Group By Distinct操作的基本原理

2. SQL转化为MapReduce的过程

(1) Antlr && ASTTree

(2) sql基本组成单元QueryBlock

(3) 逻辑操作符Operator

(4) 逻辑层优化器

(5) OperatorTree转化为MapReduce Job的过程

(6) 物理层优化器 MapJoin原理

3. 如何理解Hive执⾏行计划

Monday, 30 December, 13

Page 3: Hive sql的编译过程

Join

useruser

uid name

1 apple

2 orange

orderorder

uid orderid

1 1001

1 1002

2 1003

select u.name, o.orderid from order o join user u on o.uid = u.uid;

Monday, 30 December, 13

Page 4: Hive sql的编译过程

Join

useruser

uid name

1 apple

2 orange

orderorder

uid orderid

1 1001

1 1002

2 1003

Map

key value

1 <1,apple>

2 <1,orange>

key value

1 <2,1001>

1 <2,1002>

2 <2,1003>

select u.name, o.orderid from order o join user u on o.uid = u.uid;

Monday, 30 December, 13

Page 5: Hive sql的编译过程

Join

useruser

uid name

1 apple

2 orange

orderorder

uid orderid

1 1001

1 1002

2 1003

Map

key value

1 <1,apple>

2 <1,orange>

key value

1 <2,1001>

1 <2,1002>

2 <2,1003>

ShuffleSort

key value

1 <1,apple>

1 <2,1001>

1 <2,1002>

key value

2 <1,orange>

2 <2,1003>

select u.name, o.orderid from order o join user u on o.uid = u.uid;

Monday, 30 December, 13

Page 6: Hive sql的编译过程

Join

useruser

uid name

1 apple

2 orange

orderorder

uid orderid

1 1001

1 1002

2 1003

Map

key value

1 <1,apple>

2 <1,orange>

key value

1 <2,1001>

1 <2,1002>

2 <2,1003>

ShuffleSort

key value

1 <1,apple>

1 <2,1001>

1 <2,1002>

key value

2 <1,orange>

2 <2,1003>

Reduce

name orderid

apple 1001

apple 1002

name orderid

orange 1003

select u.name, o.orderid from order o join user u on o.uid = u.uid;

Monday, 30 December, 13

Page 7: Hive sql的编译过程

Group By

citycity

rank isonline

A 1

A 1

select rank, isonline, count(*) from city group by rank, isonline;

citycity

rank isonline

A 1

B 0

Monday, 30 December, 13

Page 8: Hive sql的编译过程

Group By

citycity

rank isonline

A 1

A 1

select rank, isonline, count(*) from city group by rank, isonline;

citycity

rank isonline

A 1

B 0

Map

key value

<A, 1> 2

key value

<A, 1> 1

<B, 0> 1

Monday, 30 December, 13

Page 9: Hive sql的编译过程

Group By

citycity

rank isonline

A 1

A 1

select rank, isonline, count(*) from city group by rank, isonline;

citycity

rank isonline

A 1

B 0

Map

key value

<A, 1> 2

key value

<A, 1> 1

<B, 0> 1

ShuffleSort

key value

<A, 1> 2

<A, 1> 1

key value

<B, 0> 1

Monday, 30 December, 13

Page 10: Hive sql的编译过程

Group By

citycity

rank isonline

A 1

A 1

select rank, isonline, count(*) from city group by rank, isonline;

citycity

rank isonline

A 1

B 0

Map

key value

<A, 1> 2

key value

<A, 1> 1

<B, 0> 1

ShuffleSort

key value

<A, 1> 2

<A, 1> 1

key value

<B, 0> 1

Reduce

rank isonline value

A 1 3

rank isonline value

B 0 1

Monday, 30 December, 13

Page 11: Hive sql的编译过程

Distinctselect dealid, count(distinct uid) num from order group by dealid;

uid dealid

1 1001

2 1002

2 1001

uid dealid

1 1002

1 1002

2 1001

Monday, 30 December, 13

Page 12: Hive sql的编译过程

Distinctselect dealid, count(distinct uid) num from order group by dealid;

uid dealid

1 1001

2 1002

2 1001

uid dealid

1 1002

1 1002

2 1001

Map

key valuepartition

Key

<1001, 1> 1 1001

<1002, 2> 1 1002

<1001, 2> 1 1001

key valuepartition

Key

<1002, 1> 1 1002

<1001, 2> 1 1001

Monday, 30 December, 13

Page 13: Hive sql的编译过程

Distinctselect dealid, count(distinct uid) num from order group by dealid;

uid dealid

1 1001

2 1002

2 1001

uid dealid

1 1002

1 1002

2 1001

Map

key valuepartition

Key

<1001, 1> 1 1001

<1002, 2> 1 1002

<1001, 2> 1 1001

key valuepartition

Key

<1002, 1> 1 1002

<1001, 2> 1 1001

ShuffleSort

key value

<1001, 1> 1

<1001, 2> 1

<1001, 2> 1

key value

<1002, 1> 2

<1002, 2> 1

Monday, 30 December, 13

Page 14: Hive sql的编译过程

Distinctselect dealid, count(distinct uid) num from order group by dealid;

uid dealid

1 1001

2 1002

2 1001

uid dealid

1 1002

1 1002

2 1001

Map

key valuepartition

Key

<1001, 1> 1 1001

<1002, 2> 1 1002

<1001, 2> 1 1001

key valuepartition

Key

<1002, 1> 1 1002

<1001, 2> 1 1001

ShuffleSort

key value

<1001, 1> 1

<1001, 2> 1

<1001, 2> 1

key value

<1002, 1> 2

<1002, 2> 1

Reduce

dealid num

1001 2

dealid num

1002 2

Monday, 30 December, 13

Page 15: Hive sql的编译过程

Distinctselect dealid, count(distinct uid), count(distinct date) from order group by dealid;

uid dealid date

1 1001 1101

2 1001 1101

2 1001 1102

Monday, 30 December, 13

Page 16: Hive sql的编译过程

Distinctselect dealid, count(distinct uid), count(distinct date) from order group by dealid;

uid dealid date

1 1001 1101

2 1001 1101

2 1001 1102

Mapkey value

partitionKey

<1001,1,1101> 1 1001

<1001,2,1101> 1 1001

<1001,2,1102> 1 1001

Monday, 30 December, 13

Page 17: Hive sql的编译过程

Distinctselect dealid, count(distinct uid), count(distinct date) from order group by dealid;

uid dealid date

1 1001 1101

2 1001 1101

2 1001 1102

Mapkey value

partitionKey

<1001,1,1101> 1 1001

<1001,2,1101> 1 1001

<1001,2,1102> 1 1001

需要在Reduce阶段在内存中分对uid和date去重

Monday, 30 December, 13

Page 18: Hive sql的编译过程

Distinctselect dealid, count(distinct uid), count(distinct date) from order group by dealid;

uid dealid date

1 1001 1101

2 1001 1101

2 1001 1102

Monday, 30 December, 13

Page 19: Hive sql的编译过程

Distinctselect dealid, count(distinct uid), count(distinct date) from order group by dealid;

uid dealid date

1 1001 1101

2 1001 1101

2 1001 1102

Map

key valuepartition

Key

<1001,0,1> 1 1001

<1001,1,1101> 1 1001

<1001,0,2> 1 1001

<1001,1,1101> 1 1001

<1001,0,2> 1 1001

<1001,1,1102> 1 1001

Monday, 30 December, 13

Page 20: Hive sql的编译过程

Distinctselect dealid, count(distinct uid), count(distinct date) from order group by dealid;

uid dealid date

1 1001 1101

2 1001 1101

2 1001 1102

Map

key valuepartition

Key

<1001,0,1> 1 1001

<1001,1,1101> 1 1001

<1001,0,2> 1 1001

<1001,1,1101> 1 1001

<1001,0,2> 1 1001

<1001,1,1102> 1 1001

只需要在Reduce阶段记录lastDealid, lastTag, lastuid, lastDate

Monday, 30 December, 13

Page 21: Hive sql的编译过程

⺫⽬目录1. MapReduce实现Join Group By Distinct操作的基本原理

2. SQL转化为MapReduce的过程

(1) Antlr && ASTTree

(2) sql基本组成单元QueryBlock

(3) 逻辑操作符Operator

(4) 逻辑层优化器

(5) OperatorTree转化为MapReduce Job的过程

(6) 物理层优化器 MapJoin原理

3. Hive执⾏行计划

Monday, 30 December, 13

Page 22: Hive sql的编译过程

Compile Workflow

Parser

SemanticAnalyzer

Logical Plan Gen

Logical Optimizer

Physical Plan Gen

Physical Optimizer

Monday, 30 December, 13

Page 23: Hive sql的编译过程

Compile Workflow

Parser

SemanticAnalyzer

Logical Plan Gen

Logical Optimizer

Physical Plan Gen

Physical Optimizer

HiveQL

ASTTree

QB

OperatorTree

TaskTree

OperatorTree

TaskTree

Monday, 30 December, 13

Page 24: Hive sql的编译过程

⺫⽬目录1. MapReduce实现Join Group By Distinct操作的基本原理

2. SQL转化为MapReduce的过程

(1) Antlr && ASTTree

(2) sql基本组成单元QueryBlock

(3) 逻辑操作符Operator

(4) 逻辑层优化器

(5) OperatorTree转化为MapReduce Job的过程

(6) 物理层优化器 MapJoin原理

3. Hive执⾏行计划

Monday, 30 December, 13

Page 25: Hive sql的编译过程

Antlr• Antlr是⼀一种语⾔言识别的⼯工具

• 可以⽤用来构造领域语⾔言

• 只需要编写⼀一个语法⽂文件,定义词法和语法替换规则,Antlr完成了词法分析、语法分析、语义分析、中间代码⽣生成等过程

Monday, 30 December, 13

Page 26: Hive sql的编译过程

如果需要对表达式做进⼀一步的处理,对表达式的运算结果求值,使⽤用 Antlr 可以有两种选择,第⼀一,直接在语法⽂文件中嵌⼊入动作,加⼊入代码⽚片段;第⼆二,使⽤用 Antlr 的抽象语法树语法,在语法分析的同时将⽤用户输⼊入转换成中间表⽰示⽅方式:抽象语法树,后续在遍历语法树的同时完成计算。

AST Tree

Monday, 30 December, 13

Page 27: Hive sql的编译过程

Example SQL

Monday, 30 December, 13

Page 28: Hive sql的编译过程

Sub Query

15

SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser

Monday, 30 December, 13

Page 29: Hive sql的编译过程

Sub Query

15

SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser

1 1

Monday, 30 December, 13

Page 30: Hive sql的编译过程

22

Sub Query

15

SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser

1 1

Monday, 30 December, 13

Page 31: Hive sql的编译过程

From => AST

16

SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser

1.1

Monday, 30 December, 13

Page 32: Hive sql的编译过程

From => AST

17

SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser

1.1

Monday, 30 December, 13

Page 33: Hive sql的编译过程

Select => AST

18

SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser

1.2

Monday, 30 December, 13

Page 34: Hive sql的编译过程

Select => AST

19

SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser

1.2

Monday, 30 December, 13

Page 35: Hive sql的编译过程

Where

20

SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser

1.3

Monday, 30 December, 13

Page 36: Hive sql的编译过程

Where => AST

21

SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser

1.3

Monday, 30 December, 13

Page 37: Hive sql的编译过程

⺫⽬目录1. MapReduce实现Join Group By Distinct操作的基本原理

2. SQL转化为MapReduce的过程

(1) Antlr && ASTTree

(2) sql基本组成单元QueryBlock

(3) 逻辑操作符Operator

(4) 逻辑层优化器

(5) OperatorTree转化为MapReduce Job的过程

(6) 物理层优化器 MapJoin原理

3. Hive执⾏行计划

Monday, 30 December, 13

Page 38: Hive sql的编译过程

QueryBlock

23

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

• QueryBlock : ⼀一条SQL的基本组成单元,包括三个部分:输⼊入源,计算过程,输出。

• 从AST Tree⽣生成QueryBlock的过程,就是从抽象语法树中找出所有的基本单元以及每个单元之间的关系的过程。每个基本单元创建⼀一个QB对象,将每个基本单元的不同操作转化为QB对象的不同属性。

Monday, 30 December, 13

Page 39: Hive sql的编译过程

QueryBlock

23

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

• QueryBlock : ⼀一条SQL的基本组成单元,包括三个部分:输⼊入源,计算过程,输出。

• 从AST Tree⽣生成QueryBlock的过程,就是从抽象语法树中找出所有的基本单元以及每个单元之间的关系的过程。每个基本单元创建⼀一个QB对象,将每个基本单元的不同操作转化为QB对象的不同属性。

Monday, 30 December, 13

Page 40: Hive sql的编译过程

QuueryBlock

24

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

Monday, 30 December, 13

Page 41: Hive sql的编译过程

QuueryBlock

24

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

表名和别名的映射关系

Monday, 30 December, 13

Page 42: Hive sql的编译过程

QuueryBlock

24

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

⼦子查询

⼦子查询

Monday, 30 December, 13

Page 43: Hive sql的编译过程

QuueryBlock

24

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

QBExpr本意是表达QB的关系,但是⺫⽬目前只实现了Union

Monday, 30 December, 13

Page 44: Hive sql的编译过程

QuueryBlock

24

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

Join ASTTree

Monday, 30 December, 13

Page 45: Hive sql的编译过程

QuueryBlock

24

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

key=‘inclause-i’ value=ASTTree

Monday, 30 December, 13

Page 46: Hive sql的编译过程

QuueryBlock

25

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

记录表的源数据

Monday, 30 December, 13

Page 47: Hive sql的编译过程

AST Tree => QB

26

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

先序遍历AST Tree SemanticAnalyze#doPhase1

Monday, 30 December, 13

Page 48: Hive sql的编译过程

1

AST Tree => QB

26

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

先序遍历AST Tree SemanticAnalyze#doPhase1

Monday, 30 December, 13

Page 49: Hive sql的编译过程

1

AST Tree => QB

26

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

先序遍历AST Tree SemanticAnalyze#doPhase1

2

Monday, 30 December, 13

Page 50: Hive sql的编译过程

1

AST Tree => QB

26

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

1. TOK_QUERY > 创建QB对象,循环递归⼦子节点

先序遍历AST Tree SemanticAnalyze#doPhase1

2

Monday, 30 December, 13

Page 51: Hive sql的编译过程

1

AST Tree => QB

26

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

1. TOK_QUERY > 创建QB对象,循环递归⼦子节点

2. TOK_FROM > QB#aliasToTabs.put(alias, tabname); QB#aliases.put(alias, tabname); QBParseInfo#aliasToSrc.put(alias.toLowerCase(), ast);

先序遍历AST Tree SemanticAnalyze#doPhase1

2

Monday, 30 December, 13

Page 52: Hive sql的编译过程

1

AST Tree => QB

26

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

1. TOK_QUERY > 创建QB对象,循环递归⼦子节点

2. TOK_FROM > QB#aliasToTabs.put(alias, tabname); QB#aliases.put(alias, tabname); QBParseInfo#aliasToSrc.put(alias.toLowerCase(), ast);

3. TOK_INSERT > 循环递归⼦子节点

先序遍历AST Tree SemanticAnalyze#doPhase1

2

Monday, 30 December, 13

Page 53: Hive sql的编译过程

1

AST Tree => QB

26

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

1. TOK_QUERY > 创建QB对象,循环递归⼦子节点

2. TOK_FROM > QB#aliasToTabs.put(alias, tabname); QB#aliases.put(alias, tabname); QBParseInfo#aliasToSrc.put(alias.toLowerCase(), ast);

3. TOK_INSERT > 循环递归⼦子节点4. TOK_DESTINATION > QBParseInfo#nameToDest.put(“insclause-i”, astnode)

先序遍历AST Tree SemanticAnalyze#doPhase1

2

Monday, 30 December, 13

Page 54: Hive sql的编译过程

1

AST Tree => QB

26

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

1. TOK_QUERY > 创建QB对象,循环递归⼦子节点

2. TOK_FROM > QB#aliasToTabs.put(alias, tabname); QB#aliases.put(alias, tabname); QBParseInfo#aliasToSrc.put(alias.toLowerCase(), ast);

3. TOK_INSERT > 循环递归⼦子节点4. TOK_DESTINATION > QBParseInfo#nameToDest.put(“insclause-i”, astnode) 5. TOK_SELECT > QBParseInfo#destToSelExpr.put(“insclause-i”, astnode);

destToAggregationExprs.put(“insclause-i”, astnode); destToDistinctFuncExprs.put(“insclause-i”, astnode);

先序遍历AST Tree SemanticAnalyze#doPhase1

2

Monday, 30 December, 13

Page 55: Hive sql的编译过程

1

AST Tree => QB

26

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

1. TOK_QUERY > 创建QB对象,循环递归⼦子节点

2. TOK_FROM > QB#aliasToTabs.put(alias, tabname); QB#aliases.put(alias, tabname); QBParseInfo#aliasToSrc.put(alias.toLowerCase(), ast);

3. TOK_INSERT > 循环递归⼦子节点4. TOK_DESTINATION > QBParseInfo#nameToDest.put(“insclause-i”, astnode) 5. TOK_SELECT > QBParseInfo#destToSelExpr.put(“insclause-i”, astnode);

destToAggregationExprs.put(“insclause-i”, astnode); destToDistinctFuncExprs.put(“insclause-i”, astnode);

6. TOK_WHERE > QBParseInfo# destToWhereExpr.put(“insclause-i”, ast);

先序遍历AST Tree SemanticAnalyze#doPhase1

2

Monday, 30 December, 13

Page 56: Hive sql的编译过程

1

AST Tree => QB

26

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

Analyzer

1. TOK_QUERY > 创建QB对象,循环递归⼦子节点

2. TOK_FROM > QB#aliasToTabs.put(alias, tabname); QB#aliases.put(alias, tabname); QBParseInfo#aliasToSrc.put(alias.toLowerCase(), ast);

3. TOK_INSERT > 循环递归⼦子节点4. TOK_DESTINATION > QBParseInfo#nameToDest.put(“insclause-i”, astnode) 5. TOK_SELECT > QBParseInfo#destToSelExpr.put(“insclause-i”, astnode);

destToAggregationExprs.put(“insclause-i”, astnode); destToDistinctFuncExprs.put(“insclause-i”, astnode);

6. TOK_WHERE > QBParseInfo# destToWhereExpr.put(“insclause-i”, ast);

先序遍历AST Tree SemanticAnalyze#doPhase1

QB1 \ QB2

2

Monday, 30 December, 13

Page 57: Hive sql的编译过程

⺫⽬目录1. MapReduce实现Join Group By Distinct操作的基本原理

2. SQL转化为MapReduce的过程

(1) Antlr && ASTTree

(2) sql基本组成单元QueryBlock

(3) 逻辑操作符Operator

(4) 逻辑层优化器

(5) OperatorTree转化为MapReduce Job的过程

(6) 物理层优化器 MapJoin原理

3. Hive执⾏行计划

Monday, 30 December, 13

Page 58: Hive sql的编译过程

Operator

28

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

• 逻辑操作符,在Map阶段或者Reduce阶段完成单⼀一特定的功能。

• 常⻅见的Operator如:TableScanOperator SelectOperator FilterOperator JoinOperator GroupByOperator ReduceSinkOperator

• Map/Reduce阶段都由⼀一个OperatorTree组成。

• 流式的计算过程。每⼀一个Operator计算完成⼀一⾏行数据之后将数据传递给childOperator计算

• 某些Operator是⼀一个终结操作符TerminalOperator,标⽰示Map/Reduce阶段的结束。如FileSinkOperator将数据写⼊入⽂文件,标志当前阶段的结束。

• ReduceSinkOperator只可能出现在Map阶段,将Map端的字段组合序列化为Reduce Key/value, Partition Key。

Monday, 30 December, 13

Page 59: Hive sql的编译过程

Operator

29

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

• RowSchema表⽰示Operator的输出字段

• InputObjInspector outputObjInspector解析输⼊入和输出字段

• Hive每⼀一⾏行数据经过⼀一个Operator处理之后,会对字段重新编号,colExprMap被LogicalOptimizer⽤用来回溯字段名

• Operator所有运⾏行时需要的参数均保存在OperatorDesc中,OperatorDesc在提交任务前序列化到hdfs上,在MR Task执⾏行前从hdfs读取并反序列化

• Map阶段OperatorTree在hdfs上的位置在Job.getConf(“hive.exec.plan”) + “/map.xml”

Monday, 30 December, 13

Page 60: Hive sql的编译过程

SemanticAnalyzer#genBodyPlan

QB => Operator Tree

30

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

中序遍历QB SemanticAnalyzer#genPlan(QB qb)

SemanticAnalyzer#genPlan1. QB#aliasToSubq => 递归调⽤用genPlan()

2. QB#aliasToTabs => TableScanOperator3. QBParseInfo#joinExpr => QBJoinTree => ReduceSinkOperator + JoinOperator4. QBParseInfo#destToWhereExpr => FilterOperator5. QBParseInfo#destToGroupby => ReduceSinkOperator + GroupByOperator6. QBParseInfo#destToOrderby => ReduceSinkOperator + ExtractOperator7. ...

Monday, 30 December, 13

Page 61: Hive sql的编译过程

QB2 : aliasToTabs => TableScanOperator

31

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

TableScanOperator(“dim.user”) TS[0]TableScanOperator(“detail.usersequence_client”) TS[1]TableScanOperator(“fact.orderpayment”) TS[2]

QB#aliasToTabs {du=dim.user, c=detail.usersequence_client, p=fact.orderpayment}

Monday, 30 December, 13

Page 62: Hive sql的编译过程

QBJoinTree

32

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

Monday, 30 December, 13

Page 63: Hive sql的编译过程

QB2 : QBParseInfo#joinExpr => QBJoinTree

33

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

先序遍历joinExpr⽣生成QBJoinTree

Monday, 30 December, 13

Page 64: Hive sql的编译过程

1

p / \c p

QB2

QB2 : QBParseInfo#joinExpr => QBJoinTree

33

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

先序遍历joinExpr⽣生成QBJoinTree

Monday, 30 December, 13

Page 65: Hive sql的编译过程

1

p / \c p

QB2

QB2 : QBParseInfo#joinExpr => QBJoinTree

33

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

先序遍历joinExpr⽣生成QBJoinTree2

base / \ p du / \c p

QB1

Monday, 30 December, 13

Page 66: Hive sql的编译过程

QB2 : QBJoinTree => RS + JOIN

34

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

前序遍历QBJoinTreeTS=TableScanOperator RS=ReduceSinkOperator JOIN=JoinOperator

Monday, 30 December, 13

Page 67: Hive sql的编译过程

QB2 : QBJoinTree => RS + JOIN

34

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

前序遍历QBJoinTreeTS=TableScanOperator RS=ReduceSinkOperator JOIN=JoinOperator

base / \ p du / \c p

TS[c] TS[p]

Monday, 30 December, 13

Page 68: Hive sql的编译过程

QB2 : QBJoinTree => RS + JOIN

34

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

前序遍历QBJoinTreeTS=TableScanOperator RS=ReduceSinkOperator JOIN=JoinOperator

base / \ p du / \c p

TS[c] TS[p]TS[c] TS[p] | |RS[3] RS[4]

Monday, 30 December, 13

Page 69: Hive sql的编译过程

QB2 : QBJoinTree => RS + JOIN

34

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

前序遍历QBJoinTreeTS=TableScanOperator RS=ReduceSinkOperator JOIN=JoinOperator

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5]

base / \ p du / \c p

TS[c] TS[p]TS[c] TS[p] | |RS[3] RS[4]

Monday, 30 December, 13

Page 70: Hive sql的编译过程

QB2 : QBJoinTree => RS + JOIN

35

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

前序遍历QBJoinTreeTS=TableScanOperator RS=ReduceSinkOperator JOIN=JoinOperator

Monday, 30 December, 13

Page 71: Hive sql的编译过程

QB2 : QBJoinTree => RS + JOIN

35

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

前序遍历QBJoinTreeTS=TableScanOperator RS=ReduceSinkOperator JOIN=JoinOperator

base / \ p du / \c p

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du]

Monday, 30 December, 13

Page 72: Hive sql的编译过程

QB2 : QBJoinTree => RS + JOIN

35

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

前序遍历QBJoinTreeTS=TableScanOperator RS=ReduceSinkOperator JOIN=JoinOperator

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7]

base / \ p du / \c p

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du]

Monday, 30 December, 13

Page 73: Hive sql的编译过程

QB2 : QBJoinTree => RS + JOIN

35

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

前序遍历QBJoinTreeTS=TableScanOperator RS=ReduceSinkOperator JOIN=JoinOperator

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7]

base / \ p du / \c p

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du]

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8]

Monday, 30 December, 13

Page 74: Hive sql的编译过程

QB2 : genBodyPlan

QBParseInfo#destToWhereExpr > FilterOperatorFIL= FilterOperator SEL= SelectOperator

36

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

Monday, 30 December, 13

Page 75: Hive sql的编译过程

QB2 : genBodyPlan

QBParseInfo#destToWhereExpr > FilterOperatorFIL= FilterOperator SEL= SelectOperator

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8]

36

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

Monday, 30 December, 13

Page 76: Hive sql的编译过程

QB2 : genBodyPlan

QBParseInfo#destToWhereExpr > FilterOperatorFIL= FilterOperator SEL= SelectOperator

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8]

36

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | FIL[9]

Monday, 30 December, 13

Page 77: Hive sql的编译过程

QB2 : genBodyPlan

QBParseInfo#destToWhereExpr > FilterOperatorFIL= FilterOperator SEL= SelectOperator

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8]

36

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | FIL[9]

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | FIL[9] | SEL[10]

Monday, 30 December, 13

Page 78: Hive sql的编译过程

QB1 : genBodyPlanQBParseInfo#destToGroupby > ReduceSinkOperator + GroupByOperatorGBY= GroupByOperator

37

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

Monday, 30 December, 13

Page 79: Hive sql的编译过程

QB1 : genBodyPlanQBParseInfo#destToGroupby > ReduceSinkOperator + GroupByOperatorGBY= GroupByOperator

37

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | FIL[9] | SEL[10]

Monday, 30 December, 13

Page 80: Hive sql的编译过程

QB1 : genBodyPlanQBParseInfo#destToGroupby > ReduceSinkOperator + GroupByOperatorGBY= GroupByOperator

37

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | FIL[9] | SEL[10]

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | FIL[9] | SEL[10] | SEL[11] | GBY[12]HashMode AGGR <

Monday, 30 December, 13

Page 81: Hive sql的编译过程

QB1 : genBodyPlanQBParseInfo#destToGroupby > ReduceSinkOperator + GroupByOperatorGBY= GroupByOperator

37

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | FIL[9] | SEL[10]

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | FIL[9] | SEL[10] | SEL[11] | GBY[12]HashMode AGGR <

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | FIL[9] | SEL[10] | SEL[11] | GBY[12] | RS[13]

Monday, 30 December, 13

Page 82: Hive sql的编译过程

QB1 : genBodyPlanQBParseInfo#destToGroupby > ReduceSinkOperator + GroupByOperatorGBY= GroupByOperator

37

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | FIL[9] | SEL[10]

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | FIL[9] | SEL[10] | SEL[11] | GBY[12]HashMode AGGR <

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | FIL[9] | SEL[10] | SEL[11] | GBY[12] | RS[13]

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | FIL[9] | SEL[10] | SEL[11] | GBY[12] | RS[13] | GBY[14]

Monday, 30 December, 13

Page 83: Hive sql的编译过程

QB1 : genPostGroupByBodyPlan

38

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.

FS=FileSinkOperator

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | FIL[9] | SEL[10]

SEL[11] | GBY[12] | RS[13] | GBY[14] | SEL[15] | SEL[16] | FS[17]

QB2 QB1

Monday, 30 December, 13

Page 84: Hive sql的编译过程

⺫⽬目录1. MapReduce实现Join Group By Distinct操作的基本原理

2. SQL转化为MapReduce的过程

(1) Antlr && ASTTree

(2) sql基本组成单元QueryBlock

(3) 逻辑操作符Operator

(4) 逻辑层优化器

(5) OperatorTree转化为MapReduce Job的过程

(6) 物理层优化器 MapJoin原理

3. Hive执⾏行计划

Monday, 30 December, 13

Page 85: Hive sql的编译过程

Logical Optimizer

40

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

Optimizer

名称 作⽤用2) PredicatePushDown 谓词前置ColumnPruner 字段剪枝2) GroupByOptimizer Map端聚合

1) ReduceSinkDeDuplication合并线性的OperatorTree中partition/sort key相同的reduce

1) CorrelationOptimizer利⽤用查询中的相关性,合并有相关性的Job,HIVE-2206

2) SimpleFetchOptimizer 优化没有GroupBy表达式的聚合查询2) MapJoinProcessor MapJoin,提供hint

2) BucketMapJoinOptimizer BucketMapJoin

变换OperatorTree

Monday, 30 December, 13

Page 86: Hive sql的编译过程

Logical Optimizer

40

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

Optimizer

名称 作⽤用2) PredicatePushDown 谓词前置ColumnPruner 字段剪枝2) GroupByOptimizer Map端聚合

1) ReduceSinkDeDuplication合并线性的OperatorTree中partition/sort key相同的reduce

1) CorrelationOptimizer利⽤用查询中的相关性,合并有相关性的Job,HIVE-2206

2) SimpleFetchOptimizer 优化没有GroupBy表达式的聚合查询2) MapJoinProcessor MapJoin,提供hint

2) BucketMapJoinOptimizer BucketMapJoin

变换OperatorTree

1) ⼀一个Job干尽可能多的事情/合并Job

Monday, 30 December, 13

Page 87: Hive sql的编译过程

Logical Optimizer

40

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

Optimizer

名称 作⽤用2) PredicatePushDown 谓词前置ColumnPruner 字段剪枝2) GroupByOptimizer Map端聚合

1) ReduceSinkDeDuplication合并线性的OperatorTree中partition/sort key相同的reduce

1) CorrelationOptimizer利⽤用查询中的相关性,合并有相关性的Job,HIVE-2206

2) SimpleFetchOptimizer 优化没有GroupBy表达式的聚合查询2) MapJoinProcessor MapJoin,提供hint

2) BucketMapJoinOptimizer BucketMapJoin

变换OperatorTree

1) ⼀一个Job干尽可能多的事情/合并Job2) 减少shuffle数据量,甚⾄至不做Reduce

Monday, 30 December, 13

Page 88: Hive sql的编译过程

PredicatePushDown

41

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

Optimizer

TS[c] TS[p] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | FIL[9] | SEL[10]

QB2

断⾔言判断提前

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10]

Monday, 30 December, 13

Page 89: Hive sql的编译过程

NonBlockingOpDeDupProc

42

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

Optimizer

合并SEL-SEL 或者 FIL-FIL 为⼀一个Operator

SEL[11] | GBY[12] | RS[13] | GBY[14] | SEL[15] | SEL[16] | FS[17]

QB1

GBY[12] | RS[13] |GBY[14] |SEL[15] | FS[17]

Monday, 30 December, 13

Page 90: Hive sql的编译过程

ReduceSinkDeDuplication

43

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

Optimizer

合并线性的相连的两个RSfrom (select key, value from src group by key, value) s select s.key group by s.key;

Monday, 30 December, 13

Page 91: Hive sql的编译过程

ReduceSinkDeDuplication

43

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

Optimizer

合并线性的相连的两个RSfrom (select key, value from src group by key, value) s select s.key group by s.key;

TS |SEL |GBY | RS |GBY |SEL |GBY | FS

TS | RS |GBY |SEL | FS

Stage-1 Stage-2

Monday, 30 December, 13

Page 92: Hive sql的编译过程

ReduceSinkDeDuplication

43

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

Optimizer

合并线性的相连的两个RSfrom (select key, value from src group by key, value) s select s.key group by s.key;

TS |SEL |GBY | RS |GBY |SEL |GBY | FS

TS | RS |GBY |SEL | FS

Stage-1 Stage-2

keypartition

Key

pRS key,value key,value

cRS key key

Monday, 30 December, 13

Page 93: Hive sql的编译过程

ReduceSinkDeDuplication

43

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

Optimizer

合并线性的相连的两个RSfrom (select key, value from src group by key, value) s select s.key group by s.key;

TS |SEL |GBY | RS |GBY |SEL |GBY | FS

TS | RS |GBY |SEL | FS

Stage-1 Stage-2

pRS key完全包含cRS key,且排序顺序⼀一致pRS partitionkey完全包含cRS partitionkey

keypartition

Key

pRS key,value key,value

cRS key key

Monday, 30 December, 13

Page 94: Hive sql的编译过程

ReduceSinkDeDuplication

43

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

Optimizer

合并线性的相连的两个RSfrom (select key, value from src group by key, value) s select s.key group by s.key;

TS |SEL |GBY | RS |GBY |SEL |GBY | FS

TS | RS |GBY |SEL | FS

Stage-1 Stage-2

pRS key完全包含cRS key,且排序顺序⼀一致pRS partitionkey完全包含cRS partitionkey

keypartition

Key

pRS key,value key,value

cRS key key

TS |SEL |GBY | RS |GBY |SEL |GBY | FS

Monday, 30 December, 13

Page 95: Hive sql的编译过程

ReduceSinkDeDuplication

43

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

Optimizer

合并线性的相连的两个RSfrom (select key, value from src group by key, value) s select s.key group by s.key;

TS |SEL |GBY | RS |GBY |SEL |GBY | FS

TS | RS |GBY |SEL | FS

Stage-1 Stage-2

pRS key完全包含cRS key,且排序顺序⼀一致pRS partitionkey完全包含cRS partitionkey

keypartition

Key

pRS key,value key,value

cRS key key

TS |SEL |GBY | RS |GBY |SEL |GBY | FS

key : key, valuepartitionkey : key

Monday, 30 December, 13

Page 96: Hive sql的编译过程

ReduceSinkDeDuplication

43

PhysicalPlan Gen.

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

Optimizer

合并线性的相连的两个RSfrom (select key, value from src group by key, value) s select s.key group by s.key;

TS |SEL |GBY | RS |GBY |SEL |GBY | FS

TS | RS |GBY |SEL | FS

Stage-1 Stage-2

pRS key完全包含cRS key,且排序顺序⼀一致pRS partitionkey完全包含cRS partitionkey

keypartition

Key

pRS key,value key,value

cRS key key

TS |SEL |GBY | RS |GBY |SEL |GBY | FS

key : key, valuepartitionkey : key

两个Job的numReduce数⺫⽬目是否⼀一致

Monday, 30 December, 13

Page 97: Hive sql的编译过程

⺫⽬目录1. MapReduce实现Join Group By Distinct操作的基本原理

2. SQL转化为MapReduce的过程

(1) Antlr && ASTTree

(2) sql基本组成单元QueryBlock

(3) 逻辑操作符Operator

(4) 逻辑层优化器

(5) OperatorTree转化为MapReduce Job的过程

(6) 物理层优化器 MapJoin原理

3. Hive执⾏行计划

Monday, 30 December, 13

Page 98: Hive sql的编译过程

MapReduceCompiler

45

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

• 对输出表⽣生成MoveTask

• 从OperatorTree的其中⼀一个根节点向下深度优先遍历

• ReduceSinkOperator标⽰示Map/Reduce的界限,多个Job间的界限

• 遍历其他根节点,遇过碰到JoinOperator合并MapReduceTask

• ⽣生成StatTask更新元数据

• 剪断Map与Reduce间的Operator

Monday, 30 December, 13

Page 99: Hive sql的编译过程

R0 gen MoveTask & Fetch Task

46

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

GBY[12] | RS[13] | GBY[14] | SEL[15] | FS[17]

QB1

MapredLockWork[Stage-0]

Stage-0 Move Operator

Monday, 30 December, 13

Page 100: Hive sql的编译过程

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10]

Begin Walk

47

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

QB2

toWalk[] {TS[c], TS[du], TS[p]}

Monday, 30 December, 13

Page 101: Hive sql的编译过程

Begin Walk

48

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

opStack {}

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10]

QB2

toWalk[] {TS[c], TS[du], TS[p]}

Monday, 30 December, 13

Page 102: Hive sql的编译过程

Begin Walk

49

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

opStack {TS[p]}

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10]

QB2

toWalk[] {TS[c], TS[du]}

Monday, 30 December, 13

Page 103: Hive sql的编译过程

R1 GenMRTableScan1

50

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p]}

Monday, 30 December, 13

Page 104: Hive sql的编译过程

R1 GenMRTableScan1

50

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p]}

"".join([t + "%" for t in opStack]) == “ TS%”

Monday, 30 December, 13

Page 105: Hive sql的编译过程

R1 GenMRTableScan1

50

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10]

QB2

toWalk[] {TS[du], TS[c]} opStack {TS[p]}

"".join([t + "%" for t in opStack]) == “ TS%”

Monday, 30 December, 13

Page 106: Hive sql的编译过程

R1 GenMRTableScan1

50

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10]

QB2

toWalk[] {TS[du], TS[c]} opStack {TS[p]}

"".join([t + "%" for t in opStack]) == “ TS%”

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10]

Stage-1 MapRedTask

Monday, 30 December, 13

Page 107: Hive sql的编译过程

R2 GenMRRedSink1

51

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4]}

Monday, 30 December, 13

Page 108: Hive sql的编译过程

R2 GenMRRedSink1

51

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4]}

"".join([t + "%" for t in opStack]) == “TS%.*RS%”

Monday, 30 December, 13

Page 109: Hive sql的编译过程

R2 GenMRRedSink1

51

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4]}

"".join([t + "%" for t in opStack]) == “TS%.*RS%”

Stage-1 MapTask

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10]

Monday, 30 December, 13

Page 110: Hive sql的编译过程

R2 GenMRRedSink1

51

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4]}

"".join([t + "%" for t in opStack]) == “TS%.*RS%”

Stage-1 MapTask TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10]

Stage-1 ReduceTask

Stage-1 MapTask

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10]

Monday, 30 December, 13

Page 111: Hive sql的编译过程

R3 GenMRRedSink2

52

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4], JOIN[5], RS[6]}

Monday, 30 December, 13

Page 112: Hive sql的编译过程

R3 GenMRRedSink2

52

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4], JOIN[5], RS[6]}

"".join([t + "%" for t in opStack]) == “RS%.*RS%”

Monday, 30 December, 13

Page 113: Hive sql的编译过程

R3 GenMRRedSink2

52

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4], JOIN[5], RS[6]}

"".join([t + "%" for t in opStack]) == “RS%.*RS%”

Stage-1 MapTask TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10]

Stage-1 ReduceTask

Monday, 30 December, 13

Page 114: Hive sql的编译过程

R3 GenMRRedSink2

52

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4], JOIN[5], RS[6]}

"".join([t + "%" for t in opStack]) == “RS%.*RS%”

Stage-1 MapTask TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10]

Stage-1 ReduceTask

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10]

Stage-1

Stage-2

Monday, 30 December, 13

Page 115: Hive sql的编译过程

R3 GenMRRedSink2

52

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4], JOIN[5], RS[6]}

"".join([t + "%" for t in opStack]) == “RS%.*RS%”

Stage-1 MapTask TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10]

Stage-1 ReduceTask

MR[Stage-1]

TS[p] | FIL[18] | RS[4] / JOIN[5] | FS[19]

MR[Stage-2]

TS[20] | RS[6] \ JOIN[8] | SEL[10]

splitPlan

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10]

Stage-1

Stage-2

Monday, 30 December, 13

Page 116: Hive sql的编译过程

R3 GenMRRedSink2

52

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4], JOIN[5], RS[6]}

"".join([t + "%" for t in opStack]) == “RS%.*RS%”

Stage-1 MapTask TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10]

Stage-1 ReduceTask

MR[Stage-1]

TS[p] | FIL[18] | RS[4] / JOIN[5] | FS[19]

MR[Stage-2]

TS[20] | RS[6] \ JOIN[8] | SEL[10]

splitPlan

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10]

Stage-1

Stage-2

中间数据落地,存储在hdfs临时⽂文件中

Monday, 30 December, 13

Page 117: Hive sql的编译过程

R3 GenMRRedSink2

53

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4], JOIN[5], RS[6], JOIN[8], SEL[10], GBY[12], RS[13]}

Stage-3

Monday, 30 December, 13

Page 118: Hive sql的编译过程

R3 GenMRRedSink2

53

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4], JOIN[5], RS[6], JOIN[8], SEL[10], GBY[12], RS[13]}

"".join([t + "%" for t in opStack]) == “RS%.*RS%”

Stage-3

Monday, 30 December, 13

Page 119: Hive sql的编译过程

R3 GenMRRedSink2

53

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4], JOIN[5], RS[6], JOIN[8], SEL[10], GBY[12], RS[13]}

"".join([t + "%" for t in opStack]) == “RS%.*RS%”

Stage-2

TS[20] |RS[6] \ JOIN[8] | SEL[10] | GBY[12] | RS[13] | GBY[14] | SEL[15] | FS[17]

Stage-3

Monday, 30 December, 13

Page 120: Hive sql的编译过程

R3 GenMRRedSink2

53

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4], JOIN[5], RS[6], JOIN[8], SEL[10], GBY[12], RS[13]}

"".join([t + "%" for t in opStack]) == “RS%.*RS%”

Stage-2

TS[20] |RS[6] \ JOIN[8] | SEL[10] | GBY[12] | RS[13] | GBY[14] | SEL[15] | FS[17]

TS[20] |RS[6] \ JOIN[8] | SEL[10] | GBY[12] | RS[13] | GBY[14] | SEL[15] | FS[17]

Stage-2

Stage-3

Monday, 30 December, 13

Page 121: Hive sql的编译过程

R3 GenMRRedSink2

53

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4], JOIN[5], RS[6], JOIN[8], SEL[10], GBY[12], RS[13]}

"".join([t + "%" for t in opStack]) == “RS%.*RS%”

Stage-2

TS[20] |RS[6] \ JOIN[8] | SEL[10] | GBY[12] | RS[13] | GBY[14] | SEL[15] | FS[17]

TS[20] |RS[6] \ JOIN[8] | SEL[10] | GBY[12] | RS[13] | GBY[14] | SEL[15] | FS[17]

Stage-2

Stage-3

MR[Stage-2]

TS[20] |RS[6] \ JOIN[8] | SEL[10] | GBY[12] | FS[21]

splitPlan

MR[Stage-3]

TS[22] | RS[13] |GBY[14] |SEL[15] | FS[17]

Monday, 30 December, 13

Page 122: Hive sql的编译过程

R4 GenMRFileSink1

54

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4], JOIN[5], RS[6], JOIN[8], SEL[10], GBY[12], RS[13], GBY[14], SEL[15], FS[17]}

"".join([t + "%" for t in opStack]) == “FS%”

Monday, 30 December, 13

Page 123: Hive sql的编译过程

R4 GenMRFileSink1

54

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4], JOIN[5], RS[6], JOIN[8], SEL[10], GBY[12], RS[13], GBY[14], SEL[15], FS[17]}

"".join([t + "%" for t in opStack]) == “FS%”

MR[Stage-1] |MR[Stage-2] |MR[Stage-3]

MoveWork[Stage-0]

Monday, 30 December, 13

Page 124: Hive sql的编译过程

R4 GenMRFileSink1

54

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[du], TS[c]} opStack {TS[p], FIL[18], RS[4], JOIN[5], RS[6], JOIN[8], SEL[10], GBY[12], RS[13], GBY[14], SEL[15], FS[17]}

"".join([t + "%" for t in opStack]) == “FS%”

MR[Stage-1] |MR[Stage-2] |MR[Stage-3] |MoveWork[Stage-0] |StatsWork[Stage-4]

MR[Stage-1] |MR[Stage-2] |MR[Stage-3]

MoveWork[Stage-0]

Monday, 30 December, 13

Page 125: Hive sql的编译过程

Begin Walk

55

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

opStack.clear()

TS[du] | RS[7] /JOIN[8] | SEL[10] | GBY[12] | FS[21]

Monday, 30 December, 13

Page 126: Hive sql的编译过程

toWalk[] {TS[c], TS[du]}

Begin Walk

56

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

opStack {}

TS[du] | RS[7] /JOIN[8] | SEL[10] | GBY[12] | FS[21]

Monday, 30 December, 13

Page 127: Hive sql的编译过程

R1 GenMRTableScan1

57

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[c]} opStack {TS[du]}

"".join([t + "%" for t in opStack]) == “ TS%”

Monday, 30 December, 13

Page 128: Hive sql的编译过程

R1 GenMRTableScan1

57

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

TS[du] | RS[7] /JOIN[8] | SEL[10] | GBY[12] | FS[21]

toWalk[] {TS[c]} opStack {TS[du]}

"".join([t + "%" for t in opStack]) == “ TS%”

Monday, 30 December, 13

Page 129: Hive sql的编译过程

R1 GenMRTableScan1

57

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

TS[du] | RS[7] /JOIN[8] | SEL[10] | GBY[12] | FS[21]

toWalk[] {TS[c]} opStack {TS[du]}

"".join([t + "%" for t in opStack]) == “ TS%”

TS[du] | RS[7] /JOIN[8] | SEL[10] | GBY[12] | FS[21]

Stage-5 MapTask

Monday, 30 December, 13

Page 130: Hive sql的编译过程

R2 GenMRRedSink1

58

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[c]} opStack {TS[du], RS[7]}

"".join([t + "%" for t in opStack]) == “ TS%.*RS%”

Monday, 30 December, 13

Page 131: Hive sql的编译过程

R2 GenMRRedSink1

58

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[c]} opStack {TS[du], RS[7]}

"".join([t + "%" for t in opStack]) == “ TS%.*RS%”

TS[du] | RS[7] /JOIN[8] | SEL[10] | GBY[12] | FS[21]

Stage-5 MapTask

Monday, 30 December, 13

Page 132: Hive sql的编译过程

R2 GenMRRedSink1

58

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[c]} opStack {TS[du], RS[7]}

"".join([t + "%" for t in opStack]) == “ TS%.*RS%”

TS[du] | RS[7] /JOIN[8] | SEL[10] | GBY[12] | FS[21]

Stage-5 MapTask

Stage-5 ReduceTask

TS[du] | RS[7] /JOIN[8] | SEL[10] | GBY[12] | FS[21]

Stage-5 MapTask

Monday, 30 December, 13

Page 133: Hive sql的编译过程

R2 GenMRRedSink1

58

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[c]} opStack {TS[du], RS[7]}

"".join([t + "%" for t in opStack]) == “ TS%.*RS%”

TS[du] | RS[7] /JOIN[8] | SEL[10] | GBY[12] | FS[21]

Stage-5 MapTask

Stage-5 ReduceTask

TS[du] | RS[7] /JOIN[8] | SEL[10] | GBY[12] | FS[21]

Stage-5 MapTask MR[Stage-2]

TS[20] | RS[6] \ JOIN[8] | SEL[10]

+

Monday, 30 December, 13

Page 134: Hive sql的编译过程

R2 GenMRRedSink1

58

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {TS[c]} opStack {TS[du], RS[7]}

"".join([t + "%" for t in opStack]) == “ TS%.*RS%”

TS[du] | RS[7] /JOIN[8] | SEL[10] | GBY[12] | FS[21]

Stage-5 MapTask

Stage-5 ReduceTask

TS[du] | RS[7] /JOIN[8] | SEL[10] | GBY[12] | FS[21]

Stage-5 MapTask MR[Stage-2]

TS[20] | RS[6] \ JOIN[8] | SEL[10]

+

merge map work

MR[Stage-2]

TS[20] TS[du] | |RS[6] RS[7] \ / JOIN[8] | SEL[10] | GBY[12] | FS[21]

Monday, 30 December, 13

Page 135: Hive sql的编译过程

Begin Walk

59

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

opStack.clear()

TS[c] | RS[3] \ JOIN[5] | FS[19]

Monday, 30 December, 13

Page 136: Hive sql的编译过程

Begin Walk

60

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

TS[c] | RS[3] \ JOIN[5] | FS[19]

opStack {}

toWalk[] {TS[c]}

Monday, 30 December, 13

Page 137: Hive sql的编译过程

R1 GenMRTableScan1

61

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {} opStack {TS[c]}

"".join([t + "%" for t in opStack]) == “ TS%”

Stage-6 MapRedTask

TS[c] | RS[3] \ JOIN[5] | FS[19]

TS[c] | RS[3] \ JOIN[5] | FS[19]

Monday, 30 December, 13

Page 138: Hive sql的编译过程

R2 GenMRRedSink1

62

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

toWalk[] {} opStack {TS[c], RS[3]}

"".join([t + "%" for t in opStack]) == “ TS%.*RS%”

Stage-6 MapRedTask

TS[c] | RS[3] \ JOIN[5] | FS[19]

Stage-6 MapWork

TS[c] | RS[3] \ JOIN[5] | FS[19]

Stage-6 RedWork

merge map work

MR[Stage-1]

TS[p] | FIL[18] | RS[4] / JOIN[5] | FS[19]

+

MR[Stage-1]

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] | FS[19]

Monday, 30 December, 13

Page 139: Hive sql的编译过程

breakTaskTree

63

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

MR[Stage-1]

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] | FS[19]

MR[Stage-2]

TS[20] TS[du] | |RS[6] RS[7] \ / JOIN[8] | SEL[10] | GBY[12] | FS[21]

MR[Stage-3]

TS[22] | RS[13] |GBY[14] |SEL[15] | FS[17]

Monday, 30 December, 13

Page 140: Hive sql的编译过程

breakTaskTree

63

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

MR[Stage-1]

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] | FS[19]

MR[Stage-2]

TS[20] TS[du] | |RS[6] RS[7] \ / JOIN[8] | SEL[10] | GBY[12] | FS[21]

MR[Stage-3]

TS[22] | RS[13] |GBY[14] |SEL[15] | FS[17]

MR[Stage-1]

TS[p] | TS[c] FIL[18] | |RS[3] RS[4]

JOIN[5] | FS[19]

MR[Stage-2]

TS[20] TS[du] | |RS[6] RS[7]

JOIN[8] | SEL[10] | GBY[12] | FS[21]

MR[Stage-3]

TS[22] | RS[13]

GBY[14] |SEL[15] | FS[17]

Monday, 30 December, 13

Page 141: Hive sql的编译过程

breakTaskTree

63

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

MR[Stage-1]

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] | FS[19]

MR[Stage-2]

TS[20] TS[du] | |RS[6] RS[7] \ / JOIN[8] | SEL[10] | GBY[12] | FS[21]

MR[Stage-3]

TS[22] | RS[13] |GBY[14] |SEL[15] | FS[17]

MR[Stage-1]

TS[p] | TS[c] FIL[18] | |RS[3] RS[4]

JOIN[5] | FS[19]

MR[Stage-2]

TS[20] TS[du] | |RS[6] RS[7]

JOIN[8] | SEL[10] | GBY[12] | FS[21]

MR[Stage-3]

TS[22] | RS[13]

GBY[14] |SEL[15] | FS[17]

map

reduce

Monday, 30 December, 13

Page 142: Hive sql的编译过程

Logical Plan => Physical Plan

64

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10] | GBY[12] | RS[13] | GBY[14] | SEL[15] | FS[17]

Monday, 30 December, 13

Page 143: Hive sql的编译过程

Logical Plan => Physical Plan

64

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10] | GBY[12] | RS[13] | GBY[14] | SEL[15] | FS[17]

MR[Stage-1]

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] JOIN[5] | FS[19]

Monday, 30 December, 13

Page 144: Hive sql的编译过程

Logical Plan => Physical Plan

64

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10] | GBY[12] | RS[13] | GBY[14] | SEL[15] | FS[17]

MR[Stage-1]

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] JOIN[5] | FS[19]

MR[Stage-2]

TS[20] TS[du] | |RS[6] RS[7]

JOIN[8] | SEL[10] | GBY[12] | FS[21]

Monday, 30 December, 13

Page 145: Hive sql的编译过程

Logical Plan => Physical Plan

64

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10] | GBY[12] | RS[13] | GBY[14] | SEL[15] | FS[17]

MR[Stage-1]

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] JOIN[5] | FS[19]

MR[Stage-2]

TS[20] TS[du] | |RS[6] RS[7]

JOIN[8] | SEL[10] | GBY[12] | FS[21]

MR[Stage-3]

TS[22] | RS[13]

GBY[14] |SEL[15] | FS[17]

Monday, 30 December, 13

Page 146: Hive sql的编译过程

Logical Plan => Physical Plan

64

PhysicalOptimizerParser Semantic

AnalyzerLogical

Plan Gen.Logical

OptimizerPhysical

Plan Gen.

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] \ / JOIN[5] TS[du] | | RS[6] RS[7] \ / JOIN[8] | SEL[10] | GBY[12] | RS[13] | GBY[14] | SEL[15] | FS[17]

MR[Stage-1]

TS[p] | TS[c] FIL[18] | |RS[3] RS[4] JOIN[5] | FS[19]

MR[Stage-2]

TS[20] TS[du] | |RS[6] RS[7]

JOIN[8] | SEL[10] | GBY[12] | FS[21]

MR[Stage-3]

TS[22] | RS[13]

GBY[14] |SEL[15] | FS[17]

MR[Stage-1]JOIN[5] |MR[Stage-2]JOIN[8] GBY[12] |MR[Stage-3] GBY[14] |MoveWork[Stage-0] |StatsWork[Stage-4]

Monday, 30 December, 13

Page 147: Hive sql的编译过程

⺫⽬目录1. MapReduce实现Join Group By Distinct操作的基本原理

2. SQL转化为MapReduce的过程

(1) Antlr && ASTTree

(2) sql基本组成单元QueryBlock

(3) 逻辑操作符Operator

(4) 逻辑层优化器

(5) OperatorTree转化为MapReduce Job的过程

(6) 物理层优化器

3. Hive执⾏行计划

Monday, 30 December, 13

Page 148: Hive sql的编译过程

Physical Optimizer

66

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

名称 作⽤用CommonJoinResolver + MapJoinResolver

MapJoin

SortMergeJoinResolver 与bucket配合,类似于归并排序SamplingOptimizer 并⾏行 order by

Vectorizer HIVE-4160

Monday, 30 December, 13

Page 149: Hive sql的编译过程

MapJoin

67

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

MapReduce Local Task

Monday, 30 December, 13

Page 150: Hive sql的编译过程

MapJoin

67

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

MapReduce Local Task

Small Table Data

Small Table Data

Small Table Data

Monday, 30 December, 13

Page 151: Hive sql的编译过程

MapJoin

67

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

MapReduce Local Task

Small Table Data

Small Table Data

Small Table Data

Distributed Cache

HashTable Files

Upload files to DCHashTable

FilesHashTable

Files

Monday, 30 December, 13

Page 152: Hive sql的编译过程

MapJoin

67

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

Mapper

Mapper

Mapper …

MapJoin Task

MapReduce Local Task

Small Table Data

Small Table Data

Small Table Data

Distributed Cache

HashTable Files

Upload files to DCHashTable

FilesHashTable

Files

Monday, 30 December, 13

Page 153: Hive sql的编译过程

MapJoin

67

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

Mapper

Mapper

Mapper …

MapJoin Task

MapReduce Local Task

Small Table Data

Small Table Data

Small Table Data

Distributed Cache

HashTable Files

Upload files to DCHashTable

FilesHashTable

Files

Monday, 30 December, 13

Page 154: Hive sql的编译过程

MapJoin

67

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

Mapper

Mapper

Mapper …

MapJoin Task

Big Table Data

Record

Record

Record

Record

……

MapReduce Local Task

Small Table Data

Small Table Data

Small Table Data

Distributed Cache

HashTable Files

Upload files to DCHashTable

FilesHashTable

Files

Monday, 30 December, 13

Page 155: Hive sql的编译过程

CommonJoinResolver

68

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

Task A

Task C

Monday, 30 December, 13

Page 156: Hive sql的编译过程

CommonJoinResolver

68

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

Task A

Conditional Task

Task C

Monday, 30 December, 13

Page 157: Hive sql的编译过程

CommonJoinResolver

68

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

Task A

Conditional Task

Task C

MapJoin LocalTask

MapJoinTask

Monday, 30 December, 13

Page 158: Hive sql的编译过程

CommonJoinResolver

68

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

Task A

Conditional Task

Task C

MapJoin LocalTask

MapJoinTask

Memory Bound

Monday, 30 December, 13

Page 159: Hive sql的编译过程

CommonJoinResolver

68

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

Task A

Conditional Task

Task C

MapJoin LocalTask

MapJoinTask

Memory Bound

Monday, 30 December, 13

Page 160: Hive sql的编译过程

CommonJoinResolver

68

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

Task A

Conditional Task

Task C

MapJoin LocalTask

CommonJoinTask

MapJoinTask

Run as a Backup Task

Memory Bound

Monday, 30 December, 13

Page 161: Hive sql的编译过程

CommonJoinResolver

69

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

MR[Stage-1]JOIN[5] |MR[Stage-2]JOIN[8] GBY[12] |MR[Stage-3] GBY[14] |MoveWork[Stage-0] |StatsWork[Stage-4]

• 深度优先遍历Task Tree

• 找到JoinOperator,判断左右表数据量⼤大⼩小

• ⼩小表 + ⼤大表 => MapJoinTask

• ⼩小/⼤大表 + 中间表 => ConditionalTask

Monday, 30 December, 13

Page 162: Hive sql的编译过程

CommonJoinResolver

70

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

MR[Stage-2]

TS[20] TS[du] | |RS[6] RS[7]

JOIN[8] | SEL[10] | GBY[12] | FS[21]

Monday, 30 December, 13

Page 163: Hive sql的编译过程

CommonJoinResolver

70

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

MR[Stage-2]

TS[20] TS[du] | |RS[6] RS[7]

JOIN[8] | SEL[10] | GBY[12] | FS[21]

big table

Monday, 30 December, 13

Page 164: Hive sql的编译过程

CommonJoinResolver

70

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

MR[Stage-2]

TS[20] TS[du] | |RS[6] RS[7]

JOIN[8] | SEL[10] | GBY[12] | FS[21]

MR[Stage-7]

TS[23] TS[25] | |RS[24] RS[26]

JOIN[34] | SEL[35] | GBY[36] | FS[37]

deepCopy

big table

Monday, 30 December, 13

Page 165: Hive sql的编译过程

CommonJoinResolver

70

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

MR[Stage-2]

TS[20] TS[du] | |RS[6] RS[7]

JOIN[8] | SEL[10] | GBY[12] | FS[21]

MR[Stage-7]

TS[23] TS[25] | |RS[24] RS[26]

JOIN[34] | SEL[35] | GBY[36] | FS[37]

deepCopy

big table

MRTask[Stage-7]FetchWork[$INTNAME]

TS[23] TS[25] \ / MAPJOIN[44] | SEL[35] | GBY[36] | FS[37]

Map Only MR

LocalWork

Monday, 30 December, 13

Page 166: Hive sql的编译过程

CommonJoinResolver

71

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

MR[Stage-2]

TS[20] TS[du] | |RS[6] RS[7]

JOIN[8] | SEL[10] | GBY[12] | FS[21]

Monday, 30 December, 13

Page 167: Hive sql的编译过程

CommonJoinResolver

71

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

MR[Stage-2]

TS[20] TS[du] | |RS[6] RS[7]

JOIN[8] | SEL[10] | GBY[12] | FS[21]

big table

Monday, 30 December, 13

Page 168: Hive sql的编译过程

CommonJoinResolver

71

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

MR[Stage-2]

TS[20] TS[du] | |RS[6] RS[7]

JOIN[8] | SEL[10] | GBY[12] | FS[21]

big table

...deepCopy

Monday, 30 December, 13

Page 169: Hive sql的编译过程

CommonJoinResolver

71

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

MR[Stage-2]

TS[20] TS[du] | |RS[6] RS[7]

JOIN[8] | SEL[10] | GBY[12] | FS[21]

big table

MRTask[Stage-8]FetchWork[du]

TS[45] TS[47] \ / MAPJOIN[66] | SEL[57] | GBY[36] | FS[37]

Map Only MR

LocalWork

...deepCopy

Monday, 30 December, 13

Page 170: Hive sql的编译过程

CommonJoinResolver

72

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

MR[Stage-1]JOIN[5] |MR[Stage-2]JOIN[8] GBY[12] |MR[Stage-3] GBY[14] |MoveWork[Stage-0] |StatsWork[Stage-4]

Monday, 30 December, 13

Page 171: Hive sql的编译过程

CommonJoinResolver

72

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

MR[Stage-1]JOIN[5] |MR[Stage-2]JOIN[8] GBY[12] |MR[Stage-3] GBY[14] |MoveWork[Stage-0] |StatsWork[Stage-4]

MR[Stage-10] MAPJOIN | ConditionalTask[Stage-9] / | \MR[Stage-7] MR[Stage-8] MR[Stage-2]MAPJOIN MAPJOIN JOIN \ | / \ | / MR[Stage-3] | MoveWork[Stage-0] | StatsWork[Stage-4]

Monday, 30 December, 13

Page 172: Hive sql的编译过程

CommonJoinResolver

72

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

MR[Stage-1]JOIN[5] |MR[Stage-2]JOIN[8] GBY[12] |MR[Stage-3] GBY[14] |MoveWork[Stage-0] |StatsWork[Stage-4]

MR[Stage-10] MAPJOIN | ConditionalTask[Stage-9] / | \MR[Stage-7] MR[Stage-8] MR[Stage-2]MAPJOIN MAPJOIN JOIN \ | / \ | / MR[Stage-3] | MoveWork[Stage-0] | StatsWork[Stage-4]

运⾏行时判断,采⽤用哪种⽅方式执⾏行

Monday, 30 December, 13

Page 173: Hive sql的编译过程

MapJoinResolver

73

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

MRTask[Stage-10]FetchWork[c]MRWork

• 遍历Task Tree,将所有有local work的MapReduceTask拆成两个Task

MRTask[Stage-13]FetchWork[c]HashTableSinkOperator |MRTask[Stage-10]MRWork

Monday, 30 December, 13

Page 174: Hive sql的编译过程

MapJoinResolver

74

Parser SemanticAnalyzer

LogicalPlan Gen.

LogicalOptimizer

PhysicalPlan Gen.

PhysicalOptimizer

MR[Stage-10] MAPJOIN | ConditionalTask[Stage-9] / | \MR[Stage-7] MR[Stage-8] MR[Stage-2]MAPJOIN MAPJOIN JOIN \ | / \ | / MR[Stage-3] | MoveWork[Stage-0] | StatsWork[Stage-4]

Lock[Stage-13] | MR[Stage-10] MAPJOIN | ConditionalTask[Stage-9] / | \Lock[Stage-11] Lock[Stage-12] \ | | | MR[Stage-7] MR[Stage-8] MR[Stage-2]MAPJOIN MAPJOIN JOIN \ | / \ | / \ | / MR[Stage-3] | MoveWork[Stage-0] | StatsWork[Stage-4]

Monday, 30 December, 13

Page 175: Hive sql的编译过程

回顾

sql翻译的过程

Monday, 30 December, 13

Page 176: Hive sql的编译过程

回顾

1. Antlr定义sql的语法规则,完成sql词法,语法解析,将sql转化为抽象语法树AST Tree

sql翻译的过程

Monday, 30 December, 13

Page 177: Hive sql的编译过程

回顾

1. Antlr定义sql的语法规则,完成sql词法,语法解析,将sql转化为抽象语法树AST Tree

2. 遍历AST Tree,抽象出查询的基本组成单元QueryBlock

sql翻译的过程

Monday, 30 December, 13

Page 178: Hive sql的编译过程

回顾

1. Antlr定义sql的语法规则,完成sql词法,语法解析,将sql转化为抽象语法树AST Tree

2. 遍历AST Tree,抽象出查询的基本组成单元QueryBlock

3. 遍历QueryBlock,翻译为执⾏行逻辑OperatorTree

sql翻译的过程

Monday, 30 December, 13

Page 179: Hive sql的编译过程

回顾

1. Antlr定义sql的语法规则,完成sql词法,语法解析,将sql转化为抽象语法树AST Tree

2. 遍历AST Tree,抽象出查询的基本组成单元QueryBlock

3. 遍历QueryBlock,翻译为执⾏行逻辑OperatorTree

4. 逻辑优化器进⾏行OperatorTree变换,合并ReduceSink,减少shuffle数据量

sql翻译的过程

Monday, 30 December, 13

Page 180: Hive sql的编译过程

回顾

1. Antlr定义sql的语法规则,完成sql词法,语法解析,将sql转化为抽象语法树AST Tree

2. 遍历AST Tree,抽象出查询的基本组成单元QueryBlock

3. 遍历QueryBlock,翻译为执⾏行逻辑OperatorTree

4. 逻辑优化器进⾏行OperatorTree变换,合并ReduceSink,减少shuffle数据量

5. 遍历OperatorTree,翻译为MapReduce任务

sql翻译的过程

Monday, 30 December, 13

Page 181: Hive sql的编译过程

回顾

1. Antlr定义sql的语法规则,完成sql词法,语法解析,将sql转化为抽象语法树AST Tree

2. 遍历AST Tree,抽象出查询的基本组成单元QueryBlock

3. 遍历QueryBlock,翻译为执⾏行逻辑OperatorTree

4. 逻辑优化器进⾏行OperatorTree变换,合并ReduceSink,减少shuffle数据量

5. 遍历OperatorTree,翻译为MapReduce任务

6. 物理层优化器进⾏行MapReduce任务的变换,⽣生成Conditional Task,动态检测是否能转化MapJoin

sql翻译的过程

Monday, 30 December, 13

Page 182: Hive sql的编译过程

⺫⽬目录1. MapReduce实现Join Group By Distinct操作的基本原理

2. SQL转化为MapReduce的过程

(1) Antlr && ASTTree

(2) sql基本组成单元QueryBlock

(3) 逻辑操作符Operator

(4) 逻辑层优化器

(5) OperatorTree转化为MapReduce Job的过程

(6) 物理层优化器 MapJoin原理

3. Hive执⾏行计划

Monday, 30 December, 13

Page 183: Hive sql的编译过程

执⾏行计划

• AST抽象语法树

• Stage Dependency

• MapReduce Plan

Monday, 30 December, 13

Page 184: Hive sql的编译过程

Stage Dependency

Stage-11 depends on stages: Stage-14 , consists of Stage-15, Stage-16, Stage-4

Stage-11是⼀一个ConditionalTask,可能执⾏行Stage-15/Stage-16/Stage-4中的⼀一个。⺫⽬目前出现ConditionalTask只可能是在执⾏行期间判断是否能转化为MapJoin的情况。Stage-4 common join,Stage-15和Stage-16就是可能的两种MapJoin的情况。

Monday, 30 December, 13

Page 185: Hive sql的编译过程

Stage Dependency

Stage-11 depends on stages: Stage-14 , consists of Stage-15, Stage-16, Stage-4

Stage-11是⼀一个ConditionalTask,可能执⾏行Stage-15/Stage-16/Stage-4中的⼀一个。⺫⽬目前出现ConditionalTask只可能是在执⾏行期间判断是否能转化为MapJoin的情况。Stage-4 common join,Stage-15和Stage-16就是可能的两种MapJoin的情况。

Monday, 30 December, 13

Page 186: Hive sql的编译过程

MapReduce Plan

• ReduceSinkOperator只可能出现在Map阶段,且标志着Map阶段

• 组合字段为reduce key, value

• sort order 按id正排,按name正排

• partition key 按partitionkey求hash值分配reduce

• tag,标⽰示表,在Join中区分是哪个原始表

Monday, 30 December, 13

Page 187: Hive sql的编译过程

MapReduce Plan

• 每个Operator计算完成之后均会对字段重新命名,命名⽅方式_col + i,Map输出字段以KEY/VALUE._col + i形式表⽰示

• KEY._col1:0._col0 “0.”表⽰示给distinct字段打上标签

• mode,聚合计算⽅方式,COMPLETE, PARTIAL1, PARTIAL2, PARTIALS, FINAL, HASH, MERGEPARTIAL

Monday, 30 December, 13

Page 188: Hive sql的编译过程

MapReduce Plan

• condition expression表⽰示join中两表分别包含的字段

• Position of Big Table 表⽰示tag=1的表是数据量⼤大的表

Monday, 30 December, 13

Page 189: Hive sql的编译过程

Monday, 30 December, 13

Page 190: Hive sql的编译过程

Thanks && QA

Monday, 30 December, 13