00-04-271 On Computing the Data Cube. Research Report 10026, IBM Almaden Research Center, San Jose,...

00-04-27 1

On Computing the Data Cube. Research Report 10026, IBM Almaden Research Center, San Jose, California, 1996.

병렬 분산 컴퓨팅 연구실 석사 1 학기 송지숙

00-04-27 2

Contents

IntroductionPipeSort AlgorithmPipeHash AlgorithmComparing PipeSort and PipeHashConclusion

00-04-27 3

Optimization(1/2)

Smallest-parent • 이전에 계산된 group-by 중 가장 작은 것으로부터 group-by 계산Cache-results • disk I/O 를 줄이기 위해서 결과가 memory 에 저장된 group-by

로부터 다른 group-by 계산 Amortize-scans • 가능하면 한꺼번에 많은 group-by 를 계산함으로써 disk read 를

줄이는 것

00-04-27 4

Optimization(2/2)

Share-sorts • sort-based algorithm 에만 한정 • 여러 group-by 간에 sorting cost 를 공유Share-partitions • hash-based algorithm 에만 한정 • hash-table 이 memory 에 비해 너무 클 경우 , data 를

memory 에 맞게 분할하여 각 partition 에 대해 aggregation

여러 group-by 간에 partitioning cost 공유

00-04-27 5

Sort-based methods

PipeSort algorithm • optimization share-sorts 와 smallest-parent 의 결합 : 두 optimization 간에 대립이 생길 수 있기 때문에

group-by 를 할 때 global planning 통해 minimum total cost 얻음 .

• optimization cache-results 와 amortize-scans 도 포함 : pipeline 방식으로 여러 group-by 를 실행함으로써 disk

scan cost 를 줄임 .

00-04-27 6

Share-sorts and smallest-parent

all

A B C D

AB AC AD BC BD CD

ABC ABD ACD BCD

ABCD

Level

0

1

2

3

4

BDA

AB

ABC

A

00-04-27 7

cache-results and amortize-scans

all

A B C D

AB AC AD BC BD CD

ABC ABD ACD BCD

ABCD

Level

0

1

2

3

4

AB

ABC

A

ABCD

00-04-27 8

Algorithm PipeSort(1/2)

Input • search lattice - vertex : group-by cube - edge : i 로부터 j 가 generate 될 때 , i 에서 j 로 연결한다 . j 는 i 보다 attribute 를 하나 적게 가지고 i 를 j 의 parent 라고

부른다 . - cost : S 는 i 가 정렬되어 있지 않을 때 i 로부터 j 를 계산하는 cost A 는 i 가 정렬되어 있을 때 i 로부터 j 를 계산하는 cost

Output • subgraph of the search lattice - 각 group-by 는 그것의 attribute 정렬순서로 결합되어 있고 그것을

계산하는데 이용되는 하나의 parent 와 연결된다 .

00-04-27 9

Algorithm PipeSort(2/2)

AB AC BC10 12 20

AB AC BC

A B C

2 5 13

ABC

allLevel

0

1

2

3A

S

AB AC BC10 12 20

A B C

AB AC BC2 5 13

Minimum cost matching

BA CABA

00-04-27 10

Minimum cost sort plan

CBAD

CBA BAD ACD DBC

CB BA AC DB AD CD

C B A D

all

Raw data

2 4 5 8 4 16 4 13

5 15 5 15 4 14 5 15 5 15 10 20

10 30 15 40 5 20 45 130

50 160

A() S()

Pipeline edges

sort edges

BADCACDB

DBA

DBCA

ADCCDA

00-04-27 11

Hash-based methods

PipeHash algorithm • optimization cache-results 와 amortize-scans 의 결합 : multiple hash-table 의 신중한 memory allocation 이 요구 • optimization smallest-parent 도 포함 • optimization share-partitions 포함 : aggregation data 는 hash-table 이 memory 에 들어가기에

너무 크기 때문에 , 하나 또는 그 이상의 attribute 에 대해서 data 를 partition 한다 . Partitioning attribute 를 포함하는 모든 group-by 간에 data partitioning cost 를 공유한다 .

00-04-27 12

cache-results and amortize-scans

ABCD

Level

0

1

2

3

4

all

A B C D

AB AC AD BC BD CD

ABC ABD ACD BCD

AB AC

A B

00-04-27 13

Algorithm PipeHash

Input • search lattice First step • 각 group-by 에 대해 , 가장 작은 total size 추정치를 가지는 parent group-by

를 선택한다 . 그 결과가 minimum spanning tree 이다 . Next step • 대개 MST 안에 모든 group-by 를 함께 계산하기에 memory 가 충분하지

않다 . • 다른 hash-table 을 위해 memory displacement 가 일어날 때 , 어떤 group-by 가 함께 계산될지 , data partitioning 을 위해 어떤 attribute 를 선택할지 결정한다 . • Optimization cache-results 와 amortize-scan 을 위해 MST 의 subtree 중 가장 큰

것을 선택하도록 한다 .

00-04-27 14

all

A B C D

AB AC BC AD CD BD

ABC ABD ACD BCD

ABCD

Raw Data

Minimum spanning tree

00-04-27 15

A AB AC AD

ABCD

Raw Data

ABC ABD ACD

BC

ABC B

AB

all

A

ABCD

BCD

CD BD

C D

First subtree partitioned on A Remaining subtrees

00-04-27 16

Comparing PipeSort and PipeHash(1/5)

Dataset # of groupingattribute

# of tuple(in millions)

Size(in MB)

Dataset-A 3 5.5 110

Dataset-B 4 7.5 121

Dataset-C 5 9 180

Dataset-D 5 3 121

Dataset-E 6 0.7 18

Datasets

Performance results • faster than the naive methods • The performance of PipeHash is very close to lower bound for hash-based algorithms. • PipeHash is inferior to the PipeSort algorithms.

00-04-27 17


00-04-27 18

각 group-by 결과로 tuple 의 수가 많이 줄어들 때 , hash-based method 가 sort-based method 보다 더 좋은 성능을 가질 것이다 .

Synthetic datasets • number of tuples, T • number of grouping attributes, N • ratio among the number of distinct values of each attribute,

d1:d2:…:dN

• ratio of T to the total number of possible attribute value combinations, p - data sparsity 정도를 바꾸는데 사용


00-04-27 19

Effect of sparseness on relative performance of the hash and sort-based algorithms for a 5 attribute synthetic dataset.


00-04-27 20

Results • x-axis denotes decreasing levels of sparsity. • y-axis denotes the ratio between the total running time of

algorithms PipeHash and PipeSort. • data 가 점점 덜 sparse 해짐에 따라 , hash-based method

가 sort-based method 보다 더 좋은 성능을 가진다 . • PipeHash 와 PipeSort algorithm 의 상대적인 성능의

predictor 는 sparsity 임을 알 수 있다 .


00-04-27 21

Conclusion

Presented five optimizations smallest-parent, cache-results, amortize-scans, share-sorts and share-partitions

The PipeHash and PipeSort algorithms combine them so as to reduce the total cost.

PipeHash does better on low sparsity data whereas PipeSort does better on high sparsity data.

00-04-271 On Computing the Data Cube. Research Report 10026, IBM Almaden Research Center, San Jose,...

Documents

Transcript of 00-04-271 On Computing the Data Cube. Research Report 10026, IBM Almaden Research Center, San Jose,...