L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

28
L/O/G/O 云云云云云云云云云云云云 Cloud 云

Transcript of L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Page 1: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

L/O/G/O

云端的小飞象系列报告之二云端的小飞象系列报告之二

Cloud 组

Page 2: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

L/O/G/O

Hadoop in SIGMOD 2011Hadoop in SIGMOD 2011

www.themegallery.com

Page 3: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

OutlineOutline

Introduction

Nova: Continuous Pig/Hadoop Workflows

Apache Hadoop Goes Realtime at Facebook

Emerging Trends in the Enterprise Data Analytics

A Hadoop Based Distributed Loading Approach to Parallel Data Warehouses

Page 4: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Industrial Session in Sigmod 2011Industrial Session in Sigmod 2011

Data Management for Feeds and Streams(2)

Dynamic Optimization and Unstructured Content (4)

BusinessAnalytics(2)

Support for Business Analytics and Warehousing (4) 

Applying Hadoop

(4)

Industrial

session

Page 5: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .
Page 6: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Nova: Continuous Pig/Hadoop Workflows

By Yahoo !

Page 7: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Nova OverviewNova Overview

Scenarios Ingesting and analyzing user behavior logs Building and updating a search index from a stream of crawled web

pages Processing semi-structured data

Two-layer programming model (Nova over Pig) Continuous processing Independent scheduling Cross-module optimization Manageability features

Page 8: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Workflow ModelWorkflow Model

Workflow Two kinds of vertices: tasks (processing

steps) and channels (data containers) Edges connect tasks to channels and channels

to tasks

Four common patterns of processing Non-incremental (template detection) Stateless incremental (shingling) Stateless incremental with lookup table

(template tagging) Stateful incremental (de-duping)

Page 9: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Workflow Model (Cont.)Workflow Model (Cont.)

Data and Update Model Blocks: A channel’s data is divided into blocks

Contains a complete snapshot of data on a channel as of some point in time

Base blocks are assigned increasing sequence numbers(B0,B1,B2……Bn)

Base block

Used in conjunction with incremental processing

Contains instructions for transforming a base block into a new base block( )

Delta block

( )i j i jB B i j

Page 10: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Workflow Model (Cont.)Workflow Model (Cont.)

Task/Data Interface Consumption mode: all or new Production mode: B or Δ

Page 11: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Workflow Model (Cont.)Workflow Model (Cont.)

Workflow Programming and Scheduling Data-based trigger. Time-based trigger Cascade trigger.

Data Compaction and Garbage Collection If a channel has blocks B0 , , , , the

compaction operation computes and adds B3 to the channel

After compaction is used to add B3 to the channel , and current cursor is at sequence number 2 , then B0 , ,

can be garbage-collected.

0 1 1 2 2 3

0 11 2

Page 12: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Nova System ArchitectureNova System Architecture

Page 13: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Apache Hadoop Goes Realtime at Facebook

By Facebook

Page 14: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Workload TypesWorkload Types

Facebook MessagingHigh Write ThroughputLarge TablesData Migration

Facebook InsightsRealtime AnalyticsHigh Throughput Increments

Facebook Metrics System (ODS)Automatic ShardingFast Reads of Recent Data and Table Scans

Page 15: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Why Hadoop & HBaseWhy Hadoop & HBase

ElasticityHigh write throughputEfficient and low-latency strong consistency semantics within

a data centerEfficient random reads from diskHigh Availability and Disaster RecoveryFault IsolationAtomic read-modify-write primitivesRange ScansTolerance of network partitions within a single data centerZero Downtime in case of individual data center failureActive-active serving capability across different data centers

Page 16: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Realtime HDFSRealtime HDFS

High Availability - AvatarNode

Page 17: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Realtime HDFS (Cont.)Realtime HDFS (Cont.)

Hadoop RPC compatibility

Block Availability: Placement Policy a pluggable block placement policy

Page 18: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Realtime HDFS (Cont.)Realtime HDFS (Cont.)

Performance Improvements for a Realtime Workload RPC Timeout Reads from Local Replicas

New Features HDFS sync Concurrent Readers

Page 19: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Production HBaseProduction HBase

ACID Compliance (RWCC: Read Write Consistency Control) Atomicity (WALEdit) Consistency

Availability Improvements HBase Master Rewrite , Region assignment in memory -> ZooKeeper

Online Upgrades Distributed Log Splitting

Performance Improvements Compaction ( minor and major ) Read Optimizations

Page 20: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Emerging Trends in the Enterprise Data Analytics: Connecting Hadoop and DB2 Warehouse

By IBM

Page 21: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

MotivationMotivation

1.Increasing volumes of data

2. Hadoop-based solutions in conjunction with data warehouses

Page 22: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .
Page 23: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

A Hadoop Based Distributed Loading Approach to Parallel Data Warehouses

By Teradata

Page 24: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

MotivationMotivation

ETL(Extraction Transformation Loading) is a critical part of data warehouse

While data are partitioned and replicated across all nodes in a parallel data warehouse, load utilities reside on a single node(bottleneck)

Page 25: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Why Hadoop for Teradata EDW ( Enterprise Data Warehouse ) ?Why Hadoop for Teradata EDW ( Enterprise Data Warehouse ) ?

More disk space can be easily added Use as a intermediate storage MapReduce for transformation Load data in parallel

Page 26: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Block Assignment ProblemBlock Assignment ProblemBlock Assignment ProblemBlock Assignment Problem

– HDFS file F on a cluster of P nodes (each node is uniquely identified with an integer i where 1 ≤ i ≤ P)

– The problem is defined by: assignment(X, Y, n,m, k, r) X is the set of n blocks (X = {1, . . . , n}) of FY is the set of m nodes running PDBMS (called PDBMS nodes)

(Y⊆ {1, . . . , P })k copies, m nodesr is the mapping recording the replicated block locations of

each block. r(i) returns the set of nodes which has a copy of the block i.

Page 27: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

Block Assignment ProblemBlock Assignment Problem (( Cont.Cont. ))Block Assignment ProblemBlock Assignment Problem (( Cont.Cont. ))

• An assignment g from the blocks in X to the nodes in Y is denoted by a mapping from X = {1, . . . , n} to Y where g(i) = j (i ∈ X, j ∈ Y ) means that the block i is assigned to the node j.

• An even assignment g is an assignment such that ∀ i ∈ Y ∀ j ∈ Y | |{ x | ∀ 1 ≤ x ≤ n&&g(x) = i}| - |{y | ∀ 1 ≤ y ≤ n&&g(y) = j}| | ≤ 1.

• The cost of an assignment g is defined to be cost(g) = |{i | g(i) r(i) ∀ 1 ≤ i ≤ n}|, which is the number of blocks assigned to remote nodes.

Page 28: L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 .

L/O/G/O

Thank You! Thank You!