Icse2013 shang

Assisting Developers of Big Data Analytics Applications When Deploying on Hadoop

Clouds

Hadi HemmatiBram Adams Weiyi Shang

Zhen Ming Jiang Ahmed E. Hassan

Patrick Martin

What are Big Data Analytics Application (BDA App)?

BDAApps

Many fields today rely on BDA Apps to make decisions

Software engineering research, especially Mining Software Repositories.

And…

Under the hood of BDA Apps

HardwareInfrastructure

SoftwarePlatform

BDA Apps

Discrepancy between scale of development and deployment

Small sample data and pseudo cloud

Big data and real-life cloud

Data sample

6ACM Interactions 2012

“Analysts moved back and forth

from local machines to cloud-

based systems.”

Many things can go wrong when scaling

BDA App

Step 1 Step 2 Step n…

Large-scale intermediate data generated by each step can fill up the disk space!!!

How to verify the deployment of BDA Apps?

Data sample

How to verify

Traditional approach for verifying BDA apps

Keyword scan

Many false positives!!Large results, too much effort to manually examine

Limitations of traditional approach

Not all kills are bad ： “ speculative execution”

Slow task identified

The results of the first finished task are saved, others tasks are killed!!

Duplicate the task to other machines

A smarter approach is needed

Execution sequences provide context information of log lines

Kill task t on node A.

Assign task t on node A.

Assign task t on node B.

Task t finished on node B.

Log abstraction reduces the amount of data to examine

Kill task t1 on node A.Kill task t2 on node B.Kill task t3 on node C.Kill task t4 on node A.Kill task t5 on node D.Kill task t6 on node B.Kill task t7 on node A.Kill task t8 on node C.

Large results, too much effort to manually examine

Kill task $t on node $n.

Overview of our approach

Data sample

Underlying platform Underlying platform

Execution sequences

Execution sequence delta

Log abstraction Log linking

Sequences simplificatio

Step 1: Log Abstractionreduces the size of logs

Log abstraction Log Linking Simplifying

sequences

Example of log lines

Execution eventsJiang et al. JSME 2008

Step 2: Log linkingprovides context for logs

sequences

Example of log lines

Execution events

Step 3: Sequence simplificationdeals with repeated logs

sequences

Repeated logs: task t1 read file A.task t1 read file A.task t1 read file A.

Remove repetition and order of events

Comparing small and large runs

Logs from testing run with small data

Logs from run with large data

Event sequence

E1, E2, E3, E5, E6

Event sequence

E1, E2, E3, E5, E6

E1, E2, E3, E7, E5, E6

Event sequence delta

E1, E2, E3, E7, E5, E6

Case study: subject systemsSource Domain

WordCountofficial example

File processing

Page Rank

developed from scratch

Social network

JACKmigrated from Perl

Log analysis

How precise is our approach?

PrecisionEffort Reduction

How much effort reduction does our approach provide?

WordCount JACK PageRank0

200400600800

100012001400160018002000

# log sequences # unique log events # log line

Our approach reduces the logs for manual inspection by over 86%

86% reduction

91% reduction

Our approach Keyword search

95% reduction

Reduce logs for manual inspection by over 86%

We manually inject 3 common failures

Machine Failure

Missing supporting library

Lack of disk space

We measure the number of log lines and log sequences caused by injected failures.

WordCount Page Rank JACK

Cola et al. Euro-Par 2005

Our approach generates less false positives than traditional approach

10152025303540

False positive ratio between keyword search and our approach

Reduce logs for manual inspection by over 86%

Less false positiveand additional context information to assist in manual inspection

Physical Infrastructure

Underlying Platform

BDA Apps

Our approach can be used in migration of BDA Apps

Hadoop generates more job sequences and task sequences.

PIGPIG automatically optimize the application by grouping jobs and reducing tasks.

Manually browsing logs to find the differences can be time-consuming.

One of the common migrations

We use our approach to compare the execution sequences of PageRank on both platforms

One more thing …

ReduceMap

Datagoodhellofishcat

schoolnighthappydog

ValueKey dog3cat3

fish4good4

hello5night5happy5

school6

ValueKey

23243516

Counting the frequency of word lengths

Key 45436553

MapReduce: Hadoop’s programming paradigm

Hadoop’s architecture

Hadoop application

Attempt 1

Attempt 2

Attempt n

Not all failures are bugs ： JVM failure

The JVM of an attempt has error

Bugs and memory issues of JVM will cause attempts to be considered fail.

Attempt on the JVM will be mark as FAILED

An overview of our approach

Log abstraction Log linking

Sequences simplificatio

Execution sequence recovery

Logs from testing run with small data

Logs from run with large data

Execution sequence report

Execution sequence delta

Execution sequence recovery

Sequence comparin

Underlying platform

BDA Apps

Physical

infrastructureor

0200400600800100012001400160018002000

Our approach has comparable precision to traditional approach

Repeating same abstracted problem over 1,400 times

Almost triple the precision

Similar precision

Half the precision

Icse2013 shang

Documents

Transcript of Icse2013 shang

ERM Operational Risk by Shang Huang

Shang Qing

UNTOLD - Shang-Yi Yang

Ancient Sage Kings ctext/shang-shu

Kabihasnang Shang

p649 Shang

Lin Shang Yao Work 2012

Giao Trinh Han Ngu 1-Shang

2010 Shang Hai World Trade Show

Xing Shang Ltd.

Shang Han Lun Preface

Araling Panglipunan: Kabihasnang shang

HanYu JiaoCheng XiuDingBen - DiYiCe Shang-Q1

Ase2010 shang

Shang & Zhou Dynasty

Shang Mor Ck

i zhège wèntí shang, nï yöu shénme ké shut) de? @ Zài zhège wèntí shang, ni shuö shénme dou méiyöu yòng. Zài zhège wèntí shang, nï zönme shuö wö clöl_l xíng.

Dinasti Shang

Shang hai shengmao2

Shang dynasty