Introduction to DISQL, a distributed programming framework widely used in Baidu

Post on 24-May-2015

7.297 views 0 download

Transcript of Introduction to DISQL, a distributed programming framework widely used in Baidu

1

Introduction to DISQLChen Xiaoming

Senior Engineer of Baidu IBASE Dept.陈晓鸣

百度基础平台部高级工程师

2

What is DISQL?

3

DISQL is a distributed programming framework

widely used in Baidu

4

Contents

Problems

Solution

Examples

Rationales

Adoption

5

Problems

6

Problems

statistical analysis of logsextraction of fields

in order to generate reports

7

Problems

statistical analysis of features features of web pages, web sites, ads, user preferences, etc

in order to provide data for data mining and machine learning

8

Problems

common operationsselecting, filtering, grouping, sorting, joining, etc

9

Solution

10

A Platform

named Log Statistical Platform, a.k.a. LSP

web-based

convenient for secondary development

convenient for task/data/rights management

11

A Programming Framework

named DIstributed SQL, a.k.a. DISQL

provide SQL-like operators which can be combined arbitrarily

encapsulate distributed algorithms

automatic code generation

12

Application Programming Interfacesnamed Distributed Query, a.k.a. DQuery

DSL-style APIs embedded in well-known programming languages

PHP so far, C++/Python,… in the future

using method chaining technique to provide fluent interface

data-flow in the form of DAG composed by chains of methods

13

Three Edit Modes – Simple Mode

14

Three Edit Modes – DQuery Mode

15

Three Edit Modes – Complex Mode

16

Hierarchy

Linux

Hadoop,…

DISQL

DQuery

LSP

17

DISQL Architecture

Simple Mode DQuery Mode

ComplexMode

PHP C++ Python

Data-flow Schema Storage APIs Computing APIs

Normalizer Optimizer Splitter Planner Coder

Edit Modes

APIs

Translators

Runtimes

18

LSP Architecturedata presentation & monitoring

data access layer

data management layer

computing layer

storage systems computing systems

third party apps

19

Examples

20

Example 1 – word count

21

Example 2

given a log of query and ad shows

extract site field from url field

filter sites with regex

calculate the amount of query and ad shows per site

output in JSON format

22

Code in DQuery Mode

23

Rationales

24

Use Case Driven VS Completeness

Problem

Problem

Problem

Problem

Our Solution

25

Internal DSL VS External DSL

take advantage of:parsers, libraries and VMs of the host languages

users and communities

language features

different from Pig, Hive, Sawzall, etc

26

Open/Closed Principles

“open for extension, closed for modification”

open for single machine algorithms, closed for distributed algorithms

also different from Pig, Hive, Sawzall, …

27

Adoption

28

Users

…… ……

29

Usage

throughput/day: hundreds of TB

tasks/day: thousands

total tasks: > 1 million

30

Q&A

also welcome to contact me with:•Twitter: @acumon•Email: chenxiaoming@baidu.com•Gmail/Gtalk: acumoncxm@gmail.com

31

The End

THANK YOU!