Is hadoop for you

44
1 Is Hadoop For You? Gwen Shapira, Solutions Architect

description

Introduction to Hadoop for Oracle Database professionals. Presented at E4 conference.

Transcript of Is hadoop for you

Page 1: Is hadoop for you

1

Is Hadoop For You?Gwen Shapira, Solutions Architect

Page 2: Is hadoop for you

2

About Me

• Solution Architect @ Cloudera• Making our customers successful• Formerly:• Database consultant @ Pythian• Specializing in Exadata, RAC, replication• Oracle ACED, Oak Table Member

• @gwenshap <- Hadoop tips in 140 characters

Page 3: Is hadoop for you

3

Agenda

Answer the question:Who needs Hadoop?

Page 4: Is hadoop for you

4

In more details…

Getting Started

What you need to succeed

When to Hadoop

Basic Hadoop Architecture

What's so special about Hadoop

0% 5%10%

15%20%

25%30%

35%40%

45%

% of Session

% of Session

Page 5: Is hadoop for you

5

What’s so special about Hadoop?

Technically Speaking

Page 6: Is hadoop for you

6

Databases in 1999

1. Buy a really big machine2. Install an expensive DBMS on it3. Point your workload at it4. Hope it doesn’t fail5. Ambitious: buy another really big machine as a

backup

Page 7: Is hadoop for you

7

Problems:

• Reliability• Scalability• Storage throughput • Complex Upgrades• Relational only

Page 8: Is hadoop for you

8

Exadata: State of the Art - 2007

1. Storage and compute in one rack2. Cluster with Infiniband interconnect3. Balanced architecture4. Offloading 5. Parallelism6. Compression

Page 9: Is hadoop for you

9

Hadoop

• Distributed File System• Programming Framework• Many projects on top• Open Source

(This means free)

Page 10: Is hadoop for you

10

Designed For:

• Reliability• Parallel Processing• Scalability• Flexibility

Page 11: Is hadoop for you

11

Reminders:

• Disk does a seek for each I/O operation• Seeks are expensive (~10ms)• Big I/Os mean better throughput• Network is fast inside rack• Slower between racks

Page 12: Is hadoop for you

12

The File System

• Files are split into 64M blocks• 64M!!!• Distributed• Replicated• Write-Once

Page 13: Is hadoop for you

HDFS Architecture

13

DataNode

Metadata

Paths, filenames, file sizes, block locations, …

NameNode

DataNode DataNode DataNode

Page 14: Is hadoop for you

HDFS Architecture

14

DataNode

Data

Blocks, checksums

NameNode

DataNode DataNode DataNode

Page 15: Is hadoop for you

HDFS Write Path

15

DN 1

NameNode

DN 2 DN 3 DN 4

Rack 1 Rack 2

Client

create(“/tmp/myfile”)

Write to [DN4,DN3,DN2]

[DN3,DN2]

[DN2]

Page 16: Is hadoop for you

HDFS Read Path

16

DN 1

NameNode

DN 2 DN 3 DN 4

Rack 1 Rack 2

Client

open(“/tmp/myfile”,“r”)

Read from [DN4,DN3,DN2]

readdata

Page 17: Is hadoop for you

17

Map-Reduce

• Java Framework • Works on Key-Value pairs• Map:

• Operate on every element• Filter or transform• Code runs where the data is stored

• Shuffle:• Redistribution of data

• Reduce:• Aggregate or Join

Page 18: Is hadoop for you

MapReduce Architcture

18

DN 1

JobTracker

DN 2 DN 3 DN 4

Rack 1 Rack 2

NameNode

TT 3 TT 4TT 2TT 1

• Gateway for users• Assigns tasks to

TaskTrackers• Tracks job status

Page 19: Is hadoop for you

MapReduce Architcture

19

DN 1

JobTracker

DN 2 DN 3 DN 4

Rack 1 Rack 2

NameNode

TT 3 TT 4TT 2TT 1

• TaskTrackers execute Map and Reduce tasks assigned by JT

Page 20: Is hadoop for you

20

Word Count Example

Page 21: Is hadoop for you

MapReduce Architcture

21

DN 1

JobTracker

DN 2 DN 3 DN 4

Rack 1 Rack 2

NameNode

TT 3 TT 4TT 2TT 1

wordcount(<files>)

M1 M2 M3 M4 R1

[cat, 1] [dog, 1][the, 1] [sat, 1]

Page 22: Is hadoop for you

MapReduce Architcture

22

DN 1

JobTracker

DN 2 DN 3 DN 4

Rack 1 Rack 2

NameNode

TT 3 TT 4TT 2TT 1

wordcount(<files>)

M5 M6 M7 M8 R1

[a, 5][cat, 2][dog, 1][the, 4][mat, 1]

Page 23: Is hadoop for you

23

Compare to Oracle PX

• Mappers -> Producers• Reducers -> Consumers• Shuffle -> Re-distribution

Page 24: Is hadoop for you

24

In Short

Benefits

• Reliable• Scalable• Infinite Flexibility• Cheap

Challenges

• New skills• Infinitely Flexible• Feature-completeness• Best practices and examples

Page 25: Is hadoop for you

25

Use Cases

When to Hadoop?

Page 26: Is hadoop for you

26

When to Hadoop?

When Relational Databases Don’t Add Benefits

Page 27: Is hadoop for you

27

Non-relational Data

• XML• Logs • Geo spatial data• Video

Page 28: Is hadoop for you

28

Adding to the Data Warehouse

• ETL• History• Some reports• Rocket Data Science

Page 29: Is hadoop for you

29

What you Need to Succeed

Page 30: Is hadoop for you

30

A Problem

Page 31: Is hadoop for you

31

Right Toolset

Page 32: Is hadoop for you

32

Toolset

Page 33: Is hadoop for you

33

Toolset for DBAs

• Hive – Turn SQL to Map-Reduce• Streaming – Map-Reduce in any language• Pig – Write and Execute execution plans• Oozie – Coordinate workflows• Impala – real-time SQL• HBase – key-value real-time data store

Page 34: Is hadoop for you

34

Data Model

• Partitions• Batch processing• Star Schema• Materialized Views• Sort and Compress

• De-normalize• Tune the data• Nested data structures

Page 35: Is hadoop for you

35

Right Hardware

• If possible – POC with your workload• Sizing by storage• You probably need to over-provision• Machine reliability• Big Data Appliance is a good start

Page 36: Is hadoop for you

36

Non-technical Advice

• Your team will have to learn a lot• Be ready for a challenge

Page 37: Is hadoop for you

37

Getting Started

Page 38: Is hadoop for you

38

Why get started?

• Hadoop projects are more visible• 48% of Hadoop clusters are owned by DWH team• Big Data == Business pays attention to data• New skills – from coding to cluster administration• Interesting projects

• No, you don’t need to learn Java

Page 39: Is hadoop for you

39

VM Cloud Cluster

Page 40: Is hadoop for you

40

Books

Page 41: Is hadoop for you

41

More Books

Page 42: Is hadoop for you

42

Beginner Projects

• Install 5 node Hadoop cluster in AWS• Load data:

• Complete works of Shakespeare• Movielens database

• Find the 10 most common words in Shakespeare• Find the 10 most recommended movies• Run TPC-H• Cloudera Data Science Challenge• Actual use-case:

XML ingestion, ETL process, DWH history

Page 43: Is hadoop for you

43

Need Help?

• I can help:• @gwenshap• [email protected]

• Hadoop Community:• http://community.cloudera.com• [email protected]• Google group: CDH Users

Page 44: Is hadoop for you

44