Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

12
Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1

Transcript of Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

Page 1: Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

1

Distributed and Parallel Processing Technology

Chapter1.Meet Hadoop

Sun Jo

Page 2: Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

Data! We live in the data age. Estimates 0.18 ZB in 2006 and forecasting a tenfold growth by 2011 to 1.8 ZB

• 1 ZB = 1021 bytes = 1,000 EB = 1,000,000 PB = 1,000,000,000 TB

The flood of data is coming from many sources New York Stock Exchange generates 1 TB of new trade data per day

Facebook hosts about 10 billion photos taking up 1 PB (=1,000 TB) of storage

Internet Archive stores around 2 PB, and is growing at a rate of 20 TB per month

‘Big Data’ can affects smaller organizations or individuals Digital photos, individual’s interactions – phone calls, emails, documents – are cap-

tured and stored for later access

The amount of data generated by machines will be even greater than that generated by people

Machine logs, RFID readers, sensor networks, vehicle GPS traces, retail transactions

2

Page 3: Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

Data! Data can be shared for anyone to download and analyze Public Data Sets on Amazon Web Services, Infochimps.org, theinfo.org

Astrometry.net project

• Watches the astrometry group on Flickr for new photos of the night sky

• Analyzes each image and identifies the sky

The project shows that are possible when data is made available and used for some-thing that was not anticipated by the creator

Big Data is here. We are struggling to store and analyze it.

3

Page 4: Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

Data Storage and Analysis The storage capacities have increased but access speeds haven’t kept up

• Writing is even slower!

Solution : Read and write data in parallel to/from multiple disks

Problem To solve hardware failure replication

• RAID : Redundant copies of the data are kept in case of failure To combine the data in a disk with the others

What Hadoop provides A reliable shared storage (HDFS) Efficient analysis (MapReduce)

4

1990 20101 drive stores 1,370 MB 1 TBtransfers 4.4 MB/s 100 MB/sreads all the data from a full drive 5 minutes 2 hours and 30 minutes

Page 5: Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

Comparison with Other Systems - RDBMS RDBMS B-Tree index Optimized for accessing and updating a small proportion of records

MapReduce Efficient for updating the large data, uses Sort/Merge to rebuild the DB Good for the needs to analyze the whole dataset in a batch fashion

Structured vs. Semi- or Unstructured Data Structured data : particular predefined schema RDBMS Semi- or Unstructured data : looser or no particular internal structure MapReduce

Normalization To retain the integrity and remove redundancy, relational data is often normalized MapReduce performs high-speed streaming reads and writes, and records that is not

normalized are well-suited to analysis with MapReduce.

5

Page 6: Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

Comparison with Other Systems - RDBMS RDBMS vs. MapReduce

Co-evolution of RDBMS and MapReduce systems RDBs start incorporating some of the ideas from MapReduce Higher-level query languages built on MapReduce

• Making MapReduce systems more approachable to traditional database programmers

6

Page 7: Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

Comparison with Other Systems – Grid Computing Grid Computing High Performance Computing(HPC) and Grid Computing communities have been do-

ing large-scale data processing• Using APIs as Message Passing Interface(MPI)

HPC• Distribute the work across a cluster of machines, which access a shared filesystem, hosted by

a SAN• Works well for compute-intensive jobs• Meets a problem when nodes need to access larger data volumes – hundreds of GB, since

the network bandwidth is the bottleneck and compute nodes become idle

Data locality, the heart of MapReduce MapReduce collocates the data with the compute node, so data access is fast since it

is local MPI vs. MapReduce MPI programmers need to handle the mechanics of the data flow MapReduce programmers think in terms of functions of key and value pairs, and the

data flow is implicit

7

Page 8: Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

Comparison with Other Systems – Grid Computing Partial failure MapReduce is a shared-nothing architecture tasks have no dependence on one

other. the order in which the tasks run doesn’t matter. MPI programs have to manage the check-pointing and recovery

8

Page 9: Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

Comparison with Other Systems – Volunteer Computing Volunteer computing projects Breaking the problem into chunks called work units Sending to computers around the world to be analyzed The Results are sent back to the server when the analysis is completed The client gets another work unit

SETI@home to analyze radio telescope data for signs of intelligent life outside earth

SETI@home vs. MapReduce SETI@home

• very CPU-intensive, which makes it suitable for running on hundreds of thousands of com-puters across the world. Volunteers are donating CPU cycles, not bandwidth

• Runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality

MapReduce• Designed to run jobs that last minutes or hours on HW running in a single data center with

very high aggregate bandwidth interconnects

9

Page 10: Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

A Brief History of Hadoop Hadoop Created by Doug Cutting, the creator of Apache Lucene, text search library Has its origin in Apache Nutch, an open source web search engine, a part of the

Lucene project ‘Hadoop’ was the name that Doug’s kid gave to a stuffed yellow elephant toy

History In 2002, Nutch was started

• A working crawler and search system emerged• Its architecture wouldn’t scale to the billions of pages on the Web

In 2003, Google published a paper describing the architecture of Google’s distrib-uted filesystem, GFS

In 2004, Nutch project implemented the GFS idea into the Nutch Distributed Filesys-tem, NDFS

In 2004, Google published the paper introducing MapReduce In 2005, Nutch had a working MapReduce implementation in Nutch

• By the middle of that year, all the major Nutch algorithms had been ported to run using MapReduce and NDFS

10

Page 11: Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

A Brief History of Hadoop History In Feb. 2006, Doug Cutting started an independent subproject of Lucene, called Hadoop

• In Jan. 2006, Doug Cutting joined Yahoo!• Yahoo! Provided a dedicated team and the resources to turn Hadoop into a system at web scale

In Feb. 2008, Yahoo! announced its search index was being generated by a 10,000 core Hadoop cluster

In Apr. 2008, Hadoop broke a world record to sort a terabytes of data In Nov. 2008, Google reported that its MapReduce implementation sorted one terabytes

in 68 seconds. In May 2009, Yahoo! used Hadoop to sort one terabytes in 62 seconds

11

Page 12: Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

Apache Hadoop and the Hadoop Ecosystem The Hadoop projects that are covered in this book are following Common – a set of components and interfaces for filesystems and I/O. Avro – a serialization system for RPC and persistent data storage. MapReduce – a distributed data processing model. HDFS – a distributed filesystem running on large clusters of machines. Pig – a data flow language and execution environment for large datasets. Hive – a distributed data warehouse providing SQL-like query language. HBase – a distributed, column-oriented database. ZooKeeper – a distributed, highly available coordination service. Sqoop – a tool for efficiently moving data between relational DB and HDFS.

12