Hadoop Overview2008년 5월 30일
발표순서• Hadoop 이란?
• Hadoop Architecture
• HDFS
• MapReduce
• HBase
• MapReduce Programming
Hadoop
• History• 2005년 Nutch 오픈소스 검색 엔진의 분산확장 문제에서 출발
(Inspired from Google’s GFS, BigTable, MapReduce)
• 2006년 Yahoo의 전폭적인 지원
• 2008년 Apache Top-level Project로 승격• 현재 0.17.0 Release
• 특징• Java 언어 기반• Apache License
• 많은 컴포넌트들 (HDFS, HBase, MapReduce, Hadoop On Demand(HOD), Stremaing, HQL, Hama, Mahout, etc)
Hadoop 사용현황
• Nutch: Open Source Web Search Software
• Yahoo!
• ~10000 machines running Hadoop
• Porting ~100 webmap applications to MapReduce
• The New York Times: Times Machine
• EC2/S3/Hadoop
• Large TIFF images(405,000) , articles(3,300,000), meta data(405,000) -> 810,000 PNG data
Hadoop Architecture
HDFS
HBase
MapReduce : Distributed Programming Model
: Distributed Database (BigTable in Google)
: Hadoop Distributed FileSystem (GFS in Google)
Commodity PC cluster
HDFS
HDFS
• User-level distributed file system
• Non-standard file system interface
• Master/Slave (namenode and datanodes)
• Replication (3 copies)
• Large chunk size (64MB)
• No cache chunk (cache metadata, however)
Architecture
Write Operation
Namenode
DataNode
DataNode
DataNode
ClientMeta data
Pipelined Data Transfer
HBase
HBase
• Distributed database modeled on Bigtable
• Column-oriented store
• Goal of billions of rows x millions of cells
• Petabytes of data across thousands of servers
• Is Not SQL Database.
DataModel
• Table of rows x columns; timestamp• Sparse Table• Column-based DB
DataModel
• Physical Storage View: Store Column Family• Column Name: <Family>:<Label>
MapReduce
MapReduce• Programming model and implementation for
parallel processing large data sets
• Automatic parallelism & distribution
• Fault-tolerant
• Clean abstraction for programmers
• map & reduce functions
map (k1, v1) -> list (k2, v2)reduce (k2, list (v2)) -> list (v2)
Example: Word Count
• map(String input_key, String input_value):
// input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1");
• reduce(String output_key, Iterator intermediate_values):
// output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
Data Processing Flow
J Dean, S Ghemawat “MapReduce: Simplified Data Processing on Large Clusters”
Parallelization
J Dean, S Ghemawat “MapReduce: Simplified Data Processing on Large Clusters”
Hadoop MapReduce Architecture
TaskTracker
t1HDFS
TaskTracker
HDFS
TaskTracker
t2HDFS
TaskTracker
t3HDFS
Node #1 Node #2 Node #3 Node #n
job queue
JobTracker
job #1input list
heartbeat
taskallocation
JobClient
JobSubmission
WritingInputbyHDFS
Features
• Mapper locality
• Overlap of maps, shuffle, sort
• Speculative execution
장점과 단점
• 모든 일에 적합한 것은 아니다.
• 주로 해야 할 일들이 잘 나눠지는 것들에 적합하고, 분산된 일들끼리 통신이나 데이타 공유가 필요하면 적용하기 까다롭다.
• Optimal을 보장하지 않는다.
• 하지만, 대부분 구현가능하고, 구현하기 쉬우며 재사용성이 높다.
MapReduce 구현예
Doug Cutting “MapReduce in Nutch”
MapReduce 구현예
Nutch의 주요 모듈
• Inject: 추가로 crawl할 url (seed urls)을 CrawlDB형식으로 변환
• Generate: CrawlDB에서 fetch할 url들을 선택
• Fetch: 선택된 Url들의 내용을 가져옴.
• Parse: 가져온 내용을 Parsing
• Invert links: 모든 url들에 대해 inlink들을 찾음
• Index: Indexing
Doug Cutting “MapReduce in Nutch”
Fetch
• Input: (url, CrawlDatum)
• Map(url, CrawlDatum) (url,(CrawlDatum, Content))해당 url을 protocol module을 이용해 받아옴.
• Reduce: identity
• Output: (url, CrawlDatum), (url,Content)
Doug Cutting “MapReduce in Nutch”
Invert Links• 모든 url들에 대해 자신을 가르키고 있는
url(inlinks)를 계산
• Input: <srcUrl, ParseData> (page outlinks를 가지고 있음)
• Map: (srcUrl, ParseData) (destUrl, inlink)*ParseData의 모든 destUrl들에대해 collectinlink는 srcUrl
• Reduce: (destUrl, inlink*) (destUrl, inlinks)같은 destUrl들에 대해 inlink들을 합함
• Output: (destUrl, inlinks)*Doug Cutting “MapReduce in Nutch”
PageRank• PR(A) = PR(B)/L(B) + PR(C)/L(C) + PR(D)/L(D) + ....
PR(B) = PR(C)/L(C) + PR(F)/L(F) + ..........
• Map: (url, (PR, outlinks)) ( outlink, PR/N)자기의 PageRank 점수를 outlinks 수로 나누어 각각의 outlink들에게 나누어줌.
• Reduce: ( url, PRs* ) ( url, PR)자기가 받은 PageRank들을 더하여 새로운 PR을 얻음.
• Interation to converge. (initial state and damping factor)
Michael Kleber, “What is MapReduce?”
Q&A
Top Related