20141111 파이썬으로 Hadoop MR프로그래밍
-
Upload
tae-young-lee -
Category
Data & Analytics
-
view
866 -
download
3
Transcript of 20141111 파이썬으로 Hadoop MR프로그래밍
HadoopStreaming
IT가맹점개발팀이태영
2014.11.11
5번째 스터디
파이썬으로 MR 개발하기
• 개발자, 팀이 익숙한 언어를 사용해서 MR 개발 가능
• 특정 언어에서 제공하는 라이브러리 사용 가능
• 표준 I/O로 데이터 교환 - 자바 MR에 비해 성능 저하
• 그러나 개발 생산성이 보장받는다면?
HadoopStreaming은 표준 입출력(stdio)를 제공하는 모든 언어 이용 가능
※ 두 가지 요소가 정의되어야 함
1. Map 기능이 정의된 실행 가능 Mapper 파일
2. Reduce 기능이 정의된 실행 가능 Reducer 파일
HadoopStreaming
MapReduce
1. MAP의 역할 - 표준 입력으로 입력 데이터 처리
2. MAP의 역할 - 표준 출력으로 Key, Value 출력
3. REDUCER 역할 - MAP의 출력 <Key, Value>를 표준 입력으로 처리
4. REDUCER 역할 - 표준 출력으로 Key, Value 출력
데이터입력 파이썬
Map 처리파이썬
Reduce 처리
PIPE
파일 읽기,PIPE,스트리밍 등
MR 처리결과
출력
Python 설치표준 I/O 데이터 Mapper 예제
1. python 사이트에서 2.7.8 다운로드 후 압축 해제
2. 계정 홈 디렉토리에서 python 심볼릭 링크로 파이썬 디렉토리 경로 생성
3. ./configure
4. ./make
python 명령어도귀찮으니 py로 축약
print ‘Hello BC’ hello.py
매번 py를 쳐주기 귀찮다.파이썬 스크립트 자체를 실행파일로!
#!/home/hadoop/python/py
print ‘Hello BC’
hello.py
[hadoop@big01 ~]$ chmod 755 hello.py[hadoop@big01 ~]$ ./hello.py Hello BC
[hadoop@big01 ~]$ py hello.py
Hello BC
Python 실행Hello BC 예제 실행
#! (SHA BANG)
#!/home/hadoop/python/py
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '{0}\t{1}'.format(word, 1)
[hadoop@big01 python]$ echo "bc bc bc card bc card" | mapper.py bc 1bc 1bc 1card 1bc 1card 1it 1
mapper.py
Python MAP표준 I/O Mapper 실행 예제
[hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py bc 1bc 1bc 1card 1bc 1card 1it 1
[hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py | sort –k 1bc 1bc 1bc 1bc 1card 1card 1it 1
첫번째 필드 기준 정렬
Python MAPMapper 출력 값을 정렬
import sys
current_word = Nonecurrent_count = 0 word = None
for line in sys.stdin:line = line.strip()word, count = line.split('\t',1)count = int(count)
if current_word == word:current_count += count
else:if current_word:
print '{0}\t{1}'.format(current_word, current_count)current_count = countcurrent_word = word
if current_word == word:print '{0}\t{1}'.format(current_word, current_count)
reducer.py
기준 단어와 같다면 카운트 +1
기준 단어가 None이 아니라면M/R 결과 출력
새로운 기준 단어 설정
마지막 라인 처리용
Python REDUCE표준 I/O의 Reducer 예제
[hadoop@big01 python]$ echo "bc bc bc card bc card it" | ./mapper.py | sort –k 1 | ./reducer.py
bc 4card 2it 1
Python ♥ HadoopHadoopStreaming
1. HadoopStreaming에서 mapper/reducer는 실행가능한 쉘로 지정되어야 한다.
[OK] Hadoop jar hadoop-streaming*.jar –mapper map.py –reducer reduce.py …
[NO] Hadoop jar hadoop-streaming*.jar –mapper python map.py –reducer python reduce.py …
2. Python 스크립트는 어디에서든 접근 가능하도록 디렉토리 PATH를 설정
조건
Caused by: java.lang.RuntimeException: configuration exceptionat org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)... 22 more
Caused by: java.io.IOException: Cannot run program "mapper.py": error=2, 그런 파일이나 디렉터리가 없습니다at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)... 23 more
Caused by: java.io.IOException: error=2, 그런 파일이나 디렉터리가 없습니다at java.lang.UNIXProcess.forkAndExec(Native Method)at java.lang.UNIXProcess.<init>(UNIXProcess.java:187)at java.lang.ProcessImpl.start(ProcessImpl.java:134)at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)... 24 more
안 하면
Python ♥ HadoopHadoopStreaming
hadoop jar hadoop-streaming-2.5.1.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /usr/bin/wc
$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar
Hadoop 2.x의 HadoopStreaming 위치
hadoop command [genericOptions] [streamingOptions]
Python ♥ HadoopHadoopStreaming 명령어(command)
Parameter Optional/Required Description
-input directoryname or filename Required Input location for mapper
-output directoryname Required Output location for reducer
-mapper executable or JavaClassName Required Mapper executable
-reducer executable or JavaClassName Required Reducer executable
-file filename OptionalMake the mapper, reducer, or combiner executable available locally on the compute nodes
-inputformat JavaClassName OptionalClass you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default
-outputformat JavaClassName OptionalClass you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default
-partitioner JavaClassName Optional Class that determines which reduce a key is sent to
-combiner streamingCommand or JavaClassName
Optional Combiner executable for map output
-cmdenv name=value Optional Pass environment variable to streaming commands
-inputreader OptionalFor backwards-compatibility: specifies a record reader class (instead of an input format class)
-verbose Optional Verbose output
-lazyOutput OptionalCreate output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write)
-numReduceTasks Optional Specify the number of reducers
-mapdebug Optional Script to call when map task fails
-reducedebug Optional Script to call when reduce task fails
hadoop command [genericOptions] [streamingOptions]
Python ♥ HadoopHadoopStreaming 제네릭 옵션
Parameter Optional/Required Description
-conf configuration_file Optional Specify an application configuration file
-D property=value Optional Use value for given property
-fs host:port or local Optional Specify a namenode
-files Optional Specify comma-separated files to be copied to the Map/Reduce cluster
-libjars Optional Specify comma-separated jar files to include in the classpath
-archives Optional Specify comma-separated archives to be unarchived on the compute machines
hadoop command [genericOptions] [streamingOptions]
사용 예
hadoop jar hadoop-streaming-2.5.1.jar \
-D mapreduce.job.reduces=2 \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /usr/bin/wc
Python ♥ HadoopHadoopStreaming 실행 : WordCount
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar \-input alice -output wc_alice-mapper mapper.py -reducer reducer.py \-file mapper.py -file reducer.py
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null14/11/11 23:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:804014/11/11 23:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:804014/11/11 23:51:43 INFO mapred.FileInputFormat: Total input paths to process : 114/11/11 23:51:43 INFO mapreduce.JobSubmitter: number of splits:214/11/11 23:51:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416242552451_000914/11/11 23:51:44 INFO impl.YarnClientImpl: Submitted application application_1416242552451_000914/11/11 23:51:44 INFO mapreduce.Job: The url to track the job: http://big01:8088/proxy/application_1416242552451_0009/14/11/11 23:51:44 INFO mapreduce.Job: Running job: job_1416242552451_000914/11/11 23:51:53 INFO mapreduce.Job: Job job_1416242552451_0009 running in uber mode : false14/11/11 23:51:53 INFO mapreduce.Job: map 0% reduce 0%14/11/11 23:52:05 INFO mapreduce.Job: map 100% reduce 0%14/11/11 23:52:13 INFO mapreduce.Job: map 100% reduce 100%14/11/11 23:52:13 INFO mapreduce.Job: Job job_1416242552451_0009 completed successfully14/11/11 23:52:13 INFO mapreduce.Job: Counters: 49
File System Counters…..
Python ♥ HadoopHadoopStreaming 실행 : WordCount
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar \-input alice -output wc_alice-mapper mapper.py -reducer reducer.py \-file mapper.py -file reducer.py
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null14/11/11 23:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:804014/11/11 23:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:804014/11/11 23:51:43 INFO mapred.FileInputFormat: Total input paths to process : 114/11/11 23:51:43 INFO mapreduce.JobSubmitter: number of splits:214/11/11 23:51:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416242552451_000914/11/11 23:51:44 INFO impl.YarnClientImpl: Submitted application application_1416242552451_000914/11/11 23:51:44 INFO mapreduce.Job: The url to track the job: http://big01:8088/proxy/application_1416242552451_0009/14/11/11 23:51:44 INFO mapreduce.Job: Running job: job_1416242552451_000914/11/11 23:51:53 INFO mapreduce.Job: Job job_1416242552451_0009 running in uber mode : false14/11/11 23:51:53 INFO mapreduce.Job: map 0% reduce 0%14/11/11 23:52:05 INFO mapreduce.Job: map 100% reduce 0%14/11/11 23:52:13 INFO mapreduce.Job: map 100% reduce 100%14/11/11 23:52:13 INFO mapreduce.Job: Job job_1416242552451_0009 completed successfully14/11/11 23:52:13 INFO mapreduce.Job: Counters: 49
File System Counters…..
Python ♥ HadoopHadoopStreaming 결과 확인
Python ♥ HadoopHadoopStreaming 결과 확인
….you'd 8you'll 4you're 15you've 5you, 25you,' 6you--all 1you--are 1you. 1you.' 1you: 1you? 2you?' 7young 5your 62yours 1yours."' 1yourself 5yourself!' 1yourself, 1yourself,' 1yourself.' 2youth, 3youth,' 3zigzag, 1
part-00000 를 열어보면
Python ♥ HadoopHadoopStreaming 예제 : WordCount 고도화
#!/home/hadoop/python/py
import sys
Import re
for line in sys.stdin:
line = line.strip()
line = re.sub('[=.#/?:$\'!,"}]', '', line)
words = line.split()
for word in words:
print '{0}\t{1}'.format(word, 1)
mapper.py 수정
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar \-input alice -output wc_alice2-mapper mapper.py -reducer reducer.py \-file mapper.py -file reducer.py
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null14/11/11 23:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:804014/11/11 23:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040
정규표현식, 특수문자 제거
Python ♥ HadoopHadoopStreaming 결과 확인
…..ye; 1year 2years 2yelled 1yelp 1yer 4yesterday 3yet 18yet--Oh 1yet--and 1yet--its 1you 357you) 1you--all 1you--are 1youd 8youll 4young 5your 62youre 15yours 2yourself 10youth 6youve 5zigzag 1
wc_alice2의 part-00000 를 열어보면
끝.