NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Spark 소개 2부
-
Upload
jinho-yoo -
Category
Engineering
-
view
5.242 -
download
6
Transcript of Spark 소개 2부
Lightning-fast cluster computing
잠시 복습
Problem
Solution
MapReduce?
모든 일을 MapReduce화 하라!
근데 이런 SQL을 어떻게 MapReduce로 만들지?
SELECT LAT_N, CITY, TEMP_F
FROM STATS, STATION WHERE MONTH = 7 AND STATS.ID =
STATION.ID ORDER BY TEMP_F;
모든 일을 MapReduce화 하라!
이런 Machine learning/Data 분석 업무는?
“지난 2007년부터 매월 나오는 전국 부동산 실거래가 정보에서 영향을 미칠 수 있는 변수 140개중에 의미있는 변수 5개만 뽑아.”“아, 마감은 내일이다.”
코드도 이정도면 뭐? (단순히 단어세는 코드가…)package org.myorg; import java.io.IOException;import java.util.*; import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
원래 세월이 가면 연장은 좋아지는 법
Generality
High-level tool들 아래에서 모든 일들을 있는 그대로 하게 해줍니다.
쓰기 쉽습니다.
Java, Scala, Python을 지원합니다.
text_file = spark.textFile
("hdfs://...")
text_file.flatMap(lambda line:
line.split())
.map(lambda word: (word,
1))
.reduceByKey(lambda a, b:
a+b)
Word count in Spark's Python API
온갖 분산처리 환경에서 다 돌아갑니다.
● Hadoop, Mesos, 혼자서도, Cloud에서도 돌아요.
● HDFS, Cassandra, HBase, S3등에서 데이타도 가져올 수 있어요.
속도도 빠릅니다.
Hadoop MapReduce를 Memory에서 올렸을 때보다 100배, Disk에서 돌렸을 때의 10배 빠릅니다.
Logistic regression in Hadoop and Spark
자체 Web UI까지 있어요….
Spark은 말이죠
● Tool이에요, Library 아닙니다. ○ 이 Tool위에 하고 싶은 일들을 정의하고 ○ 실행시키는 겁니다.
Standalone으로 부터
제 2부: 한번 해보자!
vagrant up / vagrant ssh
spark-shell
pyspark- python spark shell
Wordcount : Scalaval f = sc.textFile("README.md")val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)wc.saveAsTextFile("wc_out.txt")
Wordcount : Scalaval f = sc.textFile("README.md")===================def textFile(path: String, minPartitions:
Int = defaultMinPartitions):RDD[String]
=================== Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
Wordcount : Scalaval wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
Wordcount : Scalaval wc = f.flatMap(l => l.split(" ")): 한 단어씩 끊어서
Wordcount : Scalaval wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)):(각 단어들, 1)이라는 (Key, Value)들을 만들고
Wordcount : Scalaval wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _): 그 집합들을 다 Key별로 합해보아요.
Wordcount : Scalascala>wc.take(20)…….finished: take at <console>:26, took 0.081425 s
res6: Array[(String, Int)] = Array((package,1), (For,2), (processing.,1), (Programs,1), (Because,1), (The,1), (cluster.,1), (its,1), ([run,1), (APIs,1), (computation,1), (Try,1), (have,1), (through,1), (several,1), (This,2), ("yarn-cluster",1), (graph,1), (Hive,2), (storage,1))
Wordcount : Scalawc.saveAsTextFile("wc_out.txt")==========================파일로 저장
앞서 짠 코드를 이렇게 돌린다면?
Simplifying Big Data Analysiswith Apache SparkMatei ZahariaApril 27, 2015
Disk-> Memory로 옮겨봅시다.
Simplifying Big Data Analysiswith Apache SparkMatei ZahariaApril 27, 2015
즉 이렇게 각 Cluster별로 일거리와 명령을 전달해 주
면 되요.
Spark Model
● 데이타를 변환해가는 프로그램을 작성하는 것
● Resilient Distributed Dataset(RDDs)○ Cluster로 전달할 memory나 disk에 저장될 object들의 집합
○ 병렬 변환 ( map, filter…)등등으로 구성○ 오류가 생기면 자동으로 재구성
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Making interactive Big Data Applications Fast AND Easy
Holden Karau
Simplifying Big Data Analysiswith Apache SparkMatei ZahariaApril 27, 2015
각 code 한 줄이 RDD! val f = sc.textFile("README.md")
val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
wc.saveAsTextFile("wc_out.txt")
지원하는 명령들
Build-in libraries● 다양한 기능들을 RDD로 쓸 수 있게 만들어놓음
● Caching + DAG model은 이런거 돌리는데 충분히 효율적임.
● 모든 라이브러리를 하나 프로그램에 다 묶어 놓는게 더 빠르다.
Simplifying Big Data Analysiswith Apache SparkMatei ZahariaApril 27, 2015
Simplifying Big Data Analysiswith Apache SparkMatei ZahariaApril 27, 2015
Simplifying Big Data Analysiswith Apache SparkMatei ZahariaApril 27, 2015
MLibVectors, Matrices = RDD[Vector]Iterative computation
points = sc.textFile(“data.txt”).map(parsePoint)model = KMeans.train(points, 10)model.predict(newPoint)
GraphX
Represents graphs as RDDs of vertices and edges.
Simplifying Big Data Analysiswith Apache SparkMatei ZahariaApril 27, 2015
결론
여러분의 data source, 작업, 환경들을 다 통합하고 싶어요.
Q&A