20171012 found IT #9 PySparkの勘所
-
Upload
ryuji-tamagawa -
Category
Technology
-
view
634 -
download
1
Transcript of 20171012 found IT #9 PySparkの勘所
PySpark found IT project #9
@
▸ facebook : Ryuji Tamagawa
▸ Twitter : tamagawa_ryuji
▸ FB
found IT project
11
▸
▸
▸ Spark Hadoop
▸ PySpark
▸ Spark/Hadoop PyData
▸
▸
▸
PySpark
▸
▸ SSD
▸ CPU
▸
ParquetS3
CPU
https://www.slideshare.net/kumagi/ss-78765920/4
▸
▸
▸ groupby
▸ Spark API
Spark Hadoop
Spark Hadoop
Hadoop0.x Spark
OS
HDFS
MapReduce
OS
HDFS
Hive e.t.c.HBase
MapReduce
OSHDFS
Hive e.t.c.
HBaseMapReduce
YARN
Spark Spark Streaming, MLlib, GraphX, Spark SQL)
Impala
SQL
YARN
Spark Spark Streaming, MLlib, GraphX,
Spark SQL)
Mesos
Spark Spark Streaming, MLlib, GraphX,
Spark SQL) Spark Spark Streaming, MLlib, GraphX,
Spark SQL)
Windows
Hadoop 0.x Hadoop 1.x Hadoop 2.x + Spark
▸ Amazon EMR
▸ Microsoft Azure HDInsight
▸ Cloudera Altus
▸ Databricks Community Edition Spark
▸ PyData + Jupyter PySpark
Spark Hadoop
Hadoop Spark
mapJVM
HD
FS
reduceJVM
mapJVM
reduceJVM
f1 RDD
Executor JVM
HD
FS
f2f3
f4f5
f6f7
MapReduce Spark
RDD
Spark Hadoop
Spark
▸ Hadoop MapReduce
▸ Spark API MapReduce API
▸ Hadoop
PySpark
(Py)Spark
▸ / Spark
▸ PyData
▸ Spark
▸ Spark Hadoop
PyData
PySpark
Spark 1.2 PySpark …
(Py)Spark
PySpark
PySpark
RDD API DataFrame API
▸ RDD Resilient Distributed Dataset =
Spark Java
▸ DataFrame RDD
/ R data.frame
▸ Python RDD API DataFrame API Scala
/ Java
PySpark
DataFrame API
RDD DataFrame / Dataset
MLlib ML
GraphX GraphFrame
Spark Streaming
Structured Streaming
Worker node
PySpark
Executer JVM
Driver JVM
Executer JVM
Executer JVM
Storage
Python VM
Worker node Worker node
Python VM
Python VM
RDD API PySpark
Worker node
Executer JVM
Driver JVM
Executer JVM
Executer JVM
Storage
Python VM
Worker node Worker node
Python VM
Python VM
DataFrame API PySpark
PySpark
▸ RDD API Executer JVM Python VM
▸ DataFrame API JVM
▸ UDF Python VM
▸ UDF Scala Java
▸ Spark 2.x DataFrame
Spark PyData
Spark PyData
Spark PyData
▸ Spark
▸ Python PyData
▸
▸ Parquet
▸ Apache Arrow
Spark PyData
▸ CSV JSON
▸Parquet Spark DataFrame API
Python
fastparquet pyarrow
▸ Performance comparison of different file formats and storage engines
in the Hadoop ecosystem
▸=
Spark PyData
Parquet
https://parquet.apache.org/documentation/latest/
zip CSV
I/O
ROW BLOCKCOLUMN #0 ROW #0COLUMN #0 ROW #1
COLUMN #0 ROW #NCOLUMN #1 ROW #0COLUMN #1 ROW #1
…
…
COLUMN #1 ROW #NCOLUMN #2 ROW #0
COLUMN #2 ROW #1…
COLUMN #M ROW #N
ROW BLOCKCOLUMN #0 ROW #0COLUMN #0 ROW #1
COLUMN #0 ROW #NCOLUMN #1 ROW #0
COLUMN #1 ROW #1
…
…
COLUMN #1 ROW #NCOLUMN #2 ROW #0
COLUMN #2 ROW #1…
COLUMN #M ROW #N. . .
Spark PyData
Sparkdf = spark.read.csv(csvFilename, header=True, schema = theSchema).coalesce(20) df.write.save(filename, compression = 'snappy')
from fastparquet import write
pdf = pd.read_csv(csvFilename)
write(filename, pdf, compression='UNCOMPRESSED')
fastparquet
import pyarrow as pa
import pyarrow.parquet as pq
arrow_table = pa.Table.from_pandas(pdf)
pq.write_table(arrow_table, filename, compression = 'GZIP')
pyarrow
Spark PyData
▸ pandas CSV Spark
Spark pandas
…
▸ Spark - pandas
▸ pandas → Spark …
▸ Apache Arrow
Spark PyData
Apache Arrow
▸ Apache Arrow
▸ PyData / OSS
▸ /
https://arrow.apache.org
Spark PyData
Wes blog
▸ pandas Apache Arrow
▸ Blog
▸ PyData Blog
Wes OK
▸ Apache Arrow pandas 10
https://qiita.com/tamagawa-ryuji/items/3d8fc52406706ae0c144
PySpark
11