Bring Cartography to the Cloud

Bring Cartography to the Cloud with Apache Hadoop

Nick Dimiduk Member of Technical Staff, HBase FOSS4G-NA, 2013-05-23

Page 1

Beginnings…

Page 2 Architecting the Future of Big Data

mapbox.com/blog/ rendering-the-world/

bmander.com/dotmap/index.html

Definitions

Page 3 Architecting the Future of Big Data

car•tog•ra•phy |kärˈtägrəәfē| noun the science or practice of drawing maps. rendering map tiles from some kind of geographic data.

cloud |kloud| noun a visible mass of condensed water vapor floating in the atmosphere, typically high above the ground. on demand consumption of computation and storage resources.

Background

Architecting the Future of Big Data Page 4

Apache Hadoop in Review •  Apache Hadoop Distributed Filesystem (HDFS)

– Distributed, fault-tolerant, throughput-optimized data storage – Uses a filesystem analogy, not structured tables –  The Google File System, 2003, Ghemawat et al. –  http://research.google.com/archive/gfs.html

•  Apache Hadoop MapReduce (MR) – Distributed, fault-tolerant, batch-oriented data processing –  Line- or record-oriented processing of the entire dataset * –  “[Application] schema on read” – MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean and

Ghemawat –  http://research.google.com/archive/mapreduce.html

Page 5 Architecting the Future of Big Data

* For more on writing MapReduce applications, see “MapReduce Patterns, Algorithms, and Use Cases” http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

MapReduce in Detail

Page 6 Architecting the Future of Big Data

highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

MapReduce in Detail

Page 7 Architecting the Future of Big Data

highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

What we care about

Page 8 Architecting the Future of Big Data

$ map < input | sort | reduce > output

How Seamlessly?

Page 9 Architecting the Future of Big Data

$ git show e65731e:bin/10_simulated_hadoop.sh gzcat "$INPUT_FILES" \ | python "${PYTHON_DIR}/sample_shapes.py" \ | sort \ | python "${PYTHON_DIR}/draw_tiles.py"

$ git show e65731e:bin/11_hadoop_local.sh hadoop jar target/tile-brute-0.1.0-SNAPSHOT.jar \ -input /tmp/input.csv \ -output "$OUTPUT_DIR" \ -mapper "python ${PYTHON_DIR}/sample_shapes.py" \ -reducer "python ${PYTHON_DIR}/draw_tiles.py"

To the Code! github.com/ndimiduk/tilebrute

Architecting the Future of Big Data Page 10

Our Tools •  Python + GIS

– GDAL – Shapely – Mapnik

•  Java •  Apache Hadoop •  Bash •  MrJob

Page 11 Architecting the Future of Big Data

Prepare the Input

Page 12 Architecting the Future of Big Data

TIGER/Line Shapefiles

www.census.gov/geo/maps-data/data/tiger-line.html

$ tail -n6 bin/00_prepare_input.sh ogr2ogr `: invoke gdal tool ogr2ogr` \ -t_srs epsg:4326 `: reproject the data` \ -f CSV `: in CSV format` \ $OUTPUT `: producing output file` \ $INPUT `: from input file` \ -lco GEOMETRY=AS_WKT `: including geometries as WKT`

$ head -n2 /tmp/input.csv WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10 "POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5

Prepare the Input

Page 13 Architecting the Future of Big Data

TIGER/Line Shapefiles

www.census.gov/geo/maps-data/data/tiger-line.html

$ tail -n6 bin/00_prepare_input.sh ogr2ogr `: invoke gdal tool ogr2ogr` \ -t_srs epsg:4326 `: reproject the data` \ -f CSV `: in CSV format` \ $OUTPUT `: producing output file` \ $INPUT `: from input file` \ -lco GEOMETRY=AS_WKT `: including geometries as WKT`

$ head -n2 /tmp/input.csv WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10 "POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5

Map: Sample Geometries

Page 14 Architecting the Future of Big Data

[,[WKT, population]] => mapper => ['tx,ty,z', 'px,py']

def main(): for geom, population in read_feature(stdin): for lng, lat in sample_geometry(geom, population): for key, val in make_kv(lat, lng): emit(key, val)

$ map < input | sort | reduce > output

Map: Sample Geometries

Page 15 Architecting the Future of Big Data

$ head -n1 input.csv | python -m tilebrute.sample_shapes 2,5,4 -13224181.65427 5981084.37214 5,11,5 -13224181.65427 5981084.37214 10,22,6 -13224181.65427 5981084.37214 21,44,7 -13224181.65427 5981084.37214 43,89,8 -13224181.65427 5981084.37214 87,179,9 -13224181.65427 5981084.37214 174,359,10 -13224181.65427 5981084.37214 348,718,11 -13224181.65427 5981084.37214 696,1436,12 -13224181.65427 5981084.37214 1392,2873,13 -13224181.65427 5981084.37214 2785,5746,14 -13224181.65427 5981084.37214 5571,11493,15 -13224181.65427 5981084.37214 11142,22986,16 -13224181.65427 5981084.37214 22284,45973,17 -13224181.65427 5981084.37214

$ map < input | sort | reduce > output

Sort

Page 16 Architecting the Future of Big Data

$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort 10,22,6 -13224414.42332 5983539.01581 10,22,6 -13225723.87449 5981201.60336 10,22,6 -13225793.67181 5983127.53706 10,22,6 -13226046.70101 5983375.66839 10,22,6 -13226331.90155 5984272.31303 11138,22981,16 -13226331.90155 5984272.31303 11139,22983,16 -13225793.67181 5983127.53706 11139,22983,16 -13226046.70101 5983375.66839 11139,22986,16 -13225723.87449 5981201.60336 11141,22982,16 -13224414.42332 5983539.01581

$ map < input | sort | reduce > output

Reduce: Draw Tiles

Page 17 Architecting the Future of Big Data

def main(): for tile,points in groupby(read_points(stdin), lambda x: x[0]): zoom = get_zoom(tile) map = init_map(zoom, points) map.zoom_all() im = mapnik.Image(256,256) mapnik.render(map,im) emit(tile, encode_image(im))

$ map < input | sort | reduce > output

$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort | head -n5 | python -m tilebrute.draw_tiles 10,22,6 iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAYAAABccqhmAAADJ...+aBAAAAAElFTkSuQmCC

Write Output

Page 18 Architecting the Future of Big Data

public void write(Text tileId, Text tile) throws IOException { String[] tileIdSplits = tileId.toString().split(","); assert tileIdSplits.length == 3; String tx = tileIdSplits[0]; String ty = tileIdSplits[1]; String zoom = tileIdSplits[2]; Path tilePath = new Path(outputPath, zoom + "/" + tx + "/" + ty + ".png"); fs.mkdirs(tilePath.getParent()); byte[] buf = Base64.decodeBase64(tile.toString()); final FSDataOutputStream fout = fs.create(tilePath, progress); fout.write(buf); fout.close(); }

To the Cloud!

Architecting the Future of Big Data Page 19

Basic Services: EC2, S3 •  EC2: Elastic Compute Cloud

– Virtual machines on demand – Different “instance types” with different hardware profiles – m1.large (2 cores, 7.5G), c1.xlarge (8 cores, 7G)

•  S3: Simple Storage Service – Distributed, replicated storage – Native Hadoop integration – Also exposed over http(s), easy tile hosting

Page 20 Architecting the Future of Big Data

Add-on Service: EMR •  EMR: Elastic MapReduce

–  “Hadoop as a Service” – On-demand, pre-installed and configured Hadoop clusters –  +1: standardize of provisioning, deployment, monitoring –  -1: “stable” (old) software

Page 21 Architecting the Future of Big Data

MrJob: Python for EMR

Page 22 Architecting the Future of Big Data

class TileBrute(MRJob): HADOOP_OUTPUT_FORMAT = 'tilebrute.hadoop.mapred.MapTileOutputFormat' def mapper_cmd(self): return bash_wrap('$PYTHON -m tilebrute.sample_shapes') def reducer_cmd(self): return bash_wrap('$PYTHON -m tilebrute.draw_tiles')

github.com/Yelp/mrjob

Results

Architecting the Future of Big Data Page 23

14z, 2624x, 5722y

14z, 2624x, 5722y

How much code?

Page 27 Architecting the Future of Big Data

$ find -f src -f bin | egrep '\.(java|sh|py)$' | grep -v test | xargs cloc --quiet http://cloc.sourceforge.net v 1.56 T=0.5 s (28.0 files/s, 1868.0 lines/s) ------------------------------------------------------------------------------- Language files blank comment code ------------------------------------------------------------------------------- Python 4 69 105 299 Bourne Shell 8 51 85 210 Java 2 25 16 74 ------------------------------------------------------------------------------- SUM: 14 145 206 583 -------------------------------------------------------------------------------