Bring Cartography to the Cloud

30
© Hortonworks Inc. 2011 Bring Cartography to the Cloud with Apache Hadoop Nick Dimiduk Member of Technical Staff, HBase FOSS4G-NA, 2013-05-23 Page 1

description

If you've used a modern, interactive map such as Google or Bing Maps, you've consumed "map tiles". Map tiles are small images rendering a piece of the mosaic that is the whole map. Using conventional means, rendering tiles for the whole globe at multiple resolutions is a huge data processing effort. Even highly optimized, it spans a couple TBs and a few days of computation. Enter Hadoop. In this talk, I'll show you how to generate your own custom tiles using Hadoop. There will be pretty pictures.

Transcript of Bring Cartography to the Cloud

Page 1: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Bring Cartography to the Cloud with Apache Hadoop

Nick Dimiduk Member of Technical Staff, HBase FOSS4G-NA, 2013-05-23

Page 1

Page 2: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Beginnings…

Page 2 Architecting the Future of Big Data

mapbox.com/blog/ rendering-the-world/

bmander.com/dotmap/index.html

Page 3: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Definitions

Page 3 Architecting the Future of Big Data

car•tog•ra•phy |kärˈtägrəәfē| noun the science or practice of drawing maps. rendering map tiles from some kind of geographic data.

cloud |kloud| noun a visible mass of condensed water vapor floating in the atmosphere, typically high above the ground. on demand consumption of computation and storage resources.

Page 4: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Background

Architecting the Future of Big Data Page 4

Page 5: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Apache Hadoop in Review •  Apache Hadoop Distributed Filesystem (HDFS)

– Distributed, fault-tolerant, throughput-optimized data storage – Uses a filesystem analogy, not structured tables –  The Google File System, 2003, Ghemawat et al. –  http://research.google.com/archive/gfs.html

•  Apache Hadoop MapReduce (MR) – Distributed, fault-tolerant, batch-oriented data processing –  Line- or record-oriented processing of the entire dataset * –  “[Application] schema on read” – MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean and

Ghemawat –  http://research.google.com/archive/mapreduce.html

Page 5 Architecting the Future of Big Data

* For more on writing MapReduce applications, see “MapReduce Patterns, Algorithms, and Use Cases” http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

Page 6: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

MapReduce in Detail

Page 6 Architecting the Future of Big Data

highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

Page 7: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

MapReduce in Detail

Page 7 Architecting the Future of Big Data

highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

Page 8: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

What we care about

Page 8 Architecting the Future of Big Data

$ map < input | sort | reduce > output

Page 9: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

How Seamlessly?

Page 9 Architecting the Future of Big Data

$ git show e65731e:bin/10_simulated_hadoop.sh gzcat "$INPUT_FILES" \ | python "${PYTHON_DIR}/sample_shapes.py" \ | sort \ | python "${PYTHON_DIR}/draw_tiles.py"

$ git show e65731e:bin/11_hadoop_local.sh hadoop jar target/tile-brute-0.1.0-SNAPSHOT.jar \ -input /tmp/input.csv \ -output "$OUTPUT_DIR" \ -mapper "python ${PYTHON_DIR}/sample_shapes.py" \ -reducer "python ${PYTHON_DIR}/draw_tiles.py"

Page 10: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

To the Code! github.com/ndimiduk/tilebrute

Architecting the Future of Big Data Page 10

Page 11: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Our Tools •  Python + GIS

– GDAL – Shapely – Mapnik

•  Java •  Apache Hadoop •  Bash •  MrJob

Page 11 Architecting the Future of Big Data

Page 12: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Prepare the Input

Page 12 Architecting the Future of Big Data

TIGER/Line Shapefiles

www.census.gov/geo/maps-data/data/tiger-line.html

$ tail -n6 bin/00_prepare_input.sh ogr2ogr `: invoke gdal tool ogr2ogr` \ -t_srs epsg:4326 `: reproject the data` \ -f CSV `: in CSV format` \ $OUTPUT `: producing output file` \ $INPUT `: from input file` \ -lco GEOMETRY=AS_WKT `: including geometries as WKT`

$ head -n2 /tmp/input.csv WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10 "POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5

Page 13: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Prepare the Input

Page 13 Architecting the Future of Big Data

TIGER/Line Shapefiles

www.census.gov/geo/maps-data/data/tiger-line.html

$ tail -n6 bin/00_prepare_input.sh ogr2ogr `: invoke gdal tool ogr2ogr` \ -t_srs epsg:4326 `: reproject the data` \ -f CSV `: in CSV format` \ $OUTPUT `: producing output file` \ $INPUT `: from input file` \ -lco GEOMETRY=AS_WKT `: including geometries as WKT`

$ head -n2 /tmp/input.csv WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10 "POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5

Page 14: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Map: Sample Geometries

Page 14 Architecting the Future of Big Data

[,[WKT, population]] => mapper => ['tx,ty,z', 'px,py']

def main(): for geom, population in read_feature(stdin): for lng, lat in sample_geometry(geom, population): for key, val in make_kv(lat, lng): emit(key, val)

$ map < input | sort | reduce > output

Page 15: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Map: Sample Geometries

Page 15 Architecting the Future of Big Data

$ head -n1 input.csv | python -m tilebrute.sample_shapes 2,5,4 -13224181.65427 5981084.37214 5,11,5 -13224181.65427 5981084.37214 10,22,6 -13224181.65427 5981084.37214 21,44,7 -13224181.65427 5981084.37214 43,89,8 -13224181.65427 5981084.37214 87,179,9 -13224181.65427 5981084.37214 174,359,10 -13224181.65427 5981084.37214 348,718,11 -13224181.65427 5981084.37214 696,1436,12 -13224181.65427 5981084.37214 1392,2873,13 -13224181.65427 5981084.37214 2785,5746,14 -13224181.65427 5981084.37214 5571,11493,15 -13224181.65427 5981084.37214 11142,22986,16 -13224181.65427 5981084.37214 22284,45973,17 -13224181.65427 5981084.37214

$ map < input | sort | reduce > output

Page 16: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Sort

Page 16 Architecting the Future of Big Data

$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort 10,22,6 -13224414.42332 5983539.01581 10,22,6 -13225723.87449 5981201.60336 10,22,6 -13225793.67181 5983127.53706 10,22,6 -13226046.70101 5983375.66839 10,22,6 -13226331.90155 5984272.31303 11138,22981,16 -13226331.90155 5984272.31303 11139,22983,16 -13225793.67181 5983127.53706 11139,22983,16 -13226046.70101 5983375.66839 11139,22986,16 -13225723.87449 5981201.60336 11141,22982,16 -13224414.42332 5983539.01581

$ map < input | sort | reduce > output

Page 17: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Reduce: Draw Tiles

Page 17 Architecting the Future of Big Data

def main(): for tile,points in groupby(read_points(stdin), lambda x: x[0]): zoom = get_zoom(tile) map = init_map(zoom, points) map.zoom_all() im = mapnik.Image(256,256) mapnik.render(map,im) emit(tile, encode_image(im))

$ map < input | sort | reduce > output

$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort | head -n5 | python -m tilebrute.draw_tiles 10,22,6 iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAYAAABccqhmAAADJ...+aBAAAAAElFTkSuQmCC

Page 18: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Write Output

Page 18 Architecting the Future of Big Data

public void write(Text tileId, Text tile) throws IOException { String[] tileIdSplits = tileId.toString().split(","); assert tileIdSplits.length == 3; String tx = tileIdSplits[0]; String ty = tileIdSplits[1]; String zoom = tileIdSplits[2]; Path tilePath = new Path(outputPath, zoom + "/" + tx + "/" + ty + ".png"); fs.mkdirs(tilePath.getParent()); byte[] buf = Base64.decodeBase64(tile.toString()); final FSDataOutputStream fout = fs.create(tilePath, progress); fout.write(buf); fout.close(); }

Page 19: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

To the Cloud!

Architecting the Future of Big Data Page 19

Page 20: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Basic Services: EC2, S3 •  EC2: Elastic Compute Cloud

– Virtual machines on demand – Different “instance types” with different hardware profiles – m1.large (2 cores, 7.5G), c1.xlarge (8 cores, 7G)

•  S3: Simple Storage Service – Distributed, replicated storage – Native Hadoop integration – Also exposed over http(s), easy tile hosting

Page 20 Architecting the Future of Big Data

Page 21: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Add-on Service: EMR •  EMR: Elastic MapReduce

–  “Hadoop as a Service” – On-demand, pre-installed and configured Hadoop clusters –  +1: standardize of provisioning, deployment, monitoring –  -1: “stable” (old) software

Page 21 Architecting the Future of Big Data

Page 22: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

MrJob: Python for EMR

Page 22 Architecting the Future of Big Data

class TileBrute(MRJob): HADOOP_OUTPUT_FORMAT = 'tilebrute.hadoop.mapred.MapTileOutputFormat' def mapper_cmd(self): return bash_wrap('$PYTHON -m tilebrute.sample_shapes') def reducer_cmd(self): return bash_wrap('$PYTHON -m tilebrute.draw_tiles')

github.com/Yelp/mrjob

Page 23: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Results

Architecting the Future of Big Data Page 23

Page 24: Bring Cartography to the Cloud

© Hortonworks Inc. 2011 Page 24 Architecting the Future of Big Data

Page 25: Bring Cartography to the Cloud

© Hortonworks Inc. 2011 Page 25 Architecting the Future of Big Data

14z, 2624x, 5722y

Page 26: Bring Cartography to the Cloud

© Hortonworks Inc. 2011 Page 26 Architecting the Future of Big Data

14z, 2624x, 5722y

Page 27: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

How much code?

Page 27 Architecting the Future of Big Data

$ find -f src -f bin | egrep '\.(java|sh|py)$' | grep -v test | xargs cloc --quiet http://cloc.sourceforge.net v 1.56 T=0.5 s (28.0 files/s, 1868.0 lines/s) ------------------------------------------------------------------------------- Language files blank comment code ------------------------------------------------------------------------------- Python 4 69 105 299 Bourne Shell 8 51 85 210 Java 2 25 16 74 ------------------------------------------------------------------------------- SUM: 14 145 206 583 -------------------------------------------------------------------------------

Page 28: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Performance

Page 28 Architecting the Future of Big Data

•  1 x m1.large (2 cores) –  195575 input features (WA state) –  3 zoom levels (6, 7, 8) –  1 hour

•  19 x c1.xlarge (152 cores) –  308745538 input features (all data) –  3 zoom levels (6, 7, 8) –  3 hours 15 minutes

Page 29: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

TODOs •  Macro-level performance optimizations (configuration)

– Balancing mappers and reducers, memory allocation, &c. – On-demand Hadoop means tuning the cluster to the application

•  Micro-level performance optimizations (code) – Smarter sampling logic – Mapnik API considerations – Multi-threaded S3 PUTs

–  https://forums.aws.amazon.com/thread.jspa?threadID=125135

•  Write tiles in MBTiles format •  Write tiles to HBase •  Compression! •  Ogrbrute?

Page 29 Architecting the Future of Big Data

Page 30: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Thanks!

Architecting the Future of Big Data Page 30

M A N N I N G

Nick Dimiduk Amandeep Khurana

FOREWORD BY Michael Stack

hbaseinaction.com

Nick Dimiduk github.com/ndimiduk

@xefyr

n10k.com