MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer
description
Transcript of MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer
![Page 1: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/1.jpg)
MASSIVE Terrain Datasæt −om vigtigheden af effektive algoritmer
Lars Arge
Datalogisk Institut
Aarhus Universitet
Regionalt endagskursus datalogi20 Marts 2006
![Page 2: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/2.jpg)
Lars Arge
Massive terrain datasæt
2
Outline
1. Massive (terrain) data
2. Scalability problems (I/O bottleneck)
3. Processing massive terrain data: Flow modeling on grid terrains
4. Summary
![Page 3: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/3.jpg)
Lars Arge
Massive terrain datasæt
3
Massive Data
![Page 4: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/4.jpg)
Lars Arge
Massive terrain datasæt
4
Massive Data• Massive datasets are being collected everywhere• Storage management software is billion-$ industry
Examples (2002):
• Phone: AT&T 20TB phone call database, wireless tracking
• Consumer: WalMart 70TB database, buying patterns (supermarket checkout)
• WEB: Web crawl of 200M pages and 2000M links, Akamai stores 7 billion clicks per day
• Geography: NASA satellites generate 1.2TB per day
![Page 5: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/5.jpg)
Lars Arge
Massive terrain datasæt
5
Example: Satellite Images
– Terrabyte image database
![Page 6: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/6.jpg)
Lars Arge
Massive terrain datasæt
6
Example: Grid Terrain Data• Grid terrain data increasingly available
– NASA SRTM mission acquired 30m data
for around 80% of earth land mass
– US data readily available through
USGS National Map Seamless Data Distribution System
• Appalachian Mountains (800km x 800km)
– 100m resolution ~ 64M cells
~128MB raw data (~500MB when processing)
– ~ 1.2GB at 30m resolution
– ~ 12GB at 10m resolution (much of US available from USGS)
– ~ 1.2TB at 1m resolution (selected, mostly military, availability)
![Page 7: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/7.jpg)
Lars Arge
Massive terrain datasæt
7
Example: LIDAR Terrain Data
• Massive (irregular) point sets (1-10m resolution)
– Becoming relatively cheap and easy to collect
• NC floodplain mapping program: www.ncfloodmaps.com
– Collected LIDAR for all NC after Hurricane Floyd in 1999
– Still processing it
![Page 8: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/8.jpg)
Lars Arge
Massive terrain datasæt
8
Hurricane Floyd
• Sep. 15, 1999
7 am 3pm
![Page 9: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/9.jpg)
Lars Arge
Massive terrain datasæt
9
Example: LIDAR Terrain Data
• US LIDAR data becoming available:
– www.ncfloodmaps.com
– USGS Center for LIDAR Information
Coordination and Knowledge (CLICK)
– NOAA LIDAR Data Retrieval Tool (LDART)
![Page 10: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/10.jpg)
Lars Arge
Massive terrain datasæt
10
Scalability Problems
![Page 11: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/11.jpg)
Lars Arge
Massive terrain datasæt
11
Scalability Problems: I/O-Bottleneck
– Disk systems try to amortize large access time transferring large contiguous blocks of data
• Need to store and access data to take advantage of blocks (locality)
• I/O is often bottleneck when handling massive datasets
• Disk access is 106 times slower than main memory access
track
magnetic surface
read/write armread/write head“The difference in speed
between modern CPU and disk technologies is
analogous to the difference in speed in sharpening a
pencil using a sharpener on one’s desk or by taking an
airplane to the other side of the world and using a
sharpener on someone else’s desk.” (D. Comer)
![Page 12: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/12.jpg)
Lars Arge
Massive terrain datasæt
12
Scalability Problems: Block Access Matters• Example: Reading an array from disk
– Array size N = 10 elements
– Disk block size B = 2 elements
– Main memory size M = 4 elements (2 blocks)
• Difference between N and N/B large since block size is large
– Example: N = 256 x 106, B = 8000 , 1ms disk access time
N I/Os take 256 x 103 sec = 4266 min = 71 hr
N/B I/Os take 256/8 sec = 32 sec
1 2 10 9 5 6 3 4 8 71 5 2 6 3 8 9 4 7 10
Algorithm 2: Loads N/B=5 blocksAlgorithm 1: Loads N=10 blocks
![Page 13: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/13.jpg)
Lars Arge
Massive terrain datasæt
13
R
A
M
Scalability Problems: Block Access Matters• Most programs developed without memory considerations
– Infinite memory
– Uniform access cost
• Run on large datasets because OS moves blocks as needed
• Moderns OS utilizes sophisticated paging and prefetching strategies
– But if program makes scattered accesses even good OS cannot take advantage of block access
Scalability problems!
data size
runn
ing
tim
e
![Page 14: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/14.jpg)
Lars Arge
Massive terrain datasæt
14
L
1
L
2
R
A
M
Scalability: Hierarchical Memory• Block access not only important on disk level
• Machines have complicated memory hierarchy
– Levels get larger and slower
– Block transfers on all levels
• We focus on disk level:
data size
runn
ing
tim
eR
A
M
![Page 15: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/15.jpg)
Lars Arge
Massive terrain datasæt
15
Processing Massive Terrain Data: Flow
![Page 16: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/16.jpg)
Lars Arge
Massive terrain datasæt
16
Flow on Terrains• Modeling of water flow on terrains has many important applications
– Predict location of streams
– Predict areas susceptible to floods
– Compute watersheds
– Predict erosion
– Predict vegetation distribution
– ……
• Conceptually flow is modeled using two basic attributes
– Flow direction: The direction water flows at a point
– Flow accumulation: Amount of water flowing through a point
• Flow accumulation used to compute other hydrological attributes, e.g. drainage network, topographic convergence index…
![Page 17: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/17.jpg)
Lars Arge
Massive terrain datasæt
17
Flow Directions on Grid Terrains• Common terrain representation: Grid
• Flow directions: Water in each cell flows to downslope neighbor(s)
– Commonly used:
* Single flow direction (SFD or D8):
Flow to downslope neighbor
* Multiple flow direction (MFD):
Flow to all downslope neighbors
SFD
MFD
3 2 47 5 87 1 9
3 2 47 5 87 1 9
3 2 47 5 87 1 9
3 2 47 5 87 1 9
![Page 18: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/18.jpg)
Lars Arge
Massive terrain datasæt
18
Flow Accumulation on Grid Terrains
• Flow accumulation
– Initially one unit of water in each cell
– Water distributed from each cell according to flow direction(s)
– Flow accumulation of cell is total flow through it
![Page 19: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/19.jpg)
Lars Arge
Massive terrain datasæt
19
Flow Accumulation Example (Panama dataset)
![Page 20: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/20.jpg)
Lars Arge
Massive terrain datasæt
20
Flow Modeling on Massive Grid Terrains• Duke University Environmental researchers had problems with
computing flow accumulation for Appalachian Mountains
– Recall ~128MB raw data and ~500MB when processing
Running time: 14 days
• It could be much worse; Recall
– ~ 1.2GB at 30m resolution
– ~ 12GB at 10m resolution
– ~ 1.2TB at 1m resolution
![Page 21: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/21.jpg)
Lars Arge
Massive terrain datasæt
21
Flow Modeling on Massive Grid Terrains• We surveyed other flow accumulation software
• GRASS (leading open-source GIS)
– Killed after 17 days on a 50MB dataset (6700 x 4300 grid)
• TARDEM (specialized hydrology software)
– Could handle 50MB dataset
– Killed after 20 days on a 240MB dataset (12000 x 10000 grid)
* CPU utilization 5%, 3GB swap file
• ArcGIS (leading commercial GIS)
– Could handle the 240MB dataset
– Sometimes very slow:
* 3 days to process 490MB dataset
* 1 day to process 560MB dataset
– Does not work for datasets larger than 2GB
![Page 22: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/22.jpg)
Lars Arge
Massive terrain datasæt
22
Flow Accumulation Scalability Problem
• Natural algorithm may require ~N I/Os– “Push” flow down the terrain by visiting cells in height order
Problem since cells of same height scattered over terrain
• Natural to try “tiling” (ArcGIS?)– But computation in different tiles not independent
![Page 23: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/23.jpg)
Lars Arge
Massive terrain datasæt
23
TerraFlow• We developed theoretically I/O-optimal algorithms using ~N/B I/Os
• Avoiding scattered access by:
– Grid storing input: Data duplication
– Grid storing flow: “Lazy write”
• Implementation was very efficient
– Appalachian Mountains flow accumulation in 3 hours!
• Developed into comprehensive software package for flow computation on massive grids (www.cs.duke.edu/geo*/terraflow)
– Efficient: 2-1000 times faster than other software on massive grids
– Scalable: 1 billion elements! (>2GB data)
– Flexible: Different flow modeling (direction) methods
![Page 24: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/24.jpg)
Lars Arge
Massive terrain datasæt
24
TerraFlow• Significant speedup over ArcInfo for large datasets
– East-Coast (100m)
TerraFlow: 8.7 Hours
ArcInfo: 78 Hours
– Washington state (10m)
TerraFlow: 63 Hours
ArcInfo: %
• Incorporated in Grass 5.0.2 and later
• Recently also extensions for ArcGIS 8 and 9
Hawai
i
56M
Cumber
lands
80M Lower
NE
256M
East-C
oast
491M M
idwes
t
561M
Was
hingto
n
2G
0
10
20
30
40
50
60
70
80
90
Run
ning
Tim
e (H
ours
)
TerraFlow 512TerraFlow 128ArcInfo 512ArcInfo 128
500 MHz Alpha, FreeBSD 4.0
![Page 25: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/25.jpg)
Lars Arge
Massive terrain datasæt
25
Denmark?
![Page 26: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/26.jpg)
Lars Arge
Massive terrain datasæt
26
Denmark Terrain Data• Mainly two data suppliers in Denmark
– Kort & Matrikelstyrelsen
– COWI A/S
• Grid/vector models based on paper maps/ortofoto
• LIDAR data for major cities
• Unfortunately not available online (and not free)
– But obviously increasing interest in terrain data/applications
![Page 27: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/27.jpg)
Lars Arge
Massive terrain datasæt
27
New Project • New (NABIIT) project: Development of algorithms and software for
processing massive terrain data
– COWI A/S
* Problems processing LIDAR data during production and analysis (e.g. railroad noise)
– Spatial analysis unit, Danish Institute of Agricultural Sciences
* Use data, e.g. to comply with EU directives
– Computer science, Aarhus University
* Efficient algorithms
• Focus on
– Terrain modeling, terrain flow analysis, influence of simplification
![Page 28: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/28.jpg)
Lars Arge
Massive terrain datasæt
28
Example Sub-Projects• Terrain modeling, e.g:
– Terrain models from “raw” LIDAR
Process >10G raw data in a few hours using only 128M memory
• Terrain analysis, e.g:
– Erosion modeling (USLE factor computation)
– Watershed hierarchy computation
NC Neuse basin at 10m resolution (~400M cells) in 3 hours
![Page 29: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/29.jpg)
Lars Arge
Massive terrain datasæt
29
Summary
![Page 30: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/30.jpg)
Lars Arge
Massive terrain datasæt
30
Summary• Massive datasets appear everywhere
• Leads to scalability problems
– Due to hierarchical memory and slow I/O
• I/O-efficient algorithms greatly improves scalability
• Terrain data:
– Massive grid data exists
– New technologies are creating massive
and very detailed datasets
– Processing capabilities lag behind
![Page 31: MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer](https://reader036.fdocument.pub/reader036/viewer/2022081603/56813bdc550346895da50957/html5/thumbnails/31.jpg)
Lars Arge
Massive terrain datasæt
31
Summary - Resources• Google earth: http://earth.google.com/
• USGS national map: http://seamless.usgs.gov
• USGS center for LIDAR information: http:/lidar.cr.usgs.gov
• NC floodmaps: http://www.ncfloodmaps.com
• NOAA LIDAR data retrieval tool: http://www.csc.noaa.gov/crs/tcm/about_ldart.html
• TerraFlow: http://www.cs.duke.edu/geo*/terraflow
• Duke STREAM project: http://terrain.cs.duke.edu
• Kort & Matrikelstyrelsen: http://www.kms.dk
• COWI A/S: http://www.cowi.dk
• Geoforum: http://www.geoforum.dk/