Arumugam s

8/13/2019 Arumugam s

1/123


2/123

c 2008 Subramanian Arumugam

2


3/123

To my parents.

3


4/123

ACKNOWLEDGMENTS

First of all, I would like to thank my advisor Chris Jermaine. This dissertation would

not have been made possible had it not been for his excellent mentoring and guidance

through the years. Chris is a terrific teacher, a critical thinker and a passionate researcher.

He has served as a great role model and has helped me mature as a researcher. I cannot

thank him more for that.

My thanks also goes to Prof. Alin Dobra. Through the years, Alin has been a patient

listener and has helped me structure and refine my ideas countless times. His excitement

for research is contagious!

I would like to take this opportunity to mention my colleagues at the database

center: Amit, Florin, Fei, Luis, Mingxi and Ravi. I have had many hours of fun discussing

interesting problems with them. Special thanks goes to my friends Manas, Srijit, Arun,

Shantanu, and Seema, for making my stay in Gainesville all the more enjoyable.

Finally, I would like thank my parents for being a source of constant support and

encouragment throughout my studies.

4


5/123

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Research Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2.1 Data Modeling and Database Design . . . . . . . . . . . . . . . . . 151.2.2 Access Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.2.3 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.2.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.2.5 Data Warehousing. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.3.1 Scalable Join Processing over Massive Spatiotemporal Histories . . . 181.3.2 Entity Resolution in Spatiotemporal Databases. . . . . . . . . . . . 191.3.3 Selection Queries over Probabilistic Spatiotemporal Databases . . . 19

2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.1 Spatiotemporal Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 Entity Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.3 Probabilistic Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 SCALABLE JOIN PROCESSING OVER SPATIOTEMPORAL HISTORIES . 25

3.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Moving Object Trajectories. . . . . . . . . . . . . . . . . . . . . . . 273.2.2 Closest Point of Approach (CPA) Problem . . . . . . . . . . . . . . 28

3.3 Join Using Indexing Structures . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3.1 Trajectory Index Structures . . . . . . . . . . . . . . . . . . . . . . 313.3.2 R-tree Based CPA Join . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Join Using Plane-Sweeping . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.4.1 Basic CPA Join using Plane-Sweeping . . . . . . . . . . . . . . . . . 363.4.2 Problem With The Basic Approach . . . . . . . . . . . . . . . . . . 373.4.3 Layered Plane-Sweep . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5 Adaptive Plane-Sweeping. . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5


6/123

3.5.2 Cost Associated With a Given Granularity . . . . . . . . . . . . . . 413.5.3 The Basic Adaptive Plane-Sweep . . . . . . . . . . . . . . . . . . . 413.5.4 Estimating Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.5.5 Determining The Best Cost. . . . . . . . . . . . . . . . . . . . . . . 443.5.6 Speeding Up the Estimation . . . . . . . . . . . . . . . . . . . . . . 46

3.5.7 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . 483.6 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.6.1 Test Data Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.6.2 Methodology and Results . . . . . . . . . . . . . . . . . . . . . . . . 503.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4 ENTITY RESOLUTION IN SPATIOTEMPORAL DATABASES . . . . . . . . 58

4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Outline of Our Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.3 Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.3.1 PDF for Restricted Motion . . . . . . . . . . . . . . . . . . . . . . . 644.3.2 PDF for Unrestricted Motion. . . . . . . . . . . . . . . . . . . . . . 65

4.4 Learning the Restricted Model . . . . . . . . . . . . . . . . . . . . . . . . . 664.4.1 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . 674.4.2 LearningK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.5 Learning Unrestricted Motion . . . . . . . . . . . . . . . . . . . . . . . . . 714.5.1 Applying a Particle Filter. . . . . . . . . . . . . . . . . . . . . . . . 724.5.2 Handling Multiple Objects . . . . . . . . . . . . . . . . . . . . . . . 734.5.3 Update Strategy for a Sample given Multiple Objects . . . . . . . . 75

4.5.4 Speeding Things Up. . . . . . . . . . . . . . . . . . . . . . . . . . . 764.6 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5 SELECTION OVER PROBABILISTIC SPATIOTEMPORAL RELATIONS . . 84

5.1 Problem and Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.1.2 The False Positive Problem. . . . . . . . . . . . . . . . . . . . . . . 875.1.3 The False Negative Problem . . . . . . . . . . . . . . . . . . . . . . 90

5.2 The Sequential Probability Ratio Test (SPRT) . . . . . . . . . . . . . . . . 915.3 The End-Biased Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.3.1 Whats Wrong With the SPRT? . . . . . . . . . . . . . . . . . . . . 955.3.2 Removing the Magic Epsilon . . . . . . . . . . . . . . . . . . . . . . 965.3.3 The End-Biased Algorithm . . . . . . . . . . . . . . . . . . . . . . . 97

5.4 Indexing the End-Biased Test . . . . . . . . . . . . . . . . . . . . . . . . . 1035.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.4.2 Building the Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6


7/123

5.4.3 Processing Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6 CONCLUDING REMARKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

BIOGRAPHICAL SKETCH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7


8/123

LIST OF TABLES

Table page

4-1 Varying the number of objects and its effect on recall, precision and runtime. . . 80

4-2 Varying the number of time ticks. . . . . . . . . . . . . . . . . . . . . . . . . . . 804-3 Varying the number of sensors fired. . . . . . . . . . . . . . . . . . . . . . . . . 80

4-4 Varying the standard deviation of the Gaussian cloud. . . . . . . . . . . . . . . 80

4-5 Varying the number of time ticks where EM is applied. . . . . . . . . . . . . . . 81

5-1 Running times over varying database sizes. . . . . . . . . . . . . . . . . . . . . . 109

5-2 Running times over varying query sizes. . . . . . . . . . . . . . . . . . . . . . . 109

5-3 Running times over varying object standard deviations. . . . . . . . . . . . . . . 109

5-4 Running times over varying confidence levels. . . . . . . . . . . . . . . . . . . . 109

8


9/123

LIST OF FIGURES

Figure page

3-1 Trajectory of an object (a) and its polyline approximation (b) . . . . . . . . . . 28

3-2 Closest Point of Approach Illustration . . . . . . . . . . . . . . . . . . . . . . . 293-3 CPA Illustration with trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3-4 Example of an R-tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3-5 Heuristic to speed up distance computation . . . . . . . . . . . . . . . . . . . . 34

3-6 Issues with R-trees- Fast moving objectp joins with everyone . . . . . . . . . . 35

3-7 Progression of plane-sweep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3-8 Layered Plane-Sweep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3-9 Problem with using large granularities for bounding box approximation . . . . . 40

3-10 Adaptively varying the granularity . . . . . . . . . . . . . . . . . . . . . . . . . 42

3-11 Convexity of cost function illustration. . . . . . . . . . . . . . . . . . . . . . . . 45

3-12 Iteratively evaluating k cut points. . . . . . . . . . . . . . . . . . . . . . . . . . 46

3-13 Speeding up the Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3-14 Injection data set at time tick 2,650. . . . . . . . . . . . . . . . . . . . . . . . . 49

3-15 Collision data set at time tick 1,500 . . . . . . . . . . . . . . . . . . . . . . . . . 50

3-16 Injection data set experimental results . . . . . . . . . . . . . . . . . . . . 51

3-17 Collision data set experimental results . . . . . . . . . . . . . . . . . . . . . . . 52

3-18 Buffer size choices for Injection data set . . . . . . . . . . . . . . . . . . . . . . 53

3-19 Buffer size choices for Collision data set . . . . . . . . . . . . . . . . . . . . . . 53

3-20 Synthetic data set experimental results . . . . . . . . . . . . . . . . . . . . . . . 54

3-21 Buffer size choices for Synthetic data set . . . . . . . . . . . . . . . . . . . . . . 56

4-1 Mapping of a set of observations for linear motion . . . . . . . . . . . . . . . . . 60

4-2 Object path (a) and quadratic fit for varying time ticks (b-d). . . . . . . . . . . 62

4-3 Object path in a sensor field (a) and sensor firings triggered by object motion (b) 64

4-4 The baseline input set (10,000 observations) . . . . . . . . . . . . . . . . . . . . 79

9


10/123

4-5 The learned trajectories for the data of Figure 4-4 . . . . . . . . . . . . . . . . . 79

5-1 The SPRT in action. The middle line is the LRT statistic. . . . . . . . . . . . . 92

5-2 Two spatial queries over a database of objects with gaussian uncertainty . . . . 97

5-3 The sequence of SPRTs run by the end-biased test . . . . . . . . . . . . . . . . 98

5-4 Building the MBRs used to index the samples from the end-biased test. . . . . . 104

5-5 Using the index to speed the end-biased test . . . . . . . . . . . . . . . . . . . . 106

10


11/123

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Doctor of Philosophy

EFFICIENT ALGORITHMS FOR SPATIOTEMPORAL DATA MANAGEMENT

By

Subramanian Arumugam

August 2008

Chair: Christopher JermaineMajor: Computer Engineering

This work focuses on interesting data management problems that arise in the analysis,

modeling and querying of largescale spatiotemporal data. Such data naturally arise in the

context of many scientific and engineering applications that deal with physical processes

that evolve over time.

We first focus on the issue of scalable query processing in spatiotemporal databases.

In many applications that produce a large amount of data describing the paths of moving

objects, there is a need to ask questions about the interaction of objects over a long

recorded history. To aid such analysis, we consider the problem of computing joins over

moving object histories. The particular join studied is the Closest-Point-Of-Approach

join, which asks: Given a massive moving object history, which objects approached within

a distance d of one another?

Next, we study a novel variation of the classic entity resolution problem that

appears in sensor network applications. In entity resolution, the goal is to determine

whether or not various bits of data pertain to the same object. Given a large database of

spatiotemporal sensor observations that consist of (location, timestamp) pairs, our goal is

to perform an accurate segmentation of all of the observations into sets, where each set is

associated with one object. Each set should also be annotated with the path of the object

through the area.

11


12/123

Finally, we consider the problem of answering selection queries in a spatiotemporal

database, in the presence of uncertainty incorporated through a probabilistic model.

We propose very general algorithms that can be used to estimate the probability that a

selection predicate evaluates to true over a probabilistic attribute or attributes, where

the attributes are supplied only in the form of a pseudo-random attribute value generator.

This enables the efficient evaluation of queries such as Find all vehicles that are in close

proximity to one another with probabilityp at time t using Monte Carlo statistical

methods.

12


13/123


14/123

Extending modern database systems to support spatiotemporal data is challenging for

several reasons:

Conventional databases are designed to manage static data, whereas spatiotemporal

data describe spatial geometries that change continuously with time. This requires aunified approach to deal with aspects of spatiality and temporality.

Current databases are designed to manage data that is precise. However, uncertaintyis often an inherent property in spatiotemporal data due to discretization ofcontinuous movement and measurement errors. The fact that most spatiotemporaldata sources (particularly polling and sampling-based schemes) provide only adiscrete snapshot of continuous movement poses new problems to query processing.For example, consider a conventional database record that stores the fact JohnSmith earns $200,000 and a spatiotemporal record which stores the fact JohnSmith walks from point A to point B in the form of an discretized ordered pair

(A, B). In the former case, a query such as What is the salary of John Smith?involves dealing with precise data. On the other hand, a spatiotemporal querysuch as Did John Smith walk through point C between A and B? requires dealingwith information that is often not known with certainty. Further compounding theproblem is that even the recorded observations are only accurate to within a fewdecimal places. Thus, even queries queries such Identify all objects located at pointA may not return meaningful results unless allowed a certain margin for error.

Due to the presence of the time dimension, spatiotemporal applications have thepotential to produce a large amount of data. The sheer volume of data generatedby spatiotemporal applications presents a computational and data management

challenge. For instance, it is not uncommon for many scientific processes to producespatiotemporal data in the order of terabytes or even petabytes [7]. Developingscalable algorithms to support query processing over tera- and peta-byte-sizedspatiotemporal data sets is a significant challenge.

The semantics of many basic operations in a database changes in the presence ofspace and time. For instance, basic operations like joins typically employ equalitypredicates in a classic relational database, whereas equality is rare between twoarbitrary spatiotemporal objects.

1.2 Research Landscape

Over the last decade, database researchers have begun to respond to the challenges

posed by spatiotemporal data. Most of the research efforts is concentrated on supporting

eitherpredictive orhistoricalqueries. Within this taxonomy, we can further distinguish

work based on whether they support time-instanceor time-intervalqueries.

14


15/123

Inpredictivequeries, the focus is on the future position of the objects and only a

limited time window of the object positions needs to be maintained. On the other hand,

forhistoricalqueries, the interest is on efficient retrieval of past history and thus the

database needs to maintain the complete timeline of an objects past locations. Due to

these divergent requirements, techniques developed for predictive queries are often not

suitable for historical queries.

What follows is a brief tour of the major research areas in spatiotemporal data

management. For a more complete treatment of this topic, the interested reader is

referrred to [1,3].

1.2.1 Data Modeling and Database Design

Early research focused on aspects of data modeling and database design for

spatiotemporal data [8]. Conventional data types employed in existing databases are

often not suitable to represent spatiotemporal data which describe continuous time-varying

spatial geometries. Thus, there is a need for a spatiotemporal type system that can model

continuously moving data. Depending on whether the underlying spatial object has an

extent or not, abstractions have been developed to model a moving point, line, and region

in two- and three-dimensional space with time considered as the additional dimension

[811]. Similarly, early work has also focused on refining existing CASE tools to aid in the

design of spatiotemporal databases. Existing conceptual tools such as ER diagrams and

UML present a non-temporal view of the world and extensions to incorporate temporal

and spatial awareness has been investigated [12,13].

Recently there has been interest in designing flexible type systems that can model

aspects of uncertainty associated with an objects spatial location [14]. There has alsobeen active effort towards designing SQL language extensions for spatiotemporal data

types and operations [15].

15


16/123

1.2.2 Access Methods

Efficient processing of spatiotemporal queries requires developing new techniques

for query evaluation, providing suitable access structures and storage mechanisms, and

designing efficient algorithms for the implementation of spatiotemporal operators.

Developing efficient access structures for spatiotemporal databases is an important

area of research. A variety of spatiotemporal index structures have been developed to

support selection queries over both predictive and historical queries, most based on

generalization of the R-tree [16] to incorporate the time dimension. Indexing structures

designed to support predictive queries typically manage object movement within a small

time window and need to handle frequent updates to object locations. A popular choice

for such applications is the TPR-tree [17] and its many variants.

On the other hand, index structures designed to support historical queries need to

manage an objects entire past movement trajectory (for this reason they can be viewed as

trajectory indexes). Depending on the time interval indexed, the sheer volume of data that

needs to be managed present significant technical challenges for overlap-allowing indexing

schemes such as R-trees [16]. Thus, there has been interest in revisiting grid/cell-based

solutions that do not allow overlap, such as SETI [18]. Several tree-based indexing

structures have been developed such as STRIPES [19], 3D R-trees [20], TB trees [21] and

linear quad trees [22]. Further, spatiotemporal extensions of several popular queries such

as nearest-neighbor [23], top-k [24], and skyline [25] have been developed.

1.2.3 Query Processing

The development of efficient index structures has also led to a growing body of

research on different types of queries on spatiotemporal data, such as time-instant andrange queries [2628], continuous queries, joins [29,30], and their efficient evaluation

[31,32]. In the same vein, there has also been seem preliminary work on optimizing

spatio-temporal selection queries [33,34].

16


17/123

Much of the work focuses specifically on indexing two-dimensional space and/or

supporting time-instance or short time-interval selection queries. Thus many indexing

structures often do not scale well for higher-dimensional spaces and have difficulty with

queries over long time intervals. Finally, historical data collections may be huge and joins

over such data require new solutions, since predicates involved are non-traditional (such as

closest point of approach, within, sometimes-possibly-inside, etc.)

1.2.4 Data Analysis

Spatiotemporal data analysis allows us to obtain interesting insights from the stored

data collection. For instance:

In a road network database, the history of movement of various objects can be usedto understand traffic patterns.

In aviation, the flight path of various planes can be used in future path planning andcomputing minimum separation constraints to avoid collision.

In wildlife management, one can understand animal migration patterns from thetrajectories traced by them.

Pollutants can be traced to their source by studing air flow patterns of aerosolsstored as trajectories.

Research in this area focuses on extending traditional data mining techniques to the

analysis of large spatiotemporal data sets. Of interest includes discovering similiarities

among object trajectories [35], data classification and generalization [36], trajectory

clustering and rule mining [3739], and supporting interactive visualization for browsing

large spatiotemporal collections [40].

1.2.5 Data Warehousing

Supporting data analysis also requires designing and maintaining large collections ofhistorical spatiotemporal data, which falls under the domain ofdata warehousing.

Conventional data warehouses are often designed around the goal of supporting

aggregate queries efficiently. However, the interesting queries in a spatiotemporal data

warehouse seek to discover the interaction patterns of moving objects and understand the

17


18/123

spatial and/or temporal relationships that exist between them. Facilitating such queries

in a scalable fashion over terabyte-sized spatiotemporal data warehouses is a significant

challenge. This requires extending traditional data mining techniques to the analysis

of large spatiotemporal data sets to discover spatial and temporal relationships, which

might exist at various levels of granularity involving complex data types. Research in

spatiotemporal data warehousing [41,42] is relatively new and is focused on refining

existing multidimensional models to support continuous data and defining semantics for

spatiotemporal aggregation [43,44].

1.3 Main Contributions

It is clear that extending modern database systems to support data management

and analysis of spatiotemporal data require addressing issues that span almost the entire

breadth of database research. A full treatment of the various issues can be the subject of

numerous dissertations! To keep the scope of this dissertation managable, I tackle three

important problems in spatiotemporal data management. The dissertation focuses on

data produced by moving objects, since moving object databases represent the most

common application domain for spatiotemporal databases [1]. The three specific problems

considered are described briefly in the following subsections.

1.3.1 Scalable Join Processing over Massive Spatiotemporal Histories

I first consider the scalability problem in computing joins over massive moving

object histories. In applications that produce a large amount of data describing the

paths of moving objects, there is a need to ask questions about the interaction of

objects over a long recorded history. This problem is becoming especially important

given the emergence of computational, simulation-based science (where simulationsof natural phenomenon naturally produce massive databases containing data with

spatial and temporal characteristics), and the increased prevalence of tracking and

positioning devices such as RFID and GPS. The particular join that I study is the CPA

(Closest-Point-Of-Approach) join, which asks: Given a massive moving object history,

18


19/123

which objects approached within a distance d of one another? I carefully consider several

obvious strategies for computing the answer to such a join, and then propose a novel,

adaptive join algorithm which naturally alters the way in which it computes the join in

response to the characteristics of the underlying data. A performance study over two

physics-based simulation data sets and a third, synthetic data set validates the utility of

my approach.

1.3.2 Entity Resolution in Spatiotemporal Databases

Next, I consider the problem of entity resolution for a large database of spatio-temporal

sensor observations. The following scenario is assumed. At each time-tick, one or more of

a large number of sensors report back that they have sensed activity at or near a specific

spatial location. For example, a magnetic sensor may report that a large metal object has

passed by. The goal is to partition the sensor observations into a number of subsets so

that it is likely that all of the observations in a single subset are associated with the same

entity, or physical object. For example, all of the sensor observations in one partition may

correspond to a single vehicle driving accross the area that is monitored. The dissertation

describes a two-phase, learning-based approach to solving this problem. In the first phase,

a quadratic motion model is used to produce an initial classification that is valid for a

short portion of the timeline. In the second phase, Bayesian methods are used to learn the

long-term, unrestricted motion of the underlying objects.

1.3.3 Selection Queries over Probabilistic Spatiotemporal Databases

Finally, I consider the problem of answering selection queries in the presence of

uncertainty incorporated through a probabilistic model. One way to facilitate the

representation of uncertainty in a spatiotemporal database is by allowing tuples tohave probabilistic attributes whose actual values are unknown, but are assumed

to be selected by sampling from a specified distribution. This can be supported by

including a few, pre-specified, common distributions in the database system when it is

shipped. However, to be truly general and extensible and support distributions that

19


20/123

cannot be represented explicitly or even integrated, it is necessary to provide an interface

that allows the user to specify arbitrary distributions by implementing a function that

produces pseudo-random samples from the desired distribution. Allowing a user to specify

uncertainty via arbitrary sampling functions creates several interesting technical challenges

during query evaluation. Specifically, evaluating time-instance selection queries such as

Find all vehicles that are in close proximity to one another with probability p at time

t requires the principled use of Monte Carlo statistical methods to determine whether

the query predicate holds. To support such queries, the thesis describes new methods

that draw heavily for the relevant statistical theory on sequential estimation. I also

consider the problem of indexing for the Monte Carlo algorithms, so that samples from the

pseudo-random attribute value generator can be pre-computed and stored in a structure in

order to answer subsequent queries quickly.

Organization. The rest of this study is organized as follows. Chapter 2 provides

a survey of work related to the problems addressed in this thesis. Chapter 3 tackles the

scalability issue when processing join queries over massive spatiotemporal databases.

Chapter 4 describes an approach to handling the entity-resolution problem in cleaning

spatiotemporal data sources. Chapter 5 describes a simple and general approach to

answering selection queries over spatiotemporal databases that incorporate uncertainty

within a probabilistic model framework (selection queries over probabilistic spatiotemporal

databases). Chapter 6 concludes the dissertation by summarizing the contributions and

identifying potential directions for future work.

20


21/123


22/123

To our knowledge, the only prior work on spatiotemporal joins is due to Jeong et al.

[51]. However, they only consider spatiotemporal join techniques that are straightforward

extensions to traditional spatial join algorithms. Further, they limit their scope to

index-based algorithms for objects over limited time windows.

2.2 Entity Resolution

Research in entity resolution has a long history in databases [ 5255] and has focused

mainly on integrating non-geometric string based data from noisy external sources. Closely

related to the work in this thesis is the large body of work on target tracking that exists

in fields as diverse as signal processing, robotics, and computer vision. The goal in target

tracking [56,57] is to support the real-time monitoring and tracking of a set of moving

objects from noisy observations.

Various algorithms to classify observations among objects can be found in the

target tracking literature. They characterize the problem as one of data association (i.e.

associating observations with corresponding targets). A brief summary of the main ideas is

given below.

The seminal work is due to Reid [58] who propose a multiple hypothesis technique

(MHT) to solve the tracking problem. In the MHT approach, a set of hypotheses is

maintained with each hypothesis reflecting the belief on the location of an individual

target. When a new set of observations arrive, the hypotheses are updated. Hypotheses

with minimal support are deleted and additional hypotheses are created to reflect new

evidence. The main drawback of the approach is that the number of hypotheses can grow

exponentially over time. Though heuristic filters [5961] can be used to bound the search

space, it limits the scalability of the algorithm.Target tracking also has been studied using Bayesian approaches [62]. The Bayesian

approach views tracking as a state estimation problem. Given some initial state and a

set of observations, the goal is to predict the objects next state. An optimal solution to

the problem is given by Bayes Filter [63,64]. Bayes filters produces optimal estimates by

22


23/123

integrating over the complete set of observations. The formulation is often recursive and

involves complex integrals that are difficult to solve analytically. Hence, approximation

schemes such as particle filters [57] and sequential Monte Carlo techniques [63] are often

used in practice.

Recently, Markov Chain Monte Carlo (MCMC) [65,66] techniques have been

proposed. MCMC techniques attempt to approximate the optimal Bayes filter for multiple

target tracking. MCMC based methods employ sequential MC sampling and are shown to

perform better than existing sub-optimal approaches such as MHT for tracking objects in

highly cluttered environments.

A common theme among most of the research in target tracking is its focus on

accurate tracking and detection of objects in real time in highly cluttered environments

over relatively short time periods. In a data warehouse context, the ability of techniques

such as MCMC to make fine-grained distinctions make them ideal candidates when

performing operations such asdrilldownthat involve analytics over small time windows.

Their applicability is limited, however, to entity resolution in a data warehouse. In such a

context, summarization and visualization of historical trajectories smoothed over long time

intervals is often more useful. The model-based approach considered in this work seems a

more suitable candidate for such tasks.

2.3 Probabilistic Databases

Uncertainty management in spatiotemporal databases is a relatively new area of

research. Earlier work has focused on aspects of modeling uncertainty and query language

support [9,67].

In the context of query processing, one of the earliest papers in this area is thepaper by Pfoser et al. [68] where different sources of uncertainty are characterized and

a probability density function is used to model errors. Hosbond et al. [69] extended this

work by employing a hyper square uncertainty region, which expands over time to answer

queries using a TPR-tree.

23


24/123

Trajcevksi et al. [70] study the problem from a modeling perspective. They model

trajectories by a cylindrical volume in 3D and outline semantics of fuzzy selection queries

over trajectories in both space and time. However, the approach does not specify how to

choose the dimensions of the cylindrical region which may have to change over time to

account for shrinking or expanding of the underlying uncertainty region.

Cheng et al. [71] describe algorithms for time instant queries (probabilistic range

and nearest neighbor) using an uncertainty model where a probabilty density function

(PDF) and an uncertain region is associated with each point object. Given a location in

the uncertain region, the PDF returns the probablity of finding the object at that location.

A similar idea is used by Tao et al. [72] to answer queries in spatial databases. To handle

time interval queries, Mokhtar et al. [73] represent uncertain trajectories as a stochastic

process with a time-parametric uniform distribution.

24


25/123

CHAPTER 3SCALABLE JOIN PROCESSING OVER SPATIOTEMPORAL HISTORIES

In applications that produce a large amount of data describing the paths of

moving objects, there is a need to ask questions about the interaction of objects

over a long recorded history. In this chapter, the problem of computing joins over

massive moving object histories is considered. The particular join studied is the

Closest-Point-Of-Approach join, which asks: Given a massive moving object history,

which objects approached within a distance d of one another?

3.1 Motivation

Frequently, it is of interest in applications which make use of spatial data to ask

questions about the interaction between spatial objects. A useful operation that enables

one to answer such questions is the spatial joinoperation. Spatial join is similar to the

classical relational join except that it is defined over two spatial relations based on a

spatial predicate. The objective of the join operation is to retrieve all object pairs that

satisfy a spatial relationship. One common predicate involves distance measures, where

we are interested in objects that were within a certain distance of each other. The query

Find all restaurants within distance 10 miles from a hotel is an example of a spatialjoin.

For moving objects, the spatial join operation involves the evaluation of both a spatial

and a temporal predicate and for this reason the join is referred to as a spatiotemporal

join. For example, consider the spatial relations PLANESandTANKS, where each relation

represents accumulated trajectory data of planes and tanks from a battlefield simulation.

The queryFind all planes that are within distance 10 miles of a tank is an example of a

spatiotemporal join. The spatial predicate in this case restricts the distance (10 miles) and

the temporal predicate restricts the time period to the current time instance.

In the more general case, the spatiotemporal join is issued over a moving object

history, which contains all of the past locations of the objects stored in a database. For

25


26/123

example, consider the query Find all pairs of planes that came within distance of 1000

feet during their flight path. Since there is no restriction of the temporal predicate,

answering this query involves an evaluation of the spatial predicate at every time instance.

The amount of data to be processed can be overwhelming. For example, in a typical

flight, the flight data recorder stores about 7 MB of data which records among other

things, the position and time of the flight for every second during its operation. Given

that on average the US Air Traffic Control handles around 30000 flights in a single day,

if all of this data were archived, it would result in 200 GB of data accumulation just

for a single day. For another example, it is not uncommon for scientific simulations to

output terabytes or even petabytes of spatiotemporal data (see Abdulla et al.[7] and the

references contained therein).

In this chapter, the spatial-temporal join problem for moving object histories in

three-dimensional space, with time considered as the fourth dimension is investigated. The

spatiotemporal join operation considered is the CPA Join (Closest-Point-Of-Approach

Join). ByClosest Point of Approach, we refer to a position at which two moving objects

attain their closest possible distance[74]. Formally, in a CPA Join, we answer queries of

the following type: Find all object pairs(p P, qQ) from relations P and Q such that

CPA-distance (p, q) d. The goal is to retrieve all object pairs that are within a distance

dat their closest-point-of-approach.

Surprisingly, this problem has not been studied previously. The spatial join problem

has been well-studied for stationary objects in two- and three-dimensionsal space [ 45,47

49], however very little work related to spatiotemporal joins can be found in literature.

There has been some work related to joins involving moving objects [75,76] but the workhas been restricted to objects in a limited time window and does not consider the problem

of joining object histories that may be gigabytes or terabytes in size.

26


27/123

The contributions can be summarized as follows:

Three spatiotemporal join strategies for data involving moving object histories ispresented.

Simple adaptations of existing spatial join processing algorithms, based on the R-treestructure and on a plane-sweeping algorithm, for spatiotemporal histories is explored.

To address the problems associated with straightforward extensions to thesetechniques, a novel join strategy for moving objects based on an extension of thebasic plane-sweeping algorithm is described.

A rigorous evaluation and benchmarking of the alternatives is provided. Theperformance results suggest that we can obtain significant speedup in execution timewith the adaptive plane-sweeping technique.

The rest of this chapter is organized as follows: In Section 3.2, the closest point

of approach problem is reviewed. In Sections 3.3 and 3.4, two obvious alternatives to

implementing the CPA join using R-trees and plane-sweeping is described. In Section

3.5, a novel adaptive plane-sweeping technique that outperforms competing techniques

considerably is presented. Results from our benchmarking experiments are given in Section

3.6. Section 3.7 outlines related work.

3.2 Background

In this Section, we discuss the motion of moving objects, and give an intutive

description of the CPA problem. This is followed by an analytic solution to the CPA

problem over a pair of points moving in a straight line.

3.2.1 Moving Object Trajectories

Trajectories describe the motion of objects in a 2 or 3-dimensional space. Real-world

objects tend to have smooth trajectories and storing them for analysis often involves

approximation to a polyline. Apolylineapproximation of a trajectory connects object

positions, sampled at discrete time instances, by line segments (Figure 3-1).

In a database the trajectory of an object can be represented as a sequence of the form

(t1, v1), (t2, v2), . . . , (tn, vn)where each vi represents the position vector of the object at

27


28/123

t6

t7

t8

(B)(A)

t3

y

x

t0

t1

t2

t4

t5

t10t9

Figure 3-1. Trajectory of an object (a) and its polyline approximation (b)

time instance ti. The arityof the vector describes the dimensions of the space. For flight

simulation data, the arity would be 3, whereas for a moving car, the arity would be 2. The

position of the moving objects is normally obtained in one of several ways: by sampling orpolling the object at discrete time instances, through use of devices like GPS, etc.

3.2.2 Closest Point of Approach (CPA) Problem

We are now ready to describe the CPA problem. Let CPA(p,q,d)over two straight

line trajectories be evaluated as follows. Assuming the distance between the two objects is

given bymindist, then we output true ifmindist < d (the objects were within distance d

during their motion in space), otherwise f alse. We refer to the calculation of CPA(p,q,d)

as the CPA problem.

The minimum distance mindist between two objects is the distance between the

object positions at their closest point of approach. It is straightforward to calculate

mindistonce the CPA time tcpa, time instance at which the objects reached their closest

distance, is known.

We now give an analytic solution to the CPA problem for a pair of objects on a

simple straight-line trajectory.

Calculating the CPA time tcpa. Figure 3-2 shows the trajectory of two objects p

and qin 2-dimensional space for the time period [tstart, tend]. The position of these objects

at any time instance t is given byp(t) andq(t). Let their positions at time t = 0 be p0

and q0 and let their velocity vectors per unit of time be u andv . The motion equations for

28


29/123

t0

t3t4

t1

t2

tcpa

t4t3

t2t0

q

distcpa

tstarttend

t1 p

Figure 3-2. Closest Point of Approach Illustration

q[3]p[3]

q[1]

qp

p[2]

p[1]

q[2]

y

x

t

Figure 3-3. CPA Illustration with trajectories

these two objects are p(t) = p0+ tu; q(t) = q0+ tv. At any time instance t, the distance

between the two objects is given by d(t) =|p(t) q(t)|.

Using basic calculus, one can find the time instance at which the distance d(t) isminimum (whenD(t) =d(t)2 is a minimum). Solving for this time we obtain:

tcpa=(po qo).(u v)

|u v|2

Given this, mindist is given by |p(tcpa) q(tcpa)|.

The distance calculation that we described above is applicable only between two

objects on a straight line trajectory. To calculate the distance between two objects on a

polyline trajectory, we apply the same basic technique. For trajectories consisting of a

chain of line-segments, we find the minimum distance by first determining the distance

between each pair of line-segments and then choosing the minimum distance.

As an example, consider Figure 3-3 which shows the trajectory of two objects in

2-dimensional space with time as the third dimension. Each object is represented by

29


30/123

an array that stores the chain of segments comprising the trajectory. The line-segments

are labeled by the array indices. To determine the qualifying pairs, we find the CPA

distance between the line segment pairs (p[1],q[1]) (p[1],q[2]) (p[1],q[3]) (p[2],q[1])

(p[2],q[2]) (p[2],q[3]) (p[3],q[1]) (p[3],q[2]) (p[3],q[3]) and return the pair with the minimum

distance among all evaluated pairs. The complete code for computing CPA(p,q,d)over

multi-segment trajectories is given as Algorithm 3-1.

Algorithm 1 CPA (Object p, Object q, distance d)1: mindist = 2: for(i = 1 t o p.size)do3: for(j = 1 t o q.size)do4: tmp = CPA Distance(p[i], q[j])

5: if (tmp mindist)then6: mindist = tmp7: end if8: end for9: end for

10: if (mindist d)then11: returntrue12: end if13: return false

In the next two Sections, we consider two obvious alternatives for computing the

CPA Join, where we wish to discover al lpairs of objects (p, q) from two relations P and

Q, where CPA(p, q,d)evaluates to true. The first technique we describe makes use of an

underlying R-tree index structure to speed up join processing. The second methodology is

based on a simple plane-sweep.

3.3 Join Using Indexing Structures

Given numerous existing spatiotemporal indexing structures, it is natural to first

consider employing a suitable index to perform the join.

Though many indexing structures exist, unfortunately most are not suitable for the

CPA Join. For example, a large number of indexing structures like the TPR-tree [17],

REXP tree [77], TPR*-tree [78] have been developed to support predictive queries, where

30


31/123

the focus is on indexing the future position of an object. However, these index structures

are generally not suitable for CPA Join, where access to the entire history is needed.

Indexing structures like MR-tree [26], MV3R-tree [27], HR-tree [28], HR+-tree [27]

are more relevant since they are geared towards answering time instance queries (in case

of MV3R-tree also short time-interval queries), where all objects alive at a certain time

instance are retrieved. The general idea behind these index structures is to maintain a

separate spatial index for each time instance. However, such indices are meant to store

discrete snapshots of a evolving spatial database, and are not ideal for use with CPA Join

over continuous trajectories.

3.3.1 Trajectory Index Structures

More relevant are indexing structures specific to moving object trajectory histories

like the TB-tree, STR-tree [21] and SETI [18]. TB-trees emphasize trajectory preservation

since they are primarily designed to handle topological queries where access to entire

trajectory is desired (segments belonging to the same trajectory are stored together). The

problem with TB-trees in the context of the CPA Join is that segments from different

trajectories that are close in space or time will be scattered across nodes. Thus, retrieving

segments in a given time window will require several random I/Os. In the same paper

[21], a STR tree is introduced that attempts to somewhat balance spatial locality with

trajectory preservation. However, as the authors point out STR-trees turn out to be a

weak compromise that do not perform better than traditional 3D R-trees [20] or TB-trees.

More appropriate to the CPA Join is SETI [18]. SETI partitions two-dimensional space

statically into non-overlapping cells and uses a separate spatial index for each cell. SETI

might be a good candidate for CPA Join since it preserves spatial and temporal locality.However, there are several reasons why SETI is not the most natural choice for a CPA

Join:

It is not clear that SETIs forest scales to a three-dimensional space. A 25 25 SETIgrid in two-dimension becomes a sparse 25 25 25 grid with almost 20, 000 cells inthree-dimension.

31


32/123

SETIs grid structure is an interesting idea for addressing problems with highvariance in object speeds (we will use a related idea for the adaptive plane-sweepalgorithm described later). However, it is not clear how to size the grid for a givendata set, and sizing it for a join seems even harder. It might very well be thatrelationR should have a different grid for R Scompared to R T.

For a CPA Join over a limited history, SETI has no way of pruning the search space,since every cell will have to be searched.

3.3.2 R-tree Based CPA Join

Given these caveats, perhaps the most natural choice for the CPA Join is the R-tree

[16]. The R-tree [16] is a hierarchical, multi-dimensional index structure that is commonly

used to index spatial objects. The join problem has been studied extensively for R-trees

and several spatial join techniques exist [45,46,79] that leverage underlying R-treeindex structures to speed-up join processing. Hence, our first inclination is to consider a

spatiotemporal join strategy that is based on R-trees. The basic idea is to index object

histories using R-trees and then perform a join over these indices.

The R-Tree Index

It is a very straightforward task to adapt the R-tree to index a history of moving

object trajectories. Assuming three spatial dimensions and a fourth temporal dimension,

the four-dimensional line segments making up each individual object trajectory are simply

treated as individual spatial objects and indexed directly by the R-tree. The R-tree and

its associated insertion or packing algorithms are used to group those line segments into

disk-page sized groups, based on proximity in their four-dimensional space. These pages

make up the leaf level of the tree. As in a standard R-tree, these leaf pages are indexed

by computing the minimum bounding rectangle that encloses the set of objects stored in

each leaf page. Those rectangles are in turn grouped into disk-page sized groups which are

themselves indexed. An R-tree index for 3 line segments moving through 2-dimensional

space is depicted in Figure 3-4.

32


33/123

p1[3]p3 [2]p1[2] p1 [4]p3[3] p2 [4]

p3[1]

p2[3]

p3[2]

p3[3]

p1[4]

p2[1]

p2[2]

p1[3]

I1

I2

I3

ty

x

p2[4]

I1 I2 I3

p1[2]

p1[1]

p2[3]

p1[2]

p1[3]

p3[2]

p2[3]

p1[4]p2[4]

p3[3]

p2 [2]p3[1]p2 [1]p1 [1]

p1[1]

p2[1]

p2[2]

p3[1]

Figure 3-4. Example of an R-tree

Basic CPA Join Algorithm Using R-Trees

Assuming that the two spatiotemporal relations to be joined are organized using

R-trees, we can use one of the standard R-tree distance joins as a basis for the CPA Join.

The common approach to joins using R-trees employ carefully controlled synchronized

traversal of the two R-trees to be joined. The pruning power of the R-tree index arises

from the fact that if two bounding rectangles R1 andR2 do not satisfy the join predicate

then the join predicate is not satisfied between any two bounding rectangles that can be

enclosed within R1 or R2.

In a synchronizedtechnique, both the R-trees are simultaneously traversed retrieving

object-pairs that satisfy the join predicate. To begin with, the root nodes of both the

R-trees are pushed into a queue. A pair of nodes from the queue is processed by pairing

up every entry of the first node with every entry in the second node to form the candidate

set for further expansion. Each pair in the candidate set that qualifies the join predicate is

pushed into the queue for subsequent processing. The strategy described leads to a BFS

(Breadth-First-Search) expansion of the trees. BFS-style traversal lends itself to global

optimization of the join processing steps [46] and works well in practice.

33


34/123

d2

l1

P2

l2

P1

darbit

d1

x

y

z

dreal

Figure 3-5. Heuristic to speed up distance computation

The distance routine is used in evaluating the join predicate to determine the distance

between two bounding rectangles associated with a pair of nodes. A node-pair qualifies

for further expansion if the distance between the pair is less than the limiting distance d

supplied by the query.

Heuristics to Improve the Basic Algorithm

The basic join algorithm can be improved in several ways by using several standard

and non-standard techniques for reducing I/O and CPU costs over spatial joins. These

include:

Using a plane-sweeping algorithm [45] to speed up the all-pairs distance computationwhen pairs of nodes are expanded and their children are checked for possiblematches.

Carefully considering the processing of node pairs so that when each pair isconsidered, one or both of the nodes are in the buffer [46].

Avoiding expensive distance computations by applying heuristic filters. Computingthe distance between two 3-dimensional rectangles can be a very costly operation,

since the closest points may be on arbitrary positions on the faces of the rectangles.To speed this computation, the magnitudes of the diagonals of the two rectangles(d1 andd2) can be computed first. Next, we pick an arbitrary point from both ofthe rectangles (points P1 and P2), and compute the distance between them, calleddarbit. Ifdarbit d1 d2 > djoin , then the two rectangles cannot contain any pointsas close as djoin from one another and the pair can be discarded, as shown in Figure3-5. This provides for immediate dismissals with only three distance computations(or one if the diagonal distances are precomputed and stored with each rectangle).

34


35/123

object p

objectq

ty

x

Figure 3-6. Issues with R-trees- Fast moving object p joins with everyone

In addition, there are some obvious improvements to the algorithm that can be made

which are specific to the 4-dimensional CPA Join:

The fourth dimension, time, can be used as an initial filter. If two MBRs or linesegments do not overlap in time, then the pair cannot possibly be a candidate for aCPA match.

Since time can be used to provide for immediate dismissals without Euclideandistance calculations it is given priority over the attributes. For example, when aplane-sweep is performed to prune an all-pairs CPA distance computation, time isalways chosen as the sweeping axis. The reason is that time will usually have thegreatest pruning power of any attributes since time-based matches must always beexact, regardless of the join distance.

In our implementaion of the CPA Join for R-trees, we make use of the STR packingalgorithm [80] to build the trees. Because the potential pruning power of the timedimension is greatest, we ensure that the trees are well-organized with respect totime by choosing time as the first packing dimension.

Problem With R-tree CPA Join

Unfortunately, it turns out that in practice the R-tree can be ill-suited to the problem

of computing spatiotemporal joins over moving object histories. R-trees have a problem

handling databases with a high variance in object velocities. The reason for this is thatjoin algorithms which make use of R-trees rely on tight and well-behaved minimum

bounding rectangles to speed the processing of the join. When the positions of a set of

moving objects are sampled at periodic intervals, fast moving objects tend to produce

larger bounding rectangles than slow moving objects.

35


36/123

p1

q2

q1

p2

y

time

tendtstart

Figure 3-7. Progression of plane-sweep

One such scenario is depicted in Figure 3-6, which shows the paths of a set of objects

on a 2-D plane for a given time period. A fast moving object such as pwill be contained

in a very large MBR, while slower objects such as qwill be contained in much smaller

MBRs. When a spatial join is computed over R-trees storing these MBRs, the MBR

associated with p can overlap many smaller MBRs, and each overlap will result in an

expensive distance computation (even if the objects do not travel close to one another).

Thus, any sort of variance in object velocities can adversely affect the performance of the

join.

3.4 Join Using Plane-Sweeping

The second technique that is considered is a join strategy based on a simple plane-

sweep. Plane-sweep is a powerful technique for solving proximity problems involving

geometric objects in a plane and has previously been proposed [49] as a way to efficiently

compute the spatial join operation.

3.4.1 Basic CPA Join using Plane-Sweeping

Plane-sweep is an excellent candidate for use with the CPA join because no matter

what distance threshold is given as input into the join, two objects must overlap in the

time dimension for there to be a potential CPA match. Thus, given two spatiotemporal

relationsP andQ, we could easily base our implementation of the CPA join on a

plane-sweep along the time dimension.

36


37/123

We would begin a plane-sweep evaluation of the CPA join by first sorting the intervals

making up P and Q along the time dimension, as depicted in Figure 3-7. We then sweep

a vertical line along the time dimension. A sweepline data structure D is maintained

which keeps track of all line segments which are valid given the current position of the line

along the time dimension. As the sweepline progresses, D is updated with insertions (new

segments that became active) and deletions (segments whose time period has expired).

Segment pairs from both input relations that satisfy the join predicate are always present

in D, and they can be checked and reported during updates to D. Pseudo-code for the

algorithm is given below:

Algorithm 2 PlaneSweep (Relation P, Relation Q, distance d)1: Form a single list L containing segments from P and Q sorted bytstart2: Initialize sweepline data structure D3: while not IsEmpty (L)do4: Segmenttop = popFront (L)5: Insert (D,top)6: Delete from D all segmentss s.t. (s.tend < top.tstart){remove segments that donot

intersect sweepline}7: Query (D,top, d){report segments in D that are within distance dist}8: end while

In the case of the CPA join, assuming that all moving objects at any given moment

can be stored in main memory, any of a number of data structures can be used to

implement D, such as a quad- or oct-tree, or an interval skip-list [81]. The main

requirement is that the data structure selected should easily be possible to check proximity

of objects in space.

3.4.2 Problem With The Basic Approach

Although the plane-sweep approach is simple, in practice it is usually too slow to

be useful for processing moving object queries. The problem has to do with how the

sweepline progression takes place. As the sweepline moves through the data space, it has

to stop momentarily at sample points (time instances at which object positions where

recorded) to process newly encountered segments into the data structure D. New segments

37


38/123

y

time

tend

q2

q1

p2

p1

tstart

Figure 3-8. Layered Plane-Sweep

that are encountered at the sample point are added into the data structure and segments

in D that are no longer active are deleted from it.

Consequently, the sweepline pauses more often when objects with high sampling rates

are present, and the progress of the sweepline is heavily influenced by the sampling rates

of the underlying objects. For example, consider Figure 3-7 which shows the trajectory

of four objects in a given time period. In the case illustrated, object p2 controls the

progression of the sweepline. Observe that in the time-interval [tstart, tend], only new

segments from object p2 get added to D but expensive join computations are performed

each time with same set of line segments.

The net result is that if the sampling rate of a data set is very high relative to the

amount of object movement in the data set, then processing a multi-gigabyte object

history using a simple plane-sweeping algorithm may take a prohibitively long time.

3.4.3 Layered Plane-Sweep

One way to address this problem is to reduce the number of segment level comparisons

by comparing the regions of movementof various objects at a coarser level. For example,

reconsider the CPA join depicted in Figure 7. If we were to replace the many oscillations

of objectp2 with a single minimum bounding rectangle which enclosed all of those

oscillations fromtstart totend, we could then use that rectangle during the plane-sweep

38


39/123

as an intial approximation to the path of object p2. This would potentially save many

distance computations.

This idea can be taken to its natural conclusion by constructing a minimum bounding

box that encompasses the line-segments of each object. A plane-sweep is then performed

over the bounding boxes, and only qualifying boxes are expanded further. We refer to this

technique as the Layered Plane-Sweep approach since plane-sweep is performed at two

layers one at a coarser level of bounding boxes and then at the finer level of individual

line segments.

One issue that must be considered is how much movement is to be summarized

within the bounding rectangle for each object. Since we would like to eliminate as many

comparisons as possible, one natural choice would be to let the available system memory

dictate how much movement is covered for each object. Given a fixed buffer size, the

algorithm will proceed as follows.

Algorithm 3 LayeredPlaneSweep(Relation P, Relation Q, distance d)

1: Segments defined by [(xsart, xend), (ystart, yend), (zstart,zend), (tstart, tend)]2: Assume a sorted list of object segments (by tstart) in disk3: while there is still some unprocessed data do4:

Read in enough data fromP andQto fill the buffer5: Lettstart be the first time tick which has not yet been processed by the plane-sweep6: Lettend be the last time tick for which no data is still on disk7: Next, bound the trajectory of every object present in the buffer by a MBR8: Sort the MBRs along one of the spatial dimension and then perform a plane-sweep

along that dimension9: Expand the qualifying MBR pairs to get the actual trajectory data (line segments)

10: Sort the line segments bytstart11: Perform a final sweep along the time dimension to get the final result set12: end while

Figure 3-8 illustrates the idea. It depicts the snapshot of object trajectories starting

at some time instance tstart. Segments in the interval [tstart, tend] represent the maximum

that can be buffered in the available memory. A first level plane-sweep is carried out over

the bounding boxes to eliminate false positives. Qualifying objects are expanded and a

second-level plane-sweep is carried out over individual line-segments. In the best case,

39


40/123

tstart

q2q2

tend

time

q1

p2

p1

y

Figure 3-9. Problem with using large granularities for bounding box approximation

there is an opportunity to process the entire data set through just three comparisons at

the MBR level.

3.5 Adaptive Plane-Sweeping

While the layered plane-sweep typically performs far better than the basic plane-sweeping

algorithm, it may not always choose the proper level of granularity for the bounding box

approximations. This Section describes an adaptive strategy that takes into careful

consideration the underlying object interaction dynamics and adjusts this granularity

dynamically in response to the underlying data characteristics.

3.5.1 Motivation

In the simple layered plane-sweep, the granularity for the bounding box approximation

is always dictated by the available system memory. The assumption is that pruning power

increases monotonically with increasing granularity. Unfortunately, this is not always the

case. As a motivating example, consider Figure 3-9. Assume available system memory

allows us to buffer all the line segments. In this case, the layered plane-sweep performs

no better than the basic plane-sweep, due to the fact that all the object bounding

boxes overlap with each other and as a result no pruning is achieved at the first-level

plane-sweep.

However, assume we had instead fixed the granulatiry to correspond to the time

period [tstart, ti], as depicted in Figure 3-10. In this case, none of the bounding boxes

40


41/123

overlap, and there are possibly many dismissals at the first level. Though less of the

buffer is processed intitially, we are able to eliminate many of the segment-level distance

comparisons compared to a technique that bounds the entire time-period, thereby

potentially increasing the efficiency of the algorithm. The entire buffer can then be

processed in a piece-by-piece fashion, as depicted in Figure 3-10. In general, the efficiency

of the layered plane-sweep is tied not to the granularity of the time interval that is

processed, but the granularity that minimizes the number of distance comparisons.

3.5.2 Cost Associated With a Given Granularity

Since distance computations dominate the time required to compute a typical CPA

Join, the cost associated with a given granularity can be approximated as a function of the

number of distance comparisons that are needed to process the segments encompassed in

that granularity. LetnMBR be the number of distance computations at the box-level, let

nseg be the number of distance calculations at the segment-level, and let be the fraction

of the time range in the buffer which is processed at once. Then the cost associated with

processing that fraction of the buffer can be estimated as:

cost

= (nseg

+nMBR

)(1 )

This function reflects the fact that if we choose a very small value for , we will have to

process many cut-points in order to process the entire buffer, which can increase the cost

of the join. As shrinks, the algorithm becomes equivalent to the traditional plane-sweep.

On the other hand, choosing a very large value for tends to increase (nseg +nMBR),

eventually yielding an algorithm which is equivalent to the simple, layered plane-sweep. In

practice, the optimal value for lies somewhere in between the two extremes, and variesfrom data set to data set.

3.5.3 The Basic Adaptive Plane-Sweep

Given this cost function, it is easy to design a greedy plane-sweep algorithm that

attempts to repeatedly minimize cost in order to adapt the underlying (and potentially

41


42/123

tendtj

p1

p2

q1

q2

time

y

titstart

Figure 3-10. Adaptively varying the granularity

time-varying) characteristics of the data. At every iteration, the algorithm simply chooses

to process the fraction of the buffer that appears to minimize the overall cost of the

plane-sweep in terms of the expected number of distance computations. The algorithm is

given below:

Algorithm 4 AdaptivePlaneSweep(Relation P, Relation Q, distance d)

1: while there is still some unprocessed data do2: Read in enough data fromP andQto fill the buffer3: Lettstart be the first time tick which has not yet been processed by the plane-sweep4: Lettend be the last time tick for which no data is still on disk5: Choose so as to minimize cost6: Perform a layered plane-sweep from time tstart totstart+ (tend tstart){steps 5-9

of procedure LayeredPlaneSweep}7: end while

Unfortunately, there are two obvious difficulties involved with actually implementing

the above algorithm:

First, the cost cost associated with a given granularity is known only after thelayered plane has been executed at that granularity.

Second, even if we can compute cost easily, it is not obvious how we can computecost for all values of from 0 to 1 so as to minimize cost over all .

These two issues are discussed in detail in the next two Sections.

42


43/123

3.5.4 Estimating Cost

This Section describes how to efficiently estimate cost for a given using a simple,

online sampling algorithm reminiscent of the algorithm of Hellerstein and Haas [82].

At a high level, the idea is as follows. To estimate cost, we begin by constructing

bounding rectangles for all of the objects in P considering their trajectories from time

tstart to(tend tstart). These rectangles are then inserted into an in-memory index, just as

if we were going to perform a layered plane-sweep. Next, we randomly choose an object q1

fromQ, and construct a bounding box for its trajectory as well. This object is joined with

all of the objects inPby using the in-memory index to find all bounding boxes within

distanced ofq1. Then:

LetnMBR,q1 be the number of distance computations needed by the index tocompute which objects fromP have bounding rectangles within distance d of thebounding rectangle forq1, and

Letnseg,q1 be the total number of distance computations that would have beenneeded to compute the CPA distance between q1 and every object p P whosebounding rectangle is within distance d of the bounding rectangle for q1 (this can becomputed efficiently by performing a plane-sweep without actually performing therequired distance computations).

Once nMBR,q1 andnseg,q1 have been computed forq1, the process can be repeated for a

second randomly selected objectq2 Q, for a third object q3 and so on. A key observation

is that after m objects fromQhave been processed, the value

m= 1m

mi=1

(nMBR,q1+nseg,q1)|Q|

represents an unbiased estimator for (nMBR +nseg) at , where|Q| denotes the number of

data objects inQ.In practice, however, we are not only interested inm. We would also like to know at

all times just how accurate our estimatem is, since at the point where we are satisfiedwith our guess as to the real value ofcost, we want to stop the process of estimating

cost and continue with the join.

43


44/123

Fortunately, the central limit theorem can easily be used to estimate the accuracy

ofm. Assuming sampling with replacement from Q, for large enough m the error of ourestimate will be normally distributed around (nMBR +nseg) with variance

2m =

1m

2(Q),

where2(Q) is defined as

1

|Q|

|Q|i=1

{(nMBR,q1+nseg,q1)|Q| (nMBR +nseg)}2

Since in practice we cannot know 2(Q), it must be estimated via the expression

2(Qm) = 1m 1

mi=1

{(nMBR,q1+ nseg,q1)|Q| m}(Qm denotes the sample ofQ that is obtained afterm objects have been randomlyretrieved from Q). Substituting into the expression for 2m, we can treatm as a normallydistributed random variable with variance b

2(Qm)m .

In our implementation of the adaptive plane-sweep, we can continue the sampling

process until our estimate for cost is accurate to within 10% at 95% confidence. Since

95% of the standard normal distribution falls within two standard deviations of the mean,

this is equivalent to sampling until b2(Qm)

m is less thanm 0.1.3.5.5 Determining The Best Cost

We now address the second issue: how to compute cost for all values of from 0 to

1 so as to minimize cost over all.

Calculatingcost for all possible values of is prohibitively expensive and hence not

feasible in practice. Fortunately, in practice we do not have to evaluate all values of to

determine the best. This is due to the following interesting fact: If we plot all possible

values of and their respective associated cost, we would observe that the graph is not

linear, but exhibits a certain concavity. The concave region of the graph represents a sweet

spotand represents the feasible region for the best cost.

As an example consider Figure 3-11, which shows the plot of the cost function for

various fractions of for one of the experimental data sets from Section 3.6. Given this

44


45/123

2e+07

3e+07

4e+07

5e+07

6e+07

7e+07

8e+07

9e+07

0 10 20 30 40 50 60 70 80 90 100#ofDistanceCom

putations(estimated)

% of buffer

Convexity of Cost Function for k = 20

Figure 3-11. Convexity of cost function illustration.

fact, we identify the feasible region by evaluating costi for a small number, k , ofi

values. Givenk the number of allowed cutpoints, the fraction 1 can be determined asfollows:

1 =r(

1k)

r

wherer = (tendtstart) is the time range described by the buffer (the above formula

assumes thatr is greater than one; if not, then the time range should be scaled accordingly).

In the general case, the fraction of the buffer considered by any i(1 i k) is given by:

i = (r 1)i

r

Note that since the growth rate of each subsequent i is exponential, we can cover

the entire buffer with just a small k and still guarantee that we will consider some value

ofi that is within a factor of1 from the optimal. After computing 1,2,. . . ,k, we

successively evaluate increasing buffer fractions, 1, 2, 3, and so on and determine their

associated costs. From these k costs we determine the i with the minimum cost.

Note that if we choose based on the evaluation of a smallk, then it is possible that

the optimal choice ofmay lie outside the feasible region. However, there is a simple

approach to solving this issue. After an initial evaluation ofk granularities, consider

just the region starting before and ending after the best k and recursively reapply the

evaluation described above just in this region.

45


46/123

32

4 5

5

3

tj

ti

tendtstart

5

mincost

mincost

mincost

321

4

4

1

1

2

Figure 3-12. Iteratively evaluating k cut points

For instance, assume we chose i after evaluation ofk cutpoints in the time range r .

To further tune this i, we consider the time range defined between the adjacent cutpoints

i1 and i+1 and recursively apply cost estimation in this interval. (i.e., evaluate k points

in the time range (tstart+i1 r, tstart+i+1 r)). Figure 3-12 illustrates the idea. This

approach is simple and very effective in considering a large number of choices of.

3.5.6 Speeding Up the Estimation

Restricting the number of candidate cut points can help keep the time required to

find a suitable value formanageable. However, if the estimation is not implemented

carefully, the time required to consider the cost at each of the k possible time periods can

still be significant.

The most obvious method for estimating cost for each of the k granularities wouldbe to simply loop through each of the associated time periods. For each time period,

we would build bounding boxes around each of the trajectories of the objects in P, and

then sample objects fromQ as described in Section 3.5 until the cost was estimated with

sufficient accuracy.

However, this simple algorithm results in a good deal of repeated work for each time

period, and can actually decrease the overall speed of the adaptive plane-sweep compared

to the layered plane-sweep. A more intelligent implementation can speed the optimization

process considerably.

In our implementation, we maintain a table of all the objects in P andQ, organized

on the ID of each object. Each entry in the table points to a linked list that contains a

46


47/123

chain of MBRs for the associated object. Each MBR in the list bounds the trajectory

of the object for one of the k time-periods considered during the optimization, and the

MBRs in each list are sorted from the coarsest of the k granularities to the finest. The

data structure is depicted in Figure 3-12.

Given this structure, we can estimatecost for each of the k values of alpha in

parallel, with only a small hit in performance associated with an increased value for k.

Any object pair (p P, q Q) that needs to be evaluated during the sampling process

described in Section 3.5 is first evaluated at the coarsest granularity corresponding to k.

If the two MBRs are within distance d of one another, then the cost estimate for k is

updated, and the evaluation is then repeated at the second coarsest granularity k1. If

there is again a match, then the cost estimate fork1 is updated as well. The process is

repeated until there is not a match. As soon as we find a granularity at which the MBRs

forp and qare not within a distance d of one another, then we can stop the process,

because if the MBRs for P and Q are not within distance dfor the time period associated

withi, then they cannot be within this distance for any time period j where j < i.

The benefit of this approach is that in cases where the data are well-behaved and

the optimization process tends to choose a value forthat causes the entire buffer to be

processed at once, a quick check of the distance between the outer-most MBRs ofp and q

is the only geometric computation needed to process p andq, no matter what value ofk is

chosen.

The bounding box approximations themselves can be formed while the system buffer

is being filled with data from disk. As trajectory data are being read from disk, we grow

the MBRs for each i progressively. Since eachi represents a fraction of the buffer, theupdates to its MBR can be stopped as soon as that much fraction of the buffer has been

filled. Similar logic can be used to shrink the MBRs when some fraction of the buffer is

consumed and expand it when the buffer is refilled.

47


48/123

.

.

.

3 4 5

p2

p1

2 time

y

pn

p2

(5 ) (4 ) (3 ) (2) (1)MBRMBRMBR MBRMBR

1

Figure 3-13. Speeding up the Optimizer

3.5.7 Putting It All Together

In our implementation of the adaptive plane-sweep, data are fetched in blocks

and stored in the system buffer. Then an optimizer routine is called which evaluatesk granularities and returns the granularity with the minimum cost. Data in the

granularity chosen by the optimizer is then evaluated using the LayeredPlaneSweep

routine (procedure described in Section 3.4). When the LayeredPlaneSweep routine returns,

the buffer is refilled and the process is repeated. The techniques described in the previous

Section are utilized to make the optimizer implementation fast and efficient.

3.6 Benchmarking

This section presents experimental results comparing the various methods discussed so

far for computing a spatiotemporal CPA Join: an R-tree, a simple plane-sweep, a layered

plane-sweep, and an adaptive plane-sweep with several parameter settings. The Section is

organized as follows. First, a description of the three, three-dimensional temporal data sets

used to test the algorithms is given. This is followed by the actual experimental results

and a detailed discussion analyzing the experiemental data.

3.6.1 Test Data Sets

The first two data sets that we use to test the various algorithms result from two

physics-based,N-body simulations. In both data sets, constituent records occupy 80B on

disk (80B is the storage required to record the object ID, time information, as well as the

48


49/123


50/123

-10000

-5000

0

5000

10000-12000

-10000

-8000

-6000

-4000

-2000

0

2000

-6000

-4000

-2000

0

2000

4000

6000

8000

Figure 3-15. Collision data set at time tick 1,500

strong graviational interaction. A small sample of the galaxies in the simulation isdepicted above in Figure 3-14, at time tick 1,500.

In addition, we test a third data set created using a simple, 3-dimensional random walk.

We call this the Syntheticdata set (this data set was again about 50GB in size). The

speed of the various objects varies considerably during the walk. The purpose of including

this data is to rigorously test the adpatability of the adaptive plane-sweep, by creating

a synthetic data set where there are significant fluctuations in the amount of interaction

among objects as a function of time.

3.6.2 Methodology and Results

All experiments were conducted on a 2.4GHz Intel Xeon PC with 1GB of RAM. The

experimental data sets were each stored on an 80GB, 15,000 RPM Seagate SCSI disk.

For all three of the data sets, we tested an R-tree-based CPA Join (implemented as

described in Section 3.3; we used the STR R-tree packing algorithm [80] to construct anR-tree for each input relation), a simple plane-sweep (implemented as described in Section

3.4), a layered plane-sweep (implemented as described in Section 3.5).

We also tested the adaptive plane-sweep algorithm, implemented as described

in Section 6. For the adaptive plane-sweep, we also wanted to test the effect of the

50


51/123

0

2000

4000

6000

8000

10000

12000

0 20 40 60 80 100

TimeTaken

% of Join Completed

CPA-Join over Injection Dataset

R-Tree

Simple Sweep

Layered Sweep

Arumugam s

Documents

Transcript of Arumugam s