Arumugam s
Transcript of Arumugam s
-
8/13/2019 Arumugam s
1/123
-
8/13/2019 Arumugam s
2/123
c 2008 Subramanian Arumugam
2
-
8/13/2019 Arumugam s
3/123
To my parents.
3
-
8/13/2019 Arumugam s
4/123
ACKNOWLEDGMENTS
First of all, I would like to thank my advisor Chris Jermaine. This dissertation would
not have been made possible had it not been for his excellent mentoring and guidance
through the years. Chris is a terrific teacher, a critical thinker and a passionate researcher.
He has served as a great role model and has helped me mature as a researcher. I cannot
thank him more for that.
My thanks also goes to Prof. Alin Dobra. Through the years, Alin has been a patient
listener and has helped me structure and refine my ideas countless times. His excitement
for research is contagious!
I would like to take this opportunity to mention my colleagues at the database
center: Amit, Florin, Fei, Luis, Mingxi and Ravi. I have had many hours of fun discussing
interesting problems with them. Special thanks goes to my friends Manas, Srijit, Arun,
Shantanu, and Seema, for making my stay in Gainesville all the more enjoyable.
Finally, I would like thank my parents for being a source of constant support and
encouragment throughout my studies.
4
-
8/13/2019 Arumugam s
5/123
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Research Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Data Modeling and Database Design . . . . . . . . . . . . . . . . . 151.2.2 Access Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.2.3 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.2.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.2.5 Data Warehousing. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.3.1 Scalable Join Processing over Massive Spatiotemporal Histories . . . 181.3.2 Entity Resolution in Spatiotemporal Databases. . . . . . . . . . . . 191.3.3 Selection Queries over Probabilistic Spatiotemporal Databases . . . 19
2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1 Spatiotemporal Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 Entity Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.3 Probabilistic Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 SCALABLE JOIN PROCESSING OVER SPATIOTEMPORAL HISTORIES . 25
3.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Moving Object Trajectories. . . . . . . . . . . . . . . . . . . . . . . 273.2.2 Closest Point of Approach (CPA) Problem . . . . . . . . . . . . . . 28
3.3 Join Using Indexing Structures . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 Trajectory Index Structures . . . . . . . . . . . . . . . . . . . . . . 313.3.2 R-tree Based CPA Join . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Join Using Plane-Sweeping . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.4.1 Basic CPA Join using Plane-Sweeping . . . . . . . . . . . . . . . . . 363.4.2 Problem With The Basic Approach . . . . . . . . . . . . . . . . . . 373.4.3 Layered Plane-Sweep . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Adaptive Plane-Sweeping. . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5
-
8/13/2019 Arumugam s
6/123
3.5.2 Cost Associated With a Given Granularity . . . . . . . . . . . . . . 413.5.3 The Basic Adaptive Plane-Sweep . . . . . . . . . . . . . . . . . . . 413.5.4 Estimating Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.5.5 Determining The Best Cost. . . . . . . . . . . . . . . . . . . . . . . 443.5.6 Speeding Up the Estimation . . . . . . . . . . . . . . . . . . . . . . 46
3.5.7 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . 483.6 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6.1 Test Data Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.6.2 Methodology and Results . . . . . . . . . . . . . . . . . . . . . . . . 503.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4 ENTITY RESOLUTION IN SPATIOTEMPORAL DATABASES . . . . . . . . 58
4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Outline of Our Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.3 Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.3.1 PDF for Restricted Motion . . . . . . . . . . . . . . . . . . . . . . . 644.3.2 PDF for Unrestricted Motion. . . . . . . . . . . . . . . . . . . . . . 65
4.4 Learning the Restricted Model . . . . . . . . . . . . . . . . . . . . . . . . . 664.4.1 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . 674.4.2 LearningK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Learning Unrestricted Motion . . . . . . . . . . . . . . . . . . . . . . . . . 714.5.1 Applying a Particle Filter. . . . . . . . . . . . . . . . . . . . . . . . 724.5.2 Handling Multiple Objects . . . . . . . . . . . . . . . . . . . . . . . 734.5.3 Update Strategy for a Sample given Multiple Objects . . . . . . . . 75
4.5.4 Speeding Things Up. . . . . . . . . . . . . . . . . . . . . . . . . . . 764.6 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5 SELECTION OVER PROBABILISTIC SPATIOTEMPORAL RELATIONS . . 84
5.1 Problem and Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.1.2 The False Positive Problem. . . . . . . . . . . . . . . . . . . . . . . 875.1.3 The False Negative Problem . . . . . . . . . . . . . . . . . . . . . . 90
5.2 The Sequential Probability Ratio Test (SPRT) . . . . . . . . . . . . . . . . 915.3 The End-Biased Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.1 Whats Wrong With the SPRT? . . . . . . . . . . . . . . . . . . . . 955.3.2 Removing the Magic Epsilon . . . . . . . . . . . . . . . . . . . . . . 965.3.3 The End-Biased Algorithm . . . . . . . . . . . . . . . . . . . . . . . 97
5.4 Indexing the End-Biased Test . . . . . . . . . . . . . . . . . . . . . . . . . 1035.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.4.2 Building the Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6
-
8/13/2019 Arumugam s
7/123
5.4.3 Processing Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6 CONCLUDING REMARKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
BIOGRAPHICAL SKETCH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7
-
8/13/2019 Arumugam s
8/123
LIST OF TABLES
Table page
4-1 Varying the number of objects and its effect on recall, precision and runtime. . . 80
4-2 Varying the number of time ticks. . . . . . . . . . . . . . . . . . . . . . . . . . . 804-3 Varying the number of sensors fired. . . . . . . . . . . . . . . . . . . . . . . . . 80
4-4 Varying the standard deviation of the Gaussian cloud. . . . . . . . . . . . . . . 80
4-5 Varying the number of time ticks where EM is applied. . . . . . . . . . . . . . . 81
5-1 Running times over varying database sizes. . . . . . . . . . . . . . . . . . . . . . 109
5-2 Running times over varying query sizes. . . . . . . . . . . . . . . . . . . . . . . 109
5-3 Running times over varying object standard deviations. . . . . . . . . . . . . . . 109
5-4 Running times over varying confidence levels. . . . . . . . . . . . . . . . . . . . 109
8
-
8/13/2019 Arumugam s
9/123
LIST OF FIGURES
Figure page
3-1 Trajectory of an object (a) and its polyline approximation (b) . . . . . . . . . . 28
3-2 Closest Point of Approach Illustration . . . . . . . . . . . . . . . . . . . . . . . 293-3 CPA Illustration with trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3-4 Example of an R-tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3-5 Heuristic to speed up distance computation . . . . . . . . . . . . . . . . . . . . 34
3-6 Issues with R-trees- Fast moving objectp joins with everyone . . . . . . . . . . 35
3-7 Progression of plane-sweep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3-8 Layered Plane-Sweep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3-9 Problem with using large granularities for bounding box approximation . . . . . 40
3-10 Adaptively varying the granularity . . . . . . . . . . . . . . . . . . . . . . . . . 42
3-11 Convexity of cost function illustration. . . . . . . . . . . . . . . . . . . . . . . . 45
3-12 Iteratively evaluating k cut points. . . . . . . . . . . . . . . . . . . . . . . . . . 46
3-13 Speeding up the Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3-14 Injection data set at time tick 2,650. . . . . . . . . . . . . . . . . . . . . . . . . 49
3-15 Collision data set at time tick 1,500 . . . . . . . . . . . . . . . . . . . . . . . . . 50
3-16 Injection data set experimental results . . . . . . . . . . . . . . . . . . . . 51
3-17 Collision data set experimental results . . . . . . . . . . . . . . . . . . . . . . . 52
3-18 Buffer size choices for Injection data set . . . . . . . . . . . . . . . . . . . . . . 53
3-19 Buffer size choices for Collision data set . . . . . . . . . . . . . . . . . . . . . . 53
3-20 Synthetic data set experimental results . . . . . . . . . . . . . . . . . . . . . . . 54
3-21 Buffer size choices for Synthetic data set . . . . . . . . . . . . . . . . . . . . . . 56
4-1 Mapping of a set of observations for linear motion . . . . . . . . . . . . . . . . . 60
4-2 Object path (a) and quadratic fit for varying time ticks (b-d). . . . . . . . . . . 62
4-3 Object path in a sensor field (a) and sensor firings triggered by object motion (b) 64
4-4 The baseline input set (10,000 observations) . . . . . . . . . . . . . . . . . . . . 79
9
-
8/13/2019 Arumugam s
10/123
4-5 The learned trajectories for the data of Figure 4-4 . . . . . . . . . . . . . . . . . 79
5-1 The SPRT in action. The middle line is the LRT statistic. . . . . . . . . . . . . 92
5-2 Two spatial queries over a database of objects with gaussian uncertainty . . . . 97
5-3 The sequence of SPRTs run by the end-biased test . . . . . . . . . . . . . . . . 98
5-4 Building the MBRs used to index the samples from the end-biased test. . . . . . 104
5-5 Using the index to speed the end-biased test . . . . . . . . . . . . . . . . . . . . 106
10
-
8/13/2019 Arumugam s
11/123
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
EFFICIENT ALGORITHMS FOR SPATIOTEMPORAL DATA MANAGEMENT
By
Subramanian Arumugam
August 2008
Chair: Christopher JermaineMajor: Computer Engineering
This work focuses on interesting data management problems that arise in the analysis,
modeling and querying of largescale spatiotemporal data. Such data naturally arise in the
context of many scientific and engineering applications that deal with physical processes
that evolve over time.
We first focus on the issue of scalable query processing in spatiotemporal databases.
In many applications that produce a large amount of data describing the paths of moving
objects, there is a need to ask questions about the interaction of objects over a long
recorded history. To aid such analysis, we consider the problem of computing joins over
moving object histories. The particular join studied is the Closest-Point-Of-Approach
join, which asks: Given a massive moving object history, which objects approached within
a distance d of one another?
Next, we study a novel variation of the classic entity resolution problem that
appears in sensor network applications. In entity resolution, the goal is to determine
whether or not various bits of data pertain to the same object. Given a large database of
spatiotemporal sensor observations that consist of (location, timestamp) pairs, our goal is
to perform an accurate segmentation of all of the observations into sets, where each set is
associated with one object. Each set should also be annotated with the path of the object
through the area.
11
-
8/13/2019 Arumugam s
12/123
Finally, we consider the problem of answering selection queries in a spatiotemporal
database, in the presence of uncertainty incorporated through a probabilistic model.
We propose very general algorithms that can be used to estimate the probability that a
selection predicate evaluates to true over a probabilistic attribute or attributes, where
the attributes are supplied only in the form of a pseudo-random attribute value generator.
This enables the efficient evaluation of queries such as Find all vehicles that are in close
proximity to one another with probabilityp at time t using Monte Carlo statistical
methods.
12
-
8/13/2019 Arumugam s
13/123
-
8/13/2019 Arumugam s
14/123
Extending modern database systems to support spatiotemporal data is challenging for
several reasons:
Conventional databases are designed to manage static data, whereas spatiotemporal
data describe spatial geometries that change continuously with time. This requires aunified approach to deal with aspects of spatiality and temporality.
Current databases are designed to manage data that is precise. However, uncertaintyis often an inherent property in spatiotemporal data due to discretization ofcontinuous movement and measurement errors. The fact that most spatiotemporaldata sources (particularly polling and sampling-based schemes) provide only adiscrete snapshot of continuous movement poses new problems to query processing.For example, consider a conventional database record that stores the fact JohnSmith earns $200,000 and a spatiotemporal record which stores the fact JohnSmith walks from point A to point B in the form of an discretized ordered pair
(A, B). In the former case, a query such as What is the salary of John Smith?involves dealing with precise data. On the other hand, a spatiotemporal querysuch as Did John Smith walk through point C between A and B? requires dealingwith information that is often not known with certainty. Further compounding theproblem is that even the recorded observations are only accurate to within a fewdecimal places. Thus, even queries queries such Identify all objects located at pointA may not return meaningful results unless allowed a certain margin for error.
Due to the presence of the time dimension, spatiotemporal applications have thepotential to produce a large amount of data. The sheer volume of data generatedby spatiotemporal applications presents a computational and data management
challenge. For instance, it is not uncommon for many scientific processes to producespatiotemporal data in the order of terabytes or even petabytes [7]. Developingscalable algorithms to support query processing over tera- and peta-byte-sizedspatiotemporal data sets is a significant challenge.
The semantics of many basic operations in a database changes in the presence ofspace and time. For instance, basic operations like joins typically employ equalitypredicates in a classic relational database, whereas equality is rare between twoarbitrary spatiotemporal objects.
1.2 Research Landscape
Over the last decade, database researchers have begun to respond to the challenges
posed by spatiotemporal data. Most of the research efforts is concentrated on supporting
eitherpredictive orhistoricalqueries. Within this taxonomy, we can further distinguish
work based on whether they support time-instanceor time-intervalqueries.
14
-
8/13/2019 Arumugam s
15/123
Inpredictivequeries, the focus is on the future position of the objects and only a
limited time window of the object positions needs to be maintained. On the other hand,
forhistoricalqueries, the interest is on efficient retrieval of past history and thus the
database needs to maintain the complete timeline of an objects past locations. Due to
these divergent requirements, techniques developed for predictive queries are often not
suitable for historical queries.
What follows is a brief tour of the major research areas in spatiotemporal data
management. For a more complete treatment of this topic, the interested reader is
referrred to [1,3].
1.2.1 Data Modeling and Database Design
Early research focused on aspects of data modeling and database design for
spatiotemporal data [8]. Conventional data types employed in existing databases are
often not suitable to represent spatiotemporal data which describe continuous time-varying
spatial geometries. Thus, there is a need for a spatiotemporal type system that can model
continuously moving data. Depending on whether the underlying spatial object has an
extent or not, abstractions have been developed to model a moving point, line, and region
in two- and three-dimensional space with time considered as the additional dimension
[811]. Similarly, early work has also focused on refining existing CASE tools to aid in the
design of spatiotemporal databases. Existing conceptual tools such as ER diagrams and
UML present a non-temporal view of the world and extensions to incorporate temporal
and spatial awareness has been investigated [12,13].
Recently there has been interest in designing flexible type systems that can model
aspects of uncertainty associated with an objects spatial location [14]. There has alsobeen active effort towards designing SQL language extensions for spatiotemporal data
types and operations [15].
15
-
8/13/2019 Arumugam s
16/123
1.2.2 Access Methods
Efficient processing of spatiotemporal queries requires developing new techniques
for query evaluation, providing suitable access structures and storage mechanisms, and
designing efficient algorithms for the implementation of spatiotemporal operators.
Developing efficient access structures for spatiotemporal databases is an important
area of research. A variety of spatiotemporal index structures have been developed to
support selection queries over both predictive and historical queries, most based on
generalization of the R-tree [16] to incorporate the time dimension. Indexing structures
designed to support predictive queries typically manage object movement within a small
time window and need to handle frequent updates to object locations. A popular choice
for such applications is the TPR-tree [17] and its many variants.
On the other hand, index structures designed to support historical queries need to
manage an objects entire past movement trajectory (for this reason they can be viewed as
trajectory indexes). Depending on the time interval indexed, the sheer volume of data that
needs to be managed present significant technical challenges for overlap-allowing indexing
schemes such as R-trees [16]. Thus, there has been interest in revisiting grid/cell-based
solutions that do not allow overlap, such as SETI [18]. Several tree-based indexing
structures have been developed such as STRIPES [19], 3D R-trees [20], TB trees [21] and
linear quad trees [22]. Further, spatiotemporal extensions of several popular queries such
as nearest-neighbor [23], top-k [24], and skyline [25] have been developed.
1.2.3 Query Processing
The development of efficient index structures has also led to a growing body of
research on different types of queries on spatiotemporal data, such as time-instant andrange queries [2628], continuous queries, joins [29,30], and their efficient evaluation
[31,32]. In the same vein, there has also been seem preliminary work on optimizing
spatio-temporal selection queries [33,34].
16
-
8/13/2019 Arumugam s
17/123
Much of the work focuses specifically on indexing two-dimensional space and/or
supporting time-instance or short time-interval selection queries. Thus many indexing
structures often do not scale well for higher-dimensional spaces and have difficulty with
queries over long time intervals. Finally, historical data collections may be huge and joins
over such data require new solutions, since predicates involved are non-traditional (such as
closest point of approach, within, sometimes-possibly-inside, etc.)
1.2.4 Data Analysis
Spatiotemporal data analysis allows us to obtain interesting insights from the stored
data collection. For instance:
In a road network database, the history of movement of various objects can be usedto understand traffic patterns.
In aviation, the flight path of various planes can be used in future path planning andcomputing minimum separation constraints to avoid collision.
In wildlife management, one can understand animal migration patterns from thetrajectories traced by them.
Pollutants can be traced to their source by studing air flow patterns of aerosolsstored as trajectories.
Research in this area focuses on extending traditional data mining techniques to the
analysis of large spatiotemporal data sets. Of interest includes discovering similiarities
among object trajectories [35], data classification and generalization [36], trajectory
clustering and rule mining [3739], and supporting interactive visualization for browsing
large spatiotemporal collections [40].
1.2.5 Data Warehousing
Supporting data analysis also requires designing and maintaining large collections ofhistorical spatiotemporal data, which falls under the domain ofdata warehousing.
Conventional data warehouses are often designed around the goal of supporting
aggregate queries efficiently. However, the interesting queries in a spatiotemporal data
warehouse seek to discover the interaction patterns of moving objects and understand the
17
-
8/13/2019 Arumugam s
18/123
spatial and/or temporal relationships that exist between them. Facilitating such queries
in a scalable fashion over terabyte-sized spatiotemporal data warehouses is a significant
challenge. This requires extending traditional data mining techniques to the analysis
of large spatiotemporal data sets to discover spatial and temporal relationships, which
might exist at various levels of granularity involving complex data types. Research in
spatiotemporal data warehousing [41,42] is relatively new and is focused on refining
existing multidimensional models to support continuous data and defining semantics for
spatiotemporal aggregation [43,44].
1.3 Main Contributions
It is clear that extending modern database systems to support data management
and analysis of spatiotemporal data require addressing issues that span almost the entire
breadth of database research. A full treatment of the various issues can be the subject of
numerous dissertations! To keep the scope of this dissertation managable, I tackle three
important problems in spatiotemporal data management. The dissertation focuses on
data produced by moving objects, since moving object databases represent the most
common application domain for spatiotemporal databases [1]. The three specific problems
considered are described briefly in the following subsections.
1.3.1 Scalable Join Processing over Massive Spatiotemporal Histories
I first consider the scalability problem in computing joins over massive moving
object histories. In applications that produce a large amount of data describing the
paths of moving objects, there is a need to ask questions about the interaction of
objects over a long recorded history. This problem is becoming especially important
given the emergence of computational, simulation-based science (where simulationsof natural phenomenon naturally produce massive databases containing data with
spatial and temporal characteristics), and the increased prevalence of tracking and
positioning devices such as RFID and GPS. The particular join that I study is the CPA
(Closest-Point-Of-Approach) join, which asks: Given a massive moving object history,
18
-
8/13/2019 Arumugam s
19/123
which objects approached within a distance d of one another? I carefully consider several
obvious strategies for computing the answer to such a join, and then propose a novel,
adaptive join algorithm which naturally alters the way in which it computes the join in
response to the characteristics of the underlying data. A performance study over two
physics-based simulation data sets and a third, synthetic data set validates the utility of
my approach.
1.3.2 Entity Resolution in Spatiotemporal Databases
Next, I consider the problem of entity resolution for a large database of spatio-temporal
sensor observations. The following scenario is assumed. At each time-tick, one or more of
a large number of sensors report back that they have sensed activity at or near a specific
spatial location. For example, a magnetic sensor may report that a large metal object has
passed by. The goal is to partition the sensor observations into a number of subsets so
that it is likely that all of the observations in a single subset are associated with the same
entity, or physical object. For example, all of the sensor observations in one partition may
correspond to a single vehicle driving accross the area that is monitored. The dissertation
describes a two-phase, learning-based approach to solving this problem. In the first phase,
a quadratic motion model is used to produce an initial classification that is valid for a
short portion of the timeline. In the second phase, Bayesian methods are used to learn the
long-term, unrestricted motion of the underlying objects.
1.3.3 Selection Queries over Probabilistic Spatiotemporal Databases
Finally, I consider the problem of answering selection queries in the presence of
uncertainty incorporated through a probabilistic model. One way to facilitate the
representation of uncertainty in a spatiotemporal database is by allowing tuples tohave probabilistic attributes whose actual values are unknown, but are assumed
to be selected by sampling from a specified distribution. This can be supported by
including a few, pre-specified, common distributions in the database system when it is
shipped. However, to be truly general and extensible and support distributions that
19
-
8/13/2019 Arumugam s
20/123
cannot be represented explicitly or even integrated, it is necessary to provide an interface
that allows the user to specify arbitrary distributions by implementing a function that
produces pseudo-random samples from the desired distribution. Allowing a user to specify
uncertainty via arbitrary sampling functions creates several interesting technical challenges
during query evaluation. Specifically, evaluating time-instance selection queries such as
Find all vehicles that are in close proximity to one another with probability p at time
t requires the principled use of Monte Carlo statistical methods to determine whether
the query predicate holds. To support such queries, the thesis describes new methods
that draw heavily for the relevant statistical theory on sequential estimation. I also
consider the problem of indexing for the Monte Carlo algorithms, so that samples from the
pseudo-random attribute value generator can be pre-computed and stored in a structure in
order to answer subsequent queries quickly.
Organization. The rest of this study is organized as follows. Chapter 2 provides
a survey of work related to the problems addressed in this thesis. Chapter 3 tackles the
scalability issue when processing join queries over massive spatiotemporal databases.
Chapter 4 describes an approach to handling the entity-resolution problem in cleaning
spatiotemporal data sources. Chapter 5 describes a simple and general approach to
answering selection queries over spatiotemporal databases that incorporate uncertainty
within a probabilistic model framework (selection queries over probabilistic spatiotemporal
databases). Chapter 6 concludes the dissertation by summarizing the contributions and
identifying potential directions for future work.
20
-
8/13/2019 Arumugam s
21/123
-
8/13/2019 Arumugam s
22/123
To our knowledge, the only prior work on spatiotemporal joins is due to Jeong et al.
[51]. However, they only consider spatiotemporal join techniques that are straightforward
extensions to traditional spatial join algorithms. Further, they limit their scope to
index-based algorithms for objects over limited time windows.
2.2 Entity Resolution
Research in entity resolution has a long history in databases [ 5255] and has focused
mainly on integrating non-geometric string based data from noisy external sources. Closely
related to the work in this thesis is the large body of work on target tracking that exists
in fields as diverse as signal processing, robotics, and computer vision. The goal in target
tracking [56,57] is to support the real-time monitoring and tracking of a set of moving
objects from noisy observations.
Various algorithms to classify observations among objects can be found in the
target tracking literature. They characterize the problem as one of data association (i.e.
associating observations with corresponding targets). A brief summary of the main ideas is
given below.
The seminal work is due to Reid [58] who propose a multiple hypothesis technique
(MHT) to solve the tracking problem. In the MHT approach, a set of hypotheses is
maintained with each hypothesis reflecting the belief on the location of an individual
target. When a new set of observations arrive, the hypotheses are updated. Hypotheses
with minimal support are deleted and additional hypotheses are created to reflect new
evidence. The main drawback of the approach is that the number of hypotheses can grow
exponentially over time. Though heuristic filters [5961] can be used to bound the search
space, it limits the scalability of the algorithm.Target tracking also has been studied using Bayesian approaches [62]. The Bayesian
approach views tracking as a state estimation problem. Given some initial state and a
set of observations, the goal is to predict the objects next state. An optimal solution to
the problem is given by Bayes Filter [63,64]. Bayes filters produces optimal estimates by
22
-
8/13/2019 Arumugam s
23/123
integrating over the complete set of observations. The formulation is often recursive and
involves complex integrals that are difficult to solve analytically. Hence, approximation
schemes such as particle filters [57] and sequential Monte Carlo techniques [63] are often
used in practice.
Recently, Markov Chain Monte Carlo (MCMC) [65,66] techniques have been
proposed. MCMC techniques attempt to approximate the optimal Bayes filter for multiple
target tracking. MCMC based methods employ sequential MC sampling and are shown to
perform better than existing sub-optimal approaches such as MHT for tracking objects in
highly cluttered environments.
A common theme among most of the research in target tracking is its focus on
accurate tracking and detection of objects in real time in highly cluttered environments
over relatively short time periods. In a data warehouse context, the ability of techniques
such as MCMC to make fine-grained distinctions make them ideal candidates when
performing operations such asdrilldownthat involve analytics over small time windows.
Their applicability is limited, however, to entity resolution in a data warehouse. In such a
context, summarization and visualization of historical trajectories smoothed over long time
intervals is often more useful. The model-based approach considered in this work seems a
more suitable candidate for such tasks.
2.3 Probabilistic Databases
Uncertainty management in spatiotemporal databases is a relatively new area of
research. Earlier work has focused on aspects of modeling uncertainty and query language
support [9,67].
In the context of query processing, one of the earliest papers in this area is thepaper by Pfoser et al. [68] where different sources of uncertainty are characterized and
a probability density function is used to model errors. Hosbond et al. [69] extended this
work by employing a hyper square uncertainty region, which expands over time to answer
queries using a TPR-tree.
23
-
8/13/2019 Arumugam s
24/123
Trajcevksi et al. [70] study the problem from a modeling perspective. They model
trajectories by a cylindrical volume in 3D and outline semantics of fuzzy selection queries
over trajectories in both space and time. However, the approach does not specify how to
choose the dimensions of the cylindrical region which may have to change over time to
account for shrinking or expanding of the underlying uncertainty region.
Cheng et al. [71] describe algorithms for time instant queries (probabilistic range
and nearest neighbor) using an uncertainty model where a probabilty density function
(PDF) and an uncertain region is associated with each point object. Given a location in
the uncertain region, the PDF returns the probablity of finding the object at that location.
A similar idea is used by Tao et al. [72] to answer queries in spatial databases. To handle
time interval queries, Mokhtar et al. [73] represent uncertain trajectories as a stochastic
process with a time-parametric uniform distribution.
24
-
8/13/2019 Arumugam s
25/123
CHAPTER 3SCALABLE JOIN PROCESSING OVER SPATIOTEMPORAL HISTORIES
In applications that produce a large amount of data describing the paths of
moving objects, there is a need to ask questions about the interaction of objects
over a long recorded history. In this chapter, the problem of computing joins over
massive moving object histories is considered. The particular join studied is the
Closest-Point-Of-Approach join, which asks: Given a massive moving object history,
which objects approached within a distance d of one another?
3.1 Motivation
Frequently, it is of interest in applications which make use of spatial data to ask
questions about the interaction between spatial objects. A useful operation that enables
one to answer such questions is the spatial joinoperation. Spatial join is similar to the
classical relational join except that it is defined over two spatial relations based on a
spatial predicate. The objective of the join operation is to retrieve all object pairs that
satisfy a spatial relationship. One common predicate involves distance measures, where
we are interested in objects that were within a certain distance of each other. The query
Find all restaurants within distance 10 miles from a hotel is an example of a spatialjoin.
For moving objects, the spatial join operation involves the evaluation of both a spatial
and a temporal predicate and for this reason the join is referred to as a spatiotemporal
join. For example, consider the spatial relations PLANESandTANKS, where each relation
represents accumulated trajectory data of planes and tanks from a battlefield simulation.
The queryFind all planes that are within distance 10 miles of a tank is an example of a
spatiotemporal join. The spatial predicate in this case restricts the distance (10 miles) and
the temporal predicate restricts the time period to the current time instance.
In the more general case, the spatiotemporal join is issued over a moving object
history, which contains all of the past locations of the objects stored in a database. For
25
-
8/13/2019 Arumugam s
26/123
example, consider the query Find all pairs of planes that came within distance of 1000
feet during their flight path. Since there is no restriction of the temporal predicate,
answering this query involves an evaluation of the spatial predicate at every time instance.
The amount of data to be processed can be overwhelming. For example, in a typical
flight, the flight data recorder stores about 7 MB of data which records among other
things, the position and time of the flight for every second during its operation. Given
that on average the US Air Traffic Control handles around 30000 flights in a single day,
if all of this data were archived, it would result in 200 GB of data accumulation just
for a single day. For another example, it is not uncommon for scientific simulations to
output terabytes or even petabytes of spatiotemporal data (see Abdulla et al.[7] and the
references contained therein).
In this chapter, the spatial-temporal join problem for moving object histories in
three-dimensional space, with time considered as the fourth dimension is investigated. The
spatiotemporal join operation considered is the CPA Join (Closest-Point-Of-Approach
Join). ByClosest Point of Approach, we refer to a position at which two moving objects
attain their closest possible distance[74]. Formally, in a CPA Join, we answer queries of
the following type: Find all object pairs(p P, qQ) from relations P and Q such that
CPA-distance (p, q) d. The goal is to retrieve all object pairs that are within a distance
dat their closest-point-of-approach.
Surprisingly, this problem has not been studied previously. The spatial join problem
has been well-studied for stationary objects in two- and three-dimensionsal space [ 45,47
49], however very little work related to spatiotemporal joins can be found in literature.
There has been some work related to joins involving moving objects [75,76] but the workhas been restricted to objects in a limited time window and does not consider the problem
of joining object histories that may be gigabytes or terabytes in size.
26
-
8/13/2019 Arumugam s
27/123
The contributions can be summarized as follows:
Three spatiotemporal join strategies for data involving moving object histories ispresented.
Simple adaptations of existing spatial join processing algorithms, based on the R-treestructure and on a plane-sweeping algorithm, for spatiotemporal histories is explored.
To address the problems associated with straightforward extensions to thesetechniques, a novel join strategy for moving objects based on an extension of thebasic plane-sweeping algorithm is described.
A rigorous evaluation and benchmarking of the alternatives is provided. Theperformance results suggest that we can obtain significant speedup in execution timewith the adaptive plane-sweeping technique.
The rest of this chapter is organized as follows: In Section 3.2, the closest point
of approach problem is reviewed. In Sections 3.3 and 3.4, two obvious alternatives to
implementing the CPA join using R-trees and plane-sweeping is described. In Section
3.5, a novel adaptive plane-sweeping technique that outperforms competing techniques
considerably is presented. Results from our benchmarking experiments are given in Section
3.6. Section 3.7 outlines related work.
3.2 Background
In this Section, we discuss the motion of moving objects, and give an intutive
description of the CPA problem. This is followed by an analytic solution to the CPA
problem over a pair of points moving in a straight line.
3.2.1 Moving Object Trajectories
Trajectories describe the motion of objects in a 2 or 3-dimensional space. Real-world
objects tend to have smooth trajectories and storing them for analysis often involves
approximation to a polyline. Apolylineapproximation of a trajectory connects object
positions, sampled at discrete time instances, by line segments (Figure 3-1).
In a database the trajectory of an object can be represented as a sequence of the form
(t1, v1), (t2, v2), . . . , (tn, vn)where each vi represents the position vector of the object at
27
-
8/13/2019 Arumugam s
28/123
t6
t7
t8
(B)(A)
t3
y
x
t0
t1
t2
t4
t5
t10t9
Figure 3-1. Trajectory of an object (a) and its polyline approximation (b)
time instance ti. The arityof the vector describes the dimensions of the space. For flight
simulation data, the arity would be 3, whereas for a moving car, the arity would be 2. The
position of the moving objects is normally obtained in one of several ways: by sampling orpolling the object at discrete time instances, through use of devices like GPS, etc.
3.2.2 Closest Point of Approach (CPA) Problem
We are now ready to describe the CPA problem. Let CPA(p,q,d)over two straight
line trajectories be evaluated as follows. Assuming the distance between the two objects is
given bymindist, then we output true ifmindist < d (the objects were within distance d
during their motion in space), otherwise f alse. We refer to the calculation of CPA(p,q,d)
as the CPA problem.
The minimum distance mindist between two objects is the distance between the
object positions at their closest point of approach. It is straightforward to calculate
mindistonce the CPA time tcpa, time instance at which the objects reached their closest
distance, is known.
We now give an analytic solution to the CPA problem for a pair of objects on a
simple straight-line trajectory.
Calculating the CPA time tcpa. Figure 3-2 shows the trajectory of two objects p
and qin 2-dimensional space for the time period [tstart, tend]. The position of these objects
at any time instance t is given byp(t) andq(t). Let their positions at time t = 0 be p0
and q0 and let their velocity vectors per unit of time be u andv . The motion equations for
28
-
8/13/2019 Arumugam s
29/123
t0
t3t4
t1
t2
tcpa
t4t3
t2t0
q
distcpa
tstarttend
t1 p
Figure 3-2. Closest Point of Approach Illustration
q[3]p[3]
q[1]
qp
p[2]
p[1]
q[2]
y
x
t
Figure 3-3. CPA Illustration with trajectories
these two objects are p(t) = p0+ tu; q(t) = q0+ tv. At any time instance t, the distance
between the two objects is given by d(t) =|p(t) q(t)|.
Using basic calculus, one can find the time instance at which the distance d(t) isminimum (whenD(t) =d(t)2 is a minimum). Solving for this time we obtain:
tcpa=(po qo).(u v)
|u v|2
Given this, mindist is given by |p(tcpa) q(tcpa)|.
The distance calculation that we described above is applicable only between two
objects on a straight line trajectory. To calculate the distance between two objects on a
polyline trajectory, we apply the same basic technique. For trajectories consisting of a
chain of line-segments, we find the minimum distance by first determining the distance
between each pair of line-segments and then choosing the minimum distance.
As an example, consider Figure 3-3 which shows the trajectory of two objects in
2-dimensional space with time as the third dimension. Each object is represented by
29
-
8/13/2019 Arumugam s
30/123
an array that stores the chain of segments comprising the trajectory. The line-segments
are labeled by the array indices. To determine the qualifying pairs, we find the CPA
distance between the line segment pairs (p[1],q[1]) (p[1],q[2]) (p[1],q[3]) (p[2],q[1])
(p[2],q[2]) (p[2],q[3]) (p[3],q[1]) (p[3],q[2]) (p[3],q[3]) and return the pair with the minimum
distance among all evaluated pairs. The complete code for computing CPA(p,q,d)over
multi-segment trajectories is given as Algorithm 3-1.
Algorithm 1 CPA (Object p, Object q, distance d)1: mindist = 2: for(i = 1 t o p.size)do3: for(j = 1 t o q.size)do4: tmp = CPA Distance(p[i], q[j])
5: if (tmp mindist)then6: mindist = tmp7: end if8: end for9: end for
10: if (mindist d)then11: returntrue12: end if13: return false
In the next two Sections, we consider two obvious alternatives for computing the
CPA Join, where we wish to discover al lpairs of objects (p, q) from two relations P and
Q, where CPA(p, q,d)evaluates to true. The first technique we describe makes use of an
underlying R-tree index structure to speed up join processing. The second methodology is
based on a simple plane-sweep.
3.3 Join Using Indexing Structures
Given numerous existing spatiotemporal indexing structures, it is natural to first
consider employing a suitable index to perform the join.
Though many indexing structures exist, unfortunately most are not suitable for the
CPA Join. For example, a large number of indexing structures like the TPR-tree [17],
REXP tree [77], TPR*-tree [78] have been developed to support predictive queries, where
30
-
8/13/2019 Arumugam s
31/123
the focus is on indexing the future position of an object. However, these index structures
are generally not suitable for CPA Join, where access to the entire history is needed.
Indexing structures like MR-tree [26], MV3R-tree [27], HR-tree [28], HR+-tree [27]
are more relevant since they are geared towards answering time instance queries (in case
of MV3R-tree also short time-interval queries), where all objects alive at a certain time
instance are retrieved. The general idea behind these index structures is to maintain a
separate spatial index for each time instance. However, such indices are meant to store
discrete snapshots of a evolving spatial database, and are not ideal for use with CPA Join
over continuous trajectories.
3.3.1 Trajectory Index Structures
More relevant are indexing structures specific to moving object trajectory histories
like the TB-tree, STR-tree [21] and SETI [18]. TB-trees emphasize trajectory preservation
since they are primarily designed to handle topological queries where access to entire
trajectory is desired (segments belonging to the same trajectory are stored together). The
problem with TB-trees in the context of the CPA Join is that segments from different
trajectories that are close in space or time will be scattered across nodes. Thus, retrieving
segments in a given time window will require several random I/Os. In the same paper
[21], a STR tree is introduced that attempts to somewhat balance spatial locality with
trajectory preservation. However, as the authors point out STR-trees turn out to be a
weak compromise that do not perform better than traditional 3D R-trees [20] or TB-trees.
More appropriate to the CPA Join is SETI [18]. SETI partitions two-dimensional space
statically into non-overlapping cells and uses a separate spatial index for each cell. SETI
might be a good candidate for CPA Join since it preserves spatial and temporal locality.However, there are several reasons why SETI is not the most natural choice for a CPA
Join:
It is not clear that SETIs forest scales to a three-dimensional space. A 25 25 SETIgrid in two-dimension becomes a sparse 25 25 25 grid with almost 20, 000 cells inthree-dimension.
31
-
8/13/2019 Arumugam s
32/123
SETIs grid structure is an interesting idea for addressing problems with highvariance in object speeds (we will use a related idea for the adaptive plane-sweepalgorithm described later). However, it is not clear how to size the grid for a givendata set, and sizing it for a join seems even harder. It might very well be thatrelationR should have a different grid for R Scompared to R T.
For a CPA Join over a limited history, SETI has no way of pruning the search space,since every cell will have to be searched.
3.3.2 R-tree Based CPA Join
Given these caveats, perhaps the most natural choice for the CPA Join is the R-tree
[16]. The R-tree [16] is a hierarchical, multi-dimensional index structure that is commonly
used to index spatial objects. The join problem has been studied extensively for R-trees
and several spatial join techniques exist [45,46,79] that leverage underlying R-treeindex structures to speed-up join processing. Hence, our first inclination is to consider a
spatiotemporal join strategy that is based on R-trees. The basic idea is to index object
histories using R-trees and then perform a join over these indices.
The R-Tree Index
It is a very straightforward task to adapt the R-tree to index a history of moving
object trajectories. Assuming three spatial dimensions and a fourth temporal dimension,
the four-dimensional line segments making up each individual object trajectory are simply
treated as individual spatial objects and indexed directly by the R-tree. The R-tree and
its associated insertion or packing algorithms are used to group those line segments into
disk-page sized groups, based on proximity in their four-dimensional space. These pages
make up the leaf level of the tree. As in a standard R-tree, these leaf pages are indexed
by computing the minimum bounding rectangle that encloses the set of objects stored in
each leaf page. Those rectangles are in turn grouped into disk-page sized groups which are
themselves indexed. An R-tree index for 3 line segments moving through 2-dimensional
space is depicted in Figure 3-4.
32
-
8/13/2019 Arumugam s
33/123
p1[3]p3 [2]p1[2] p1 [4]p3[3] p2 [4]
p3[1]
p2[3]
p3[2]
p3[3]
p1[4]
p2[1]
p2[2]
p1[3]
I1
I2
I3
ty
x
p2[4]
I1 I2 I3
p1[2]
p1[1]
p2[3]
p1[2]
p1[3]
p3[2]
p2[3]
p1[4]p2[4]
p3[3]
p2 [2]p3[1]p2 [1]p1 [1]
p1[1]
p2[1]
p2[2]
p3[1]
Figure 3-4. Example of an R-tree
Basic CPA Join Algorithm Using R-Trees
Assuming that the two spatiotemporal relations to be joined are organized using
R-trees, we can use one of the standard R-tree distance joins as a basis for the CPA Join.
The common approach to joins using R-trees employ carefully controlled synchronized
traversal of the two R-trees to be joined. The pruning power of the R-tree index arises
from the fact that if two bounding rectangles R1 andR2 do not satisfy the join predicate
then the join predicate is not satisfied between any two bounding rectangles that can be
enclosed within R1 or R2.
In a synchronizedtechnique, both the R-trees are simultaneously traversed retrieving
object-pairs that satisfy the join predicate. To begin with, the root nodes of both the
R-trees are pushed into a queue. A pair of nodes from the queue is processed by pairing
up every entry of the first node with every entry in the second node to form the candidate
set for further expansion. Each pair in the candidate set that qualifies the join predicate is
pushed into the queue for subsequent processing. The strategy described leads to a BFS
(Breadth-First-Search) expansion of the trees. BFS-style traversal lends itself to global
optimization of the join processing steps [46] and works well in practice.
33
-
8/13/2019 Arumugam s
34/123
d2
l1
P2
l2
P1
darbit
d1
x
y
z
dreal
Figure 3-5. Heuristic to speed up distance computation
The distance routine is used in evaluating the join predicate to determine the distance
between two bounding rectangles associated with a pair of nodes. A node-pair qualifies
for further expansion if the distance between the pair is less than the limiting distance d
supplied by the query.
Heuristics to Improve the Basic Algorithm
The basic join algorithm can be improved in several ways by using several standard
and non-standard techniques for reducing I/O and CPU costs over spatial joins. These
include:
Using a plane-sweeping algorithm [45] to speed up the all-pairs distance computationwhen pairs of nodes are expanded and their children are checked for possiblematches.
Carefully considering the processing of node pairs so that when each pair isconsidered, one or both of the nodes are in the buffer [46].
Avoiding expensive distance computations by applying heuristic filters. Computingthe distance between two 3-dimensional rectangles can be a very costly operation,
since the closest points may be on arbitrary positions on the faces of the rectangles.To speed this computation, the magnitudes of the diagonals of the two rectangles(d1 andd2) can be computed first. Next, we pick an arbitrary point from both ofthe rectangles (points P1 and P2), and compute the distance between them, calleddarbit. Ifdarbit d1 d2 > djoin , then the two rectangles cannot contain any pointsas close as djoin from one another and the pair can be discarded, as shown in Figure3-5. This provides for immediate dismissals with only three distance computations(or one if the diagonal distances are precomputed and stored with each rectangle).
34
-
8/13/2019 Arumugam s
35/123
object p
objectq
ty
x
Figure 3-6. Issues with R-trees- Fast moving object p joins with everyone
In addition, there are some obvious improvements to the algorithm that can be made
which are specific to the 4-dimensional CPA Join:
The fourth dimension, time, can be used as an initial filter. If two MBRs or linesegments do not overlap in time, then the pair cannot possibly be a candidate for aCPA match.
Since time can be used to provide for immediate dismissals without Euclideandistance calculations it is given priority over the attributes. For example, when aplane-sweep is performed to prune an all-pairs CPA distance computation, time isalways chosen as the sweeping axis. The reason is that time will usually have thegreatest pruning power of any attributes since time-based matches must always beexact, regardless of the join distance.
In our implementaion of the CPA Join for R-trees, we make use of the STR packingalgorithm [80] to build the trees. Because the potential pruning power of the timedimension is greatest, we ensure that the trees are well-organized with respect totime by choosing time as the first packing dimension.
Problem With R-tree CPA Join
Unfortunately, it turns out that in practice the R-tree can be ill-suited to the problem
of computing spatiotemporal joins over moving object histories. R-trees have a problem
handling databases with a high variance in object velocities. The reason for this is thatjoin algorithms which make use of R-trees rely on tight and well-behaved minimum
bounding rectangles to speed the processing of the join. When the positions of a set of
moving objects are sampled at periodic intervals, fast moving objects tend to produce
larger bounding rectangles than slow moving objects.
35
-
8/13/2019 Arumugam s
36/123
p1
q2
q1
p2
y
time
tendtstart
Figure 3-7. Progression of plane-sweep
One such scenario is depicted in Figure 3-6, which shows the paths of a set of objects
on a 2-D plane for a given time period. A fast moving object such as pwill be contained
in a very large MBR, while slower objects such as qwill be contained in much smaller
MBRs. When a spatial join is computed over R-trees storing these MBRs, the MBR
associated with p can overlap many smaller MBRs, and each overlap will result in an
expensive distance computation (even if the objects do not travel close to one another).
Thus, any sort of variance in object velocities can adversely affect the performance of the
join.
3.4 Join Using Plane-Sweeping
The second technique that is considered is a join strategy based on a simple plane-
sweep. Plane-sweep is a powerful technique for solving proximity problems involving
geometric objects in a plane and has previously been proposed [49] as a way to efficiently
compute the spatial join operation.
3.4.1 Basic CPA Join using Plane-Sweeping
Plane-sweep is an excellent candidate for use with the CPA join because no matter
what distance threshold is given as input into the join, two objects must overlap in the
time dimension for there to be a potential CPA match. Thus, given two spatiotemporal
relationsP andQ, we could easily base our implementation of the CPA join on a
plane-sweep along the time dimension.
36
-
8/13/2019 Arumugam s
37/123
We would begin a plane-sweep evaluation of the CPA join by first sorting the intervals
making up P and Q along the time dimension, as depicted in Figure 3-7. We then sweep
a vertical line along the time dimension. A sweepline data structure D is maintained
which keeps track of all line segments which are valid given the current position of the line
along the time dimension. As the sweepline progresses, D is updated with insertions (new
segments that became active) and deletions (segments whose time period has expired).
Segment pairs from both input relations that satisfy the join predicate are always present
in D, and they can be checked and reported during updates to D. Pseudo-code for the
algorithm is given below:
Algorithm 2 PlaneSweep (Relation P, Relation Q, distance d)1: Form a single list L containing segments from P and Q sorted bytstart2: Initialize sweepline data structure D3: while not IsEmpty (L)do4: Segmenttop = popFront (L)5: Insert (D,top)6: Delete from D all segmentss s.t. (s.tend < top.tstart){remove segments that donot
intersect sweepline}7: Query (D,top, d){report segments in D that are within distance dist}8: end while
In the case of the CPA join, assuming that all moving objects at any given moment
can be stored in main memory, any of a number of data structures can be used to
implement D, such as a quad- or oct-tree, or an interval skip-list [81]. The main
requirement is that the data structure selected should easily be possible to check proximity
of objects in space.
3.4.2 Problem With The Basic Approach
Although the plane-sweep approach is simple, in practice it is usually too slow to
be useful for processing moving object queries. The problem has to do with how the
sweepline progression takes place. As the sweepline moves through the data space, it has
to stop momentarily at sample points (time instances at which object positions where
recorded) to process newly encountered segments into the data structure D. New segments
37
-
8/13/2019 Arumugam s
38/123
y
time
tend
q2
q1
p2
p1
tstart
Figure 3-8. Layered Plane-Sweep
that are encountered at the sample point are added into the data structure and segments
in D that are no longer active are deleted from it.
Consequently, the sweepline pauses more often when objects with high sampling rates
are present, and the progress of the sweepline is heavily influenced by the sampling rates
of the underlying objects. For example, consider Figure 3-7 which shows the trajectory
of four objects in a given time period. In the case illustrated, object p2 controls the
progression of the sweepline. Observe that in the time-interval [tstart, tend], only new
segments from object p2 get added to D but expensive join computations are performed
each time with same set of line segments.
The net result is that if the sampling rate of a data set is very high relative to the
amount of object movement in the data set, then processing a multi-gigabyte object
history using a simple plane-sweeping algorithm may take a prohibitively long time.
3.4.3 Layered Plane-Sweep
One way to address this problem is to reduce the number of segment level comparisons
by comparing the regions of movementof various objects at a coarser level. For example,
reconsider the CPA join depicted in Figure 7. If we were to replace the many oscillations
of objectp2 with a single minimum bounding rectangle which enclosed all of those
oscillations fromtstart totend, we could then use that rectangle during the plane-sweep
38
-
8/13/2019 Arumugam s
39/123
as an intial approximation to the path of object p2. This would potentially save many
distance computations.
This idea can be taken to its natural conclusion by constructing a minimum bounding
box that encompasses the line-segments of each object. A plane-sweep is then performed
over the bounding boxes, and only qualifying boxes are expanded further. We refer to this
technique as the Layered Plane-Sweep approach since plane-sweep is performed at two
layers one at a coarser level of bounding boxes and then at the finer level of individual
line segments.
One issue that must be considered is how much movement is to be summarized
within the bounding rectangle for each object. Since we would like to eliminate as many
comparisons as possible, one natural choice would be to let the available system memory
dictate how much movement is covered for each object. Given a fixed buffer size, the
algorithm will proceed as follows.
Algorithm 3 LayeredPlaneSweep(Relation P, Relation Q, distance d)
1: Segments defined by [(xsart, xend), (ystart, yend), (zstart,zend), (tstart, tend)]2: Assume a sorted list of object segments (by tstart) in disk3: while there is still some unprocessed data do4:
Read in enough data fromP andQto fill the buffer5: Lettstart be the first time tick which has not yet been processed by the plane-sweep6: Lettend be the last time tick for which no data is still on disk7: Next, bound the trajectory of every object present in the buffer by a MBR8: Sort the MBRs along one of the spatial dimension and then perform a plane-sweep
along that dimension9: Expand the qualifying MBR pairs to get the actual trajectory data (line segments)
10: Sort the line segments bytstart11: Perform a final sweep along the time dimension to get the final result set12: end while
Figure 3-8 illustrates the idea. It depicts the snapshot of object trajectories starting
at some time instance tstart. Segments in the interval [tstart, tend] represent the maximum
that can be buffered in the available memory. A first level plane-sweep is carried out over
the bounding boxes to eliminate false positives. Qualifying objects are expanded and a
second-level plane-sweep is carried out over individual line-segments. In the best case,
39
-
8/13/2019 Arumugam s
40/123
tstart
q2q2
tend
time
q1
p2
p1
y
Figure 3-9. Problem with using large granularities for bounding box approximation
there is an opportunity to process the entire data set through just three comparisons at
the MBR level.
3.5 Adaptive Plane-Sweeping
While the layered plane-sweep typically performs far better than the basic plane-sweeping
algorithm, it may not always choose the proper level of granularity for the bounding box
approximations. This Section describes an adaptive strategy that takes into careful
consideration the underlying object interaction dynamics and adjusts this granularity
dynamically in response to the underlying data characteristics.
3.5.1 Motivation
In the simple layered plane-sweep, the granularity for the bounding box approximation
is always dictated by the available system memory. The assumption is that pruning power
increases monotonically with increasing granularity. Unfortunately, this is not always the
case. As a motivating example, consider Figure 3-9. Assume available system memory
allows us to buffer all the line segments. In this case, the layered plane-sweep performs
no better than the basic plane-sweep, due to the fact that all the object bounding
boxes overlap with each other and as a result no pruning is achieved at the first-level
plane-sweep.
However, assume we had instead fixed the granulatiry to correspond to the time
period [tstart, ti], as depicted in Figure 3-10. In this case, none of the bounding boxes
40
-
8/13/2019 Arumugam s
41/123
overlap, and there are possibly many dismissals at the first level. Though less of the
buffer is processed intitially, we are able to eliminate many of the segment-level distance
comparisons compared to a technique that bounds the entire time-period, thereby
potentially increasing the efficiency of the algorithm. The entire buffer can then be
processed in a piece-by-piece fashion, as depicted in Figure 3-10. In general, the efficiency
of the layered plane-sweep is tied not to the granularity of the time interval that is
processed, but the granularity that minimizes the number of distance comparisons.
3.5.2 Cost Associated With a Given Granularity
Since distance computations dominate the time required to compute a typical CPA
Join, the cost associated with a given granularity can be approximated as a function of the
number of distance comparisons that are needed to process the segments encompassed in
that granularity. LetnMBR be the number of distance computations at the box-level, let
nseg be the number of distance calculations at the segment-level, and let be the fraction
of the time range in the buffer which is processed at once. Then the cost associated with
processing that fraction of the buffer can be estimated as:
cost
= (nseg
+nMBR
)(1 )
This function reflects the fact that if we choose a very small value for , we will have to
process many cut-points in order to process the entire buffer, which can increase the cost
of the join. As shrinks, the algorithm becomes equivalent to the traditional plane-sweep.
On the other hand, choosing a very large value for tends to increase (nseg +nMBR),
eventually yielding an algorithm which is equivalent to the simple, layered plane-sweep. In
practice, the optimal value for lies somewhere in between the two extremes, and variesfrom data set to data set.
3.5.3 The Basic Adaptive Plane-Sweep
Given this cost function, it is easy to design a greedy plane-sweep algorithm that
attempts to repeatedly minimize cost in order to adapt the underlying (and potentially
41
-
8/13/2019 Arumugam s
42/123
tendtj
p1
p2
q1
q2
time
y
titstart
Figure 3-10. Adaptively varying the granularity
time-varying) characteristics of the data. At every iteration, the algorithm simply chooses
to process the fraction of the buffer that appears to minimize the overall cost of the
plane-sweep in terms of the expected number of distance computations. The algorithm is
given below:
Algorithm 4 AdaptivePlaneSweep(Relation P, Relation Q, distance d)
1: while there is still some unprocessed data do2: Read in enough data fromP andQto fill the buffer3: Lettstart be the first time tick which has not yet been processed by the plane-sweep4: Lettend be the last time tick for which no data is still on disk5: Choose so as to minimize cost6: Perform a layered plane-sweep from time tstart totstart+ (tend tstart){steps 5-9
of procedure LayeredPlaneSweep}7: end while
Unfortunately, there are two obvious difficulties involved with actually implementing
the above algorithm:
First, the cost cost associated with a given granularity is known only after thelayered plane has been executed at that granularity.
Second, even if we can compute cost easily, it is not obvious how we can computecost for all values of from 0 to 1 so as to minimize cost over all .
These two issues are discussed in detail in the next two Sections.
42
-
8/13/2019 Arumugam s
43/123
3.5.4 Estimating Cost
This Section describes how to efficiently estimate cost for a given using a simple,
online sampling algorithm reminiscent of the algorithm of Hellerstein and Haas [82].
At a high level, the idea is as follows. To estimate cost, we begin by constructing
bounding rectangles for all of the objects in P considering their trajectories from time
tstart to(tend tstart). These rectangles are then inserted into an in-memory index, just as
if we were going to perform a layered plane-sweep. Next, we randomly choose an object q1
fromQ, and construct a bounding box for its trajectory as well. This object is joined with
all of the objects inPby using the in-memory index to find all bounding boxes within
distanced ofq1. Then:
LetnMBR,q1 be the number of distance computations needed by the index tocompute which objects fromP have bounding rectangles within distance d of thebounding rectangle forq1, and
Letnseg,q1 be the total number of distance computations that would have beenneeded to compute the CPA distance between q1 and every object p P whosebounding rectangle is within distance d of the bounding rectangle for q1 (this can becomputed efficiently by performing a plane-sweep without actually performing therequired distance computations).
Once nMBR,q1 andnseg,q1 have been computed forq1, the process can be repeated for a
second randomly selected objectq2 Q, for a third object q3 and so on. A key observation
is that after m objects fromQhave been processed, the value
m= 1m
mi=1
(nMBR,q1+nseg,q1)|Q|
represents an unbiased estimator for (nMBR +nseg) at , where|Q| denotes the number of
data objects inQ.In practice, however, we are not only interested inm. We would also like to know at
all times just how accurate our estimatem is, since at the point where we are satisfiedwith our guess as to the real value ofcost, we want to stop the process of estimating
cost and continue with the join.
43
-
8/13/2019 Arumugam s
44/123
Fortunately, the central limit theorem can easily be used to estimate the accuracy
ofm. Assuming sampling with replacement from Q, for large enough m the error of ourestimate will be normally distributed around (nMBR +nseg) with variance
2m =
1m
2(Q),
where2(Q) is defined as
1
|Q|
|Q|i=1
{(nMBR,q1+nseg,q1)|Q| (nMBR +nseg)}2
Since in practice we cannot know 2(Q), it must be estimated via the expression
2(Qm) = 1m 1
mi=1
{(nMBR,q1+ nseg,q1)|Q| m}(Qm denotes the sample ofQ that is obtained afterm objects have been randomlyretrieved from Q). Substituting into the expression for 2m, we can treatm as a normallydistributed random variable with variance b
2(Qm)m .
In our implementation of the adaptive plane-sweep, we can continue the sampling
process until our estimate for cost is accurate to within 10% at 95% confidence. Since
95% of the standard normal distribution falls within two standard deviations of the mean,
this is equivalent to sampling until b2(Qm)
m is less thanm 0.1.3.5.5 Determining The Best Cost
We now address the second issue: how to compute cost for all values of from 0 to
1 so as to minimize cost over all.
Calculatingcost for all possible values of is prohibitively expensive and hence not
feasible in practice. Fortunately, in practice we do not have to evaluate all values of to
determine the best. This is due to the following interesting fact: If we plot all possible
values of and their respective associated cost, we would observe that the graph is not
linear, but exhibits a certain concavity. The concave region of the graph represents a sweet
spotand represents the feasible region for the best cost.
As an example consider Figure 3-11, which shows the plot of the cost function for
various fractions of for one of the experimental data sets from Section 3.6. Given this
44
-
8/13/2019 Arumugam s
45/123
2e+07
3e+07
4e+07
5e+07
6e+07
7e+07
8e+07
9e+07
0 10 20 30 40 50 60 70 80 90 100#ofDistanceCom
putations(estimated)
% of buffer
Convexity of Cost Function for k = 20
Figure 3-11. Convexity of cost function illustration.
fact, we identify the feasible region by evaluating costi for a small number, k , ofi
values. Givenk the number of allowed cutpoints, the fraction 1 can be determined asfollows:
1 =r(
1k)
r
wherer = (tendtstart) is the time range described by the buffer (the above formula
assumes thatr is greater than one; if not, then the time range should be scaled accordingly).
In the general case, the fraction of the buffer considered by any i(1 i k) is given by:
i = (r 1)i
r
Note that since the growth rate of each subsequent i is exponential, we can cover
the entire buffer with just a small k and still guarantee that we will consider some value
ofi that is within a factor of1 from the optimal. After computing 1,2,. . . ,k, we
successively evaluate increasing buffer fractions, 1, 2, 3, and so on and determine their
associated costs. From these k costs we determine the i with the minimum cost.
Note that if we choose based on the evaluation of a smallk, then it is possible that
the optimal choice ofmay lie outside the feasible region. However, there is a simple
approach to solving this issue. After an initial evaluation ofk granularities, consider
just the region starting before and ending after the best k and recursively reapply the
evaluation described above just in this region.
45
-
8/13/2019 Arumugam s
46/123
32
4 5
5
3
tj
ti
tendtstart
5
mincost
mincost
mincost
321
4
4
1
1
2
Figure 3-12. Iteratively evaluating k cut points
For instance, assume we chose i after evaluation ofk cutpoints in the time range r .
To further tune this i, we consider the time range defined between the adjacent cutpoints
i1 and i+1 and recursively apply cost estimation in this interval. (i.e., evaluate k points
in the time range (tstart+i1 r, tstart+i+1 r)). Figure 3-12 illustrates the idea. This
approach is simple and very effective in considering a large number of choices of.
3.5.6 Speeding Up the Estimation
Restricting the number of candidate cut points can help keep the time required to
find a suitable value formanageable. However, if the estimation is not implemented
carefully, the time required to consider the cost at each of the k possible time periods can
still be significant.
The most obvious method for estimating cost for each of the k granularities wouldbe to simply loop through each of the associated time periods. For each time period,
we would build bounding boxes around each of the trajectories of the objects in P, and
then sample objects fromQ as described in Section 3.5 until the cost was estimated with
sufficient accuracy.
However, this simple algorithm results in a good deal of repeated work for each time
period, and can actually decrease the overall speed of the adaptive plane-sweep compared
to the layered plane-sweep. A more intelligent implementation can speed the optimization
process considerably.
In our implementation, we maintain a table of all the objects in P andQ, organized
on the ID of each object. Each entry in the table points to a linked list that contains a
46
-
8/13/2019 Arumugam s
47/123
chain of MBRs for the associated object. Each MBR in the list bounds the trajectory
of the object for one of the k time-periods considered during the optimization, and the
MBRs in each list are sorted from the coarsest of the k granularities to the finest. The
data structure is depicted in Figure 3-12.
Given this structure, we can estimatecost for each of the k values of alpha in
parallel, with only a small hit in performance associated with an increased value for k.
Any object pair (p P, q Q) that needs to be evaluated during the sampling process
described in Section 3.5 is first evaluated at the coarsest granularity corresponding to k.
If the two MBRs are within distance d of one another, then the cost estimate for k is
updated, and the evaluation is then repeated at the second coarsest granularity k1. If
there is again a match, then the cost estimate fork1 is updated as well. The process is
repeated until there is not a match. As soon as we find a granularity at which the MBRs
forp and qare not within a distance d of one another, then we can stop the process,
because if the MBRs for P and Q are not within distance dfor the time period associated
withi, then they cannot be within this distance for any time period j where j < i.
The benefit of this approach is that in cases where the data are well-behaved and
the optimization process tends to choose a value forthat causes the entire buffer to be
processed at once, a quick check of the distance between the outer-most MBRs ofp and q
is the only geometric computation needed to process p andq, no matter what value ofk is
chosen.
The bounding box approximations themselves can be formed while the system buffer
is being filled with data from disk. As trajectory data are being read from disk, we grow
the MBRs for each i progressively. Since eachi represents a fraction of the buffer, theupdates to its MBR can be stopped as soon as that much fraction of the buffer has been
filled. Similar logic can be used to shrink the MBRs when some fraction of the buffer is
consumed and expand it when the buffer is refilled.
47
-
8/13/2019 Arumugam s
48/123
.
.
.
3 4 5
p2
p1
2 time
y
pn
p2
(5 ) (4 ) (3 ) (2) (1)MBRMBRMBR MBRMBR
1
Figure 3-13. Speeding up the Optimizer
3.5.7 Putting It All Together
In our implementation of the adaptive plane-sweep, data are fetched in blocks
and stored in the system buffer. Then an optimizer routine is called which evaluatesk granularities and returns the granularity with the minimum cost. Data in the
granularity chosen by the optimizer is then evaluated using the LayeredPlaneSweep
routine (procedure described in Section 3.4). When the LayeredPlaneSweep routine returns,
the buffer is refilled and the process is repeated. The techniques described in the previous
Section are utilized to make the optimizer implementation fast and efficient.
3.6 Benchmarking
This section presents experimental results comparing the various methods discussed so
far for computing a spatiotemporal CPA Join: an R-tree, a simple plane-sweep, a layered
plane-sweep, and an adaptive plane-sweep with several parameter settings. The Section is
organized as follows. First, a description of the three, three-dimensional temporal data sets
used to test the algorithms is given. This is followed by the actual experimental results
and a detailed discussion analyzing the experiemental data.
3.6.1 Test Data Sets
The first two data sets that we use to test the various algorithms result from two
physics-based,N-body simulations. In both data sets, constituent records occupy 80B on
disk (80B is the storage required to record the object ID, time information, as well as the
48
-
8/13/2019 Arumugam s
49/123
-
8/13/2019 Arumugam s
50/123
-10000
-5000
0
5000
10000-12000
-10000
-8000
-6000
-4000
-2000
0
2000
-6000
-4000
-2000
0
2000
4000
6000
8000
Figure 3-15. Collision data set at time tick 1,500
strong graviational interaction. A small sample of the galaxies in the simulation isdepicted above in Figure 3-14, at time tick 1,500.
In addition, we test a third data set created using a simple, 3-dimensional random walk.
We call this the Syntheticdata set (this data set was again about 50GB in size). The
speed of the various objects varies considerably during the walk. The purpose of including
this data is to rigorously test the adpatability of the adaptive plane-sweep, by creating
a synthetic data set where there are significant fluctuations in the amount of interaction
among objects as a function of time.
3.6.2 Methodology and Results
All experiments were conducted on a 2.4GHz Intel Xeon PC with 1GB of RAM. The
experimental data sets were each stored on an 80GB, 15,000 RPM Seagate SCSI disk.
For all three of the data sets, we tested an R-tree-based CPA Join (implemented as
described in Section 3.3; we used the STR R-tree packing algorithm [80] to construct anR-tree for each input relation), a simple plane-sweep (implemented as described in Section
3.4), a layered plane-sweep (implemented as described in Section 3.5).
We also tested the adaptive plane-sweep algorithm, implemented as described
in Section 6. For the adaptive plane-sweep, we also wanted to test the effect of the
50
-
8/13/2019 Arumugam s
51/123
0
2000
4000
6000
8000
10000
12000
0 20 40 60 80 100
TimeTaken
% of Join Completed
CPA-Join over Injection Dataset
R-Tree
Simple Sweep
Layered Sweep