﻿ABSTRACT
Approximation Techniques for Spatial Data ∗
Abhinandan Das
Cornell University
asdas@cs.cornell.edu
Spatial Database Management Systems (SDBMS), e.g., Geographical
Information Systems, that manage spatial objects
such as points, lines, and hyper-rectangles, often have
very high query processing costs. Accurate selectivity estimation
during query optimization therefore is crucially important
for finding good query plans, especially when spatial
joins are involved. Selectivity estimation has been studied
for relational database systems, but to date has only received
little attention in SDBMS. In this paper, we introduce novel
methods that permit high-quality selectivity estimation for
spatial joins and range queries. Our techniques can be constructed
in a single scan over the input, handle inserts and
deletes to the database incrementally, and hence they can
also be used for processing of streaming spatial data. In contrast
to previous approaches, our techniques return approximate
results that come with provable probabilistic quality
guarantees. We present a detailed analysis and experimentally
demonstrate the efficacy of the proposed techniques.
1. INTRODUCTION
In recent years, data management for spatial applications
such as Geographic Information Systems, the earth sciences,
and environmental monitoring has gained significant importance.
In spatial data management, records in the database
have a spatial extent, and users can pose expressive queries
such as a spatial join between two relations (join all objects
that overlap or are within certain distance of each other)
or a range query (report all objects in a selected range, or
return an aggregate over the selected objects). Due to the
spatial extent of the objects, executing spatial queries can
be very expensive.
When selecting different query plans for a spatial query,
∗ The authors are supported by NSF grants IIS-0330201,
CCF-0205452, and IIS-0133481, and by a gift from Microsoft.
Any opinions, findings, conclusions, or recommendations
expressed in this paper are those of the authors and
do not necessarily reflect the views of the sponsors.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage, and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGMOD 2004 June 13-18, 2004, Paris, France.
Copyright 2004 ACM 1-58113-859-8/04/06 . . . $5.00.
Johannes Gehrke
Cornell University
johannes@cs.cornell.edu
Mirek Riedewald
Cornell University
mirek@cs.cornell.edu
we need accurate estimates of the cost of different execution
strategies. Thus as in relational systems, we need accurate
selectivity (cardinality) estimates. Note that these
estimates can also be used for fast approximation of an aggregate
query, e.g., a classical example is the approximate
range aggregate; another interesting example is to use approximate
join cardinality for correlation analysis between
data sets.
The current state of the art in estimating the sizes of
spatial joins uses either sampling or histograms, where recent
work has shown that histograms are superior for a wide
range of possible query classes [5, 26]. However, existing
histogram-based techniques still have significant drawbacks.
First, they either give no or only very conservative worstcase
error guarantees that are usually overly pessimistic in
practice. Due to the lack of error guarantees, it is not possible
to quantify the tradeoff between storage and result accuracy
(except by performing extensive experimental evaluation
with a priori known data distributions). Second, existing
histogram techniques are designed for static datasets,
and require usually several passes over the data during construction.
The only exception are histograms that use a
fixed partitioning of the space (e.g., equi-width). These can
be constructed in a single pass and can be maintained incrementally,
but they cannot adapt to skewed or changing
data distributions.
Our Contributions. In this paper, we introduce a novel
framework for spatial data that is based on randomized projections.
Our framework can handle a large set of challenging
spatial operators, and it can handle objects with spatial
extent. Our approach is grounded on randomizing techniques
for computing small, pseudo-random sketches of the
spatial dataset. The basic sketching technique was originally
introduced for on-line self-join size estimation by Alon,
Matias, and Szegedy in their seminal paper [4] and, as we
demonstrate in our work, can be generalized to provide approximate
answers to spatial join queries with explicit and
tunable performance guarantees on the approximation error.
The main challenge in using sketches for spatial data is to
design selectivity estimation algorithms which are based on
the right way of counting events like points being contained
in intervals. The algorithm should not only determine the
correct estimate on expectation, but its results should also
have low variance.
More concretely, the key contributions of our work can be
summarized as follows:
• We introduce the first summary data structure for spatial
data that allows estimation of the selectivity of
spatial queries with provable performance guarantees.
It permits a graceful tradeoff between space consumption
and the quality of the resulting estimation. We
apply our spatial sketch technique to spatial joins of
intervals and rectangles (Section 4). All of our techniques
are accompanied by a thorough analytical evaluation
in which we show probabilistic bounds on the
quality of the resulting summary data structures.
• We show the generality of our technique by discussing
several extensions: Joins of hyper-rectangles in a ddimensional
space where d > 2, spatial joins with other
join predicates like “contained”, ǫ-joins, and range
queries (Section 6).
• In a thorough experimental evaluation, we compare
our techniques with the best previously known estimation
techniques for spatial joins, the Geometric and
the Euler histograms (Section 7).
Note that, even though we develop our sketching algorithms
in the classic database context of stored relations, our techniques
are more generally applicable to scenarios where only
a single pass over the data is possible — from streams of
spatial data to huge Terabyte databases where performing
multiple passes over the data for the exact computation of
query results may be prohibitively expensive.
The rest of the paper is organized as follows. Section 2
defines spatial queries and discusses the sketching approach
of [4]. In Section 3 we introduce the basic atomic sketches
which will be used to construct selectivity estimators. Selectivity
estimation for joins of interval sets and rectangle
sets are discussed in Section 4. Sections 5 and 6 show how
to generalize the technique to virtually any practical setting
and how to extend our techniques to the d-dimensional
case, other join conditions, and other spatial operators. Experimental
results are discussed in Section 7, and Section 8
reviews related work. Section 9 concludes this article.
2. BACKGROUND
2.1 Queries
We focus mostly on selectivity estimation for spatial joins,
more precisely spatial joins of sets of hyper-rectangles and
ε-joins of point sets. These are the most common types of
joins in spatial DBMS. In the following, let N be a (discrete)
metric space and let r = r(1) × r(2) × · · · × r(d)
be a hyper-rectangle which is defined by range r(i) in dimension
i, 1 ≤ i ≤ d. A range r(i) is defined by its
lower and upper endpoint l(r(i)) and u(r(i)) such that
r(i) = {x ∈ N |l(r(i)) ≤ x ≤ u(r(i))}. Hence r = {x =
(x1, x2, . . . , xd) ∈ N d |∀1 ≤ i ≤ d : l(r(i)) ≤ xi ≤ u(r(i))}.
Note that our techniques easily generalize to multidimensional
data spaces with different domains for the dimensions.
In Section 5.1 we discuss how to handle real-valued domains.
Definition 1. Let R and S be two sets of hyper-rectangles
in d-dimensional space N d . Functionoverlap for two hyperrectangles
r(1) ×r(2) × · · · ×r(d) ∈ R and s(1) ×s(2) × · · · ×
s(d) ∈ S returns true if ∀1 ≤ i ≤ d : l(r(i)) < l(s(i)) <
u(r(i)) ∨ l(r(i)) < u(s(i)) < u(r(i)) ∨ l(s(i)) < l(r(i)) <
u(s(i)) ∨ l(s(i)) < u(r(i)) < u(s(i)), and false otherwise.
The spatial join of R and S then is defined as
R ⊲⊳o S = {(r, s)|r ∈ R ∧ s ∈ S ∧overlap(r,s)} ,
and hence its selectivity is |R⊲⊳oS|
|R|·|S| .
According to this definition objects that only “touch at their
boundaries” are not part of the join result. We will examine
possible generalizations in Section 6.
Definition 2. Let A and B be two sets of points in ddimensional
space N d and dist be a function that returns
the distance between a point in A and a point in B. The
ε-join of A and B then is defined as
A ⊲⊳ε B = {(a, b)|a ∈ A ∧ b ∈ B ∧dist(a, b) ≤ ε} .
Its selectivity is the cardinality of the result divided by the
total number of point-pairs, i.e., |A⊲⊳εB|
|A|·|B| .
Typically the distance dist(a, b) of two points a =
(a1, . . . , ad) and b = (b1, . . . , bd) is computed by using an
Li-distance distL i , or disti for short, where
disti(a,b) = (
including the L∞ distance
d�
i=1
|ai − bi| i ) 1 i ,
dist∞(a, b) = max{|a1 − b1|, . . . , |ad − bd|} .
We also examine selectivity estimation for range queries.
Definition 3. Let R be a set of d-dimensional hyperrectangles
and q = (q(1) × q(2) × · · · × q(d)) be the query
hyper-rectangle such that range q(i) is selected in dimension
i. This range query selects the set Q(q, R) = {r ∈
R|overlap(r,q)}. The selectivity is defined as |Q(q,R)|
|R|
Notice that for all introduced query types the challenge is
to compute the result cardinality. Knowing the cardinality,
we can compute the selectivity by simply keeping track
of the input cardinalities. Hence in the following we are
concerned with estimating the cardinality of the results of
spatial queries.
2.2 AMS Sketches
Sketches were proposed for processing data streams. Consider
a simple scenario where the goal is to estimate the size
of the self-join SJ(A) of relation R over one of its attributes
R.A as the tuples of R are streaming in; that is, we seek
to approximate the result of query Q = COUNT(R ⊲⊳A R).
W.l.o.g. let the domain of join attribute A be dom(A) =
{0, 1, · · · , |dom(A)| − 1}, where |dom(A)| denotes the size of
the domain. Using f(i) to denote the frequency of attribute
value i in R.A, we can rewrite query Q as Q = SJ(A) =
�
i∈dom(A) f(i)2 (i.e., the second moment of A). In their
seminal paper, Alon, Matias, and Szegedy [4] prove that
any deterministic algorithm that produces a tight approximation
to SJ(A) requires at least Ω(|dom(A)|) bits of storage,
rendering such solutions impractical for a data-stream
setting. Instead, they propose a randomized technique that
offers strong probabilistic guarantees on the quality of the
resulting SJ(A) approximation while using only logarithmic
space in |dom(A)|.
The basic idea of their scheme is to define a random variable
Z that can be easily computed over the streaming values
of R.A, such that (1) Z is an unbiased estimator for
SJ(A), i.e., E[Z] = SJ(A); and (2) Z has sufficiently small
variance Var(Z) to provide strong probabilistic guarantees
for the quality of the estimate. This random variable Z
is constructed on-line from the streaming values of R.A as
follows:
• Select a family of four-wise independent binary random
variables {ξi : i = 1, . . . , |dom(A)|}, where each ξi ∈
{−1, +1} and Pr[ξi = +1] = Pr[ξi = −1] = 1/2 (i.e.,
E[ξi] = 0). Informally, the four-wise independence
condition means that for any 4-tuple of ξi variables
and for any 4-tuple of {−1,+1} values, the probability
that the values of the variables coincide with those in
the {−1, +1} 4-tuple is exactly 1/16 (the product of
the equality probabilities for each individual ξi).
• Define Z = X 2 , where X = �
i∈dom(A) f(i)ξi. Note
that X is simply a randomized linear projection (inner
product) of the frequency vector of Ri.A with the
vector of ξi’s that can be efficiently generated from the
streaming values of A as follows: Start with X = 0 and
simply add ξi to X whenever the i th value of A is observed
in the stream.
The crucial point here is that, by employing known tools
(e.g., orthogonal arrays) for the explicit construction of
small sample spaces supporting four-wise independent random
variables, such families can be efficiently constructed
on-line using only O(log |dom(A)|) space [4]. More precisely,
we do not explicitly store the ξi. Instead we store a single
seed (for ξi with i of length k bits, the seed has length 2k+1
bits) for the whole ξ-family. Whenever a ξi is needed for the
computation, its value is generated on-the-fly from the seed
in time linear in the seed size.
2.3 Boosting Accuracy
To improve the quality of the estimation guarantees one
can use a standard boosting technique (see [4]) that maintains
several independent identically-distributed (i.i.d.) instantiations
of a random variable and uses averaging and
median-selection operators to boost accuracy and probabilistic
confidence. The i.i.d. instances can be constructed
by simply selecting independent random seeds for generating
the families of four-wise independent ξi’s for each instance.
Let {Zi,j}, i ∈ {1, 2, . . . , k1} and j ∈ {1, 2, . . . , k2}, be
a set of k1 · k2 of such i.i.d. instances of an estimator Z
for a quantity Q, such that E[Z] = Q. Each Zi,j is a randomized
linear projection of the data stream as discussed
in Section 2.2. We use the term atomic sketch to describe
such a single projection, and the term sketch for the overall
synopsis consisting of k1 ·k2 i.i.d. instances of these random
projections.
Figure 1 shows a sketch and illustrates how the boosting
works. For each row j we compute the average ¯Zj =
�k1 i=1
Zi,j. Then we output the median of these averages
{ ¯ Z1, ¯ Z2, . . . , ¯ Zk2 }. The following lemma establishes the
quality guarantees.
Lemma 1. [4] Using 16 Var[Z]
ε2E[Z] 2 lg 1
independent copies of
φ
Z we can guarantee that the computed estimate ¯ Z of E[Z]
satisfies Pr[| ¯ Z − E[Z]| > εE[Z]] ≤ φ.
Proof. By setting k1 = 8
ε2 Var[Z]
E[Z] 2 and k2 = 2 lg(1/φ) the
lemma follows from Chebychev’s inequality and Chernoff
bounds.
Note that determining the number of instances of Z actually
requires knowledge of E[Z], the very value we would like
Sketch
Atomic sketch
Z(1,1)
Atomic sketch
Z(1,2)
Atomic sketch
Z(1,k2)
Atomic sketch
Z(2,1)
Atomic sketch
Z(2,2)
Atomic sketch
Z(2,k2)
Atomic sketch
Z(k1,1)
Atomic sketch
Z(k1,2)
Atomic sketch
Z(k1,k2)
Boosted estimate of E[Z]:
Average
Z(1)
Average
Z(2)
Average
Z(k2)
Median of the
Z
Figure 1: Boosting the accuracy
to estimate! This is a common problem shared by previous
sketching techniques (see Section 8) and Online Aggregation
[19, 18], in fact also by any experiment in the natural
sciences where we want to estimate the error in measuring
an unknown quantity. We can generally use simple approximation
techniques that provide a lower bound on E[Z]
(“sanity bounds” as discussed in [3]), or use historic data,
e.g., previously computed exact answers, to predict future
values of E[Z]. The tradeoff here is that the tighter these
bounds are, the stronger the quality guarantees returned by
the technique.
3. SKETCHES FOR SPATIAL DATA
Techniques designed for relational DBMS often do not easily
extend to spatial data. This is mainly caused by the fact
that SDBMS manage data with extent. A typical operation
during the processing of a spatial query is to check if two
objects overlap, or if they are within a certain distance ε of
each other. Our techniques address this problem for spatial
queries over collections of hyper-rectangles (including
points, lines, and rectangles as special cases). Most SDBMS
manage hyper-rectangles like geographical maps, environmental
measurements for points on the earth’s surface, and
so on. (Hyper-)rectangles are also used as bounding boxes
of more complex objects (e.g., polygons, polylines) since it is
often more efficient to first compute a super-set of the final
result based on these bounding boxes, and then to filter out
the false positives in a final filtering step.
For processing hyper-rectangles, we identified that all of
the targeted spatial queries rely on the same basic operation:
determine if a point lies within an interval. Hence by constructing
a corresponding sketch, we have the basic ingredient
for selectivity estimation for spatial queries. The actual
challenge then is to determine how to use these sketches to
obtain a good query result cardinality estimate with high
probability, using only small space.
3.1 Basic Sketches for One Dimension
For simplicity assume our input data set R is a onedimensional
set of intervals. We use [a, b] to denote an
interval with lower endpoint a and upper endpoint b. To
construct our sketch we use a family of four-wise independent
random variables {ξi} (as introduced in Section 2.2),
such that variable ξi, i ∈ {0, . . . , n−1} = N, corresponds to
coordinate i ∈ N. An interval [a, b] ∈ R intuitively is represented
by the coordinates of the points it contains, i.e., we
use ξa +ξa+1+ · · ·+ξb to sketch it. Now assume we want to
test if a point c ∈ N lies within the interval. Observe that
since the ξ-variables are four-wise (and hence also pairwise)
independent we have
a ≤ c ≤ b ⇔ E[(ξa + ξa+1 + · · · + ξb)ξc] = 1
c < a ∨ c > b ⇔ E[(ξa + ξa+1 + · · · + ξb)ξc] = 0
This forms the basis of our spatial sketches. For data set R
define the standard atomic spatial sketches
VI = �
(ξa + ξa+1 + · · · + ξb) and
[a,b]∈R
VE = �
[a,b]∈R
(ξa + ξb) (1)
Intuitively VI keeps track of the complete intervals, while VE
summarizes only their endpoints.
Notice that the update cost of VI depends linearly on the
interval length since we have to add the corresponding ξi for
each point i ∈ N that is contained in the interval. For large
coordinate domains, which are not uncommon in spatial applications,
this quickly becomes costly and also results in
high variance of the estimates. Hence we introduce dyadic
spatial sketches. Similar to [11, 17], we partition the domain
N into intervals of size 2 i . For simplicity let n = |N | be a
power of 2, say 2 h for some positive integer h. 1 For each
level 0 ≤ i ≤ h we partition N into 2 h−i intervals of size 2 i
each. Hence for level i = 0 we have point “intervals”, each
corresponding to a single domain value, while for level i = h
we have a single interval covering the whole domain. Let D
be the set of all dyadic intervals of all levels over N. Our
technique is based on the following lemmata.
Lemma 2. Let [a, b] be an interval. Its dyadic cover,
D([a, b]), is defined to be the smallest set of dyadic intervals
{δ1, δ2, . . . , δm} such that δi = [di−1, di], 1 ≤ i ≤ m,
and d0 = a and dm = b. Then m ≤ 2log 2 n.
Lemma 3. For each point a ∈ N, let D([a]) denote its
dyadic point cover, which is the set of all dyadic intervals in
D containing a. There are exactly log 2 n+1 dyadic intervals
in D that contain this point. Each of these dyadic intervals
is at a different level.
Lemma 4. A point c ∈ N is contained in an interval [a, b]
iff there is exactly one dyadic interval δ ∈ D such that δ ∈
D([a, b]) and δ ∈ D([c]).
Proof. See [13].
Instead of having a ξ-variable for each coordinate in N
we will use a ξ-variable for each dyadic interval over N. For
data set R, the corresponding atomic sketches for intervals
and endpoints then are defined as:
XI = � �
ξδ and
XE = �
[a,b]∈R δ∈D([a,b])
�
[a,b]∈R δ∈D([a])∪D([b])
Figure 2 shows an example for intervals and their dyadic
covers. To simplify notation we will henceforth use
¯ξ[a,b] = �
ξδ and ¯ ξ[a] = �
(3)
δ∈D([a,b])
ξδ
δ∈D([a])
1 Otherwise we just pad the domain with additional values.
ξδ
(2)
Domain:
Dyadic Intervals:
Input intervals: r
δ1
δ2 δ3
δ4 δ5 δ6 δ7
s
Dyadic covers:
D(r): {δ2,δ6}
D(l(r)): {δ4,δ2,δ1}
D(u(r)): {δ6,δ3,δ1}
D(s): {δ5,δ3}
D(l(s)): {δ5,δ2,δ1}
D(u(s)): {δ7,δ3,δ1}
Figure 2: Intervals and their dyadic covers
With this notation we can rewrite the dyadic atomic sketches
as
XI = �
¯ξ[a,b]; XE = �
[a,b]∈R
[a,b]∈R
( ¯ ξ[a] + ¯ ξ[b]) (4)
Overall, using dyadic sketches the update cost is reduced to
O(log2 n) per atomic sketch. Since the atomic sketches are
defined over the set of dyadic intervals, which has cardinality
2n − 1, we need O(log2 (2n − 1)) = O(log2 n) space for
generating each family of the ξ-variables.
For variance analysis it will often be more convenient to
rewrite the dyadic sketch definition into a different format.
Note that XI is computed by adding one or more ξδi for
each interval [a, b] ∈ R, where δi ∈ D is a dyadic interval
over N. For each dyadic interval δ ∈ D let fI(δ) = |{[a, b] ∈
R|δ ∈ D([a, b])}| be the number of intervals in R whose
cover contains δ. Similarly, let fE(δ) = |{a|a ∈ δ ∧(∃b ∈ N :
[a, b] ∈ R ∨ [b, a] ∈ R)}| be the number of interval endpoints
in R whose cover contains dyadic interval δ. Then we can
write
XI = �
fI(δ)ξδ; XE = �
fE(δ)ξδ (5)
δ∈D
δ∈D
For instance for interval r in Figure 2 we have fI(δ2) = 1,
fI(δ6) = 1, and fI(δi) = 0 for i ∈ {1, 3,4, 5,7}.
For an atomic sketch X we define its self-join size SJ(X)
as E[X 2 ]. For example, for sketch XI we have
SJ(XI) = E[X 2 I ] = E[( �
fI(δ)ξδ) 2 ] = �
f 2 I (δ) .
δ∈D
δ∈D
The last step follows from linearity of expectation and
the four-wise (and hence two-wise) independence of the ξvariables.
3.2 Sketches for Multidimensional Data
We discuss sketches for d = 2, i.e., sets of rectangles. The
generalization to higher dimensionality then is straightforward.
As introduced in Section 2.1, we write a rectangle
as the cross-product of its ranges in each dimension, e.g.,
r = [a, b] × [c, d].
To keep track of two-dimensional objects, we need an independent
ξ(i) family of four-wise independent random variables
(see Section 2.2) for each dimension i ∈ {1, 2}. More
precisely, any ξ(1)j is independent of any ξ(2)k for any j, k.
The main observation for sketching a rectangle [a, b] × [c, d]
is that we (1) have to make sure that the sketch “remembers”
that [a, b] and [c, d] “belong together”, and that (2) we
have to keep track not only of rectangles and points, but also
of horizontal and vertical lines, as will become clear below.
The latter is important for correct join size estimation.
The dyadic atomic sketches for a two-dimensional data set
R are defined as follows:
XII = �
[a,b]×[c,d]∈R
XIE = �
[a,b]×[c,d]∈R
XEI = �
[a,b]×[c,d]∈R
XEE = �
[a,b]×[c,d]∈R
¯ξ(1)[a,b] ¯ ξ(2)[c,d]
¯ξ(1)[a,b]( ¯ ξ(2)[c] + ¯ ξ(2)[d])
( ¯ ξ(1)[a] + ¯ ξ(1)[b]) ¯ ξ(2)[c,d]
( ¯ ξ(1)[a] + ¯ ξ(1)[b])( ¯ ξ(2)[c] + ¯ ξ(2)[d])
The ¯ ξ are sums over the ξ-variables for dyadic intervals that
cover the corresponding interval or endpoint as defined in
the previous section. Intuitively, XII keeps track of the
whole rectangles, XIE and XEI of their horizontal and vertical
edges, respectively, and XEE of the rectangles’ corner
points. The standard atomic sketches VII, VIE, VEI, and VEE
are defined similarly.
Similar to Equation 5, we can rewrite XII as
XII =
�
fII(δ(1), δ(2))ξ(1)δ(1)ξ(2)δ(2) (6)
δ(1)×δ(2)∈D 2
Here variable fII(δ(1), δ(2)) denotes the number of rectangles
in R whose dyadic cover contains rectangle δ(1) × δ(2),
defined by dyadic intervals δ(1) and δ(2) in dimension 1 and
2, respectively. We can similarly rewrite the other atomic
sketches. As in Section 3.1 we define SJ(X) = E[X 2 ], hence
we obtain in a similar manner
SJ(XII) =
�
f 2 II(δ(1), δ(2))
δ(1)×δ(2)∈D 2
The self-join size for the other atomic sketches is similar.
The generalization to higher dimensionality follows the
same pattern. Let IE d = {I,E} d be the set of all strings
of length d over letters I and E. For w ∈ IE d the term w[i]
refers to the i-th letter in w. For a hyper-rectangle r let r(i)
be its range in dimension i. With N d we denote the data
space. The atomic sketches for R are then defined for all
w ∈ IE d as:
Xw =
�
r(1)×r(2)×···×r(d)∈R
ξ(1)r(1)ξ(2)r(2) · · · ξ(d)r(d)
such that for each dimension i, ξ(i)r(i) = ¯ ξ(i)[l(r(i)),u(r(i))]
if w[i] = I, and ξ(i)r(i) = ¯ ξ(i)[l(r(i))] + ¯ ξ(i)[u(r(i))] otherwise
(i.e., if w[i] = E). As before, each family ξ(i) for dimension
i is a family of four-wise independent {−1, 1} random variables.
For any i �= j the variables in ξ(i) are independent
from the variables in ξ(j).
4. SPATIAL JOINS
4.1 Spatial Join of Intervals
We describe a counting procedure that will be used to
compute the cardinality of the join of two sets of intervals.
Since intervals only overlap if their intersection is a
non-empty one-dimensional object (see Definition 1), we can
safely assume that the data sets do not contain any degenerate
objects, in this case point objects, since these would
not contribute to the join result anyway. We also assume
that the data domain is finite. We will discuss in Section 5.1
how to handle real valued coordinates.
r s
s
(1) disjunct (4) contain
r
(2) meet
s
r
r
s
s
(3) overlap (6) identical
r
r r
s s
(5) contain, meet
Figure 3: Spatial relationships between intervals
4.1.1 Spatial Relationships
Figure 3 shows all cases of spatial relationships between
an interval in R and a potential join partner in S (cases
which can be obtained by swapping r and s are omitted for
simplicity). According to our definition, in cases (1) and (2)
the intervals do not overlap, while intervals in cases (3)-(6)
are overlapping. If two intervals r ∈ R and s ∈ S fall into
case (i), we say that they have the spatial relationship (i).
4.1.2 Counting Intersections
Finding an appropriate sketch that correctly counts all
overlaps, but does not count cases (1) and (2) is a non-trivial
problem. None of the previously proposed approaches (cf.
Section 8) can be directly applied to compute a function
like “add 1 for each pair of intervals (r,s) for which one
endpoint of r is between the two endpoints of s”. In order
to use sketches, we need a different way of counting interval
intersections. As before, let l(r) and u(r) be the lower and
upper endpoints of interval r, respectively (similar for s).
Intuitively, for each pair of intervals r ∈ R and s ∈ S
we want to add 0 to the join result size if they have spatial
relationship (1) or (2), and 1 otherwise. We obtain a first
approximation by counting for each pair (r, s) how many
of their endpoints are contained in the other interval. For
instance, in case (3) r.u is covered by s, and s.l is covered
by r, hence the count is 2. Overall we obtain counts 0, 2, 2,
2, 3, and 4 for cases (1) to (6), respectively. If we divide this
result by 2, we obtain counts 0, 1, 1, 1, 1.5, 2. However, for
correct join selectivity estimation the counts should be 0, 0,
1, 1, 1, 1 (add 1 only for cases (3) to (6)).
Notice that the “problem cases” with incorrect counts are
only those cases where r and s have endpoints in common.
Hence for the remainder of this section we will assume the
following.
Assumption 1. None of the intervals in R has an endpoint
in common with any of the intervals in S.
With the assumption it is easy to see that cases (2), (5), and
(6) are eliminated, and since our technique counts correctly
for cases (1), (3), and (4), we have a simple method for
calculating the join size. We will show in Section 5.2 how to
generalize our algorithm if the assumption does not hold.
4.1.3 An Atomic Sketch for Intervals
The simple counting procedure can be directly implemented
with interval and endpoint sketches as introduced
in Section 3.1. Here we use the dyadic sketches, i.e., we
construct atomic sketches XI and XE for R, and the corresponding
sketches YI and YE for S.
We want to count how many times an endpoint of an interval
in S is contained in an interval in R, and vice versa.
To do this we just multiply the corresponding sketches, i.e.,
we define a random variable Z = (XIYE + XEYI)/2. This
random variable is an unbiased estimator for the join cardinality.
Lemma 5. Random variable Z has the expected value
E[Z] = |R ⊲⊳o S|.
Proof. See [13].
For our example in Figure 2 we have XI = ξ2 + ξ6, XE =
2ξ1 + ξ2 + ξ3 + ξ4 + ξ6, YI = ξ3 + ξ5, and YE = 2ξ1 + ξ2 +
ξ3 + ξ5 + ξ7. For Z we therefore obtain
Z = ((ξ2 + ξ6)(2ξ1 + ξ2 + ξ3 + ξ5 + ξ7)
+(2ξ1 + ξ2 + ξ3 + ξ4 + ξ6)(ξ3 + ξ5))/2 .
Multiplying these terms results in a formula that is the summation
of terms of the form ξi·ξj. Recall that the ξ-variables
are four-wise (and hence also pairwise) independent, therefore
it holds that E[ξiξj] = E[ξi]E[ξj] = 0 if i �= j. Furthermore,
because each ξ is either 1 or -1, it holds that E[ξ 2 i ] = 1.
Thus from the above formula we obtain
E[Z] = (E[ξ 2 2] + E[ξ 2 3])/2 = 1 ,
which, as expected, is the correct value for the number of
intersecting intervals.
4.1.4 Variance Analysis
Using standard transformations we obtain:
Var[Z] = Var[(XIYE + XEYI)/2]
= 1/4(Var[XIYE] + Var[XEYI]
+2Cov[XIYE, XEYI])
Since in general, for any random variables X and Y ,
|Cov[X, Y ]| ≤ � Var[X] � Var[Y ]:
Var[Z] ≤ 1/4(Var[XIYE] + Var[XEYI]
+2 � Var[XIYE] � Var[XEYI])
= 1/4( � Var[XIYE] + � Var[XEYI]) 2
To analyze the terms of Equation 7 we use the definition of
the atomic sketches in the form of Equation 5. For this type
of sketches (tug-of-war sketch) it can be shown that their
variance is bounded by twice the product of their self-join
size [3]:
Var[XIYE] ≤ 2SJ(XI)SJ(YE) and
Var[XEYI] ≤ 2SJ(XE)SJ(YI) .
Together with Equation 7 we have
Var[Z] ≤ 1/2( � SJ(XI)SJ(YE) + � SJ(XE)SJ(YI)) 2 .
Since XI and XE together account for all dyadic intervals
that cover the intervals of R and their endpoints (similarly
the Y sketches for S), we define
SJ(R) = SJ(XI) + SJ(XE) and SJ(S) = SJ(YI) + SJ(YE) .
Using the Cauchy-Schwarz inequality it holds that
( � SJ(XI) � SJ(YE) + � SJ(XE) � SJ(YI)) 2 ≤ (SJ(XI) +
SJ(XE))(SJ(YE) + SJ(YI)), hence:
(7)
Var[Z] ≤ 1/2 SJ(R)SJ(S) . (8)
4.1.5 The Overall Technique
Having constructed the appropriate random variable Z
and computed an upper bound on its variance, we can boost
its accuracy as described in Section 2.3. The following theorem
summarizes the result.
Theorem 1. Let R = {r1, r2, . . . , r|R|} and S =
{s1, s2, . . . , s|S|} be two sets of intervals that satisfy Assumption
1. Furthermore let X (i) �
I = [a,b]∈R ¯ ξ (i)
, X(i)
[a,b] E =
�
(i)
), Y I = �
[c,d]∈S ¯ ξ (i)
(i)
[c,d] , and Y E =
�
[c,d]∈S (¯ ξ (i)
), 1 ≤ i ≤ 8SJ(R)SJ(S) , be atomic
[a,b]∈R (¯ ξ (i)
[a] + ¯ ξ (i)
[b]
[c] + ¯ ξ (i)
[d] ε2E[Z] 2 lg 1
φ
sketches such that Zi = (X (i) (i)
(i)
I Y E + X(i)
E Y I )/2. The ¯ ξvariables
are defined as in Equation 3 over ξ (i) -families of
four-wise independent {−1, 1} random variables, which are
generated from seeds s (i) of size O(log2 n) bits and where
each s (i) is independently chosen. By computing the median
of 2 lg(1/φ) averages over groups of 4 SJ(R)SJ(S)
ε2E[Z] 2 estimators
Zi we obtain an estimate ¯ Z that with probability
1 − φ is within ε relative error of the true expected value
E[Z] = |R ⊲⊳o S|.
Notice that the atomic sketch Z for estimating the join
size only stores five values: a seed of size O(log n) bits for
generating the ξ-variables on-the-fly (during updates), and
four counters for keeping track of the current value of XI,
XE, YI, and YE. As mentioned before, when an interval is
inserted into R (deleted from R), we simply generate the
corresponding ξ-variables for its interval and endpoint cover
from the seed and add (subtract) them from XI and XE,
respectively (similarly for S).
The cost of generating a ξ-variable from the seed is linear
in the seed size, and each interval or endpoint cover consists
of O(log n) dyadic intervals. Hence the total update
cost for a single instance of Z is O(log 2 n). To compute
Z’s estimate we simply combine the counters, resulting in
a constant overhead. Since we use 8 SJ(R)SJ(S)
ε 2 E[Z] 2 lg 1
φ independent
instances of Z, the total query, update, and stor-
age costs of our interval join sketch are O( SJ(R)SJ(S)
ε 2 E[Z] 2 log 1
φ ),
O( SJ(R)SJ(S)
ε 2 E[Z] 2 log 1
φ log2 n), and O( SJ(R)SJ(S)
ε 2 E[Z] 2 log 1
φ
spectively.
log n), re-
4.2 Spatial Join of Rectangles
The spatial relationships between intervals (cf. Section
4.1.1) generalize naturally to higher dimensionality by
examining the one-dimensional axis-parallel projections of
the hyper-rectangles (these projections are intervals). Figure
4 shows selected examples. 2 The spatial relationship of
two hyper-rectangles r and s is a d-tuple (i1, . . . , id) where
ij is the spatial relationship of the projection in dimension
j, as per Figure 3. Hence r and s overlap iff ij ∈ {3, 4,5, 6}
for all dimensions j.
4.2.1 Rectangle Sketches
The simple counting procedure for interval overlaps generalizes
to two-dimensional rectangles as follows. Let r and
s be rectangles from R and S, respectively. The counting
procedure adds the following values: number of corners of
2 The spatial relationships are slightly different from the ones
used in [24, 25] since we selected them to correspond to the
sketches we use.
(3, 3)
overlap
(3, 4)
overlap
(2, 3)
no overlap
(4, 5)
overlap
Figure 4: Spatial relationships between rectangles
r which are covered by s, number of horizontal edges of r
which intersect vertical edges of s, number of vertical edges
of r which intersect horizontal edges of s, and number of
corners of s which are covered by r. The reader might easily
convince herself that under the assumption that r and
s have no common corner coordinates in any dimension, we
will always obtain a count of 4 for intersecting rectangles
(and 0 otherwise).
As it turns out, we can combine the two-dimensional
sketches (see Section 3.2) to obtain the correct count; using
atomic sketches XII, XIE, XEI, and XEE for R, and the
analogously defined YII, YIE, YEI, and YEE for S:
Lemma 6. Let Z be a random variable such that Z =
(XIIYEE + XIEYEI + XEIYIE + XEEYII)/4; and R and S
be two sets of rectangles that satisfy Assumption 1 in
each dimension. Then E[Z] = |R ⊲⊳o S| and Var[Z] ≤
1/16(8SJ(R)SJ(S)) = 1/2 SJ(R)SJ(S).
Proof. It is easy to show that E[Z] = |R ⊲⊳o S|,
with the analysis being similar to the 1-dimensional case.
The variance analysis, on the other hand, differs from
the 1-dimensional case mainly because the set of random
variables {ξ(1)δ(1)ξ(2)δ(2) : δ(1) × δ(2) ∈ D 2 }
does not have the 4-wise independence property. This
might come as a surprise since the set of ξ(1) and the
set of ξ(2)-variables are both 4-wise independent, and
the variables in one set are independent of those in the
other. However, notice that for instance for the 4-tuple
{x1 = ξ(1)δ(1)1ξ(2)δ(2)1 , x2 = ξ(1)δ(1)1ξ(2)δ(2)2 , x3 =
ξ(1)δ(1)2ξ(2)δ(2)1 , x4 = ξ(1)δ(1)2ξ(2)δ(2)2 } it is easy to show
that E[x1x2x3x4] = 1 �= 0 = E[x1]E[x2]E[x3]E[x4]. Hence
we cannot directly use variance bounds from earlier results.
We can obtain a bound on the variance of Z as follows.
In the following, for w ∈ IE 2 let ¯w be the string obtained
from w by replacing I with E and vice versa. Then
Var[Z] = 1 �
(
16
w∈IE2 Var[XwY ¯w]
+ �
w1�=w2
≤ 1 �
(
16
w∈IE2 Cov[Xw1Yw1, ¯ Xw2Y ¯w2])
� Var[XwY ¯w]) 2
The inequality follows since, for any random variables X
and Y , |Cov[X, Y ]| ≤ � Var[X] � Var[Y ].
With some involved analysis we can show that for any
w ∈ IE 2 , Var[XwY ¯w] ≤ 8SJ(Xw)SJ(Y ¯w). With the above
inequality we have:
Var[Z] ≤ 1/16 · 8 · ( � �
2
SJ(Xw)SJ(Y ¯w))
w∈IE 2
Using Cauchy-Schwarz and the fact that SJ(R) =
�
w∈IE 2 SJ(Xw) and SJ(S) = �
w∈IE 2 SJ(Yw) yields:
For details see [13].
Var[Z] ≤ 1/16 (8SJ(R)SJ(S)) .
Here SJ(R) = SJ(XII) + SJ(XIE) + SJ(XEI) + SJ(XEE),
similarly for SJ(S).
4.2.2 The Overall Technique
As for interval sketches, we boost the accuracy of the
atomic rectangle sketch as described in Section 2.3 to obtain
accurate estimates of |R ⊲⊳o S| with quality guarantees.
Theorem 2. Let R = {r1, r2, . . . , r|R|} and
S = {s1, s2, . . . , s|S|} be two sets of rectangles whose
endpoints have no coordinates in common in any dimension.
Furthermore let XII = �
[a,b]×[c,d]∈R ¯ ξ(1)[a,b] ¯ ξ(2)[c,d],
�
XIE =
[a,b]×[c,d]∈R ¯ ξ(1)[a,b]( ¯ ξ(2)[c] + ¯ ξ(2)[d]),
�
XEI =
[a,b]×[c,d]∈R (¯ ξ(1)[a] + ¯ ξ(1)[b]) ¯ ξ(2)[c,d],
XEE = �
[a,b]×[c,d]∈R (¯ ξ(1)[a] + ¯ ξ(1)[b])( ¯ ξ(2)[c] +
¯ξ(2)[d]),
�
YII =
[a,b]×[c,d]∈S ¯ ξ(1)[a,b] ¯ ξ(2)[c,d],
�
YIE =
[a,b]×[c,d]∈S ¯ ξ(1)[a,b]( ¯ ξ(2)[c] + ¯ ξ(2)[d]),
YEI = �
[a,b]×[c,d]∈S (¯ ξ(1)[a] + ¯ ξ(1)[b]) ¯ ξ(2)[c,d], and
YEE = �
[a,b]×[c,d]∈S (¯ ξ(1)[a] + ¯ ξ(1)[b])( ¯ ξ(2)[c] + ¯ ξ(2)[d]),
be atomic sketches such that Z = (XIIYEE + XIEYEI +
XEIYIE + XEEYII)/4. Using 8 SJ(R)SJ(S)
ε 2 E[Z] 2 lg 1
φ independent
instances of Z we obtain an estimate ¯ Z that with probability
1 − φ is within ε relative error of the true expected value
E[Z] = |R ⊲⊳o S|.
Note that the number of instances of Z for the one- and
two-dimensional case incidentally is the same (cf. Theorem
1). However, the actual storage cost can be quite different.
First, the number of atomic sketches per instance of Z
has doubled for d = 2. Second, the self-join size for R and S
will be larger than for the one-dimensional case. Since update
and query cost depend approximately linearly on the
storage, those costs will be larger as well. In Section 6.1
we will show that our technique suffers from the “curse of
dimensionality”, like any other estimation or indexing technique.
5. GENERALIZATION
In this section we discuss how to remove the restricting
assumptions we made for our technique.
5.1 Real-Valued Data Domains
Note that for our techniques to work, domain N should be
finite. This is essential because a seed of size 2k+1 limits us
to a domain of size 2 k . Theoretically this prevents us from
using our technique for real-valued domains.
However, in practice our technique will work just fine.
There is no spatial application we know of that uses coordinates
of unbounded precision. Typically real-valued coordinates
are stored as 32 or 64 bit size floating point numbers—
clearly a finite domain.
Our sketch-based estimation technique can handle such
domains very well. Recall that the storage requirement of
an atomic sketch is logarithmic in the domain size. Hence
our technique scales very well with increasing domain size.
In contrast, histograms traditionally suffer from the problem
that large data domains result in large buckets and hence
poor approximation. Histogram-based techniques therefore
typically quantize the data space and make assumptions
about the distribution within a bucket in order to obtain
good estimates [23, 26].
5.2 Spatial Join with Common Endpoints
The counting algorithms used so far relied on Assumption
1 for correctness. We can make this assumption hold
for any data set as follows. Recall that the interval endpoints
are drawn from domain N = {0, 1, . . . , n −1}. First, we create
a new domain M which contains all values from N and
in addition for each pair of consecutive values i, i + 1 ∈ N
two values i+ and (i + 1)− with i < i+ < (i + 1)− < (i + 1).
Then we replace each interval s ∈ S by a new interval s ′
such that for l(s) = i, we set l(s ′ ) = i+ and for u(s) = j we
set u(s ′ ) = j−, i.e., we shrink the interval “a little”. For
instance if all interval endpoints are integer coordinates,
we could augment the domain by i − 0.1 and i + 0.1 for
each integer in the domain and thus shrink the S-intervals
by 0.1 at each end. The spatial join size is not affected
by this transformation: For each case we can easily verify
overlap(r,s) ⇔ overlap(r, s ′ ). All this transformation does
is increase the dimension’s domain size by at most a factor
of 3. The asymptotic cost of our technique is not affected
by this transformation (cf. Section 4.1.5 where n now becomes
3n). We can apply the same transformation to each
dimension of the data space.
Instead of using the endpoint transformation, we can alternatively
adapt our original simple counting procedure to
explicitly keep track of common endpoints [13].
6. EXTENSIONS
6.1 Join of Hyper-Rectangles
The one-and two-dimensional spatial join estimators generalize
naturally to higher dimensionality d as the following
theorem shows.
Theorem 3. For d-dimensional data sets R and S let
{Xw} w∈IEd and {Yw} w∈IEd be two families of atomic
sketches over the same ξ(i) families of four-wise independent
random variables, such that families ξ(i) and ξ(j),
1 ≤ i, j ≤ d, for i �= j are independent of each other. Set
Z = 2
−d �
w∈IE d XwY ¯w. Then it holds that E[Z] = |R ⊲⊳o
S| and Var[Z] ≤ 3d −1
4 d SJ(R)SJ(S).
Proof. The proof is fairly involved and can be found
in [13].
The atomic sketches are as defined in Section 3.2. As before,
for w ∈ IE d we define ¯w as the string that is obtained from
w by replacing I with E and vice versa (the “complement”
of w).
At a first glance the variance might appear to be decreasing
with increasing dimensionality d due to the (3/4) d factor.
Unfortunately this is not the case. The self-join terms for
R and S each have 2 d contributing sums. Also note that to
compute Z we have to maintain 2 d atomic sketches for each
stream.
6.2 Other Join Conditions
We can extend our algorithm to estimate the cardinality
for a slightly different notion of overlap, where case (2) (cf.
Figure 3) is also counted as overlap. We can also support
other join predicates like containment joins by maintaining
the appropriate spatial sketches. Due to space constraints
we defer the details to [13].
6.3 ε-Joins
For simplicity we will first discuss ε-joins for twodimensional
data. Let A and B be two-dimensional sets
of data points over space N 2 . In the following we assume
that the L∞ distance is used. We can estimate
the ε-join cardinality based on the following observation.
Let B ′ be a data set obtained from B by replacing each
point b ∈ B with the spatial object b ′ which is defined as
b ′ = {x|x ∈ N 2 ∧dist∞(x, b) ≤ ε}. Notice that for the L∞
distance this object b ′ is a square of sidelength 2ε with b as
its center.
According to definition, it holds that dist∞(a, b) ≤ ε ⇔
a is contained in b ′ . Hence we can compute the cardinality
of the ε-join by counting how many points of A are contained
in the squares of B ′ . This turns out to be a special
case of the rectangle join where endpoints might have equal
coordinates. Since the objects in A are points, we do not
need most of the two-dimensional atomic sketches, except
for:
XEE = �
YII =
(a1,a2)∈A
�
[e,f]×[g,h]∈B ′
¯ξ(1)[a1] ¯ ξ(2)[a2]
¯ξ(1)[e,f] ¯ ξ(2)[g,h] ,
where the ¯ ξ-variables are as defined in Equation 3.
Lemma 7. For random variable Z = XEEYII it holds that
E[Z] = |A ⊲⊳ε B| and Var[Z] ≤ 8SJ(XEE)SJ(YII).
The generalization to any dimensionality follows the same
pattern as for the spatial join. The following lemma summarizes
the result:
Lemma 8. For d-dimensional point sets A and B
let B ′ = {b ′ |b ′ = {x|x ∈ N d ∧ dist∞(x,b) ≤ ε}}
be the set of hyper-cubes of sidelength 2ε around
the points of B. For atomic sketches XE �
=
(a1,a2,...,ad)∈A ¯ ξ(1)[a1] ¯ ξ(2)[a2] · · · ¯ ξ(d)[ad] and YI =
�
l1×l2×···×ld∈B ′ ¯ ξ(1)[l(l1),u(l1)] ¯ ξ(2)[l(l2),u(l2)] · · · ¯ ξ(d)[l(ld),u(ld)] let Z = XEYI. Then E[Z] = |A ⊲⊳ε B| and
Var[Z] ≤ (3 d − 1)SJ(XE)SJ(YI).
Here SJ(XE) = �
δ(1)×δ(2)×···×δ(d)∈Dd f 2 (δ(1), δ(2), . . . , δ(d))
and SJ(YI) = �
δ(1)×δ(2)×···×δ(d)∈Dd g 2 (δ(1), δ(2), . . . , δ(d))
where f(δ(1), δ(2), . . . , δ(d)) is the number of points in A
which are covered by the hyper-rectangle spanned by the
dyadic intervals δ(i). Similarly, g(δ(1), δ(2), . . . , δ(d)) is the
number of hyper-cubes in B ′ whose dyadic cover contains
δ(1) × δ(2) × · · · × δ(d).
Notice that the constant factor of the variance bound
seems surprisingly high, compared to the bound for the more
general spatial join (see Theorem 3). However, here the selfjoin
sizes will be much lower (1 term versus 2 d terms added
for the spatial join), and the number of atomic sketches is
much lower (2 per instance of Z versus 2 · 2 d for the spatial
join).
6.4 Range Queries
The range query is a special case of the spatial join,
where the second “data set” consists only of a single hyperrectangle—the
query (see Definition 3). Hence we could
readily apply the spatial join estimation technique. The following
optimization can further improve performance. We
illustrate it for the case of R being a one-dimensional set of
intervals.
Assume we want to estimate how many intervals are selected
by range query q = [u, v]. The reader might convince
herself that an interval [a, b] ∈ R is selected iff either its
upper endpoint lies in [u, v] or if v lies in [a, b]. Notice that
both conditions are mutually exclusive and exhaust all possible
cases of intervals being selected. To implement this
counting procedure we only need two atomic sketches—one
that keeps track of the intervals as a whole (XI), and one
for their upper endpoints (XU):
XI = �
¯ξ[a,b]; XU = �
[a,b]∈R
[a,b]∈R
¯ξ[b] .
Lemma 9. For interval set R let XI and XU be atomic
sketches as defined above. For range query q = [u, v] let Z =
¯ξ[u,v]XU + ¯ ξ[v]XI. Then E[Z] = |Q([u, v], R)| and Var[Z] ≤
2 ·(3 log 2 n+1) ·SJ(R) where n is the size of the domain N.
The logarithmic factor is caused by the fact that the dyadic
cover of [u, v] can consist of up to 2log 2 n intervals and the
dyadic point cover of v contains log 2 n + 1 intervals. Compared
to the result for the spatial join (see Section 4.1.4)
the self-join size here does not depend on the lower interval
endpoints and hence will be smaller. The generalization to
d dimensions follows the same pattern as for the spatial join
(just replace XE with XU).
6.5 Taking Data Properties into Account
In Section 3.1 we presented two different atomic sketch approaches
for summarizing intervals: standard sketches and
dyadic sketches. For each interval [a, b] ∈ R, the standard
sketch adds the ξ-variables for all points in [a, b] to the interval
sketch VI, a cost of O(n). If the domain is large and
the data set contains many long intervals, then the dyadic
atomic sketches are clearly preferable since they guarantee
O(log n) update cost. However, if most intervals are very
short, then the standard sketch might deliver better estimates
for the same storage cost. Recall that the dyadic
endpoint sketch keeps track of all dyadic intervals covering
an endpoint, hence it will add the ξ-variable for the dyadic
interval that covers the whole domain on each interval insertion.
For sets with mostly short intervals this might lead to
a higher self-join size than if standard atomic sketches were
used.
We therefore propose an adaptive approach. Based on
statistics about the interval length distribution, the algorithm
determines the maximum level, maxLevel. When computing
the cover of an interval or endpoint it only uses dyadic
intervals from levels up to maxLevel. Note that the input
stream can have intervals of length greater than 2 maxLevel .
The lower maxLevel, the lower the self-join size for XE. On
the other hand, long intervals will require more dyadic intervals
than before (since the longer dyadic intervals on the
upper levels above maxLevel can not be used any more).
Notice that the more small intervals the data set contains,
the lower maxLevel, in the extreme the dyadic sketch turns
into the standard sketch for maxLevel = 0.
7. EXPERIMENTS
We evaluate the performance of our spatial join size estimation
techniques (henceforth referred to as SKETCH) on
both synthetic and real life 1D and 2D spatial datasets. We
compare our algorithms with recently proposed techniques
based on generalized Euler Histograms [26] (henceforth referred
to as EH) and with the Geometric Histograms technique
[5] (henceforth referred to as GH). These are currently
the best known techniques for spatial join selectivity estimation.
3
The EH approach partitions the data space using a grid of
a given level L. A grid of level L partitions each dimension
into 2 L equi-width cells. An Euler histogram allocates buckets
not only for grid cells, but also for grid edges and grid
vertices. Besides storing object counts in a cell, the generalized
Euler histogram [26] also stores information such as the
average height, width and area of the intersection regions
between the objects and the cell. In terms of storage space,
a generalized Euler histogram of level L uses 9·2 2L −6·2 L +1
units (words) of memory.
Geometric Histograms also partition the data space using
a grid of given level L, similar to the generalized Euler
Histograms. The information stored in each cell is the total
number of corner points, the sum of the areas of the objects,
the sum of the lengths of the vertical edges and the sum of
the lengths of the horizontal edges of objects intersecting
the cell. Thus a Geometric Histogram of level L uses 4 L+1
units of memory.
In the following, all references to memory allocation refer
to the memory allocated per dataset to the SKETCH, GH
or EH techniques.
7.1 Effect of Input Size and Skew
None of the existing techniques provides guaranteed error
bounds (probabilistic or otherwise) for spatial join size
estimation. Hence, for comparison purposes, we provide
SKETCH, GH and EH with the same space and compare
the actual relative errors on their respective estimates as
the size of the spatial data set increases, for datasets with
different degrees of skew.
We use synthetic two-dimensional datasets, with intervals
along each dimension i generated independently according
to a Zipfian distribution with Zipf parameter zi. The average
length of an object along a dimension is O( √ di) where
di is the size of the domain along the i th dimension. We
used generalized Euler histograms with grid level L = 6,
which corresponds to about 36K units of memory. The same
space was allocated to SKETCH and GH, and the actual relative
errors are shown in Figures 5 and 6. Since our sketch
techniques are based on randomization, the relative errors
reported are averages over multiple independent runs.
Figure 5 plots the relative error as the size of the dataset
increases from 30K to 0.5 million. Both the joining datasets
had the same size, as specified on the x-axis. The datasets
used here have rectangles generated on the two streams with
Zipf parameter z = 0 (uniform). As can be seen from the
figure, the error for SKETCH (and GH) remains fairly stable
as the input size increases, if the distribution of the datasets
remains the same. For uniform data (z = 0), the SKETCH
and GH techniques perform similarly, with an average rel-
3 The authors would like to thank Chengyu Sun for providing
the original EH code and the data sets used in [26].
Relative Error
0.6
0.5
0.4
0.3
0.2
0.1
Relative Error Vs Dataset size [ zipf=0 (uniform)]
SK
EH
GH
0
0 100 200 300 400 500
Dataset Size (K)
Figure 5: Zipf = 0
ative error which is much lower than the error of the EH
technique. Similar results were observed on other datasets
with low skew in the distribution of the spatial objects over
the data space. Note that EH is more general than GH, and
hence its performance is more sensitive to data distribution
and grid level, which explains the poor performance for the
given data set.
We would like to point out that for a given grid level,
the performance of both EH and GH depends heavily on
the domain size, since that determines the granularity of
the grid. Thus, for example, if the domain size is doubled
along each dimension, the relative errors for both EH and
GH increase considerably, even if the input is not changed!
For the SKETCH technique, on the other hand, if the maximum
dyadic level is not changed (see Section 6.5), the relative
error remains the same for the same input, even if the
underlying domain is ‘conceptually’ (i.e. without changing
the input datasets) doubled.
Figure 6 shows the relative error for datasets with a fairly
high degree of skew in the distributions of the spatial objects.
For these datasets, the projection of the objects
along each dimension is distributed with Zipf parameter
z = 1. Here the performance of EH is comparable to GH
and SKETCH, with SKETCH faring marginally better than
the other two techniques. For other data sets we observed
the general trend that SKETCH does comparatively better
than the grid-based histogram techniques for skewed input.
7.2 Error Guarantees and Space Requirements
Our next set of experiments shows how the actual relative
error and space requirement for our technique varies for a
given guaranteed relative error bound for spatial datasets of
different sizes. In Figures 8 and 7 we consider interval joins
of two datasets with intervals uniformly distributed over domains
of sizes ranging from 16384 to 65536. Figure 8 shows
the space requirement and Figure 7 shows the actual relative
error for a guaranteed relative error bound of 0.3 at the 99%
confidence level. As can be seen from the figure, the space
required remains almost constant at around 63K as the size
of the dataset increases. The reason is that the distribution
of the objects does not change significantly. Note that for a
d-dimensional dataset of size N, the total space required to
store the dataset completely is 2d · N. Thus the size of the
SKETCH summary structure as a fraction of dataset size
varies from around 60% for smaller datasets, to about 6%
for larger datasets. As can be seen from Figure 7, the actual
relative error is much smaller than the guaranteed error
bound.
Relative Error
0.6
0.5
0.4
0.3
0.2
0.1
Relative Error Vs Dataset size [zipf skew=1]
SK
EH
GH
0
0 100 200 300 400 500
Dataset Size (K)
Figure 6: Zipf = 1
Sketch size (1000’s words)
70
69
68
67
66
65
64
63
Relative Error
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
Actual Relative Error Vs Dataset size for epsilon=0.3 phi=0.01 (1D)
True Error
Guaranteed Error Bound
0
0 100 200 300 400 500
Dataset Size (K)
Figure 7: Relative error guarantee
Space Requirement Vs Dataset size for epsilon=0.3 phi=0.01 (1D)
SKETCH
62
0 100 200 300 400 500
Dataset Size (1000’s objects) [1 object = 2 words]
Figure 8: Space requirement
7.3 Real Life Datasets
In this section, we compare the performance of EH, GH
and SKETCH on the three real life 2-dimensional spatial
datasets used in [26]. The descriptions of the datasets are
as follows:
• LANDO: This dataset contains land cover ownership
and management information for the state of Wyoming
at a 1 : 10 6 scale. Number of objects = 33860.
• LANDC: This dataset contains land cover information
such as vegetation types for the state of Wyoming
at a 1 : 10 6 scale. Number of objects = 14731.
• SOIL: This dataset represents soils of Wyoming at a
1 : 10 6 scale. Number of objects = 29662.
We consider all the 3 combinations of joins on these spatial
datasets. Figures 9, 10 and 11 show the relative errors obtained
on LANDO ⊲⊳o LANDC, LANDC ⊲⊳o SOIL and
LANDO ⊲⊳o SOIL respectively, as the allocated space is
varied.
The three graphs are similar in nature. The most surprising
result is the unpredictability of the EH estimates:
Whereas the estimates provided by SKETCH improve as the
space allocated to it is increased, histogram-based methods
(with no quality guarantees) such as EH might sometimes
produce worse estimates. Similar to the results in [26], EH
provides good estimates with small memory allocated to it,
but the relative error increases rapidly with finer grid partitioning.
This behavior is related to the probabilistic model
used for estimation by EH — it might produce small perbucket
errors which add up if the actual data distribution
differs from the model assumption. The GH technique is
more robust than EH in this regard since it uses a simpler
model. However, it fares well only for higher amounts
Relative Error
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Relative Error for the LANDC+LANDO spatial join
SKETCH
EH
GH
0
0 5 10 15 20 25 30 35 40
Space Allocated (K words)
Figure 9: LANDC ⊲⊳o LANDO
of memory, and is mostly outperformed by SKETCH (by
a small margin). The relative error of SKETCH shows a
steady decline with increasing space in all the three graphs.
This is expected, since SKETCH is an unbiased estimator
and therefore the more independent instances of this estimator
are used, the better the expected results.
7.4 Discussion of Experimental Results
We presented a comparison of SKETCH, GH and EH on
synthetic and real life datasets. Overall, the performance of
GH and SKETCH is comparable, with SKETCH performing
slightly better. In general the experiments highlight the
practicality of our randomized estimation technique. It not
only provides quality guarantees, but also has predictable
behavior: The more space we allocate, the better the estimate.
The histogram techniques might sometimes produce
better estimates, sometimes they do much worse. Without
knowing the data distribution in advance, their behavior and
best storage allocation cannot be determined. For instance,
the EH technique does very well for small space, but produces
high errors when the number of buckets is increased.
Since in practice robustness of an estimation technique is
an important factor, we believe that SKETCH in general
is preferable over techniques whose behavior cannot be predicted
a priori.
There are essentially only two parameters that effect
SKETCH’s performance: the self-join sizes of the data sets,
and the result size. As long as the self-join sizes are not too
large compared to the result size, SKETCH provides good
estimates and its performance is independent of the actual
object distribution (skew) in the joining spatial datasets.
This dependence on result size is a characteristic of all probabilistic
estimation techniques with provable error bounds.
8. RELATED WORK
The best known general techniques for selectivity estimation
for joins involving spatial objects are based on sampling
or on histograms. An et al. [5] examine different sampling
techniques and propose new histogram-based approaches.
Their results indicate that for achieving a comparable accuracy,
histograms will require less storage and estimation
time than sampling. Mamoulis and Papadias [23] take a
more general approach of analytically estimating the selectivity
of complex spatial queries, i.e., queries that combine
selection and join operators. Their formulas are based on
uniformity assumptions. A skewed data set is partitioned
into a regular grid of cells, and the formulas are applied for
each cell. Sun et al. [26] improve on these results with a
Relative Error
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Relative Error for the LANDC+SOIL spatial join
SKETCH
EH
GH
0
0 5 10 15 20 25 30 35 40
Space Allocated (K words)
Figure 10: LANDC ⊲⊳o SOIL
Relative Error
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Relative Error for the LANDO+SOIL spatial join
SKETCH
EH
GH
0
0 5 10 15 20 25 30 35 40
Space Allocated (K words)
Figure 11: LANDO ⊲⊳o SOIL
framework based on Euler histograms [25]. They also use
a regular grid partitioning of the space, but the Euler histograms
together with adaptively selected per-cell estimation
techniques provide more accurate estimates for spatial
joins with geometric selections.
Faloutsos et al. [15] and Belussi and Faloutsos [8] propose
parametric methods for estimating the selectivity of ε-(self-
) joins of point-sets. For self-similar data sets they achieve
good approximations using power laws and fractal dimensionality.
The techniques of Acharya et al. [2] estimate the
selectivity for point and range queries over two-dimensional
rectangular data. The basic idea is to partition the data
space into buckets and to use models based on uniformity
assumptions per bucket. Closely related to selectivity estimation
are approaches for estimating the cost (in terms of
CPU or I/O) of specific operator implementations. Examples
are cost models for range queries and index-supported
joins for spatial data [9, 20, 21, 22, 29, 30].
None of the previous approaches provides any result quality
guarantees. Samples are difficult to maintain in the presence
of updates, especially deletes which could remove objects
from the sample. Similarly, there is no efficient algorithm
for maintaining fractal dimensionality and power
law parameters in the presence of updates. The histogram
techniques, due to the regular grid partitioning, are easy to
maintain dynamically, however, their utility is limited when
the data is skewed. Other histograms that are more appropriate
for skewed data cannot be maintained efficiently and
introduce an additional error when the buckets of the joined
data sets do not align.
Our techniques are based on AMS sketches [4, 3] which
are dynamically maintainable and provide approximation
quality guarantees. Different types of sketches so far have
been used for efficiently maintaining aggregates over data
streams [6]. Example applications are complex aggregates
[14], quantiles [17], frequent items [10, 11], multidimensional
histograms [28], and graph statistics [7]. An upcoming
paper by Tao et al. proposes to use sketches to
improve the result quality for aggregate queries over spatiotemporal
point sets [27].
9. CONCLUSION AND FUTURE WORK
In this paper, we proposed a new framework for using spatial
sketch techniques for approximately answering spatial
queries. To the best of our knowledge, our technique is the
first method for spatial data that gives probabilistic quality
guarantees. It permits incremental construction under
insertion and deletion of records while at the same time be-
ing effective for skewed distributions and applications where
deletions of objects are frequent. Our experimental evaluation
shows that our techniques are competitive compared
to the best previously known methods, and we believe that
their additional features of quality guarantees, incremental
maintenance, and predictable behavior make them a viable
choice in practice.
Note that in principle any sketching technique that can
estimate join sizes with guaranteed error bounds (not
only AMS sketches) could be incorporated into our framework.
Thus for example one could potentially use some
of the sketching techniques proposed in upcoming publications
[12, 16] with their corresponding error guarantees
suitably adapted to our spatial framework.
In future work, we plan to extend our techniques to other
spatial queries, including non-rectangular selections [1]; and
we would like to incorporate quality boosting techniques
such as sketch partitioning into our framework. One especially
promising candidate for future work are complex spatial
queries that contain multiple operators, e.g., joins with
selections. Selectivity estimation for spatial joins with selections
can be done using our sketch-based framework. However
a straightforward application (use counting sketches
like for spatial joins, but multiply each inserted interval and
endpoint contribution with a factor that encodes the rangequery
sketch for independent ξ-families), would lead to dramatically
increased variance and hence space requirement,
compared to spatial join without selection.
10. REFERENCES
[1] A. Aboulnaga and J. F. Naughton. Accurate
estimation of the cost of spatial selections. In Proc.
ICDE, pages 123–134, 2000.
[2] S. Acharya, V. Poosala, and S. Ramaswamy.
Selectivity estimation in spatial databases. In Proc.
SIGMOD, pages 13–24, 1999.
[3] N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy.
Tracking join and self-join sizes in limited storage. In
Proc. PODS, pages 10–20, 1999.
[4] N. Alon, Y. Matias, and M. Szegedy. The space
complexity of approximating the frequency moments.
In Proc. STOC, pages 20–29, 1996.
[5] N. An, Z.-Y. Yang, and A. Sivasubramaniam.
Selectivity estimation for spatial joins. In Proc. ICDE,
pages 368–375, 2001.
[6] B. Babcock, S. Babu, M. Datar, R. Motwani, and
J. Widom. Models and issues in data stream systems.
In Proc. PODS, pages 1–16, 2002.
[7] Z. Bar-Yossef, R. Kumar, and D. Sivakumar.
Reductions in streaming algorithms, with an
application to counting triangles in graphs. In Proc.
SODA, pages 623–632, 2002.
[8] A. Belussi and C. Faloutsos. Self-spacial join
selectivity estimation using fractal concepts. Proc.
ACM TOIS, 16(2):161–201, 1998.
[9] T. Brinkhoff, H.-P. Kriegel, and B. Seeger. Parallel
processing of spatial joins using R-trees. In Proc.
ICDE, pages 258–265, 1996.
[10] M. Charikar, K. Chen, and M. Farach-Colton. Finding
frequent items in data streams. In Proc. ICALP, pages
693–703, 2002.
[11] G. Cormode and S. Muthukrishnan. What’s hot and
what’s not: Tracking most frequent items dynamically.
In Proc. PODS, pages 296–306, 2003.
[12] G. Cormode and S. Muthukrishnan. An improved
data stream summary: The count-min sketch and its
applications. In LATIN, pages 29–38, 2004.
[13] A. Das, J. Gehrke, and M. Riedewald. Approximation
techniques for spatial data. Technical report, Cornell
University, 2004.
http://techreports.library.cornell.edu.
[14] A. Dobra, M. N. Garofalakis, J. Gehrke, and
R. Rastogi. Processing complex aggregate queries over
data streams. In Proc. SIGMOD, pages 61–72, 2002.
[15] C. Faloutsos, B. Seeger, A. J. M. Traina, and
C. Traina. Spatial join selectivity using power laws. In
Proc. SIGMOD, pages 177–188, 2000.
[16] S. Ganguly, M. N. Garofalakis, and R. Rastogi.
Processing data-stream join aggregates using skimmed
sketches. In Proc. EDBT, pages 569–586, 2004.
[17] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and
M. Strauss. How to summarize the universe: Dynamic
maintenance of quantiles. In Proc. VLDB, pages
454–465, 2002.
[18] P. J. Haas and J. M. Hellerstein. Online query
processing. In Proc. SIGMOD, page 623, 2001.
[19] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online
aggregation. In Proc. SIGMOD, pages 171–182, 1997.
[20] Y.-W. Huang, N. Jing, and E. A. Rundensteiner. A
cost model for estimating the performance of spatial
joins using R-trees. In Proc. SSDBM, pages 30–38,
1997.
[21] J. Jin, N. An, and A. Sivasubramaniam. Analyzing
range queries on spatial data. In Proc. ICDE, pages
525–534, 2000.
[22] H.-P. Kriegel, M. Pfeifle, M. Pötke, and T. Seidl. A
cost model for interval intersection queries on
RI-trees. In Proc. SSDBM, pages 131–141, 2002.
[23] N. Mamoulis and D. Papadias. Selectivity estimation
of complex spatial queries. In Proc. SSTD, pages
155–174, 2001.
[24] D. Papadias, Y. Theodoridis, T. K. Sellis, and M. J.
Egenhofer. Topological relations in the world of
minimum bounding rectangles: A study with R-trees.
In Proc. SIGMOD, pages 92–103, 1995.
[25] C. Sun, D. Agrawal, and A. El Abbadi. Exploring
spatial datasets with histograms. In Proc. ICDE,
pages 93–102, 2002.
[26] C. Sun, D. Agrawal, and A. El Abbadi. Selectivity
estimation for spatial joins with geometric selections.
In Proc. EDBT, pages 609–626, 2002.
[27] Y. Tao, G. Kollios, J. Considine, F. Li, and
D. Papadias. Spatio-temporal aggregation using
sketches. In Proc. ICDE, 2004. to appear.
[28] N. Thaper, S. Guha, P. Indyk, and N. Koudas.
Dynamic multidimensional histograms. In Proc.
SIGMOD, 2002.
[29] Y. Theodoridis and T. K. Sellis. A model for the
prediction of R-tree performance. In Proc. PODS,
pages 161–171, 1996.
[30] Y. Theodoridis, E. Stefanakis, and T. K. Sellis.
Efficient cost models for spatial queries using R-trees.
IEEE TKDE, 12(1), 2000.
