﻿Privacy-Preserving Top-k Queries
Jaideep Vaidya
Rutgers University and CIMIC
180 University Avenue
Newark, NJ 07102-1803
jsvaidya@rbs.rutgers.edu
Abstract
Processing of top-k queries has been attracting considerable
attention. Much of the work assumes distributed
data, with each site holding a different set of attributes for
the same set of entities. These methods assume that all sites
are happy with revealing the local order or scores of the
entities. Privacy/security concerns faced by the individual
sites can prohibit such disclosure. We present a mechanism
through which only the result of the top-k query (i.e., the
top-k entities) is returned while minimizing disclosure of
other information.
1. Introduction
The increased volume and variety of information stored
in databases has lead to an increase in the desire for ranked
and “best match” queries. A particularly important instantiation
of this is the top-k query: Finding the k closest
matches to a query “point”.
One problem addressed by the community is distributed
processing of top-k queries. In particular, vertically partitioned
data (each site has different attributes about a common
set of entities) raises interesting questions. Computing
the distance for any entity requires cooperation of all the
sites. Fagin made the observation that if each site produces
a list of its closest entities, and the lists are grown until they
have k items in common, then the union of the lists is guaranteed
to contain the top-k. He used this observation to
develop an algorithm which can work with O(k) communication
cost [7], although worst case could be O(n). 1
Top-k queries are particularly relevant when dealing
with privacy-sensitive information. For example, profiling
has potential anti-terrorism applications (e.g., finding
visa applicants who are likely terrorists). Profiles, however,
1 To see the inherent difficulty of this problem, assume an entity is close
at most sites, but the farthest entity at one site. The total distance could
place it in the top-k, but the one site would not expect this.
Chris Clifton
Purdue University
250 N. University St.
W. Lafayette, IN 47907-2066
clifton@cs.purdue.edu
rarely match individuals exactly - a prime example of the
need for similarity or best match queries. However, profiles
frequently match innocent people (or transactions) as well,
subjecting them to unwarranted scrutiny. This raises privacy
concerns, potentially restricting valuable applications.
The concept of k-anonymity has been proposed as a way
of balancing privacy concerns with the value gained by such
applications[19]. In the context of profiling, k-anonymity
would require that any individual that “matches” be indistinguishable
from at least k − 1 other individuals – in other
words, all k would match. If k is large enough, then any
use of the results will have to assume that most of the k are
innocent (or good credit risks, or ...), and individuals will
not be mistreated simply because they match the profile.
If we replace a “best match” query with a top-k query
that discloses no information other than the best k matches
to the query, we can maintain k-anonymity while meeting
the profiling goals. This paper presents an algorithm to
solve this problem.
The basic idea is to replicate Fagin’s algorithm[9], while
limiting disclosure to that inherently required to meet the
efficiency goals of the algorithm. For example, if a site is
required to test if an entity that is not locally close is in the
top-k, it learns that this entity must be somewhat close at another
site – preventing this disclosure would require testing
items that Fagin’s algorithm would not, hurting efficiency.
This disclosure is unlikely to raise privacy issues, and enables
an algorithm whose typical cost is in terms of k rather
than n. We prove the disclosure properties of our algorithm
under the semi-honest model. In the semi-honest model, a
party cannot deviate from the protocol, but it can later try to
infer information from the data it sees during the protocol.
Our algorithm actually has stronger security properties. A
dishonest party may cause incorrect results but is unable to
obtain significant information on the private values of honest
parties. We further discuss this issue in Section 4.
We next describe the specific models of privacy, distribution,
and querying used in the paper; in the process discussing
related work. Section 3 gives the algorithm, start-
ing top-down with an outline based on Fagin’s algorithm,
then discussing key secure components. This also includes
a discussion of the correctness of the algorithms. The security
properties are discused in Section 4, and communication/complexity
issues in Section 5.
2. Model and related work
We assume a vertically partitioned data distribution
model (i.e., heterogeneous schemas). Assume k parties,
P1, . . . , Pk, each with has access to a separate database that
give information about different features. Data is collected
for the same set of entities. This is the data model assumed
in several distributed best-match query papers[9, 8, 5, 4].
We include the added twist that none of the parties trust
each other completely; privacy concerns prevent them from
disclosing scores or local ordering of the entities.
Our data distribution model assumes that all sites collect
data for the exact same set of entities. Relaxing this assumption
would mean computing top-K over missing values,
which might not be very meaningful. However, most
of the current work is also based on this assumption, so we
are not adding any extra constraints on to the problem.
There has been a lot of work on the top-k query/selection
problem since Fagin first proposed his A0 algorithm [7].
Fagin also proposed other algorithms (TA and NRA) [10]
that perform better than his original algorithm. However,
because of its amenability to secure evaluation we start with
the original A0 algorithm.
There has also been work on top-k queries that assumes
all of the attributes of every object are readily available to
the query processor [5, 6, 14]. In [4], the authors introduce
an algorithm for evaluating top-K selection queries
over web accessible databases.
This paper draws heavily on work in privacy-preserving
data mining[17] and secure multiparty computation[13]. Of
particular relevance is a method for determining the kth element
in a distributed list[1]; this can be used for privacypreserving
top-k queries in horizontally partitioned data.
A key contribution of this paper is to demonstrate how
components developed in privacy-preserving data mining
and secure multiparty computation can be composed to
solve secure querying problems. Additional work in this
area will be referenced and discussed as the components
are described. Simply using secure components, however,
is not enough; they most be cleverly composed in ways that
achieve both efficiency and enable proof that the complete
algorithm is secure.
3. Algorithm
We follow the basic structure of Fagin’s A0 algorithm[9].
We first give a brief description of Fagin’s algorithm
(adapted from [9]) and then discuss our modifications.
Please refer to the original paper [9] for more details. The
A0 algorithm consists of 3 basic phases:
Sorted access phase: Each site i is accessed under sorted
access, such that it starts outputting the graded tuples
ordered on the local grade. This phase continues until
there are at least k common objects in the output of
all of the subsystems/sites. (i.e., the idea is to locally
query and obtain sorted pairs from each site until there
are at least k common objects in the output from all
sites). Thus, if we represent the local set at site i as
Oi, | �
i Oi| ≥ k. Under a monotone grading function,
it is shown that the top-k objects are guaranteed to be
within the union of all the objects [9].
Random access phase: For each object seen in the sorted
access phase (i.e., ∀o ∈ �
i Oi), random access is done
to each subsystem to find the local score of that object.
Computation phase: Compute the global score for each
object that has been seen (i.e., ∀o ∈ �
i Oi). The top-
K objects from this list are the required top-K objects.
Our modification of Fagin’s algorithm is as follows: First
we show how we can effectively simulate the operation of
the sorted access phase. We describe a mechanism through
which we can check if there are at least k objects in common
to all the sites, as well as giving the union of the sets without
revealing the effective order of the objects at any site or even
revealing which object came from which site.
Each site performs the random access phase independently.
In the computation phase, the sites first partially
compute scores such that two sites are left with (random)
shares of the score for each object. The two sites then identify
a cutoff score that separates the kth item from those
below it; this performs a binary search for the appropriate
threshold while use secure comparisons to guide the search
without revealing the score of individual items. Knowing
this, the sites can (securely) compare each item with the
score to determine if it is in the top-k or not.
What information is disclosed by this process? Since the
answer discloses the top-k, the knowledge of which items
are closer or farther than the kth can be inferred from the
result, so the final stage discloses nothing. The largest disclosure
is the set of candidates - the union of all items in
the set required to get k intersecting items. While this is
significant, we argue that it is necessary: Simply observing
which items are accessed in the random access phase
would reveal this information, so to prevent disclosing the
candidates each site would need to access every item (losing
the efficiency of the algorithm of [9].) The information
disclosed is discussed further in Section 4.1.
The algorithm does place some restrictions on the scoring
functions to ensure that they can be easily computed
in a distributed and secure manner (specifically – the score
function should be expressible as a sum of locally computed
functions). Fortunately, most distance functions meet these
restrictions, in particular the following from [5]:
Definition 1 Consider a relation R = (A1, . . . , An).
A1, . . . , An are real-valued attributes ranging between 0
and 1. Then, given a query q = (q1, . . . , qn) and a tuple
t = (t1, . . .,tn) from R, we define the score of t for q using
any of the following two scoring functions:
n� |qi − ti|
Sum(q, t) = 1 −
n
i=1
�
�
�
Euclidean(q, t) = 1 − � n � (qi − ti) 2
n
i=1
For ease of presentation, we assume the scores are inverted
(we are looking for the lowest k rather than the highest).
For Sum (equation 1) this lets us convert the metric
to a sum of scores computed independently at the different
sites. Since the goal is to determine the top-k rather than the
scores of the top-k, we can remove the square root (and division
by n); the ranking stays unchanged. This again gives
us a sum of locally computed values. We thus generalize the
score as �
i fi(qi, ti): a sum of locally-computed functions.
Assuming such a formulation of the score, we go through
the following basic protocol. Each site Pj, j = 0, . . .,p−2
selects a random rj uniformly over the field F . For
each candidate transaction ti, site Pj locally computes the
weighted score f(qj, tij) for the ith transaction, and sends
f(qj, tij) + rj to site Pp−1. Site Pp−1 adds these (and its
own) to get �p−1 i=0 rj (mod F). Each
i=0 f(qj, tij) + � p−2
site Pj (j = 0, . . .,p − 3) also send the random rj to site
Pp−2. Pp−2 now computes R = �p−2 2
j=0 rj.
This is performed (in parallel) for each of the candidate
elements. Parties Pp−1 and Pp−2 then go through a binary
search to find a threshold score separating the top-k elements,
this is described in Section 3.1. Parties Pp−1 and
Pp−2 compare each candidate with this threshold to determine
which are in the top-k.
In the basic algorithm, any party can play the role of
Pp−1 and Pp−2. Thus, the order of the parties can be completely
random and we assume that the ordering is chosen
such that all parties are amenable to the choice of Pp−1 and
Pp−2.
Details of the algorithm are given in Algorithm 1.
Correctness follows directly from the correctness of Fagin’s
A0 algorithm, assuming that the component intersection,
2 Note that there is an obvious problem if Pp−1 and Pp−2 collude;
using a secure summation protocol instead of sending values directly limits
the damage of such collusion to revealing scores for items rather than
individual attribute values.
(1)
(2)
Algorithm 1 Finding the top-K elements
Require: Parameter m ≥ k (initial number of items to
check)
Require: p sites, P0, . . .,Pp−1
1: {Sorted access phase}
2: numcommon ← 0
3: while numcommon < k do
4: Each site Pi finds the m closest local items according
to the local distance metric wiSi {Once the first m
items have been found, each site can simply find the
next k items in sorted order}
5: Set IDi is formed by site Pi from the ids found in
6:
the previous step
numcommon ←
3.3)
� �
|ID0 . . . IDp−1| (Section
7: m ← m + k
8: end while
9: {Random access / Computation
� �
phase}
10: Candidates ← ID0 . . . IDp−1 (Section 3.2)
11: for id ∈ Candidates do
12: for all sites j = 0, . . .,p − 2 do
13: Pj: Generate a random rj uniformly from the field
F
14: Pj: Compute the local weighted score wjSj for
candidate id
15: Pj: inpj ← wjSj + rj (mod F)
16: Pj: Send inpj to site Pp−1
17: Pj: Send −rj (mod F) to site Pp−2
18: end for
19: Pp−1: Compute spid = �p−2 j=0 inpj + wp−1Sp−1
(mod F)
20: Pp−2: Compute sp ′ id = �p−2 j=0 −rj (mod F)
21: end for
22: {Pp−1, Pp−2 hold shares of the scores for all candidates}
23: Find the score sc of the kth element in Candidates using
kth element protocol (see Section 3.1)
24: for id ∈ Candidates do
25: Use Yao’s secure comparison[21] to find if spid +
sp ′ id > sc. If so, id is a part of the result.
26: end for
union, kth element score computation, and comparison protocols
are correct.
3.1 Finding the kth element
We now describe how to find a threshold score separating
the top k elements. The problem can be stated as follows:
There is a list of n elements, e1, . . . , en. We want to find the
score separating the kth largest element in the list from the
k + 1st. We consider the case where the elements are vertically
partitioned. i.e., two parties have random shares of all
of the elements. Specifically, there are two parties A and B
having n elements (a1, . . . , an and b1, . . . , bn respectively)
such that ∀i, ai + bi = di (mod F).
Aggarwal et. al solved the problem for horizontally partitioned
data[1], i.e., two parties A and B have na and
nb items (a1, . . . , ana and b1, . . . , bnb respectively). Our
method shares the same basic idea: a binary search for an
appropriate threshold score. Use of secure comparisons enables
the search to proceed without revealing anything.
The key idea is that as long as we know the range of the
elements (the field F ), binary search can be used to identify
the value of kth element. The algorithm starts off with an
initial guess (say F/2). Now, for every element, the two
sites run a secure comparison protocol that compares the
magic number to the sum of the random shares with the
sites. The output for each site are also random shares of
1 (if the element is greater) or 0 (if the element is equal
or smaller). Once all the elements are compared, the two
sites can independently sum up their shares and run another
secure comparison to find out if the total of these sums is
greater than or less than k. Depending on the result, the
estimate is updated to be lower or higher. Within log |F |
rounds the correct value is identified. Algorithm 2 gives the
detailed algorithm.
3.2 Set Union
Secure Union has also been solved; we sketch the
method from [16] (another solution can be found in [3].)
The problem can be stated as follows: There exist p sites
P1, . . . , Pp. Party Pi has set Si where the elements of the
set come from a common global universe G. Together they
want to compute the union of the local sets S = S1∪. . .∪Sp
securely. The union of items can be evaluated using SMC
methods if the domain of the items is small. Each party creates
a binary vector where 1 in the i th entry represents that
the party has the i th item. After this point, a simple circuit
that or’s the corresponding vectors can be built and it can be
securely evaluated using general secure multi-party circuit
evaluation protocols. However, in data mining the domain
of the items is usually large. To overcome this problem a
simple approach based on commutative encryption can be
Algorithm 2 Finding the kth largest element
Require: Sites A and B having shares a1, . . .,an and
b1, . . .,bn respectively
1: lbound ← 0
2: ubound ← |F |
3: while true do
4: result ← ⌈(lbound + ubound)/2⌉
5: if result = ubound then
6: return result {Tie for kth element}
7: end if
8: for i = 1, . . .,n do
9: Use Yao’s secure comparison with input ai, bi and
output la, lb such that la + lb = 1 (mod F) if
ai+bi < result (mod F) otherwise la+lb = 0
(mod F)
10: Use Yao’s secure comparison with input ai, bi and
output ga, gb such that ga + gb = 1 (mod F) if
ai+bi > result (mod F) otherwise ga+gb = 0
(mod F)
11: end for
12: A: lla ← � n
i=1 la (mod F)
13: A: gga ← � n
i=1 ga (mod F)
14: B: llb ← � n
i=1 lb (mod F)
15: B: ggb ← � n
i=1 gb (mod F)
16: if lla+llb (mod F) > k {estimate too high} then
17: ubound ← result
18: else if gga +ggb (mod F) < k {estimate too low}
then
19: lbound ← result {estimate too low}
20: else
21: return result {Found it}
22: end if
23: end while
used. An encryption algorithm is commutative if given encryption
keys K1, . . . , Kn ∈ K, for any m in domain M,
and for any permutation i, j, the following two equations
hold:
EKi (. . . EKin 1 (M)...) = EKj (. . . EKjn (M)...) (3)
1
∀M1, M2 ∈ M such that M1 �= M2 and for given k, ǫ < 1
2 k
Pr(EKi 1 (. . . EKin (M1)...) = EKj 1 (. . . EKjn (M2)...)) < ǫ
(4)
With shared p the Pohlig-Hellman encryption scheme[18]
satisfies the above equations, but any other commutative encryption
scheme can be used.
The key idea is that every site creates its own instance of
a public-private key pair. Now every site will encrypt all of
its items. Every site will also encrypt all the items from all
of the other sites. When all of the items are encrypted by all
of the sites, since equation 3 holds, duplicates in the original
items will be duplicates in the encrypted items, and can
be deleted. (Due to equation 4, only the duplicates will be
deleted.) In addition, the decryption can occur in any order,
so by permuting the encrypted items we prevent sites from
tracking the source of an item. The algorithm for evaluating
the union of the items is given in Algorithm 3.
Clearly algorithm 3 finds the union without revealing
which item belongs to which site. It is not, however, secure
under the definitions of secure multi-party computation.
It reveals the number of items that exist commonly
in two sites, e.g. if k sites have an item in common, there
will be an (encrypted) item duplicated k times. This does
not reveal which items these are, but a truly secure computation
(as good as each site giving its input to a “trusted
party”) could not reveal even this count. Allowing innocuous
information leakage (the number of items that is owned
by two sites) allows an algorithm that is sufficiently secure
with much lower cost than a fully secure approach.
It can be proven that other than the size of intersections
and the final result, nothing is revealed. By assuming that
the count of duplicated items is part of the final result, a
Secure Multiparty Computation proof is possible. Refer to
[16] for details.
3.3 Size of Set Intersection
There are several solutions to securely finding the size
of the intersection of sets[20, 2, 12], we sketch the method
of [20]. Consider several parties having their own sets of
items from a common domain. The problem is to securely
compute the cardinality/size of the intersection of these local
sets.
Formally, given p parties P1 . . . Pp having local sets
S1 . . .Sp, we wish to securely compute |S1 ∩ . . . ∩ Sp|. We
can do this is using a parametric commutative one way hash
Algorithm 3 Finding secure union of items
Require: p sites, P1, . . .,Pp
Union set ← ∅ {Encryption of all the rules by all sites}
for each site i do
for each X ∈ Si do
M ← newarray[p]
Xp ← encrypt(X, ei)
M[i] ← 1
Union set ← Union Set � (Xp, M)
end for
end for{Site i encrypts its items and adds them to the
global set. Each site then encrypts the items it has not
encrypted before}
for each site i do
for each tuple (r,M) ∈ Union set do
if M[i] = 0 then
rp ← encrypt(r, ei)
M[i] ← 1
Mp ← M
Union set ← (Union set −
{(r, M)}) � {(rp, Mp)}
end if
end for
end for
for (r,M) ∈ Union set and (rp,Mp ) ∈ Union set do
{check for duplicates}
if r=rp then
Union set ← Union set − {(r, M)} {Eliminate
duplicate items before decryption}
end if
end for
for each site i do {Each site decrypts every item to get
the union of items}
for all (r,M) ∈ Union set do
rd ← decrypt(r, di)
Union set ← (Union set − {(r, M)}) � {(rd)}
end for
permute elements in the Union set
end for
return Union set
function. One way of getting such a hash function is to use
commutative public key encryption, such as Pohlig Hellman,
and throw away the decryption keys. Commutative
encryption has already been described in detail in Section
3.2.
All p parties locally generate their public key-pair
(Ei, Di) for a commutative encryption scheme. (They can
throw away their decryption keys since these will never be
used.) Each party encrypts its items with its key and passes
it along to the other parties. On receiving a set of (encrypted)
items, a party encrypts each item and permutes the
order before sending it to the next party. This is repeated
until every item has been encrypted by every party. Since
encryption is commutative, the resulting values from two
different sets will be equal if and only if the original values
were the same (i.e., the item was present in both sets). Thus,
we need only count the number of values that are present
in all of the encrypted itemsets. This can be done by any
party. None of the parties is able to know which of the items
are present in the intersection set because of the encryption.
The complete protocol is shown in algorithm 4. For details
and proof see [20].
4. Security Analysis
The Secure Multiparty Computation community has developed
a method for demonstrating that protocols do not
leak information known as proof by simulation. The key
idea is that if a party can simulate what it sees during execution
of a protocol knowing only its own input and the
result, then it hasn’t learned anything from the execution of
the protocol.
One challenge is that a party’s view in this protocol isn’t
exactly determined by the inputs – the choice of random
numbers (e.g., encryption keys) by other parties may change
the actual bits seen by a party, even if the inputs (and thus
output) are the same. In this context, to simulate a party’s
view actually means to simulate the distribution over a number
of runs with the same inputs. Formal definitions, and a
proof that the ability to simulate a protocol does ensure information
is not disclosed, can be found in [13].
We include the formal definition for semi-honest behavior
below. Semi-honest means that all parties follow the
protocol, but may try to infer private information from what
is seen during execution.
Definition 2 Privacy with respect to semi-honest behavior
[13].
Let f : {0, 1} ∗ × {0, 1} ∗ ↦−→ {0, 1} ∗ × {0, 1} ∗ be probabilistic,
polynomial-time functionality, where f1 (x, y) (respectively,
f2 (x, y)) denotes the first (resp., second) element
of f (x, y)); and let Π be two-party protocol for computing
f.
Algorithm 4 Securely computing size of intersection set
Require: p sites
Require: each site has a local set Si
Generate the commutative encryption key-pair (Ei, Di)
{Throw away the decryption keys, since they will not be
needed.}
M ← Si
for p − 1 steps do
M ′ ← newarray[|M|]
j ← 0
for each X ∈ M do
M ′ [j + +] ← encrypt(X, Ei)
end for
permute the array M ′ in some random order
send the array M ′ to site i + 1 mod k
receive array M from site i − 1 mod k
end for
{Final Encryption}
M ′ ← newarray[|M|]
j ← 0
for each X ∈ M do
M ′ [j + +] ← encrypt(X, Ei)
end for
permute the array M ′ in some random order
send M ′ to site i mod 2 {This prevents a site from seeing
it’s own encrypted items}
sites 0 and 1 produce array I0 and I1 containing only
(encrypted) items present in all arrays received.
site 1 sends I1 to site 0
site 0 broadcasts the result |I0 ∪ I1|
Let the view of the first (resp., second) party during
an execution of protocol Π on (x, y), denoted
view Π 1 (x, y) (resp., viewΠ 2 (x, y)), be (x, r1, m1, . . . , mt)
(resp., (y, r2, m1, . . . , mt)). r1 represent the outcome of
the first (resp., r2 the second) party’s internal coin tosses,
and mi represent the i th message it has received.
The output of the first (resp., second) party during an
execution of Π on (x, y) is denoted outputΠ 1 (x, y) (resp.,
outputΠ 2 (x, y)) and is implicit in the party’s view of the
execution.
Π privately computes f if there exist probabilistic polynomial
time algorithms, denoted S1, S2 such that
{(S1 (x, f1 (x, y)) , f2 (x, y))} x,y∈{0,1} ∗ ≡ C
�� view Π 1 (x, y),output Π 2
(x, y)��
x,y∈{0,1} ∗
{(f1 (x, y) , S2 (x, f1 (x, y)))} x,y∈{0,1} ∗ ≡ C
�� output Π 1 (x, y),view Π 2 (x, y) ��
x,y∈{0,1} ∗
where ≡ C denotes computational indistinguishability.
The full malicious model, which is based on equivalence to
all parties giving their input to a trusted third party to compute
the result, results in solutions with significantly higher
complexity. As we shall see, our protocol gives somewhat
better protection than semi-honest: a dishonest party may
cause incorrect results, but is unable to obtain significant
information on the private values of honest parties.
We also rely heavily on the composition theorem of [13],
which states that if a protocol incorporates known secure
protocol, and the outer protocol can be proven secure in the
way it uses the inner protocol, then the overall protocol is
secure.
Theorem 1 (Composition Theorem for the semi-honest
model): Suppose that g is privately reducible to f and that
there exists a protocol for privately computing f. Then there
exists a protocol for privately computing g.
PROOF. Refer to [13].
Theorem 2 Algorithm 1 disclose only an unranked list of
the top-k elements, an estimate of the distance to the kth
element (somewhere between the kth and k + 1st element),
and
1. the candidate items (the union of the lm closest items
at each site, where l is chosen so that lists have at least
k items in common),
2. the number of common items in lists of the top im items
from each site, i = 1 . . .l, and
3. ranges for the distance to the kth and k+1st elements.
PROOF. First, we note that much of the algorithm can
be simulated by simply running the algorithm on the local
inputs. The exceptions are Step 7 (the intersection), Step
10 (Union), Steps 16 on (where sites Pp−1 and Pp−2 must
simulate the completion of the protocol), and the number of
times each loop must iterate.
The results of the intersection at step 7 are exactly the
information disclosed in 2 above – by acknowledging this
information is disclosed, we are able to use it to simulate the
step. Assuming the set intersection protocol is secure, the
composition theorem allows us to conclude that the protocol
is secure to this point.
The number of intersection sets disclosed by 2 gives us
the number of iterations of the loop at steps 3-8, completing
the simulation of this first part of the protocol.
Likewise, assuming knowledge of the candidate set (disclosure
1 allows us to simulate the result of step 10 and the
number of iterations of the for loop of steps 11-21. Assuming
the set union is generated without disclosing anything
except the items in that union, the composition theorem allows
us to conclude step 10 is secure.
Simulating the values received by Pp−1 and Pp−2 at lines
16 and 17 is more difficult. Here we rely on simulating the
distribution of values. We start with step 17, and show that
Pp−2 can simulate rj by simply selecting a random value r
uniformly from the field F :
Pr[−rj (mod F) = x] = Pr[rj = −x (mod F)]
= 1/|F |
= Pr[r = x]
In other words, the probability that rj equals any particular
value x is the same for all x (since rj is chosen uniformly
from F ), as is the probability that r = x.
The same simulation (select a random r uniformly from
F ) works for Step 16 as well, but the proof of this is slightly
more complex:
Pr[inpj = x] = Pr[wjSj + rj (mod F) = x]
= Pr[rj = x − wjSj (mod F)]
= 1/|F |
= P[r = x]
Since rj is chosen uniformly from F , the likelihood that it
is equal to any particular value in the field is 1/|F |.
To simulate steps 19 and 20, we must show that the distribution
of the sums of the inpj and rj equals the distribu-
tion of spid and sp ′ id
. Step 20 simply relies on the fact that
the sum modF of two numbers chosen randomly from a
uniform distribution of F is itself uniformly distributed of
F :
Pr[r1 + r2 (mod F) = x] = Pr[r1 = x − r2 (mod F)]
= 1/|F |
By induction, we can show that the sum at step 20 is uniformly
distributed over F . Likewise, since the actual and
simulated inpj are both uniformly distributed over F , the
sum at step 19 is uniformly distributed over F . Thus summing
the simulated inpj and −rj effectively simulates spid
and sp ′ id .
Step 23 finds a distance score sc that separates the topk
items from other items; since this score is presumed to
be part of the result we simulate it with the known value.
Algorithm 2 does disclose a lower bound on the distance
between the kth and k + 1st element; see Theorem 3. Accepting
this disclosure, we can use the composition theorem
to conclude the algorithm is secure through this step.
The comparisons in Step 25 can also be simulated using
the result: If id is in the top-k items, then the comparison
result is that spid + sp ′ id < sc is true; otherwise the comparison
result is false.
Since we have shown that the (distribution of) the view
of any of the parties can be simulated knowing only the result
and the specific disclosures 1 and 2, we have shown that
Algorithm 1 is secure except for those disclosures.
We now separately prove the security of Algorithm 2,
then discuss the innocuousness of the disclosed information.
Theorem 3 Algorithm 2 discloses a score that separates
the kth and k + 1st element, along with (possibly) bounds
on the locations of the kth and k + 1st elements.
PROOF. The key to proving Algorithm 2 is simulating
the comparison results. Knowing the final score, we can
compute a unique binary search path leading to that score.
The problem is that the simulator knows a valid score, not
necessarily the final result. The steps taken (the sequence
of intermediate values of result) will progress toward the
known score in a predetermined (and thus simulatable) binary
search; the key question is when to stop? The algorithm
will stop at the first valid score. If we have a range
of possible values for the kth and k + 1st elements, then
the first point in the binary search that leads to something
between these ranges is the final score.
Once we know this actual final score, the algorithm can
be simulated.
Steps 1-8 are performed locally, and thus easily simulated.
Steps 9 and 10 give random shares la (lb) and ga
(lb); while the outputs sum to either 0 or 1, independently
each can be simulated by choosing a random r uniformly
from the field F . (The secure comparison works by setting
one result la equal to a random uniformly chosen from the
field F , and the other result to 1 − la (mod F) or 0 − la
(mod F).) The same argument as used in the proof in Theorem
2 of steps 16-20 of Algorithm 1 shows the effectiveness
of this simulation.
Likewise, the simulation of steps 12-15 is simply a sum
of the simulated la, etc. The proof that this is valid follows
that of the proof of steps 19-20 in Theorem 2.
The comparison result of steps 16 and 18 are simulated
knowing the final score: If the current simulated value of
result is greater than score, then the comparison at line 16
is true; otherwise it is false. The opposite holds for step
18.
It is possible to find the kth largest element without disclosing
ranges for the kth and k + 1st elements, however this
requires continuing the search for the maximum log(|F |)
steps every time. In addition, the conditions in lines 16-
22 must be replaced with a single comparison that gives a
random result in the case of equality (< or >, hiding the
fact that equality was reached.) Full discussion and proof is
beyond the scope of this paper; as discussed below the disclosures
of the algorithm presented here do not violate the
privacy policy we are assuming.
4.1 Disclosure Significance
How bad is the disclosure of information from 1, 2, and
3? We argue that disclosing the candidates is necessary to
achieve even a moderately efficient protocol. One option to
achieve higher security would be to have every site treat every
item as a candidate; this is clearly inefficient. However,
if a site knows that it does not have to participate in computing
the score for an item, then it knows that the sum of
the distances from the other sites is large enough to prevent
the item from being in the top-k. Failure to be in the actual
candidate list (disclosure 1) reveals slightly more: It reveals
that the score for the item and some site is sufficient to prevent
it from being in the top-k. More precisely, it reveals
that k items have lower distances at all sites than items that
are not candidates.
The information revealed by the intermediate intersection
sizes (disclosure 2) is similar – it reveals how many
points have relatively low scores at all sites. However, this
has little privacy impact. Individual entities are not revealed,
except to the extent that they are known to be part
of the candidate set.
The ranges of disclosure 3 are also divorced from the
particular entities that they apply to. Knowing the range of
scores for the top-k items, and how well they are separated
from the remaining items, is likely to be desirable anyway.
However, the key reason these disclosures are not an issue
comes back to the privacy policy of k-anonymity. Note
that the candidate set grows in groups of size k. If the first
attempt at the candidate set is exactly the top-k, then no
entities are disclosed except the top-k. Otherwise, at least
k additional entities are brought in for the second round.
Since the protocol protects against distinguishing individual
entities except for revealing the candidate set and the
top-k, any information disclosed (such as score ranges for
the kth and k + 1st elements) could apply with equal probability
to any of at least k entities. Thus privacy policy is
met.
5. Computation and Communication Analysis
Our analysis is based on the number of candidate items
that need to be checked. Assume that x is the total number
of candidates. Obviously this is an upper bound on the size
of the candidate list generated by each site.
A single invocation of the size of set intersection protocol
will have the following cost[20]: For p parties and maximum
set size x, there are p 2 ∗ x encryptions. Parallelization
is possible and the real computation time cost is that of
p ∗ x encryptions. In terms of communication cost, there
are O(p 2 ) messages, O(p 2 x) bits and p rounds. Actual experimental
cost can be found in [20]. Other set intersection
algorithms also exist, some of which may be more efficient
in certain cases [2, 12].
Depending on the protocol used for set union, detailed
complexity analysis can be found in the relevant papers
[16, 3]. For now, we just give a rough estimate based on
the protocol in [16]. Effectively, the set union protocol simply
requires the same number of encryptions as the set intersection
protocol as well as requiring decryptions by all
parties. Thus O(p 2 x) encryptions and O(px) decryptions
are required. Communication complexity is comparable to
that of size of set intersection protocol. As a general note,
for these protocols the computation time tends dominate the
communication time in terms of overall time required.
As a basis for reference, our experiments showed that
1 Million encryptions with a key size of 512 bits require
4660 seconds on a SUN Blade 1000 workstation (900 MHz
processor, 1 GB RAM). The transfer time on a 100 Mbps
network is < 1 second.
The sorted access phase requires a number of intersections
until the while condition (line 3) is satisfied. In the
random access/computation phase, first a single invocation
of the set union protocol is necessary to identify the candidates.
Once the candidates are identified, lines 12-18 consist
of the random access part. This requires at most O(px)
numbers to be transferred with negligible computation cost.
Line 23 requires a single invocation of the kth element finding
protocol to find the kth element. This protocol requires
at most log |F | rounds. Each round requires secure comparisons
with all of the candidates. Thus O(x ∗ log |F |)
secure comparisons are required. Once the kth element is
identified, a final set of comparisons are required with all
the candidates to identify the top-K results. This requires
another x secure comparisons.
The secure comparison [21] requires O(size) oblivious
transfers. More efficient comparison protocols also exist
[11, 15]. Ioannidis et. al [15] report that the computation
cost for comparing two number of size 15 bits is 217.5ms
on a Pentium III/450 MHz machine. The communication
overhead is < 4ms. Using this data we can estimate that for
1000 candidates and field size of a million, the random access
phase would require approximately 2.4 hours on such
slow machines.
We have given cost estimates in terms of the number
of candidates. Fagin [9] showed that the worst-case number
of candidates is O(N (p−1)/p k 1/p ); where N is the total
database size, p is the number of parties, and the top-k results
are required. Typically the number of candidates is
much lower.
6. Conclusion
The primary contribution of this paper is to propose a
secure method for doing top-k selection from vertically
partitioned data. This has particular relevance to privacysensitive
searches, and meshes well with privacy policies
such as k-anonymity.
Another contribution is a demonstration of how secure
primitives from the literature can be composed with efficient
query processing algorithms, with the result having
provable security properties. This lends credence to the hypothesis
that many secure data management problems can
be solved by putting together secure primitives in a novel
way. There remain many open problems in developing secure
solutions based on efficient non-secure query processing
algorithms. Further, the paper shows that there a tradeoff
between efficiency and the amount of information which
is disclosed. It is worthwhile to explore whether one could
have a suite of algorithms (or a configurable algorithm) so
that applications can choose the goal they want to optimize.
For example, an application may be willing to accept lower
performance as long as the amount of information that is
disclosed is minized. We hope that this paper can ignite
further research into this problem.
References
[1] G. Aggarwal, N. Mishra, and B. Pinkas. Secure computation
of the k th -ranked element. In Proceedings of IACR
Eurocrypt (EUROCRYPT04), Interlaken, Switzerland, May
2-6 2004.
[2] R. Agrawal, A. Evfimievski, and R. Srikant. Information
sharing across private databases. In Proceedings of
ACM SIGMOD International Conference on Management
of Data, San Diego, California, June 9-12 2003.
[3] M. Bawa, R. J. Bayardo Jr., and R. Agrawal. Privacypreserving
indexing of documents on the network. In Proceedings
of the 29th International Conference on Very Large
Data Bases, pages 922–933, Berlin, Germany, Sept.9-12
2003. Morgan Kaufmann Publishers Inc.
[4] N. Bruno, L. Gravano, and A. Marian. Evaluating top-k
queries over web-accessible databases. In Proceedings of the
18th International Conference on Data Engineering, pages
369–380, San Jose, California, Feb. 26 - Mar. 1 2002.
[5] S. Chaudhuri and L. Gravano. Evaluating top-k selection
queries. In M. P. Atkinson, M. E. Orlowska, P. Valduriez,
S. B. Zdonik, and M. L. Brodie, editors, Proceedings of 25th
International Conference on Very Large Data Bases, pages
397–410, Edinburgh, Scotland, Sept. 7-10 1999. VLDB,
Morgan Kaufmann.
[6] D. Donjerkovic and R. Ramakrishnan. Probabilistic optimization
of top n queries. In Proceedings of the 25th International
Conference on Very Large Data Bases, pages 411–
422. Morgan Kaufmann Publishers Inc., 1999.
[7] R. Fagin. Combining fuzzy information from multiple
systems. In Proceedings of the Fifteenth ACM SIGACT-
SIGMOD-SIGART Symposium on Principles of Database
Systems, pages 216–226, Montreal, Canada, June 3-5 1996.
ACM Press.
[8] R. Fagin. Fuzzy queries in multimedia database systems.
In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-
SIGART Symposium on Principles of Database Systems,
pages 1–10, Seattle, Washington, June 1-3 1998. ACM
Press.
[9] R. Fagin. Combining fuzzy information from multiple systems.
Journal of Computer and System Sciences, pages 83–
99, 1999. (Special issue for selected papers from the 1996
ACM Symposium on Principles of Database Systems).
[10] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms
for middleware. In Proceedings of the Twentieth
ACM SIGACT-SIGMOD-SIGART Symposium on Principles
of Database Systems, pages 102–113, Santa Barbara, California,,
May 21-23 2001. ACM.
[11] M. Fischlin. A cost-effective pay-per-multiplication comparison
method for millionaires. In RSA Security 2001 Cryptographer’s
Track Lecture Notes in Computer Science, volume
2020, pages 457–471. Springer-Verlag, 2001.
[12] M. J. Freedman, K. Nissim, and B. Pinkas. Efficient private
matching and set intersection. In Eurocrypt 2004, Interlaken,
Switzerland, May 2-6 2004. International Association
for Cryptologic Research (IACR).
[13] O. Goldreich. The Foundations of Cryptography, volume 2,
chapter General Cryptographic Protocols. Cambridge University
Press, 2004.
[14] I. F. Ilyas, W. G. Aref, and A. K. Elmagarmid. Supporting
top-k join queries in relational databases. In Proceedings of
the 29th International Conference on Very Large Databases,
pages 754–765, Berlin, Germany, Sept.9-12 2003. Morgan
Kaufmann Publishers Inc.
[15] I. Ioannidis and A. Grama. An efficient protocol for yao’s
millionaires’ problem. In Hawaii International Conference
on System Sciences (HICSS-36), pages 205–210, Waikoloa
Village, Hawaii, Jan. 6-9 2003.
[16] M. Kantarcıoˇglu and C. Clifton. Privacy-preserving distributed
mining of association rules on horizontally partitioned
data. IEEE Transactions on Knowledge and Data
Engineering, 16(9):1026–1037, Sept. 2004.
[17] Special section on privacy and security. SIGKDD Explorations,
4(2):i–48, Jan. 2003.
[18] S. C. Pohlig and M. E. Hellman. An improved algorithm
for computing logarithms over GF(p) and its cryptographic
significance. IEEE Transactions on Information Theory, IT-
24:106–110, 1978.
[19] P. Samarati and L. Sweeney. Protecting privacy when
disclosing information: k-anonymity and its enforcement
through generalization and suppression. In Proceedings of
the IEEE Symposium on Research in Security and Privacy,
Oakland, CA, May 1998.
[20] J. Vaidya and C. Clifton. Secure set intersection cardinality
with application to association rule mining. Journal of
Computer Security, to appear.
[21] A. C. Yao. How to generate and exchange secrets. In Proceedings
of the 27th IEEE Symposium on Foundations of
Computer Science, pages 162–167. IEEE, 1986.
