Dynamic Authenticated Index Structures for Outsourced
†CS Department, Boston University, USA.‡AT&T Labs-Research, USA.
(lifeifei, gkollios, reyzin)@cs.bu.edu, firstname.lastname@example.org
Technical Report BUCS-TR-2006-004
April 22, 2006
In outsourced database (ODB) systems the database owner publishes its data through a
number of remote servers, with the goal of enabling clients at the edge of the network to access
and query the data more efficiently. As servers might be untrusted or can be compromised,
query authentication becomes an essential component of ODB systems. Existing solutions for
this problem concentrate mostly on static scenarios and are based on idealistic properties for
certain cryptographic primitives. In this work, first we define a variety of essential and prac-
tical cost metrics associated with ODB systems. Then, we analytically evaluate a number of
different approaches, in search for a solution that best leverages all metrics. Most importantly,
we look at solutions that can handle dynamic scenarios, where owners periodically update the
data residing at the servers. Finally, we discuss query freshness, a new dimension in data au-
thentication that has not been explored before. A comprehensive experimental evaluation of
the proposed and existing approaches is used to validate the analytical models and verify our
claims. Our findings exhibit that the proposed solutions improve performance substantially
over existing approaches, both for static and dynamic environments.
Database outsourcing  is a new paradigm that has been proposed recently and received con-
siderable attention. The basic idea is that data owners delegate their database needs and func-
tionalities to a third-party that provides services to the users of the database. Since the third
party can be untrusted or can be compromised, security concerns must be addressed before this
There are three main entities in the Outsourced Database (ODB) model: the data owner,
the database service provider (a.k.a. server) and the client. In general, many instances of each
entity may exist. In practice, usually there is a single or a few data owners, a few servers, and
many clients. The data owner first creates the database, along with the associated index and
authentication structures and uploads it to the servers. It is assumed that the data owner may
update the database periodically or ocasionally, and that the data management and retrieval
happens only at the servers. Clients submit queries about the owner’s data to the servers and get
back results through the network.
It is much cheaper to maintain ordinary servers than to maintain truly secure ones, particularly
in the distributed setting. To guard against malicious/compromised servers, the owner must give
the clients the ability to authenticate the answers they receive without having to trust the servers.
In that respect, query authentication has three important dimensions: correctness, completeness
and freshness. Correctness means that the client must be able to validate that the returned records
do exist in the owner’s database and have not been modified in any way. Completeness means
that no answers have been omitted from the result. Finally, freshness means that the results are
based on the most current version of the database, that incorporates the latest owner updates. It
should be stressed here that query freshness is an important dimension of query authentication
that has not been extensively explored in the past, since it is a requirement arising from updates
to the ODB systems, an aspect that has not been sufficiently studied yet.
There are a number of important costs pertaining to the aforementioned model, relating to
the construction, query, and update phases. In particular, in this work the following metrics are
considered: 1. The computation overhead for the owner, 2. The owner-server communication
cost, 3. The storage overhead for the server, 4. The computation overhead for the server, 5. The
client-server communication cost, and 6. The computation cost for the client (for verification).
Previous work has addressed the problem of query authentication mostly for static scenarios,
where owners never issue data updates. In addition, existing solutions take into account only
a subset of the metrics proposed here, and hence are optimized only for particular scenarios
and not the general case. Finally, previous work was mostly of theoretical nature, analyzing
the performance of the proposed techniques using analytical cost formulas, and not taking into
account the fact that certain cryptographic primitives do not feature idealistic characteristics
in practice. For example, trying to minimize the I/O cost associated with the construction of
an authenticated structure does not take into account the fact that generating signatures using
popular public signature schemes is two times slower than a random disk page access on today’s
computers. To the best of our knowledge, no previous work ever conducted empirical evaluations
on a working prototype of existing techniques.
Our Contributions. In this work, we: 1. Conduct a methodical analysis of existing approaches
over all six metrics, 2. Propose a novel authenticated structure that best leverages all metrics,
3. Formulate detailed cost models for all techniques that take into account not only the usual
structural maintenance overheads, but the cost of cryptographic operations as well, 4. Discuss
the extensions of the proposed techniques for dynamic environments (where data is frequently
updated), 5. Consider possible solutions for guaranteeing query freshness, 6. Implement a fully
working prototype and perform a comprehensive experimental evaluation and comparison of all
We would like to point out that there are other security issues in ODB systems that are
orthogonal to the problems considered here. Examples include: privacy-preservation issues [14, 1,
10], secure query execution , security in conjunction with access control requirements [20, 29, 5]
and query execution assurance . In particular, query execution assurance of  does not
provide authentication: the server could pass the challenges and yet still return false query results.
The rest of the paper is organized as follows. Section 2 presents background on essential
cryptography tools, and a brief review of related work. Section 3 discusses the authenticated
index structures for static ODB scenarios. Section 4 extends the discussion to the dynamic case
and Section 5 addresses query freshness. Finally, the empirical evaluation is presented in Section
6. Section 7 concludes the paper.
Figure 1: Example of a Merkle hash tree.
The basic idea of the existing solutions to the query authentication problem is the following. The
owner creates a specialized data structure over the original database that is stored at the servers
together with the database. The structure is used by a server to provide a verification object VO
along with the answers, which the client can use for authenticating the results. Verification usually
occurs by means of using collision-resistant hash functions and digital signature schemes. Note
that in any solution, some information that is authentic to the owner must be made available
to the client; else, from the client’s point of view, the owner cannot be differentiated from a
(potentially malicious) server. Examples of such information include the owner’s public signature
verification key or a token that in some way authenticates the database. Any successful scheme
must make it computationally infeasible for a malicious server to find incorrect query results
and verification object that will be accepted by a client who has the appropriate authentication
information from the owner.
Collision-resistant hash functions.
computable function that takes a variable-length input x to a fixed-length output y = h(x).
Collision resistance states that it is computationally infeasible to find two inputs, x1?= x2, such
that h(x1) = h(x2). Collision-resistant hash functions can be built provably based on various
cryptographic assumptions, such as hardness of discrete logarithms . However, in this work
we concentrate on using heuristic hash functions, which have the advantage of being very fast to
evaluate, and specifically focus on SHA1 , which takes variable-length inputs to 160-bit (20-
byte) outputs. SHA1 is currently considered collision-resistant in practice; we also note that any
eventual replacement to SHA1 developed by the cryptographic community can be used instead of
SHA1 in our solution.
For our purposes, a hash function h is an efficiently
Public-key digital signature schemes. A public-key digital signature scheme, formally defined
in , is a tool for authenticating the integrity and ownership of the signed message. In such
a scheme, the signer generates a pair of keys (SK,PK), keeps the secret key SK secret, and
publishes the public key PK associated with her identity. Subsequently, for any message m that
she sends, a signature smis produced by: sm= S(SK,m). The recipient of smand m can verify
smvia V(PK,m,sm) that outputs “valid” or “invalid.” A valid signature on a message assures
the recipient that the owner of the secret key intended to authenticate the message, and that
the message has not been changed. The most commonly used public digital signature scheme is
RSA . Existing solutions [26, 27, 21, 23] for the query authentication problem chose to use this
scheme, hence we adopt the common 1024-bit (128-byte) RSA. Its signing and verification cost is
one hash computation and one modular exponentiation with 1024-bit modulus and exponent.
Aggregating several signatures.
m1,...,mtsigned by the same signer need to be verified all at once, certain signature schemes
allow for more efficient communication and verification than t individual signatures. Namely, for
RSA it is possible to combine the t signatures into a single aggregated signature s1,tthat has the
same size as an individual signature and that can be verified (almost) as fast as an individual
signature. This technique is called Condensed-RSA . The combining operation can be done by
anyone, as it does not require knowledge of SK; moreover, the security of the combined signature
is the same as the security of individual signatures. In particular, aggregation of t RSA signatures
can be done at the cost of t−1 modular multiplications, and verification can be performed at the
cost of t−1 multiplications, t hashing operations, and one modular exponentiation (thus, the com-
putational gain is that t − 1 modular exponentiations are replaced by modular multiplications).
Note that aggregating signatures is possible only for some digital signature schemes.
In the case when t signatures s1,...,st on t messages
The Merkle Hash Tree. An improvement on the straightforward solution for authenticating
a set of data values is the Merkle hash tree (see Figure 1), first proposed by . It solves the
simplest form of the query authentication problem for point queries and datasets that can fit in
main memory. The Merkle hash tree is a binary tree, where each leaf contains the hash of a
data value, and each internal node contains the hash of the concatenation of its two children.
Verification of data values is based on the fact that the hash value of the root of the tree is
authentically published (authenticity can be established by a digital signature). To prove the
authenticity of any data value, all the prover has to do is to provide the verifier, in addition to the
data value itself, with the values stored in the siblings of the path that leads from the root of the
tree to that value. The verifier, by iteratively computing all the appropriate hashes up the tree,
at the end can simply check if the hash she has computed for the root matches the authentically
published value. The security of the Merkle hash tree is based on the collision-resistance of the
hash function used: it is computationally infeasible for a malicous prover to fake a data value,
since this would require finding a hash collision somewhere in the tree (because the root remains
the same and the leaf is different—hence, there must be a collision somewhere in between). Thus,
the authenticity of any one of n data values can be proven at the cost of providing and computing
log2n hash values, which is generally much cheaper than storing and verifying one digital signature
per data value. Furthermore, the relative position (leaf number) of any of the data values within
the tree is authenticated along with the value itself.
Cost models for SHA1, RSA and Condensed-RSA. Since all existing authenticated struc-
tures are based on SHA1 and RSA, it is imperative to evaluate the relative cost of these operations
in order to be able to draw conclusions about which is the best alternative in practice. Based
on experiments with two widely used cryptography libraries, Crypto++  and OpenSSL ,
we obtained results for hashing, signing, verifying and performing modulo multiplications. Evi-
dently, one hashing operation on our testbed computer takes approximately 2 to 3 µs. Modular
multiplication, signing and verifying are, respectively, approximately 100, 10,000 and 1,000 times
slower than hashing (verification is faster than signing due to the fact that the public verification
exponent can be fixed to a small value).
Thus, it is clear that multiplication, signing and verification operations are very expensive,
and comparable to random disk page accesses. The cost of these operations needs to be taken into
account in practice, for the proper design of authenticated structures. In addition, since the cost
of hashing is orders of magnitude smaller than that of singing, it is essential to design structures
that use as few signing operations as possible, and hashing instead.
Table 1: Notation used.
A database record
A B+-tree key
A B+-tree pointer
A hash value
Size of object x
Total number of database records
Total number of query results
Node fanout of structure x
Height of structure x
A hash operation on input x of length l
A signing operation on input x of length l
A verifying operation on input x of length l
Cost of operation x
The verification object
There are several notable works that are related to our problem. A good survey is provided in ;
our review here is brief. The first set of attempts to address query authentication problems in
ODB systems appeared in [9, 8, 16]. The focus of these works is on designing solutions for query
correctness only, creating structures that are based on Merkle trees. The work of  generalized
the Merkle hash tree ideas to work with any DAG (directed acyclic graph) structure. With
similar techniques, the work of  uses the Merkle tree to authenticate XML documents in the
ODB model. The work of  further extended the idea and introduced the VB-tree which was
suitable for structures stored on secondary storage. However, this approach is expensive and was
later subsumed by . Several proposals for signature-based approaches addressing both query
correctness and completeness appear in [23, 21, 26]. We are not aware of work that specifically
addresses the query freshness issue.
Hardware support for secure data access is investigated in [5, 4]. It offers a promising research
direction for designing query authentication scheme with special hardware support if it is available.
Lastly, distributed content authentication has been addressed in , where distributed version
of Merkle hash tree is applied.
3The Static Case
In this section we illustrate three approaches for query correctness and completeness: a signature-
based approach similar to the ones described in [26, 23], a Merkle-tree-like approach based on
the ideas presented in , and our novel embedded tree approach. We present them for the
static scenario where no data updates occur between the owner and the servers on the outsourced
database. We also present analytical cost models for all techniques, given a variety of performance
metrics. As already mentioned, detailed analytical modeling was not considered in related liter-