Page 1

Dynamic Authenticated Index Structures for Outsourced

Databases

Feifei Li†

†CS Department, Boston University, USA.‡AT&T Labs-Research, USA.

(lifeifei, gkollios, reyzin)@cs.bu.edu, marioh@research.att.com

Marios Hadjieleftheriou‡

George Kollios†

Leonid Reyzin†

Technical Report BUCS-TR-2006-004

April 22, 2006

Abstract

In outsourced database (ODB) systems the database owner publishes its data through a

number of remote servers, with the goal of enabling clients at the edge of the network to access

and query the data more efficiently. As servers might be untrusted or can be compromised,

query authentication becomes an essential component of ODB systems. Existing solutions for

this problem concentrate mostly on static scenarios and are based on idealistic properties for

certain cryptographic primitives. In this work, first we define a variety of essential and prac-

tical cost metrics associated with ODB systems. Then, we analytically evaluate a number of

different approaches, in search for a solution that best leverages all metrics. Most importantly,

we look at solutions that can handle dynamic scenarios, where owners periodically update the

data residing at the servers. Finally, we discuss query freshness, a new dimension in data au-

thentication that has not been explored before. A comprehensive experimental evaluation of

the proposed and existing approaches is used to validate the analytical models and verify our

claims. Our findings exhibit that the proposed solutions improve performance substantially

over existing approaches, both for static and dynamic environments.

1 Introduction

Database outsourcing [13] is a new paradigm that has been proposed recently and received con-

siderable attention. The basic idea is that data owners delegate their database needs and func-

tionalities to a third-party that provides services to the users of the database. Since the third

party can be untrusted or can be compromised, security concerns must be addressed before this

delegation.

There are three main entities in the Outsourced Database (ODB) model: the data owner,

the database service provider (a.k.a. server) and the client. In general, many instances of each

entity may exist. In practice, usually there is a single or a few data owners, a few servers, and

many clients. The data owner first creates the database, along with the associated index and

authentication structures and uploads it to the servers. It is assumed that the data owner may

update the database periodically or ocasionally, and that the data management and retrieval

happens only at the servers. Clients submit queries about the owner’s data to the servers and get

back results through the network.

It is much cheaper to maintain ordinary servers than to maintain truly secure ones, particularly

in the distributed setting. To guard against malicious/compromised servers, the owner must give

1

Page 2

the clients the ability to authenticate the answers they receive without having to trust the servers.

In that respect, query authentication has three important dimensions: correctness, completeness

and freshness. Correctness means that the client must be able to validate that the returned records

do exist in the owner’s database and have not been modified in any way. Completeness means

that no answers have been omitted from the result. Finally, freshness means that the results are

based on the most current version of the database, that incorporates the latest owner updates. It

should be stressed here that query freshness is an important dimension of query authentication

that has not been extensively explored in the past, since it is a requirement arising from updates

to the ODB systems, an aspect that has not been sufficiently studied yet.

There are a number of important costs pertaining to the aforementioned model, relating to

the construction, query, and update phases. In particular, in this work the following metrics are

considered: 1. The computation overhead for the owner, 2. The owner-server communication

cost, 3. The storage overhead for the server, 4. The computation overhead for the server, 5. The

client-server communication cost, and 6. The computation cost for the client (for verification).

Previous work has addressed the problem of query authentication mostly for static scenarios,

where owners never issue data updates. In addition, existing solutions take into account only

a subset of the metrics proposed here, and hence are optimized only for particular scenarios

and not the general case. Finally, previous work was mostly of theoretical nature, analyzing

the performance of the proposed techniques using analytical cost formulas, and not taking into

account the fact that certain cryptographic primitives do not feature idealistic characteristics

in practice. For example, trying to minimize the I/O cost associated with the construction of

an authenticated structure does not take into account the fact that generating signatures using

popular public signature schemes is two times slower than a random disk page access on today’s

computers. To the best of our knowledge, no previous work ever conducted empirical evaluations

on a working prototype of existing techniques.

Our Contributions. In this work, we: 1. Conduct a methodical analysis of existing approaches

over all six metrics, 2. Propose a novel authenticated structure that best leverages all metrics,

3. Formulate detailed cost models for all techniques that take into account not only the usual

structural maintenance overheads, but the cost of cryptographic operations as well, 4. Discuss

the extensions of the proposed techniques for dynamic environments (where data is frequently

updated), 5. Consider possible solutions for guaranteeing query freshness, 6. Implement a fully

working prototype and perform a comprehensive experimental evaluation and comparison of all

alternatives.

We would like to point out that there are other security issues in ODB systems that are

orthogonal to the problems considered here. Examples include: privacy-preservation issues [14, 1,

10], secure query execution [12], security in conjunction with access control requirements [20, 29, 5]

and query execution assurance [30]. In particular, query execution assurance of [30] does not

provide authentication: the server could pass the challenges and yet still return false query results.

The rest of the paper is organized as follows. Section 2 presents background on essential

cryptography tools, and a brief review of related work. Section 3 discusses the authenticated

index structures for static ODB scenarios. Section 4 extends the discussion to the dynamic case

and Section 5 addresses query freshness. Finally, the empirical evaluation is presented in Section

6. Section 7 concludes the paper.

2

Page 3

r1

r2

r3

r4

h12=H(h1|h2)h34=H(h3|h4)

hroot=H(h12|h34)

h1=H(r1)h2=H(r2)h3=H(r3)h4=H(r4)

stree=S(hroot)

Figure 1: Example of a Merkle hash tree.

2 Preliminaries

The basic idea of the existing solutions to the query authentication problem is the following. The

owner creates a specialized data structure over the original database that is stored at the servers

together with the database. The structure is used by a server to provide a verification object VO

along with the answers, which the client can use for authenticating the results. Verification usually

occurs by means of using collision-resistant hash functions and digital signature schemes. Note

that in any solution, some information that is authentic to the owner must be made available

to the client; else, from the client’s point of view, the owner cannot be differentiated from a

(potentially malicious) server. Examples of such information include the owner’s public signature

verification key or a token that in some way authenticates the database. Any successful scheme

must make it computationally infeasible for a malicious server to find incorrect query results

and verification object that will be accepted by a client who has the appropriate authentication

information from the owner.

2.1 Cryptography essentials

Collision-resistant hash functions.

computable function that takes a variable-length input x to a fixed-length output y = h(x).

Collision resistance states that it is computationally infeasible to find two inputs, x1?= x2, such

that h(x1) = h(x2). Collision-resistant hash functions can be built provably based on various

cryptographic assumptions, such as hardness of discrete logarithms [17]. However, in this work

we concentrate on using heuristic hash functions, which have the advantage of being very fast to

evaluate, and specifically focus on SHA1 [24], which takes variable-length inputs to 160-bit (20-

byte) outputs. SHA1 is currently considered collision-resistant in practice; we also note that any

eventual replacement to SHA1 developed by the cryptographic community can be used instead of

SHA1 in our solution.

For our purposes, a hash function h is an efficiently

Public-key digital signature schemes. A public-key digital signature scheme, formally defined

in [11], is a tool for authenticating the integrity and ownership of the signed message. In such

a scheme, the signer generates a pair of keys (SK,PK), keeps the secret key SK secret, and

publishes the public key PK associated with her identity. Subsequently, for any message m that

she sends, a signature smis produced by: sm= S(SK,m). The recipient of smand m can verify

smvia V(PK,m,sm) that outputs “valid” or “invalid.” A valid signature on a message assures

the recipient that the owner of the secret key intended to authenticate the message, and that

the message has not been changed. The most commonly used public digital signature scheme is

RSA [28]. Existing solutions [26, 27, 21, 23] for the query authentication problem chose to use this

scheme, hence we adopt the common 1024-bit (128-byte) RSA. Its signing and verification cost is

one hash computation and one modular exponentiation with 1024-bit modulus and exponent.

3

Page 4

Aggregating several signatures.

m1,...,mtsigned by the same signer need to be verified all at once, certain signature schemes

allow for more efficient communication and verification than t individual signatures. Namely, for

RSA it is possible to combine the t signatures into a single aggregated signature s1,tthat has the

same size as an individual signature and that can be verified (almost) as fast as an individual

signature. This technique is called Condensed-RSA [22]. The combining operation can be done by

anyone, as it does not require knowledge of SK; moreover, the security of the combined signature

is the same as the security of individual signatures. In particular, aggregation of t RSA signatures

can be done at the cost of t−1 modular multiplications, and verification can be performed at the

cost of t−1 multiplications, t hashing operations, and one modular exponentiation (thus, the com-

putational gain is that t − 1 modular exponentiations are replaced by modular multiplications).

Note that aggregating signatures is possible only for some digital signature schemes.

In the case when t signatures s1,...,st on t messages

The Merkle Hash Tree. An improvement on the straightforward solution for authenticating

a set of data values is the Merkle hash tree (see Figure 1), first proposed by [18]. It solves the

simplest form of the query authentication problem for point queries and datasets that can fit in

main memory. The Merkle hash tree is a binary tree, where each leaf contains the hash of a

data value, and each internal node contains the hash of the concatenation of its two children.

Verification of data values is based on the fact that the hash value of the root of the tree is

authentically published (authenticity can be established by a digital signature). To prove the

authenticity of any data value, all the prover has to do is to provide the verifier, in addition to the

data value itself, with the values stored in the siblings of the path that leads from the root of the

tree to that value. The verifier, by iteratively computing all the appropriate hashes up the tree,

at the end can simply check if the hash she has computed for the root matches the authentically

published value. The security of the Merkle hash tree is based on the collision-resistance of the

hash function used: it is computationally infeasible for a malicous prover to fake a data value,

since this would require finding a hash collision somewhere in the tree (because the root remains

the same and the leaf is different—hence, there must be a collision somewhere in between). Thus,

the authenticity of any one of n data values can be proven at the cost of providing and computing

log2n hash values, which is generally much cheaper than storing and verifying one digital signature

per data value. Furthermore, the relative position (leaf number) of any of the data values within

the tree is authenticated along with the value itself.

Cost models for SHA1, RSA and Condensed-RSA. Since all existing authenticated struc-

tures are based on SHA1 and RSA, it is imperative to evaluate the relative cost of these operations

in order to be able to draw conclusions about which is the best alternative in practice. Based

on experiments with two widely used cryptography libraries, Crypto++ [7] and OpenSSL [25],

we obtained results for hashing, signing, verifying and performing modulo multiplications. Evi-

dently, one hashing operation on our testbed computer takes approximately 2 to 3 µs. Modular

multiplication, signing and verifying are, respectively, approximately 100, 10,000 and 1,000 times

slower than hashing (verification is faster than signing due to the fact that the public verification

exponent can be fixed to a small value).

Thus, it is clear that multiplication, signing and verification operations are very expensive,

and comparable to random disk page accesses. The cost of these operations needs to be taken into

account in practice, for the proper design of authenticated structures. In addition, since the cost

of hashing is orders of magnitude smaller than that of singing, it is essential to design structures

that use as few signing operations as possible, and hashing instead.

4

Page 5

Table 1: Notation used.

Symbol

r

k

p

h

s

|x|

ND

NR

P

fx

dx

Hl(x)

Sl(x)

Vl(x)

Cx

VO

Description

A database record

A B+-tree key

A B+-tree pointer

A hash value

A signature

Size of object x

Total number of database records

Total number of query results

Page size

Node fanout of structure x

Height of structure x

A hash operation on input x of length l

A signing operation on input x of length l

A verifying operation on input x of length l

Cost of operation x

The verification object

2.2 Previous work

There are several notable works that are related to our problem. A good survey is provided in [23];

our review here is brief. The first set of attempts to address query authentication problems in

ODB systems appeared in [9, 8, 16]. The focus of these works is on designing solutions for query

correctness only, creating structures that are based on Merkle trees. The work of [16] generalized

the Merkle hash tree ideas to work with any DAG (directed acyclic graph) structure. With

similar techniques, the work of [3] uses the Merkle tree to authenticate XML documents in the

ODB model. The work of [27] further extended the idea and introduced the VB-tree which was

suitable for structures stored on secondary storage. However, this approach is expensive and was

later subsumed by [26]. Several proposals for signature-based approaches addressing both query

correctness and completeness appear in [23, 21, 26]. We are not aware of work that specifically

addresses the query freshness issue.

Hardware support for secure data access is investigated in [5, 4]. It offers a promising research

direction for designing query authentication scheme with special hardware support if it is available.

Lastly, distributed content authentication has been addressed in [31], where distributed version

of Merkle hash tree is applied.

3 The Static Case

In this section we illustrate three approaches for query correctness and completeness: a signature-

based approach similar to the ones described in [26, 23], a Merkle-tree-like approach based on

the ideas presented in [16], and our novel embedded tree approach. We present them for the

static scenario where no data updates occur between the owner and the servers on the outsourced

database. We also present analytical cost models for all techniques, given a variety of performance

metrics. As already mentioned, detailed analytical modeling was not considered in related liter-

5