# ACM Transactions on Database Systems

Print ISSN: 0362-5915
Publications
Ontology-based data access is concerned with querying incomplete data sources in the presence of domain-specific knowledge provided by an ontology. A central notion in this setting is that of an ontology-mediated query, which is a database query coupled with an ontology. In this paper, we study several classes of ontology-mediated queries, where the database queries are given as some form of conjunctive query and the ontologies are formulated in description logics or other relevant fragments of first-order logic, such as the guarded fragment and the unary-negation fragment. The contributions of the paper are three-fold. First, we characterize the expressive power of ontology-mediated queries in terms of fragments of disjunctive datalog. Second, we establish intimate connections between ontology-mediated queries and constraint satisfaction problems (CSPs) and their logical generalization, MMSNP formulas. Third, we exploit these connections to obtain new results regarding (i) first-order rewritability and datalog-rewritability of ontology-mediated queries, (ii) P/NP dichotomies for ontology-mediated queries, and (iii) the query containment problem for ontology-mediated queries.

Numerous generalization techniques have been proposed for privacy preserving data publishing. Most existing techniques, however, implicitly assume that the adversary knows little about the anonymization algorithm adopted by the data publisher. Consequently, they cannot guard against privacy attacks that exploit various characteristics of the anonymization mechanism. This paper provides a practical solution to the above problem. First, we propose an analytical model for evaluating disclosure risks, when an adversary knows everything in the anonymization process, except the sensitive values. Based on this model, we develop a privacy principle, transparent l-diversity, which ensures privacy protection against such powerful adversaries. We identify three algorithms that achieve transparent l-diversity, and verify their effectiveness and efficiency through extensive experiments with real data. Comment: To appear in the ACM Transaction on Database Systems (TODS)

Differential privacy is a promising privacy-preserving paradigm for statistical query processing over sensitive data. It works by injecting random noise into each query result, such that it is provably hard for the adversary to infer the presence or absence of any individual record from the published noisy results. The main objective in differentially private query processing is to maximize the accuracy of the query results, while satisfying the privacy guarantees. Previous work, notably \cite{LHR+10}, has suggested that with an appropriate strategy, processing a batch of correlated queries as a whole achieves considerably higher accuracy than answering them individually. However, to our knowledge there is currently no practical solution to find such a strategy for an arbitrary query batch; existing methods either return strategies of poor quality (often worse than naive methods) or require prohibitively expensive computations for even moderately large domains. Motivated by this, we propose low-rank mechanism (LRM), the first practical differentially private technique for answering batch linear queries with high accuracy. LRM works for both exact (i.e., $\epsilon$-) and approximate (i.e., ($\epsilon$, $\delta$)-) differential privacy definitions. We derive the utility guarantees of LRM, and provide guidance on how to set the privacy parameters given the user's utility expectation. Extensive experiments using real data demonstrate that our proposed method consistently outperforms state-of-the-art query processing solutions under differential privacy, by large margins.

In this paper, we show how to model and verify a data manager whose algorithm is based on ARIES. The work uses the I/O automata as the formal model and the definition of correctness in one that is based on the user's view of the database.

Sorting database tables before compressing them improves the compression rate. Can we do better than the lexicographical order? For minimizing the number of runs in a run-length encoding compression scheme, the best approaches to row-ordering are derived from traveling salesman heuristics, although there is a significant trade-off between running time and compression. A new heuristic, Multiple Lists, which is a variant on Nearest Neighbor that trades off compression for a major running-time speedup, is a good option for very large tables. However, for some compression schemes, it is more important to generate long runs rather than few runs. For this case, another novel heuristic, Vortex, is promising. We find that we can improve run-length encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%: these gains are on top of the gains due to lexicographically sorting the table. We prove that the new row reordering is optimal (within 10%) at minimizing the runs of identical values within columns, in a few cases.

We present a simple geometric framework for the relational join. Using this framework, we design an algorithm that achieves the fractional hypertree-width bound, which generalizes classical and recent worst-case algorithmic results on computing joins. In addition, we use our framework and the same algorithm to show a series of what are colloquially known as beyond worst-case results. The framework allows us to prove results for data stored in Btrees, multidimensional data structures, and even multiple indices per table. A key idea in our framework is formalizing the inference one does with an index as a type of geometric resolution; transforming the algorithmic problem of computing joins to a geometric problem. Our notion of geometric resolution can be viewed as a geometric analog of logical resolution. In addition to the geometry and logic connections, our algorithm can also be thought of as backtracking search with memoization.

We study two succinct representation systems for relational data based on relational algebra expressions with unions, Cartesian products, and singleton relations: f-representations, which employ algebraic factorisation using distributivity of product over union, and d-representations, which are f-representations where further succinctness is brought by explicit sharing of repeated subexpressions. In particular we study such representations for results of conjunctive queries. We derive tight asymptotic bounds for representation sizes and present algorithms to compute representations within these bounds. We compare the succinctness of f-representations and d-representations for results of equi-join queries, and relate them to fractional edge covers and fractional hypertree decompositions of the query hypergraph. Recent work showed that f-representations can significantly boost the performance of query evaluation in centralised and distributed settings and of machine learning tasks.

The distributed transaction commit problem requires reaching agreement on whether a transaction is committed or aborted. The classic Two-Phase Commit protocol blocks if the coordinator fails. Fault-tolerant consensus algorithms also reach agreement, but do not block whenever any majority of the processes are working. Running a Paxos consensus algorithm on the commit/abort decision of each participant yields a transaction commit protocol that uses 2F +1 coordinators and makes progress if at least F +1 of them are working. In the fault-free case, this algorithm requires one extra message delay but has the same stable-storage write delay as Two-Phase Commit. The classic Two-Phase Commit algorithm is obtained as the special F = 0 case of the general Paxos Commit algorithm.

SPARQL is the W3C candidate recommendation query language for RDF. In this paper we address systematically the formal study of SPARQL, concentrating in its graph pattern facility. We consider for this study a fragment without literals and a simple version of filters which encompasses all the main issues yet is simple to formalize. We provide a compositional semantics, prove there are normal forms, prove complexity bounds, among others that the evaluation of SPARQL patterns is PSPACE-complete, compare our semantics to an alternative operational semantics, give simple and natural conditions when both semantics coincide and discuss optimizations procedures.

This paper studies the complexity of evaluating functional query languages for complex values such as monad algebra and the recursion-free fragment of XQuery. We show that monad algebra with equality restricted to atomic values is complete for the class TA[2^{O(n)}, O(n)] of problems solvable in linear exponential time with a linear number of alternations. The monotone fragment of monad algebra with atomic value equality but without negation is complete for nondeterministic exponential time. For monad algebra with deep equality, we establish TA[2^{O(n)}, O(n)] lower and exponential-space upper bounds. Then we study a fragment of XQuery, Core XQuery, that seems to incorporate all the features of a query language on complex values that are traditionally deemed essential. A close connection between monad algebra on lists and Core XQuery (with child'' as the only axis) is exhibited, and it is shown that these languages are expressively equivalent up to representation issues. We show that Core XQuery is just as hard as monad algebra w.r.t. combined complexity, and that it is in TC0 if the query is assumed fixed.

We prove exponential lower bounds on the running time of the state-of-the-art exact model counting algorithms—algorithms for exactly computing the number of satisfying assignments, or the satisfying probability, of Boolean formulas. These algorithms can be seen, either directly or indirectly, as building Decision-Decomposable Negation Normal Form (decision-DNNF) representations of the input Boolean formulas. Decision-DNNFs are a special case of d-DNNFs where d stands for deterministic. We show that any knowledge compilation representations from a class (called DLDDs in this article) that contain decision-DNNFs can be converted into equivalent Free Binary Decision Diagrams (FBDDs), also known as Read-Once Branching Programs, with only a quasi-polynomial increase in representation size. Leveraging known exponential lower bounds for FBDDs, we then obtain similar exponential lower bounds for decision-DNNFs, which imply exponential lower bounds for model-counting algorithms. We also separate the power of decision-DNNFs from d-DNNFs and a generalization of decision-DNNFs known as AND-FBDDs. We then prove new lower bounds for FBDDs that yield exponential lower bounds on the running time of these exact model counters when applied to the problem of query evaluation in tuple-independent probabilistic databases—computing the probability of an answer to a query given independent probabilities of the individual tuples in a database instance. This approach to the query evaluation problem, in which one first obtains the lineage for the query and database instance as a Boolean formula and then performs weighted model counting on the lineage, is known as grounded inference. A second approach, known as lifted inference or extensional query evaluation, exploits the high-level structure of the query as a first-order formula. Although it has been widely believed that lifted inference is strictly more powerful than grounded inference on the lineage alone, no formal separation has previously been shown for query evaluation. In this article, we show such a formal separation for the first time. In particular, we exhibit a family of database queries for which polynomial-time extensional query evaluation techniques were previously known but for which query evaluation via grounded inference using the state-of-the-art exact model counters requires exponential time.

Ontological queries are evaluated against a knowledge base consisting of an extensional database and an ontology (i.e., a set of logical assertions and constraints which derive new intensional knowledge from the extensional database), rather than directly on the extensional database. The evaluation and optimization of such queries is an intriguing new problem for database research. In this paper, we discuss two important aspects of this problem: query rewriting and query optimization. Query rewriting consists of the compilation of an ontological query into an equivalent first-order query against the underlying extensional database. We present a novel query rewriting algorithm for rather general types of ontological constraints which is well-suited for practical implementations. In particular, we show how a conjunctive query against a knowledge base, expressed using linear and sticky existential rules, that is, members of the recently introduced Datalog+/- family of ontology languages, can be compiled into a union of conjunctive queries (UCQ) against the underlying database. Ontological query optimization, in this context, attempts to improve this rewriting process so to produce possibly small and cost-effective UCQ rewritings for an input query.

XML data projection (or pruning) is a natural optimization for main memory query engines: given a query Q over a document D, the subtrees of D that are not necessary to evaluate Q are pruned, thus producing a smaller document D'; the query Q is then executed on D', hence avoiding to allocate and process nodes that will never be reached by Q. In this article, we propose a new approach, based on types, that greatly improves current solutions. Besides providing comparable or greater precision and far lesser pruning overhead, our solution—unlike current approaches—takes into account backward axes, predicates, and can be applied to multiple queries rather than just to single ones. A side contribution is a new type system for XPath able to handle backward axes. The soundness of our approach is formally proved. Furthermore, we prove that the approach is also complete (i.e., yields the best possible type-driven pruning) for a relevant class of queries and Schemas. We further validate our approach using the XMark and XPathMark benchmarks and show that pruning not only improves the main memory query engine's performances (as expected) but also those of state of the art native XML databases.

We study the problem of validating XML documents of size N against general DTDs in the context of streaming algorithms. The starting point of this work is a well-known space lower bound. There are XML documents and DTDs for which p-pass streaming algorithms require Ω(N/p) space. We show that when allowing access to external memory, there is a deterministic streaming algorithm that solves this problem with memory space O(log²N), a constant number of auxiliary read/write streams, and O(log N) total number of passes on the XML document and auxiliary streams. An important intermediate step of this algorithm is the computation of the First-Child-Next-Sibling (FCNS) encoding of the initial XML document in a streaming fashion. We study this problem independently, and we also provide memory efficient streaming algorithms for decoding an XML document given in its FCNS encoding. Furthermore, validating XML documents encoding binary trees in the usual streaming model without external memory can be done with sublinear memory. There is a one-pass algorithm using O(√N log N) space, and a bidirectional two-pass algorithm using O(log²N) space performing this task.

Physical database design tools rely on a DBA-provided workload to pick an “optimal” set of indexes and materialized views. Such an approach fails to capture scenarios where DBAs are unable to produce a succinct workload for an automated tool but still able to suggest an ideal physical design based on their broad knowledge of the database usage. Unfortunately, in many cases such an ideal design violates important constraints (e.g., space) and needs to be refined. In this paper, we focus on the important problem of physical design refinement, which addresses the above and other related scenarios. We propose to solve the physical refinement problem by using a transformational architecture that is based upon two novel primitive operations, called merging and reduction. These operators help refine a configuration, treating indexes and materialized views in a unified way, as well as succinctly explain the refinement process to DBAs.

Personal data has value to both its owner and to institutions who would like to analyze it. Privacy mechanisms protect the owner's data while releasing to analysts noisy versions of aggregate query results. But such strict protections of the individual's data have not yet found wide use in practice. Instead, Internet companies, for example, commonly provide free services in return for valuable sensitive information from users, which they exploit and sometimes sell to third parties. As awareness of the value of personal data increases, so has the drive to compensate the end-user for her private information. The idea of monetizing private data can improve over the narrower view of hiding private data, since it empowers individuals to control their data through financial means. In this article we propose a theoretical framework for assigning prices to noisy query answers as a function of their accuracy, and for dividing the price amongst data owners who deserve compensation for their loss of privacy. Our framework adopts and extends key principles from both differential privacy and query pricing in data markets. We identify essential properties of the pricing function and micropayments, and characterize valid solutions.

ing with credit is permitted. To copy otherwise, to republish, to Post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org. 2 Delta B. Kemme and G. Alonso 1. INTRODUCTION Database replication has been traditionally used as a basic mechanism to increase the availability (by allowing fail-over configurations) and the performance (by eliminating the need to access remote sites) of distributed databases. In spite of the large number of existing protocols which provide data consistency and faulttolerance [Bernstein et al. 1987], few of these ideas have ever been used in commercial products. There is a strong belief among database designers that most existing solutions are not feasible due to their complexity, poor performance and lack of scalability. As a result, current products adopt a very pragmatic approach: copies are not kept consistent, updates...

A fine-grained, proxy-based approach for caching dynamic content, deployable in both reverse proxy or forward proxy mode is proposed. The dynamic proxy caching technique enables granular, proxy-based caching in a fully distributed mode and allows both the content and layout to be dynamic. The approach is found to be capable of providing significant reductions in bandwidth and response times. The proposed dynamic proxy caching technique provides up to 3x reductions in bandwidth and response times in real-world dynamic Web applications.

this paper, we present an access control model in which periodic temporal intervals are associated with authorizations. An authorization is automatically granted in the specified intervals and revoked when such intervals expire. Deductive temporal rules with periodicity and order constraints are provided to derive new authorizations based on the presence or absence of other authorizations in specific periods of time. We provide a solution to the problem of ensuring the uniqueness of the global set of valid authorizations derivable at each instant, and we propose an algorithm to compute this set. Moreover, we address issues related to the efficiency of access control by adopting a materialization approach. The resulting model provides a high degree of flexibility and supports the specification of several protection requirements that cannot be expressed in traditional access control models.

This article proposes a scalable protocol for replication management in large-scale replicated systems. The protocol organizes sites and data replicas into a tree-structured, hierarchical cluster architecture. The basic idea of the protocol is to accomplish the complex task of updating replicated data with a very large number of replicas by a set of related but independently committed transactions. Each transaction is responsible for updating replicas in exactly one cluster and invoking additional transactions for member clusters. Primary copies (one from each cluster) are updated by a cross-cluster transaction. Then each cluster is independently updated by a separate transaction. This decoupled update propagation process results in possible multiple views of replicated data in a cluster. Compared to other replicated data management protocols, the proposed protocol has several unique advantages. First, thanks to a smaller number of replicas each transaction needs to atomically update in a cluster, the protocol significantly reduces the transaction abort rate, which tends to soar in large transactional systems. Second, the protocol improves user-level transaction response time as top-level update transactions are allowed to commit before all replicas have been updated. Third, read-only queries have the flexibility to see database views of different degrees of consistency and data currency. This ranges from global, most up to date, and consistent views, to local, consistent, but potentially old views, to local, nearest to users but potentially inconsistent views. Fourth, the protocol maintains its scalability by allowing dynamic system reconfiguration as it grows by splitting a cluster into two or more smaller ones. Fifth, autonomy of the clusters is preserved as no speci...

A query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or "top" # pages for the query. This top-# query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. For example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. A user who queries such a relation might simply specify the user's location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. Processing such top-# queries efficiently is challenging for a number of reasons. One critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. In this paper, we study how to process top- # queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. We present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data.

ACTA is a comprehensive transaction framework that facilitates the formal description of properties of extended transaction models. Specifically, using ACTA, one can specify and reason about (1) the effects of transactions on objects and (2) the interactions between transactions. This paper presents ACTA as a tool for the synthesis of extended transaction models, one which supports the development and analysis of new extended transaction models in a systematic manner. Here, this is demonstrated by deriving new transaction definitions (1) by modifying the specifications of existing transaction models, (2) by combining the specifications of existing models and (3) by starting from first principles. To exemplify the first, new models are synthesized from atomic transactions and join transactions. To illustrate the second, we synthesize a model that combines aspects of the nested and split transaction models. We demonstrate the latter by deriving the specification of an open nested transac...

This paper addresses the performance of distributed database systems. Specifically, we present an algorithm for dynamic replication of an object in distributed systems. The algorithm is adaptive in the sense that it changes the replication scheme of the object (i.e. the set of processors at which the object is replicated), as changes occur in the read-write pattern of the object (i.e. the number of reads and writes issued by each processor). The algorithm continuously moves the replication scheme towards an optimal one. We show that the algorithm can be combined with the concurrency control and recovery mechanisms of a distributed database management system. The performance of the algorithm is analyzed theoretically and experimentally. On the way we provide a lower bound on the performance of any dynamic replication algorithm.

Dyreson and Snodgrass have drawn attention to the fact that, in many temporal database applications, there is often uncertainty about the start time of events, the end time of events, and the duration of events. When the granularity of time is small (e.g., milliseconds), a statement such as “Packet p was shipped sometime during the first 5 days of January, 1998” leads to a massive amount of uncertainty (5×24×60×60×1000) possibilities. As noted in Zaniolo et al. [1997], past attempts to deal with uncertainty in databases have been restricted to relatively small amounts of uncertainty in attributes. Dyreson and Snodgrass have taken an important first step towards solving this problem. In this article, we first introduce the syntax of Temporal-Probabilistic (TP) relations and then show how they can be converted to an explicit, significantly more space-consuming form, called Annotated Relations. We then present a theoretical annotated temporal algebra (TATA). Being explicit, TATA is convenient for specifying how the algebraic operations should behave, but is impractical to use because annotated relations are overwhelmingly large. Next, we present a temporal probabilistic algebra (TPA). We show that our definition of the TP-algebra provides a correct implementation of TATA despite the fact that it operates on implicit, succinct TP-relations instead of overwhemingly large annotated relations. Finally, we report on timings for an implementation of the TP-Algebra built on top of ODBC.

ing with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or permissions@acm.org. This is a preliminary release of an article accepted by ACM Transactions on Database Systems. The definitive version is currently in production at ACM and, when released, will supersede this version. 2 Delta E. Baralis and J. Widom 1. INTRODUCTION An active database system is a conventional database system extended with a facility for managing active rules (or triggers). Incorporating active rules into a conventional database system has raised considerable interest both in the scientific community and in the commercial world: A number of prototypes that incorporate active rules into relational and object-oriented database system...

Our experimental analysis of several popular XPath processors reveals a striking fact: Query evaluation in each of the systems requires time exponential in the size of queries in the worst case. We show that XPath can be processed much more efficiently, and propose main-memory algorithms for this problem with polynomial-time combined query evaluation complexity. Moreover, we show how the main ideas of our algorithm can be profitably integrated into existing XPath processors. Finally, we present two fragments of XPath for which linear-time query processing algorithms exist and another fragment with linear-space/quadratic-time query processing.

ing with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or permissions@acm.org. 2 D. Kossmann and K. Stocker 1. INTRODUCTION The great commercial success of database systems is partly due to the development of sophisticated query optimization technology: users pose queries in a declarative way using SQL or OQL, and the optimizer of the database system finds a good way (i.e., plan) to execute these queries. The optimizer, for example, determines which indices should be used to execute a query and in which order the operations of a query (e.g., joins and group-bys) should be executed. To this end, the optimizer enumerates alternative plans, estimates the cost of every plan using a cost model, and chooses the p...

First-order formulas allow natural descriptions of queries and rules. Van Gelder's alternating fixpoint semantics extends the well-founded semantics of normal logic programs to general logic programs with arbitrary first-order formulas in rule bodies. However, an implementation of general logic programs through the standard translation into normal logic programs does not preserve the alternating fixpoint semantics. This paper presents a direct method for goal-oriented query evaluation of general logic programs. Every general logic program is first transformed into a normal form where the body of each rule is either an existential conjunction of literals or a universal disjunction of literals. Techniques of memoing and loop checking are incorporated so that termination and polynomial time data complexity are guaranteed for deductive databases (or function-free programs). Results of the soundness and search space completeness are established. Supported in part by the National Science F...

ing with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or permissions@acm.org. 2 Delta M. J. Franklin et al. 1. INTRODUCTION 1.1 Client-Server Database System Architectures Advances in distributed computing and object-orientation have combined to bring about the development of a new class of database systems. These systems employ a client-server computing model to provide both responsiveness to users and support for complex, shared data in a distributed environment. Current relational DBMS products are based on a query-shipping approach in which most query processing is performed at servers; clients are primarily used to manage the user interface. In contrast, object-oriented database systems (OODBMS), whi...

this paper, we propose a query-planning framework to answer queries in the presence of limited access patterns. In the framework, a query and source descriptions are translated to a recursive datalog program. We then solve optimization problems in this framework, including how to decide whether accessing off-query sources is necessary, how to choose useful sources for a query, and how to test query containment. We develop algorithms to solve these problems, and thus construct an efficient program to answer a query

Specifying and Verifying and Reasoning about Programs ---Invariants; Pre- and post-conditions; Specification techniques General Terms: Database Management Systems,Formal Specifications Additional Key Words and Phrases: Database Management Systems, Transaction Processing, Concurrency Control 1. INTRODUCTION Performance requirements can force an application developer to decompose some transactions into smaller logical units or steps , especially for long-lived transactions. The work of Paul Ammann was partially supported by National Science Foundation under grant number CCR-9202270. The work of Sushil Jajodia was partially supported by a grant from ARPA, administered by the Office of Naval Research under grant number N0014-92-J-4038, by National Science Foundation under grant number IRI-9303416, and by National Security Agency under contract number MDA904-94-C-6118. The work of Indrakshi Ray w

Many commercial database systems maintain histograms to summarize the contents of large relations and permit efficient estimation of query result sizes for use in query optimizers. Delaying the propagation of database updates to the histogram often introduces errors in the estimation. This paper presents new sampling-based approaches for incremental maintenance of approximate histograms. By scheduling updates to the histogram based on the updates to the database, our techniques are the first to maintain histograms effectively up-to-date at all times and avoid computing overheads when unnecessary. Our techniques provide highly-accurate approximate histograms belonging to the equi-depth and Compressed classes. Experimental results show that our new approaches provide orders of magnitude more accurate estimation than previous approaches. An important aspect employed by these new approaches is a backing sample, an up-to-date random sample of the tuples currently in a relation. We provide efficient solutions for maintaining a uniformly random sample of a relation in the presence of updates to the relation. The backing sample techniques can be used for any other application that relies on random samples of data. 1

Metric access methods (MAMs), such as the M-tree, are powerful index structures for supporting similarity queries on metric spaces, which represent a common abstraction forthIj searchrc problems tho arise in many modern application areas, such as multimedia, data mining, decision support, pattern recognition, and genomic databases. As compared to multi-dimensional (spatial) access methods (SAMs), MAMs are more general, yet they are reputed to lose in flexibility, since it is commonly deemed th= th= can only answer queries using th same distance function used to buildth index. In thj paper we sh wth" th" limitation is only apparent -- thus MAMs are far more flexible than believed -- and extend the M-tree so as to be able to support user-defined distance criteria, approximate distance functions to speed up query evaluation, as well as dissimilarity functions whD h are not metrics. The so-extended M-tree, also called QIC-M-tree, can deal with three distinct distances at a time: 1) a query (user-defined) distance,2)anindex distance (used to buildth tree), and 3) a comparison(iso oximate) distance (used to quickly discard from th search uninteresting parts of th tree). We develop an analytical cost model thl accurately characterizes the performance of QIC-M-tree and validate such model thjj"[ extensive experimentation on real metric data sets. In particular, our analysis is able to predict th best evaluation strategy (i.e.whe h distances to use) under a variety of configurations, by properly taking into account relevant factors such as th distribution of distances, th cost of computing distances, and th actual index structure. We also prove thF the overall saving in CPU search costs whj using an approximate distance can be estimated by using information on the data set only -- thus...

We present an archiving technique for hierarchical data with key structure. Our approach is based on the notion of timestamps whereby an element appearing in multiple versions of the database is stored only once along with a compact description of versions in which it appears. The basic idea of timestamping was discovered by Driscoll et. al. in the context of persistent data structures where one wishes to track the sequences of changes made to a data structure. We extend this idea to develop an archiving tool for XML data that is capable of providing meaningful change descriptions and can also efficiently support a variety of basic functions concerning the evolution of data such as retrieval of any specific version from the archive and querying the temporal history of any element. This is in contrast to diff-based approaches where such operations may require undoing a large number of changes or significant reasoning with the deltas. Surprisingly, our archiving technique does not incur any significant space overhead when contrasted with other approaches. Our experimental results support this and also show that the compacted archive file interacts well with other compression techniques. Finally, another useful property of our approach is that the resulting archive is also in XML and hence can directly leverage existing XML tools.

Previous work on superimposed coding are characterized by two aspects. First, it is in general assumed that signatures are generated from logical text blocks of the same size. That is, each block contains the same number of unique terms after stopword and duplicate removal. We call this approach the fixed size block (FSB) method, since each text block has the same size as measured by the number of unique terms contained in it. Second, with only a few exceptions [7,8,9,16], most previous work has assumed that each term in the text contributes the same number of 1's to the signature (i.e., the weight of the term signatures is fixed). The main objective of this paper is to derive an optimal weight assignment which assigns weights to document terms according to their occurrence and query frequencies to minimize the false drop probability. The optimal scheme can account for both uniform and nonuniform occurrence and query frequencies, and the signature generation method is still based on ha...

This paper suggests that more powerful data base systems (DBMS) can be built by supporting data base procedures as full fledged data base objects. In particular, allowing fields of a data base to be a collection of queries in the query language of the system is shown to allow complex data relationships to be naturally expressed. Moreover, many of the features present in objectoriented systems and semantic data models can be supported by this facility. In order to implement this construct, extensions to a typical relational query language must be made and considerable work on the execution engine of the underlying DBMS must be accomplished. This paper reports on the extensions for one particular query language and data manager and then gives performance figures for a prototype implementation. Even though the performance of the prototype is competitive with that of a conventional system, suggestions for improvement are presented. 1. INTRODUCTION Most current data base systems store inf...

Communication behavior represents dynamic evolution and cooperation of a group of objects in accomplishing a task. It is an important feature in object-oriented systems. We propose the concept of activity as a basic building block for declarative specification of communication behavior in object-oriented database systems, including the temporal ordering of message exchanges within object communication and the behavioral relationships between activity executions. We formally introduce two kinds of activity composition mechanisms: activity specialization and activity aggregation for abstract implementation of communication behavior. The former is suited for behavioral refinement of existing activities into specialized activities. The latter is used for behavioral composition of simpler activities into complex activities, and ultimately, into the envisaged database system. We use First-Order Temporal Logic as an underlying formalism for specification of communication constraints. The well...

Mobile computing has the potential for managing information globally. Data management issues in mobile computing have received some attention in recent times, and the design of adaptive broadcast protocols has been posed as an important problem. Such protocols are employed by database servers to decide on the content of broadcasts dynamically, in response to client mobility and demand patterns. In this paper we design such protocols and also propose efficient retrieval strategies that may be employed by clients to download information from broadcasts. The goal is to design cooperative strategies between server and client to provide access to information in such a way as to minimize energy expenditure by clients. We evaluate the performance of our protocols both analytically and through simulation. General Terms: Algorithms, Performance.

Emerging distributed query-processing systems support flexible execution strategies in which each query can be run using a combination of data shipping and query shipping. As in any distributed environment, these systems can obtain tremendous performance and availability benefits by employing dynamic data caching. When flexible execution and dynamic caching are combined, however, a circular dependency arises: Caching occurs as a by-product of query operator placement, but query operator placement decisions are based on (cached) data location. The practical impact of this dependency is that query optimization decisions that appear valid on a per-query basis can actually cause suboptimal performance for all queries in the long run. To address this problem, we developed Cache Investment - a novel approach for integrating query optimization and data placement that looks beyond the performance of a single query. Cache Investment sometimes intentionally generates a “suboptimal” plan for a particular query in the interest of effecting a better data placement for subsequent queries. Cache Investment can be integrated into a distributed database system without changing the internals of the query optimizer. In this paper, we propose Cache Investment mechanisms and policies and analyze their performance. The analysis uses results from both an implementation on the SHORE storage manager and a detailed simulation model. Our results show that Cache Investment can significantly improve the overall performance of a system and demonstrate the trade-offs among various alternative policies.

This paper concentrates on query unnesting (also known as query decorrelation), an optimization that, even though improves performance considerably, is not treated properly (if at all) by most OODB systems. Our framework generalizes many unnesting techniques proposed recently in the literature and is capable of removing any form of query nesting using a very simple and efficient algorithm. The simplicity of our method is due to the use of the monoid comprehension calculus as an intermediate form for OODB queries. The monoid comprehension calculus treats operations over multiple collection types, aggregates, and quantifiers in a similar way, resulting in a uniform way of unnesting queries, regardless of their type of nesting.

Traditional database systems provide a user with the ability to query and manipulate one database state, namely the current database state. However, in several emerging applications, the ability to analyze "what-if" scenarios in order to reason about the impact of an update (before committing that update) is of paramount importance. Example applications include hypothetical database access, active database management systems, and version management, to name a few. The central thesis of the Heraclitus paradigm is to provide flexible support for applications such as these by elevating deltas, which represent updates proposed against the current database state, to be first-class citizens. Heraclitus[Alg,C] is a database programming language that extends C to incorporate the relational algebra and deltas. Operators are provided that enable the programmer to explicitly construct, combine, and access deltas. Most interesting is the when operator, that supports hypothetical access ...

Database design commonly assumes, explicitly or implicitly, that instances must belong to classes. This can be termed the assumption of inherent classification. We argue that the extent and complexity of problems in schema integration, schema evolution, and interoperability are, to a large extent, consequences of inherent classification. Furthermore, we make the case that the assumption of inherent classification violates philosophical and cognitive guidelines on classification and is, therefore, inappropriate in view of the role of data modeling in representing knowledge about application domains. As an alternative, we propose a layered appro...

A schema mapping is a specification that describes how data structured under one schema (the source schema) is to be transformed into data structured under a di#erent schema (the target schema). Schema mappings play a key role in numerous areas of database systems, including database design, information integration, and model management. A fundamental problem in this context is composing schema mappings: given two successive schema mappings, derive a schema mapping between the source schema of the first and the target schema of the second that has the same e#ect as applying successively the two schema mappings.

Schema evolution is a problem that is faced by long-lived data. When a schema changes, existing persistent data can become inaccessible unless the database system provides mechanisms to access data created with previous versions of the schema. Existing systems that support schema evolution focus on changes local to individual types within the schema, thereby limiting the changes that the database maintainer can perform. We have developed a model of type changes incorporating changes local to individual types as well as compound changes involving multiple types. The model describes both type changes and their impact on data by defining derivation rules to initialize new data based on the existing data. The derivation rules can describe local and non-local changes to types to capture the intent of a large class of type change operations. We have built a system called Tess (Type Evolution Software System) that uses this model to recognize type changes by comparing schemas and then produces a transformer that can update data in a database to correspond to a newer version of the schema.

Galileo, a programming language for database applications, is presented. Galileo is a strongly-typed, interactive programming language designed specifically to support semantic data model features (classification, aggregation, and specialization), as well as the abstraction mechanisms of modern programming languages (types, abstract types, and modularization). The main contributions of Galileo are (a) a flexible type system to model database structure and semantic integrity constraints; (b) the inclusion of type hierarchies to support the specialization abstraction mechanisms of semantic data models; (c) a modularization mechanism to structure data and operations into interrelated units (d) the integration of abstraction mechanisms into an expression-based language that allows interactive use of the database without resorting to a new stand-alone query language. Galileo will be used in the immediate future as a tool for database design and, in the long term, as a high-level interface for DBMSs.

Gifford's basic Quorum Consensus algorithm for data replication is generalized to accommodate nested transactions and transaction failures (aborts). A formal description of the generalized algorithm is presented using the new Lynch-Merritt inputoutput automaton model for nested transaction systems. This formal description is used to construct a complete (yet simple) proof of correctness that uses standard asserttonal techniques and is based on a natural correctness condition. The presentation and proof treat issues of data replication entirely separately from issues of concurrency control and recovery.

The concept of serializability has been the traditionally accepted correctness criterion in database systems. However in multidatabase systems (MDBSs), ensuring global serializability is a difficult task. The difficulty arises due to the heterogeneity of the concurrency control protocols used by the participating local database management systems (DBMSs), and the desire to preserve the autonomy of the local DBMSs. In general, solutions to the global serializability problem result in executions with a low degree of concurrency. The alternative, relaxed serializability, may result in data inconsistency. In this article, we introduce a systematic approach to relaxing the serializability requirement in MDBS environments. Our approach exploits the structure of the integrity constraints and the nature of transaction programs to ensure consistency without requiring executions to be serializable. We develop a simple yet powerful classification of MDBSs based on the nature of integrity constraints and transaction programs. For each of the identified models we show how consistency can be preserved by ensuring that executions are two-level serializable (2LSR). 2LSR is a correctness criterion for MDBS environments weaker than serializability. What makes our approach interesting is that unlike global serializability, ensuring 2LSR in MDBS environments is relatively simple and protocols to ensure 2LSR permit a high degree of concurrency. Furthermore, we believe the range of models we consider cover many practical MDBS environments to which the results of this article can be applied to preserve database consistency.

Continuous queries often require significant runtime state over arbitrary data streams. However, streams may exhibit certain data or arrival patterns, or constraints, that can be detected and exploited to reduce state considerably without compromising correctness. Rather than requiring constraints to be satisfied precisely, which can be unrealistic in a data streams environment, we introduce -constraints, where is an adherence parameter specifying how closely a stream adheres to the constraint. (Smaller 's are closer to strict adherence and offer better memory reduction.) We present a query processing architecture, called -Mon, that detects useful automatically and exploits the constraints to reduce run-time state for a wide range of continuous queries. Experimental results show dramatic state reduction, while only modest computational overhead is incurred for our constraint monitoring and query execution algorithms.

Data exchange is the problem of taking data structured under a source schema and creating an instance of a target schema that reflects the source data as accurately as possible. Given a source instance, there may be many solutions to the data exchange problem, that is, many target instances that satisfy the constraints of the data exchange problem. In an earlier paper, we identified a special class of solutions that we call universal. A universal solution has homomorphisms into every possible solution, and hence is a "most general possible" solution. Nonetheless, given a source instance, there may be many universal solutions. This naturally raises the question of whether there is a "best" universal solution, and hence a best solution for data exchange.

In this paper we present our research on defining a correct semantics for a class of update rule (UR) programs, and discuss implementing these programs in a DBMS environment. Update rules execute by updating relations in a database which may cause the further execution of rules. A correct semantics must guarantee that the execution of the rules will terminate and that it will produce a minimal updated database. The class of UR programs is syntactically identified, based upon a concept that is similar to stratification. We extend the strict definition of stratification and allow a relaxed criterion for partitioning of the rules in the UR program. This relaxation allows a limited degree of non-determinism in rule execution. We define an execution semantics based upon a monotonic fixpoint operator T UR , resulting in a set of fixpoints for UR. The monotonicity of the operator is maintained by explicitly representing the effect of asserting and retracting tuples in the database. A declarat...

Top-cited authors
• anjuke inc
• IBM Research
• Massachusetts Institute of Technology
• University of California, Berkeley