Philip A. BernsteinMicrosoft · Microsoft Research
Philip A. Bernstein
About
302
Publications
61,468
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
28,257
Citations
Publications
Publications (302)
Every five years, a group of the leading database researchers meet to reflect on their community's impact on the computing industry as well as examine current research challenges.
Approximately every five years, a group of database researchers meet to do a self-assessment of our community, including reflections on our impact on the industry as well as challenges facing our research community. This report summarizes the discussion and conclusions of the 9th such meeting, held during October 9-10, 2018 in Seattle.
Many of today’s interactive server applications are implemented using actor-oriented programming frameworks. Such applications treat actors as a distributed in-memory object-oriented database. However, actor programming frameworks offer few if any data-base system features, leaving application developers to fend for themselves. It is challenging to...
In large enterprises, data discovery is a common problem faced by users who need to find relevant information in re lational databases. In this scenario, schema annotation is a useful tool to enrich a database schema with descriptive keywords. In this paper, we demonstrate Barcelos, a sys tem that automatically annotates corporate databases. Un lik...
Scaling-out a database system typically requires partitioning the database across multiple servers. If applications do not partition perfectly, then transactions accessing multiple partitions end up being distributed, which has well-known scalability challenges. To address them, we describe a high-performance transaction mechanism that uses optimis...
A sequence of storage devices of a data store may include one or more stripesets for storing data stripes of different lengths and of different types. Each data stripe may be stored in a prefix or other portion of a stripeset. Each data stripe may be identified by an array of addresses that identify each page of the data stripe on each included sto...
We study the following problem: given the name of an ad-hoc concept as well as a few seed entities belonging to the concept, output all entities belonging to it. Since producing the exact set of entities is hard, we focus on returning a ranked list of entities. Previous approaches either use seed entities as the only input, or inherently require ne...
Aspects of the subject matter described herein relate to automating evolution of schemas and mappings. In aspects, mappings between a conceptual model and a store model are updated automatically in response to a change that occurs to the conceptual model. For example, when a change occurs to the conceptual model, a local scope of the change is dete...
Architecture that includes an ordered and shared log of indexed transaction records represented as multi-version data structures of nodes and node pointers. The log is a sole monolithic source of datastore state and is used for enforcing concurrency control. The architecture also includes a transaction processing component that appends transaction...
Abstract
Every few years a group of database researchers meets to discuss the state of database research,
its impact on practice, and important new directions. This report summarizes the discussion and
conclusions of the eighth such meeting, held October 14-15, 2013 in Irvine, California. It observes that
Big Data has now become a de ning challenge...
A method and system for increasing server cluster availability by requiring at a minimum only one node and a quorum replica set of replica members to form and operate a cluster. Replica members maintain cluster operational data. A cluster operates when one node possesses a majority of replica members, which ensures that any new or surviving cluster...
Computers are provided with a totally ordered, durable shared log. Shared storage is used and can be directly accessed by the computers over a network. Append-log operations are made atomic in the face of failures by committing provisional append ordering information onto a log. The log may comprise multiple flash packages or non-volatile memory de...
In an object-to-relational mapping system (ORM), mapping expressions explain how to expose relational data as objects and how to store objects in tables. If mappings are sufficiently expressive, then it is possible to define lossy mappings. If a user updates an object, stores it in the database based on a lossy mapping, and then retrieves the objec...
We describe a software architecture we have developed for a constructive containment checker of Entity SQL queries defined over extended ER schemas expressed in Microsoft's Entity Data Model. Our application of interest is compilation of object-to-relational mappings for Microsoft's ADO.NET Entity Framework, which has been shipping since 2007. The...
There has been a resurgence of work on replicated, distributed database systems to meet the demands of intermittently-connected clients and of disaster-tolerant databases that span data centers. Many systems weaken the criteria for replica-consistency or isolation, and in some cases add new mechanisms, to improve partition-tolerance, availability,...
Architecture that addresses the efficient detection of conflicts and the merging of data structures such as trees, when possible. The process of detecting conflicts and merging the trees is a meld operation. Confluent trees offer transactional consistency with some degree of isolation, and scaling out a concurrent system based on confluent trees ca...
A shared storage system is described herein that is based on an append-only model of updating a storage device to allow multiple computers to access storage with lighter-weight synchronization than traditional systems and to reduce wear on flash-based storage devices. Appending data allows multiple computers to write to the same storage device with...
This paper argues that an algebraic approach to regular languages, such as using monoids, can yield efficient algorithms on strings and trees.
XML is commonly supported by SQL database systems. However, existing mappings of XML to tables can only deliver satisfactory query performance for limited use cases. In this paper, we propose a novel mapping of XML data into one wide table whose columns are sparsely populated. This mapping provides good performance for document types and queries th...
This paper describes a new optimistic concurrency control algorithm for tree-structured data called meld. Each transaction executes on a snapshot of a multiversion database and logs a record with its intended updates. Meld processes log records in log order on a cached partial-copy of the last committed state to determine whether each transaction c...
In a paper published in the 2001 VLDB Conference, we proposed treating generic schema matching as an independent problem. We developed a taxonomy of existing techniques, a new schema matching algorithm, and an approach to comparative evaluation. Since then, the field has grown into a major research topic. We briefly summarize the new techniques tha...
Cloud SQL Server is a relational database system designed to scale-out to cloud computing workloads. It uses Microsoft SQL Server as its core. To scale out, it uses a partitioned database on a shared-nothing system architecture. Transactions are constrained to execute on one partition, to avoid the need for two-phase commit. The database is replica...
This article is a summary of the technology issues and challenges of data-intensive science and cloud computing as discussed in the Data-Intensive Science (DIS) workshop in Seattle, September 19-20, 2010.
Hyder supports reads and writes on indexed records within classical multi-step transactions. It is designed to run on a cluster of servers that have shared access to a large pool of network-addressable raw flash chips. The flash chips store the indexed records as a multiversion log-structured database. Log-structuring leverages the high random I/O...
Schema evolution is an unavoidable consequence of the application development lifecycle. The two primary schemas in an application,
the conceptual model and the persistent database model, must co-evolve or risk quality, stability, and maintainability issues.
We study application-driven scenarios, where the conceptual model changes and the database...
This paper presents an object-oriented representation of the core structural and constraint-related features of XML Schema.
The structural features are represented within the limitations of object-oriented type systems including particles (elements
and groups) and type hierarchies (simple and complex types and type derivations). The applicability o...
Modern storage solutions, such as non-volatile solid-state devices, offer unprecedented speed of access over high-bandwidth
interconnects. An array of flash memory chips attached directly to a 1-10 GB fiber switch can support up to 100K page writes
per second. While no single host can drive such throughput, the combined power of a large group of cl...
This paper presents algorithms that make it possible to process XML data that conforms to XML Schema (XSD) in a mainstream
object-oriented programming language. These algorithms are based on our object-oriented view of the core of XSD. The novelty
of this view is that it is intellectually manageable for object-oriented programmers while still captu...
Schema evolution is an unavoidable consequence of the application development lifecycle. The two primary schemas in an application, the client conceptual object model and the persistent database model, must co-evolve or risk quality, stability, and maintainability issues. We present MoDEF, an extension to Visual Studio that supports automatic evolu...
Object-relational mapping systems have become often-used tools to provide application access to relational databases. In a database-first development scenario, the onus is on the developer to construct a meaningful object layer for the application because shipping tools, as ORM tools only ship database reverse-engineering tools that generate object...
Solid-state disks are currently based on NAND flash and expose a standard disk interface. To accommodate limitations of the medium, solid-state disk implementations avoid rewriting data in place, instead exposing a logical remapping of the physical storage. We present an alternative way to use flash storage, where an append interface is exposed dir...
Publisher Summary Replication is the technique of using multiple copies of a server or a resource for better availability and performance. Each copy is called a replica.The main goal of replication is to improve availability, since a service is available even if some of its replicas are not. This helps mission critical services, such as many financ...
Transaction processing (TP) systems often are expected to be available 24 hours per day, 7 days per week, to support around-the-clock business operations. Two factors affect their availability: the mean time between failures (MTBF) and the mean time to repair (MTTR). Improving availability requires increasing MTBF, decreasing MTTR, or both. Compute...
Although transaction processing principles have remained fairly constant during the past 20 years or so, the technologies that implement the principles have been evolving. Recent changes starting to impact transactional middleware products include cloud computing, highly scalable computing designs, solid state memory, and streaming event processing...
The two-phase commit protocol ensures that a transaction either commits at all the resource managers that it accessed or aborts at all of them. It avoids the undesirable outcome that the transaction commits at one resource manager and aborts at another. The protocol is driven by a coordinator, which communicates with participants, which together in...
An important property of transactions is that they are isolated, which means that the execution of transactions has the same effect as running the transactions serially, one after another, in sequence, with no overlap in executing any two of them. Such an execution is called serializable and this gives each user the easy-to-understand illusion that...
A business process is a set of related tasks that lead to a particular goal. Some business processes automate the execution or tracking of tasks using software. The term workflow is a commonly used synonym for the concept of a business process. The term business transaction is sometimes used as a synonym for a business process or a step within a bu...
Transactional middleware products meet the requirements of multitier transaction processing (TP) applications. Twenty years ago, transactional middleware was delivered to market as a single product category, the TP (or OLTP) monitor. Many of these products are still in production, but the most popular transactional middleware environments are now d...
A transaction processing (TP) application is a serial processor of requests. It is a server that appears to execute an infinite loop whose body is an ACID (atomicity, consistency, isolation, durability) transaction. The processing of simple requests involves receiving a request, routing it to the appropriate application program, and then executing...
This chapter covers major software abstractions needed to make it easy to build reliable transaction processing (TP) applications with good performance: transaction bracketing, threads, remote procedure calls, state management, and scalability techniques. Transaction bracketing offers the programmer commands to start, commit, and abort a transactio...
Queued transaction processing (TP) is an alternative to direct TP that uses a persistent queue between client and server programs. The client enqueues requests and dequeues replies. The server dequeues a request, processes the request, enqueues a reply, and commits; if the transaction aborts, the request is replaced in the queue and can be retried....
This chapter provides an overview of transaction processing application and system structure. A transaction is the execution of a program that performs an administrative function by accessing a shared database. Transactions can execute online, while a user is waiting, or off-line (in batch mode) if the execution takes longer than a user can wait fo...
We address the problem of unsupervised matching of schema information from a large number of data sources into the schema of a data warehouse. The matching process is the first step of a framework to integrate data feeds from third- party data providers into a structured-search engine's data warehouse. Our experiments show that traditional schema-...
Developers need to programmatically access persistent XML data. Object-oriented access is often the preferred method. Translating XML data into objects or vice-versa is a hard problem due to the data model mismatch and the difficulty of query translation. We propose a framework that addresses this problem by transforming object-based queries and up...
Many of the largest database-driven web sites use custom webscale data managers (WDMs). On the surface, these WDMs are being applied to problems that are well-suited for relational database systems. Some examples are the following: � Map-Reduce [5], Hadoop [7], and Dryad [9] are used to process queries on large data sets using sequential scan and a...
Many of the largest database-driven web sites use custom web-scale data managers (WDMs). On the surface, these WDMs are being applied to problems that are well-suited for relational database systems. Some examples are the following:
• Map-Reduce [5], Hadoop [7], and Dryad [9] are used to process queries on large data sets using sequential scan and...
A group of database researchers, architects, users, and pundits met in May 2008 at the Claremont Resort in Berkeley, CA, to discuss the state of database research and its effects on practice. This was the seventh meeting of this sort over the past 20 years and was distinguished by a broad consensus that the database community is at a turning point...
A model is a formal description of a complex application artifact, such as a database schema, an application interface, a
UML model, an ontology, or a message format. The problem of merging such models lies at the core of many meta data applications,
such as view integration, mediated schema creation for data integration, and ontology merging. This...
Translating data and data access operations between applications and databases is a longstanding data management problem. We present a novel approach to this problem, in which the relationship between the application data and the persistent storage is specified using a declarative mapping, which is compiled into bidirectional views that drive the d...
We discuss a proposal for the implementation of the model management operator ModelGen, which translates schemas from one
model to another, for example from object-oriented to SQL or from SQL to XML schema descriptions. The operator can be used
to generate database wrappers (e.g., object-oriented or XML to relational), default user interfaces (e.g....
At the 2008 Computing Research Association Conference at Snowbird, the authors participated in a panel addressing the issue of paper and proposal reviews. This short paper summarizes the panelists' presentations and audience commentary. It concludes with some observations and suggestions on how we might address this issue in the near-term future.
Software integration problems are solved, information integration tools used in practice are described, core technologies of integration tools are reviewed, and future integration trends are identified. The solution of an integration problem is provided by programs aligning data instances, as data formats of the extracted text are identical to thos...
Developers need to access persistent XML data programmatically. Object-oriented access is often the preferred method. Translating XML data into objects or vice-versa is a hard problem due to the data model mismatch and the difficulty of query translation. Our prototype addresses this problem by transforming object-based queries and updates into que...
Model management is a high-level programming language designed to efficiently manipulate schemas and mappings. It is comprised of robust operators that combined in short programs can solve complex metadata-oriented problems in a compact way. For instance, countless enterprise data integration scenarios can be easily expressed in this high-level lan...
We address the problem of generating a mediated schema from a set of relational data source schemas and conjunctive queries that specify where those schemas overlap. Unlike past approaches that generate only the mediated schema, our algorithm also generates view definitions , i.e., source-to-mediated schema mappings. Our main goal is to understand...
This paper presents the first object-oriented interfaces that capture the essence of the structural complexity of XML Schema. We develop two such interfaces: a lightweight object-oriented interface that hides some of the complexity of XML Schema by simplifying the particle and type hierarchies, and a more complete but more complex interface that co...
This paper describes a rule-based algorithm to derive a relational
schema from an extended entity-relationship model. Our work is based on
an approach by Atzeni and Torlone in which the source EER model is
imported into a universal metamodel, a series of transformations are
performed to eliminate constructs not appearing in the relational
metamodel...
To analyze the comparison, through their results, of two distinct approaches applied to aligning two representations of anatomy.
Both approaches use a combination of lexical and structural techniques. In addition, the first approach takes advantage of domain knowledge, while the second approach treats alignment as a special case of schema matching....
ANSI SQL-92 defines Isolation Levels in terms of phenomena: Dirty Reads,
Non-Repeatable Reads, and Phantoms. This paper shows that these
phenomena and the ANSI SQL definitions fail to characterize several
popular isolation levels, including the standard locking implementations
of the levels. Investigating the ambiguities of the phenomena leads to
c...
Model management is a generic approach to solving problems of data programmability where precisely engineered mappings are required. Applications include data warehousing, e-commerce, object-to-relational wrappers, enterprise information integration, database portals, and report generators. The goal is to develop a model management engine that can...
Translating data and data access operations between applications and databases is a longstanding data management problem. We present a novel approach to this problem, in which the relationship between the application data and the persistent storage is specified using a declarative mapping, which is compiled into bidirectional views that drive the d...
We present an overview of a tutorial on model management—an approach to solving data integration problems, such as data ware- housing, e-commerce, object-to-relational mapping, schema evo- lution and enterprise information integration. Model management defines a small set of operations for manipulating schemas and mappings, such as Match, Compose,...
We briefly motivate and present a new online bibliog- raphy on schema evolution, an area which has recently gained much interest in both research and practice.
Mapping composition is a fundamental operation in metadata driven applications. Given a mapping over schemas S1 and S2 and a mapping over schemas S2 and S3, the composition problem is to compute an equivalent mapping over S1 and S3. We describe a new composition algorithm that targets practical applications. It incorporates view unfolding. It elimi...
We describe MIDST, an implementation of the model management operator ModelGen, which translates schemas from one model to
another, for example from OO to SQL or from SQL to XSD. It extends past approaches by translating database instances, not
just their schemas. The operator can be used to generate database wrappers (e.g. OO or XML to relational)...
This paper discusses technical problems that arise in supporting large-scale 24×7 web services based on experience at MSN with Windows Live T M services. Issues covered include multi-tier architecture, costs of commodity vs. premium servers, managing replicas, managing sessions, use of materialized views, and controlling checkpointing. We finish wi...
Many applications, such as e-commerce, routinely use copies of data that are not in sync with the database due to heuristic caching strategies used to enhance performance. We study concurrency control for a transactional model that allows update transactions to read out-of-date copies. Each read operation carries a "fresh- ness constraint" that spe...
The goal of schema matching is to identify correspondences be- tween the elements of two schemas. Most schema matching sys- tems calculate and display the entire set of correspondences in a single shot. Invariably, the result presented to the engineer includes many false positives, especially for large schemas. The user is of- ten overwhelmed by al...
This paper defines a collection of metrics on manuscript reviewing and presents historical data for ACM Transactions on Database Systems and The VLDB Journal.
We discuss the main features of a multilevel dictionary based on a metamodel approach. The application is an implementation of Mod- elGen, the model management operator that translates schemas from one model to another, for example from ER to relational or from XSD to ob- ject. The dictionary manages schemas and, at a metalevel, a description of th...
Schema matching is the problem of identifying corresponding elements in different schemas. Discovering these correspondences or matches is inherently difficult to automate. Past solutions have proposed a principled combination of multiple algorithms. However, these solutions sometimes perform rather poorly due to the lack of sufficient evidence in...
A customizable and extensible tool is proposed to implement ModelGen, the model management operator that translates a schema from one model to another. A wide family of models is handled, by using a metamodel in which models can be succinctly and precisely described. The approach is novel because the tool exposes the dictionary that stores models,...
The database research with focus on integration of text, data, code, fusion of information from heterogeneous data sources, and information privacy, conducted at Lowell, is discussed. The object-oriented (OO) and object-relational (OR) database management systems (DBMS) showed how text and other data types can be added to a DBMS. Several goals ment...
Model management is an approach to simplify the programming of metadata-intensive applications. It offers developers powerful operators, such as Compose, Diff, and Merge, that are applied to models, such as database schemas or interface specifications, and to mappings between models. Prior model management solutions focused on a simple class of map...
This paper is a short introduction to an industrial session on the use of meta data to address data integration problems in large enterprises. The main topics are data discovery, version and configuration management, and mapping development.
We demonstrate a prototype that translates schemas from a source metamodel (e.g., OO, relational, XML) to a target metamodel. The prototype is integrated with Microsoft Visual Studio 2005 to generate relational schemas from an object-oriented design. It has four novel features. First, it produces instance mappings to round-trip the data between the...
Composition of mappings between schemas is essential to support schema evolution, data exchange, data integration, and other data management tasks. In many applications, mappings are given by embedded dependencies. In this article, we study the issues involved in composing such mappings.
Our algorithms and results extend those of Fagin et al. [2004...
Schema matching identifies elements of two given schemas that correspond to each other. Although there are many algorithms for schema matching, little has been written about building a system that can be used in practice. We describe our initial experience building such a system, a customizable schema matcher called Protoplasm.
This paper describes a research program that exploits a large corpus of database schemas, possibly with associated data and meta-data, to build tools that facilitate the creation, querying and sharing of structured data. The key insight is that given a large corpus, we can discover patterns concerning how designers create structures for representin...
d of manipulating data directly # # Meta Meta- -data data- -based solutions all involve based solutions all involve models (schemas) and mappings models (schemas) and mappings # # Mappings Mappings - - data data transformations, queries, transformations, queries, dependencies, dependencies, ... ... # # Model, manipulate, and generate them Model, ma...
This paper describes how we used generic schema matching algorithms to align the Foundational Model of Anatomy (FMA) and the GALEN Common Reference Model (CRM), two large models of human anatomy. We summarize the generic schema matching algorithms we used to identify correspondences. We present sample results that highlight the similarities and dif...
This paper describes how we used generic schema matching algorithms to align the Foundational Model of Anatomy (FMA) and the GALEN Common Reference Model (CRM), two large models of human anatomy. We summarize the generic schema matching algorithms we used to identify correspondences. We present sample results that highlight the similarities and dif...
Several forces, with impacts so fundamental thatthey are akin to tectonic plate movements, are drivingthe commercial database marketplace. First ishardware commoditization: arrays of low pricedcomputers with high speed interconnects which yieldthe new ...
Database system architectures are undergoing revolutionary changes. Most importantly, algorithms and data are being unified by integrating programming languages with the database system. This gives an extensible object-relational system where non-procedural ...
Model management is an approach to simplify the programming of metadata-intensive applications. It offers developers powerful operators, such as Compose, Extract, and Merge, that are applied to models, such as database schemas or interface specifications, and to mappings between models. To be used in practice, these operators need to be implemented...
The difficulty inherent in schema matching has led to the development of several generic match algorithms. We describe how we adapted general approaches to the specific task of aligning two ontologies of human anatomy, the Foundational Model of Anatomy and the GALEN Common Reference Model. Our approach consists of three phases: lexical, structural...
The difficulty inherent in schema matching has led to the development of several generic match algorithms. This paper describes how we adapted general approaches to the specific task of aligning two ontologies of human anatomy, the Foundational Model of Anatomy and the GALEN Common Reference Model. Our approach consists of three phases: lexical, st...
Network
Cited