Conference Paper

Data Integration Systems for Scientific Applications

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The integration of data stemming from heterogeneous sources is an issue that challenges computer science research since years – not to say decades. Therefore, many methods, frameworks and tools were and are still being developed that all promise to solve the integration of data. This work describes those which we think are most promising by relating them to each other. Since our focus is on scientific applications, we consider important properties within this domain such as data provenance. However, aspects like the extensibility of an approach are also considered.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The tool must be easily extensible with respect to the user interface or the core logic. [22] Interoperability of the tool The tool must provide transparent mechanisms for the interactions and be compatible with other systems. ...
Conference Paper
Full-text available
Co-creating innovations with external stakeholders, such as customers , is gaining popularity among companies as a way to address the competitive and market pressures they face. To this end, research has brought forward a notable number of customer integration methods. The selection of a particular method is governed by various organizational constraints; there is, however, a paucity of research providing decision support for practitioners in terms of when to use which customer integration method. Using the design science approach , our research addresses this research gap by implementing a decision support system to assist practitioners in the selection of appropriate customer integration methods. We elicit requirements from literature and expert interviews , and subsequently design, implement, and evaluate a prototype of the system. Based on identified requirements, the prototype is implemented as a web-based tool (HTML5). The DSS tool aims to acquaint practitioners with use cases and experiences with different customer integration methods.
Article
Full-text available
Heterogeneous and dirty data is abundant. It is stored under different, often opaque schemata, it represents identical real-world objects multiple times, causing duplicates, and it has missing values and conflicting values. The Humboldt Merger (HumMer) is a tool that allows ad-hoc, declarative fusion of such data using a simple extension to SQL.Guided by a query against multiple tables, HumMer proceeds in three fully automated steps: First, instance-based schema matching bridges schematic heterogeneity of the tables by aligning corresponding attributes. Next, duplicate detection techniques find multiple representations of identical real-world objects. Finally, data fusion and conflict resolution merges duplicates into a single, consistent, and clean representation.
Conference Paper
Full-text available
In today’s integrating information systems data fusion, i.e., the merging of multiple tuples about the same real-world object into a single tuple, is left to ETL tools and other specialized software. While much attention has been paid to architecture, query languages, and query execution, the final step of actually fusing data from multiple sources into a consistent and homogeneous set is often ignored. This paper states the formal problem of data fusion in relational databases and discusses which parts of the problem can already be solved with standard Sql. To bridge the final gap, we propose the SQL Fuse By statement and define its syntax and semantics. A first implementation of the statement in a prototypical database system shows the usefulness and feasibility of the new operator.
Conference Paper
Full-text available
Data intensive applications in Life Sciences extensively use the Hidden Web as a platform for information sharing. Access to these heterogeneous Hidden Web resources is limited through the use of prede- flned web forms and interactive interfaces that users navigate manually, and assume responsibility for reconciling schema heterogeneity, mediat- ing missing information, extracting information and piping, transformat- ing formats and so on in order to implement desired query sequences or scientiflc work ∞ows. In this paper, we present a new data management system, called LifeDB, in which we ofier support for currency without view materialization and autonomous reconciliation of schema hetero- geneity in one single platform through a declarative query language called BioFlow. In our approach, schema heterogeneity is resolved at run time by treating the hidden web resources as a virtual warehouse, and by sup- porting a set of primitives for data integration on-the-∞y, for extracting information and piping to other resources, and for manipulating data in a way similar to traditional database systems to respond to application demands. We also describe BioFlow's support for work ∞ow design and application design using a visual interface called VizBuilder.
Conference Paper
Full-text available
Semantic integration in the hidden Web is an emerging area of research where traditional assumptions do not al- ways hold. Frequent changes, conflicts and the sheer size of the hidden Web demand vastly different integration tech- niques that rely on autonomous detection and heterogene- ity resolution, correspondence establishment, and informa- tion extraction strategies. In this paper, we present an alge- braic language, called Integra, as a foundation for another SQL-like query language called BioFlow, for the integra- tion of Life Sciences data on the hidden Web. The alge- bra presented here adopts the view that the web forms can be treated as user defined functions and the response they generate from the back end databases can be considered as traditional relations or tables. These assumptions allow us to extend the traditional relational algebra to include in- tegration primitives such as schema matching, wrappers, form submission, and object identification as a family of database functions. These functions are then incorporated into the traditional relational algebra operators to extend them in the direction of semantic data integration. To sup- port the well known concepts of horizontal and vertical in- tegration, we also propose two new operators called link and combine. We show that these family of functions can be designed from existing literature and their implementation is completely orthogonal to our language in the same way many database technologies are (such as relational join op- eration). Finally, we show that for traditional relations without integration, our algebra reduces to classical rela- tional algebra establishing it as a special case of Integra.
Conference Paper
Full-text available
Model management is a generic approach to solving problems of data programmability where precisely engineered mappings are required. Applications include data warehousing, e-commerce, object-to-relational wrappers, enterprise information integration, database portals, and report generators. The goal is to develop a model management engine that can support tools for all of these applications. The engine supports operations to match schemas, compose mappings, diff schemas, merge schemas, translate schemas into different data models, and generate data transformations from mappings. Much has been learned about model management since it was proposed seven years ago. This leads us to a revised vision that differs from the original in two main respects: the operations must handle more expressive mappings, and the runtime that executes mappings should be added as an important model management component. We review what has been learned from recent experience, explain the revised model management vision based on that experience, and identify the research problems that the revised vision opens up.
Article
Full-text available
The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most commercial systems stop short of the actual integration of available data. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation. This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data fusion, namely, uncertain and conflicting data values. We give an overview and classification of different ways of fusing data and present several techniques based on standard and advanced operators of the relational algebra and SQL. Finally, the article features a comprehensive survey of data integration systems from academia and industry, showing if and how data fusion is performed in each.
Article
Full-text available
Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources. In this paper we create a taxonomy of data provenance characteristics and apply it to current research efforts in e-science, focusing primarily on scientific workflow approaches. The main aspect of our taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and store provenance, and ways to disseminate it. The survey culminates with an identification of open research problems in the field.
Article
Full-text available
A federated database system (FDBS) is a collection of cooperating database systems that are autonomous and possibly heterogeneous. In this paper, we define a reference architecture for distributed database management systems from system and schema viewpoints and show how various FDBS architectures can be developed. We then define a methodology for developing one of the popular architectures of an FDBS. Finally, we discuss critical issues related to developing and operating an FDBS.
Conference Paper
Full-text available
Scientific workflows in life sciences are usually complex, and use many online databases, analysis tools, publication repositories and customized computation intensive desktop software in a coherent manner to respond to investigative queries. These investigative queries are generally ad hoc, ill-formed, and often, used only once to test a single hypothesis. In such cases, developing customized workflows becomes a major undertaking, rendering the effort truly expensive, prohibitive and resource intensive. Such high development costs often act as deterrents to many interesting queries and promising on-time scientific discoveries. In this paper, we introduce a new query language that combines workflow features for scientific applications, called BioFlow, that exploits many recent developments in internet communication, databases, wrapper and mediator technologies, ontology, and data integration. BioFlow is a declarative language that abstracts these features to help hide most procedural aspects of mediation, data integration, communication protocols, data extraction and workflow details. We will demonstrate that fairly complex workflows can be effortlessly and declaratively expressed in BioFlow in an ad hoc fashion at minimal costs. We also report a prototype implementation of BioFlow in Windows VB .NET that includes most of its powerful and representative features as proof of feasibility of our proposal.
Conference Paper
Full-text available
Most data integration applications require a matching between the schemas of the respective data sets. We show how the existence of duplicates within these data sets can be exploited to automatically identify matching attributes. We describe an algorithm that first discovers duplicates among data sets with unaligned schemas and then uses these duplicates to perform schema matching between schemas with opaque column names. Discovering duplicates among data sets with unaligned schemas is more difficult than in the usual setting, because it is not clear which fields in one object should be compared with which fields in the other. We have developed a new algorithm that efficiently finds the most likely duplicates in such a setting. Now, our schema matching algorithm is able to identify corresponding attributes by comparing data values within those duplicate records. An experimental study on real-world data shows the effectiveness of this approach.
Article
Life sciences research is based on individuals, often with diverse skills, assembled into research groups. These groups use their specialist expertise to address scientific problems. The in silico experiments undertaken by these research groups can be represented as workflows involving the co-ordinated use of analysis programs and information repositories that may be globally distributed. With regards to Grid computing, the requirements relate to the sharing of analysis and information resources rather than sharing computational power. The myGrid project has developed the Taverna Workbench for the composition and execution of workflows for the life sciences community. This experience paper describes lessons learnt during the development of Taverna. A common theme is the importance of understanding how workflows fit into the scientists' experimental context. The lessons reflect an evolving understanding of life scientists' requirements on a workflow environment, which is relevant to other areas of data intensive and exploratory science. Copyright © 2005 John Wiley & Sons, Ltd.
Article
Fusionplex is a system for integrating multiple heterogeneous and autonomous information sources that uses data fusion to resolve factual inconsistencies among the individual sources. To accomplish this, the system relies on source features, which are meta-data on the merits of each information source; for example, the recentness of the data, its accuracy, its availability, or its cost. The fusion process is controlled with several parameters: (1) with a vector of feature weights, each user defines an individual notion of data utility; (2) with thresholds of acceptance, users ensure minimal performance of their data, excluding from the fusion process data that are too old, too costly, or lacking in authority, or numeric data that are too high, too low, or obvious outliers; and, ultimately, (3) in naming a particular fusion function to be used for each attribute (for example, average, maximum, or simply any) users implement their own interpretation of fusion. Several simple extensions to SQL are all that is needed to allow users to state these resolution parameters, thus ensuring that the system is easy to use. Altogether, Fusionplex provides its users with powerful and flexible, yet simple, control over the fusion process. In addition, Fusionplex supports other critical integration requirements, such as information source heterogeneity, dynamic evolution of the information environment, quick ad-hoc integration, and intermittent source availability. The methods described in this paper were implemented in a prototype system that provides complete Web-based integration services for remote clients.
Article
The need to understand and manage provenance arises in almos t every scientific application. In many cases, information about provenance constitutes the proofof correctness of results that are generated by scientific applications. It also determines the quality andamount of trust one places on the results. For these reasons, the knowledge of provenance of a scientific re sult is typically regarded to be as important as the result itself. In this paper, we provide an overview ofresearch in provenance in databases and dis- cuss some future research directions. The content of this pa per is largely based on the tutorial presented at SIGMOD 2007 (11).
Article
Life sciences research is based on individuals, often with diverse skills, assembled into research groups. These groups use their specialist expertise to address scientific problems. The in silico experiments undertaken by these research groups can be represented as workflows involving the co-ordinated use of analysis programs and information repositories that may be globally distributed. With regards to Grid computing, the requirements relate to the sharing of analysis and information resources rather than sharing computational power. The myGrid project has developed the Taverna Workbench for the composition and execution of workflows for the life sciences community. This experience paper describes lessons learnt during the development of Taverna. A common theme is the importance of understanding how workflows fit into the scientists' experimental context. The lessons reflect an evolving understanding of life scientists' requirements on a workflow environment, which is relevant to other areas of data intensive and exploratory science. Copyright © 2005 John Wiley & Sons, Ltd.
Article
We describe a provenance model tailored to scientific workflows based on the collection-oriented modeling and design paradigm. Our implementation within the Kepler scientific workflow system captures the dependencies of data and collection creation events on preexisting data and collections, and embeds these provenance records within the data stream. A provenance query engine operates on self-contained workflow traces representing serializations of the output data stream for particular workflow runs. We demonstrate this approach in our response to the first provenance challenge.
Article
This paper presents a framework for data integration currently under development in the Squirrel project. The framework is based on a special class of mediators, called Squirrel integration mediators. These mediators can support the traditional virtual and materialized approaches, and also hybrids of them. In the Squirrel mediators, a relation in the integrated view can be supported as (a) fully materialized, (b) fully virtual, or (c) partially materialized (i.e., with some attributes materialized and other attributes virtual). In general, (partially) materialized relations of the integrated view are maintained by incremental updates from the source databases. Squirrel mediators provide two approaches for doing this: (1) materialize all needed auxiliary data, so that data sources do not have to be queried when processing the incremental updates; or (2) leave some or all of the auxiliary data virtual, and query selected source databases when processing incremental updates. The paper p...
A Framework for Supporting Data Integration using the Materialized and Virtual Approaches. SIGMOD Rec
  • R Hull
  • G Zhou