Book

Uncertain Schema Matching

Authors:

Abstract

Schema matching is the task of providing correspondences between concepts describing the meaning of data in various heterogeneous, distributed data sources. Schema matching is one of the basic operations required by the process of data and schema integration, and thus has a great effect on its outcomes, whether these involve targeted content delivery, view integration, database integration, query rewriting over heterogeneous sources, duplicate data elimination, or automatic streamlining of workflow activities that involve heterogeneous data sources. Although schema matching research has been ongoing for over 25 years, more recently a realization has emerged that schema matchers are inherently uncertain. Since 2003, work on the uncertainty in schema matching has picked up, along with research on uncertainty in other areas of data management. This lecture presents various aspects of uncertainty in schema matching within a single unified framework. We introduce basic formulations of uncertainty and provide several alternative representations of schema matching uncertainty. Then, we cover two common methods that have been proposed to deal with uncertainty in schema matching, namely ensembles, and top-K matchings, and analyze them in this context. We conclude with a set of real-world applications. Table of Contents: Introduction / Models of Uncertainty / Modeling Uncertain Schema Matching / Schema Matcher Ensembles / Top-K Schema Matchings / Applications / Conclusions and Future Work
... The most widelycovered application areas include schema matching (see e.g. [7,17,18]) and ontology matching (e.g. [19,20,21]). ...
... Given an optimal alignmentσ between an activity set A and a sentence set S, a predictor quantifies the probability thatσ does not contain correct correspondences. This notion of a predictor is inspired by according notions used to analyze alignments in the context of schema and process model matching [17,40]. The core premise underlying the predictors is that the similarity scores in the optimal alignments have different characteristics for consistent and inconsistent model-text pairs. ...
... Therefore, we also define predictors that consider a similarity score relative to other scores in a similarity matrix between A and S. The underlying notion is commonly applied in schema matching in the form of a dominates property (see e.g. [17,42]). These predictors build on the premise that a sentence is more likely to describe an activity a if it is more similar to a than other sentences are. ...
Article
Many organizations maintain textual process descriptions alongside graphical process models. The purpose is to make process information accessible to various stakeholders, including those who are not familiar with reading and interpreting the complex execution logic of process models. Despite this merit, there is a clear risk that model and text become misaligned when changes are not applied to both descriptions consistently. For organizations with hundreds of different processes, the effort required to identify and clear up such conflicts is considerable. To support organizations in keeping their process descriptions consistent, we present an approach to automatically identify inconsistencies between a process model and a corresponding textual description. Our approach detects cases where the two process representations describe activities in different orders and detect process model activities not contained in the textual description. A quantitative evaluation with 53 real-life model-text pairs demonstrates that our approach accurately identifies inconsistencies between model and text.
... Schema matching research has been going on for more than 30 years now, focusing on designing high quality matchers, automatic tools for identifying correspondences among database attributes. Initial heuristic attempts (e.g., COMA [11]) were followed by theoretical grounding (e.g., see [16,5]). ...
... , b m }, respectively. A matching process matches S and S by aligning their attributes using matchers that utilize matching cues such as attribute names, instance data, and schema structure (see surveys e.g., [6] and books e.g., [16]). A matcher's output is conceptualized as a similarity matrix M (S, S ) (M for short), having entry m i,j (typically a real number in [0, 1]) represent a degree of similarity between a i ∈ S and b j ∈ S . ...
... Matchers can be separated into first-line matchers -1LMs, which are applied directly to the problem, returning a similarity matrix, and second-line matchers -2LMs, which are applied to the outcome of matchers, receiving similarity matrices and returning a similarity matrix. [16] compares attribute names to identify syntactically similar attributes (e.g., using edit distance and soundex). WordNet uses abbreviation expansion and tokenization methods to generate a set of related words for matching attribute names. ...
Chapter
The schema matching problem is at the basis of integrating structured and semi-structured data. Being investigated in the fields of databases, AI, semantic Web and data mining for many years, the core challenge still remains the ability to create quality matchers, automatic tools for identifying correspondences among data concepts (e.g., database attributes). In this work, we investigate human matchers behavior using a new concept termed match consistency and introduce a novel use of cognitive models to explain human matcher performance. Using empirical evidence, we further show that human matching suffers from predictable biases when matching schemata, which prevent them from providing consistent matching.
... Human effort is usually needed to determine the right matches between concepts or to correct semi-automatic or automatic concept alignments. Schema mappings are inherently uncertain [4]. Alignment systems turn out to be uncertain when concept are neither completely similar nor dissimilar [5]. ...
... Schema matching is an important operation that provides correspondences between concepts of various heterogeneous sources like relational databases, XML-schemas, catalogs, directories etc. [4]. An excellent review in schema matching is presented in [9]. ...
... Data integration of heterogeneous sources has become relevant in the last decades for health, E-business, and Semantic Web domains among others. The process of finding schema mappings is inherently uncertain [4]; mappings identified by automatic or semi-automatic tools can never be 100% certain [3]. Dong et al. [10] state that data integration systems must handle uncertainty at three levels: uncertain schema mapping, uncertain data, and uncertain queries. ...
Conference Paper
Full-text available
The need to share and reuse information has grown in the new era of Internet of things and ubiquitous computing. Researchers in ontology and schema matching use mapping approaches in order to achieve interoperability between heterogeneous sources. The use of multiple similarity measures that take into account lexical, structural and semantic properties of the concepts is often found in schema matching for the purpose of data integration, sharing and reusing. Mappings identified by automatic or semi-automatic tools can never be certain. In this paper, we present a fuzzy-based approach to combine different similarity measures to deal with scenarios where ambiguity of terms hinder the process of alignment and add uncertainty to the match.
... Schema matching research has been going on for more than 30 years now, focusing on designing high quality matchers, automatic tools for identifying correspondences among database attributes. Initial heuristic attempts (e.g., COMA [20] and Similarity Flooding [43]) were followed by theoretical grounding (e.g., see [13,21,28]). ...
... , }, respectively. A matching model matches and ′ by aligning their attributes using matchers that utilize matching cues such as attribute names, instances, schema structure, etc. (see surveys, e.g., [15] and books, e.g., [28]). ...
... 1LMs receive (typically two) schemata and return a matching matrix, in which each entry captures the similarity between attributes and . 2LMs receive (one or more) matching matrices and return a matching matrix using some function ( ) [28]. Among the 2LMs, we term decision makers those that return a binary matrix as an output, from which a match is derived, by maximizing ( ), as a solution to Problem 1. ...
Preprint
Full-text available
Schema matching is a core task of any data integration process. Being investigated in the fields of databases, AI, Semantic Web and data mining for many years, the main challenge remains the ability to generate quality matches among data concepts (e.g., database attributes). In this work, we examine a novel angle on the behavior of humans as matchers, studying match creation as a process. We analyze the dynamics of common evaluation measures (precision, recall, and f-measure), with respect to this angle and highlight the need for unbiased matching to support this analysis. Unbiased matching, a newly defined concept that describes the common assumption that human decisions represent reliable assessments of schemata correspondences, is, however, not an inherent property of human matchers. In what follows, we design PoWareMatch that makes use of a deep learning mechanism to calibrate and filter human matching decisions adhering the quality of a match, which are then combined with algorithmic matching to generate better match results. We provide an empirical evidence, established based on an experiment with more than 200 human matchers over common benchmarks, that PoWareMatch predicts well the benefit of extending the match with an additional correspondence and generates high quality matches. In addition, PoWareMatch outperforms state-of-the-art matching algorithms.
... Table 1a is the outcome of a matching process, using some matcher. The similarity matrix represented by Table 1b is a binary similarity matrix, generated as the outcome of a decision maker matcher [21]. This matcher enforces a binary decision while requiring participation of each attribute in at most one correspondence. ...
... Several classifications of matching steps have been proposed over the years (see e.g., [12,25]). Following [21], we separate matchers into those that are applied directly to the problem [first-line matchers (1LM)] and those that are applied to the outcome of other matchers [second-line matchers (2LM)]. 1LMs receive a matched pair and return a similarity matrix. ...
... Overall/Accuracy [33] and HSR [15] assume that matching problems are followed by a manual effort to validate correspondences and therefore are suggested as a better measure for post-match effort. A thorough comparison of the use of precision, recall, and their derivatives in the context of schema matching can be found in [21,Ch. 3.4]. ...
Article
Full-text available
The evolution of data accumulation, management, analytics, and visualization has led to the coining of the term big data, which challenges the task of data integration. This task, common to any matching problem in computer science involves generating alignments between structured data in an automated fashion. Historically, set-based measures, based upon binary similarity matrices (match/non-match), have dominated evaluation practices of matching tasks. However, in the presence of big data, such measures no longer suffice. In this work, we propose evaluation methods for non-binary matrices as well. Non-binary evaluation is formally defined together with several new, non-binary measures using a vector space representation of matching outcome. We provide empirical analyses of the usefulness of non-binary evaluation and show its superiority over its binary counterparts in several problem domains.
... In its origin, schema matching was regarded as a preliminary step to schema mapping in the data integration process [1]. However, schema matching (and the related field of ontology alignment) has proved to hold both theoretic and practical appeal and enjoys a continued interest by researchers and practitioners (See surveys [2], [3], [4] and books [5], [6] for an overview). In recent years, several challenges and opportunities have presented themselves. ...
... al. [12]. Uncertain schema matching [6], concedes that schema matching is an inherently uncertain process and maintains the uncertainty recorded by the schema matching process. This 1 www.mturk.com 2 www.fiverr.com ...
Conference Paper
Full-text available
Schema matching problems have been historically defined as a semi-automated task in which correspondences are generated by matching algorithms and subsequently validated by a single human expert. Emerging alternative models are based upon piecemeal human validation of algorithmic results and the usage of crowd based validation. We propose an alternative model in which human and algorithmic matchers are given more symmetric roles. Under this model, better insight into the respective strengths and weaknesses of human and algorithmic matchers is required. We present initial insights from a pilot study conducted and outline future work in this area.
... To this end, we introduce an approach that automatically relates a textual PPI description to the relevant parts of a process model. We shall refer to this relation as an alignment, following the terminology used to describe relations between concepts from different artifacts in contexts such as schema matching [5] and process model matching [4]. An alignment consists of a number of pair-wise correspondences between the PPI and process model elements. ...
... The process model describes the request for change process as implemented by the IT Department of the Andalusian Health Service. 5 The process starts when a requester submits a request for change (RFC). Then, the planning & quality manager analyzes the request in order to make a decision on its approval. ...
Conference Paper
Full-text available
To determine whether strategic goals are met, organizations must monitor how their business processes perform. Process Performance Indicators (PPIs) are used to specify relevant performance requirements. The formulation of PPIs is typically a managerial concern. Therefore, considerable effort has to be invested to relate PPIs, described by management , to the exact operational and technical characteristics of business processes. This work presents an approach to support this task, which would otherwise be a laborious and time-consuming endeavor. The presented approach can automatically establish links between PPIs, as formulated in natural language, with operational details, as described in process models. To do so, we employ machine learning and natural language processing techniques. A quantitative evaluation on the basis of a collection of 173 real-world PPIs demonstrates that the proposed approach works well.
... Mapping.-Given matching scores between pairs of columns, we need to find the schema mapping Γ m S , m T between S and T. We find this schema mapping using integer linear programming [26], much as in prior work [16]. Formally, we denote x ij as the binary variable indicating if s i ∈ S is matched to t j ∈ T , according to some attribute similarity function that may take schema or data into account [27,39]. ...
... Here, we assume that each attribute in S can be matched to at most one attribute in T and vice versa. Thus, the objective is to find a mapping Γ m S , m T satisfying the constraints that can maximize the matching score as follows: Zhang (16) Finally, we use a greedy algorithm proposed by Papadimitriou [35] to get the best schema mapping. ...
Conference Paper
Many modern data science applications build on data lakes, schema-agnostic repositories of data files and data products that offer limited organization and management capabilities. There is a need to build data lake search capabilities into data science environments, so scientists and analysts can find tables, schemas, workflows, and datasets useful to their task at hand. We develop search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables. Our core methods also generalize to other settings where computational tasks involve execution of programs or scripts.
... There are two main reasons for that. Automatic matchers (e.g., COMA [5] and BigGorilla [4]) are unable to overcome the inherent uncertainty in the matching process due to ambiguity and heterogeneity of data description concepts [6]. Challenged by an abundance of possible correspondences, their choices may be erroneous. ...
... A matching process applies matching algorithms to align schemata terms using similarity functions that are deduced by data source characteristics, e.g., term labels and domain constraints. Such output is conceptualized as a similarity matrix, having entry M i j (typically a real number in [0, 1]) represents similarity between a i ∈ S and b j ∈ S ′ [6]. From a similarity matrix, a candidate correspondences set (candidate set for short) can be derived to represent a final match or to be validated by the crowd. ...
Conference Paper
Full-text available
We present InCognitoMatch, the frst cognitive-aware crowdsourcing application for matching tasks. InCognitoMatch provides a handy tool to validate, annotate, and correct correspondences using the crowd whilst accounting for human matching biases. In addition, InCognitoMatch enables system administrators to control context information visible for workers and analyze their performance accordingly. For crowd workers, InCognitoMatch is an easy-to-use application that may be accessed from multiple crowdsourcing platforms. In addition, workers completing a task are offered suggestions for followup sessions according to their performance in the current session. For this demo, the audience will be able to experience InCognitoMatch thorough three use-cases, interacting with system as workers and as administrators.
... Then, agents exchange partial mappings with each other (through costly communication) and gradually converge to a new mapping of improved quality through aggregation and filtering of partial mappings. In the aggregation and filtering process, we assume that each agent, upon receiving a partial mapping, is able (using one of the quality metrics in [17]) to evaluate whether replacing its corresponding local partial mapping leads to an improvement in the schema mapping quality. Moreover, a certain partial mapping might violate a 1-to-1 constraint in the local schema mapping or lead to an inconsistency by creating incorrect attribute-correspondence circles. ...
... The received instance of a mapping μ is taken into consideration by a rational agent if it results in a schema mapping of higher quality, as compared to the one currently used (by employing one of the quality metrics in [17], e.g., average string similarity); other- ...
Article
Full-text available
Today’s complex online applications often require the interaction of multiple (web) services that belong to potentially different business entities. Interoperability is a core element of such an environment, yet not a straightforward one due to the lack of common data semantics. The problem is often approached by means of standardization procedures in a top-down manner with limited adoption in practice. (De facto) standards for semantic interoperability most commonly emerge in a bottom-up approach, i.e., involving the interaction and information exchange among self-interested industrial agents. In this paper, we argue that the emergence of semantic interoperability can be seen as an economic process among rational agents and, although interoperability can be mutually beneficial for the involved parties, it may also be costly and might fail to emerge. As a sample scenario, we consider the emergence of semantic interoperability among rational web service agents in service-oriented architectures (SOAs), and we analyze their individual economic incentives with respect to utility, risk and cost. We model this process as a positive-sum game and study its equilibrium and evolutionary dynamics. According to our analysis, which is also experimentally verified, certain conditions on the communication cost, the cost of technological adaptation, the expected mutual benefit from interoperability, as well as the expected loss from isolation, drive the process.
... Recently, some new ideas have been introduced into schema matching area, like crowd and uncertainty [18], etc. Chen et al. [47] think that human are good at understanding data represented in various forms, and crowdsourcing platforms are making the human annotation process more affordable. Thus, they show how to utilize the crowd to find the right matching. ...
... In the probabilistic mapping, an attribute can be matched to several attributes, and each correspondence will be associated with a probability. Avigdor Gal et al. [18] address various aspects of uncertainty in schema matching within a single unified framework and present basic formulations of uncertainty in schema matching. Moreover, they also introduce some real-world applications related to uncertain schema matching. ...
Article
Full-text available
Schema matching is a critical step in numerous database applications such as web data sources integrating, data warehouse loading and information exchanging among several authorities. In this paper, we propose to exploit the similarities of the SQL statements in the query logs to find the correspondences between attributes in the schemas to be matched. We discover three kinds of similarities which benefit schema matching, that is, the similarity of clauses itself, the similarity of the frequency of clauses occurring in different SQL statements and the similarity of statistics about the relationship among clauses. We combine the clauses related to the similarities into a graph, and then transform the task of matching attributes into the problem of matching the graphs. Through matching the graphs, we obtain a set of attribute sequence pairs with the similarity score. Actually, each sequence pair represents a set of correspondences. Next, we exploit the techniques from the quadratic programming field to decompose the sequence pairs into correspondences, that is, to obtain the similarity score of each correspondence. Finally, an efficient method is used to choose the best correspondence for each attribute from the candidate set. The experimental study shows that the proposed approach is effective and its combination with other matchers has good performance.
... Note that even though many matchers generate solely simple one-to-one correspondences, our formulation does not preclude handling of one-tomany or many-to-many relations, which may be represented by the Cartesian product of the respective elements. We refer to correspondences generated by automatic matchers as candidate correspondences since there is no guarantee that they are indeed correct [35,38]. Given two distinct models A, R , A , R ∈ S, we write C A,A ⊆ A to denote the set of all candidate correspondences returned by automatic matchers. ...
... In this paper, we use results from matching tools as input for our approach. Uncertain matching for conceptual models has been studied in many works, see, for instance, [25,76,35,75,41]. Yet, our approach is the first to consider integrity constraints defined for a matching network to assess the correctness probabilities of correspondences and guide the reconciliation by an expert user. ...
Technical Report
Full-text available
Conceptual models such as database schemas, ontologies or process models have been established as a means for effective engineering of information systems. Yet, for complex systems, conceptual models are created by a variety of stakeholders, which calls for techniques to manage consistency among the different views on a system. Techniques for model matching generate correspondences between the elements of conceptual models, thereby supporting effective model creation, utilization, and evolution. Although various automatic matching tools have been developed for different types of conceptual models, their results are often incomplete or erroneous. Automatically generated correspondences, therefore, need to be reconciled, i.e., validated by a human expert. We analyze the reconciliation process in a network setting, where a large number of conceptual models need to be matched. Then, the network induced by the generated correspondences shall meet consistency expectations in terms of mutual reinforcing relations between the correspondences. We develop a probabilistic model to identify the most uncertain correspondences in order to guide the expert's validation work. We also show how to construct a set of high-quality correspondences, even if the expert does not validate all generated correspondences. We demonstrate the efficiency of our techniques for real-world datasets in the domains of schema matching and ontology alignment.
... Recently, there have been a few attempts to tackle process model matching [1,3,4,5]. Typically, the developed matchers relied on the rich literature of schema and ontology matching [6,7] with emphasis on string comparison and graph matching. Evaluating the outcome of these works shows that the empirical quality is subject to high variability even within a single dataset. ...
... For two process models with A 1 and A 2 as their sets of activities, process model matching aims at identifying activity correspondences that represent the same behaviour in both models. Following Gal [6], we subdivide the matching process into first and second line matching. A first line matcher operates on the process models, compares some of their attributes such as activity labels or the process structure, and produces a similarity matrix M(A 1 ,A 2 ) over activities with |A 1 | rows and |A 2 | columns. ...
Chapter
Full-text available
Process model matching refers to the task of creating correspondences among activities of different process models. This task is crucial whenever comparison and alignment of process models are called for. In recent years, there have been a few attempts to tackle process model matching. Yet, evaluating the obtained sets of correspondences reveals high variability in the results. Addressing this issue, we propose a method for predicting the quality of results derived by process model matchers. As such, prediction serves as a case-by-case decision making tool in estimating the amount of trust one should put into automatic matching. This paper proposes a model of prediction for process matching based on both process properties and preliminary match results.
... It specifically focusses on discovering subsumption correspon-dences. SMB (Schema Matcher Boosting) is an approach to combining matchers into ensembles (Gal, 2011). It is based on a machine learning technique called boosting, that is able to select (presumably the most appropriate) matchers that participate in an ensemble. ...
... Although these tools comprise a significant step towards fulling the vision of automated schema matching, it has become obvious that the user must accept a degree of imperfection in this process. A prime reason for this is the enormous ambiguity and heterogeneity of data description concepts: It is unrealistic to expect a single mapping engine to identify the correct mapping for any possible concept in a set [3], [4]. Schema matching and mapping concepts are often confused and discussed under the single name "Schema Matching". ...
Conference Paper
Full-text available
Schema matching and mapping are an important tasks for many applications, such as data integration, data warehousing and e-commerce. Many algorithms and approaches were proposed to deal with the problem of automatic schema matching and mapping. In this work, we describe how schema matching problem can be modelled and simulated as agents where each agent learn, reason and act to find the best match in the other schema attributes group. Many differences exist between our approach and the existing practice in schema matching. First and foremost our approach is based on the paradigm Agent-based Modeling and Simulation (ABMS), while, as far as we know, all the current methods do not use ABMS paradigm. Second, the agent’s decision-making and reasoning process leverages probabilistic models (Bayesian) for matching prediction and action selection (planning). The results we obtained so far are very encouraging and reinforce our belief that many intrinsic properties of our model, such as simulations, stochasticity and emergence, contribute efficiently to the increase of the matching quality and thus the decrease of the matching uncertainty.
... Based on Def. 3, we do not try to combine the results from multiple automatic matchers with a math formula to get a synthesized result to weigh the similarity between the elements. It is because although combination of the outcomes from multiple automatic matchers can be utilized in one matching process to get better results, such combination is also uncertain [8]. It also cannot assure the correctness of combination results. ...
Article
In recent years, the rapid development of Internet of Things has received wide attention of the social and academic circles. However, if there is no unified standard to store and process the huge data, the systems are still highly independent and interconnection is difficult to be realized. This paper researches the design and implementation of data platform based on Internet of Things technology. We firstly analyze the data sources and features to understand the platform requirements. Then we propose the data platform scheme with the function and performance requirements considered. It focuses on the resource identification and addressing, resource description and management, data storage, processing and analysis problem. With the data platform, the resources in Internet of Things system are managed in a unified way, which improves the system openness, access and transmission capability thus makes the system more flexible and open. However, the current design scheme can be improved in performance and safety in the future research.
... Although the proposed schema matching tools comprise a significant step towards fulfilling the vision of automated schema matching, it has become obvious that the user must accept a degree of imperfection in this process. A prime reason for this is the enormous ambiguity and heterogeneity of data description concepts: It is unrealistic to expect a single mapping engine to identify the correct mapping for any possible concept in a set [4], [5]. ...
Conference Paper
Full-text available
Many algorithms and approaches were proposed to deal with the problem of automatic schema matching and mapping. Yet, managing uncertainty and complexity for Schema Matching still remains as an open question. The challenges and difficulties caused by the complexity characterising the process of Schema Matching motivated us to investigate how the application of a bio-inspired emerging paradigm can lead us to understand, manage and ultimately overcome the inherent uncertainty in this process. The central idea of our work, is to consider the process of matching as a complex adaptive system and model it using the approach of agent-based modeling and simulation. The aim being the exploitation of the intrinsic properties of the agent-based models, such as emergence, stochasticity and self-organization, to help provide answers to better manage complexity and uncertainty of Schema Matching. Keywords— schema matching; schema mapping; complex adaptive systems; agent-based modelling and simulation; bayesian networks; machine learning
... Approximate computing has been investigated in the literature as a response to various problems: time efficiency such as approximation algorithms where finding an optimal solution can have a combinatorial time [27], and full integration such as uncertain schema matching with the realization that matchers are inherently uncertain [15]. We proposed in previous work an approximate approach to event processing that leverages probabilistic matching of events [21,24]. ...
Conference Paper
Event-based systems follow an interaction model based on three decoupling dimensions: space, time, and synchroniza- tion. However, event producers and consumers are tightly coupled by event semantics: types, attributes, and values. That limits scalability in large-scale heterogeneous environ- ments with significant variety such as the Internet of Things (IoT) due to difficulties in establishing semantic agreements at such scales. This paper studies this problem and inves- tigates the suitability of different traditional and emerging approaches for tackling the issue.
... Although these tools comprise a significant step towards fulling the vision of automated schema matching, it has become obvious that the user must accept a degree of imperfection in this process. A prime reason for this is the enormous ambiguity and heterogeneity of data description concepts: It is unrealistic to expect a single mapping engine to identify the correct mapping for any possible concept in a set [1], [2]. S In a this paper, we propose a novel Agent-based Modeling and Simulation approach for the Schema Matching problem called "Schema Matching Agent-based Simulation" (SMAS). ...
... In many systems finding such a schema matching is an early step in building a schema mapping. Although these tools comprise a significant step towards fulfilling the vision of automated schema matching, it has become obvious that the user must accept a degree of imperfection in this process [1]. ...
Conference Paper
Full-text available
In this demo, we present the implementation of a novel Agent-based Modelling and Simulation approach for the Schema Matching problem called "Schema Matching Agent-based Simulation" (SMAS). Our solution aims at generating high quality schema matchings with minimum uncertainty. As far as we know, there is no previous literature describing a solution approaching the Automatic Schema Matching and Mapping problem under the angle of Agent-Based Modelling and Simulation.
... In recent years, uncertain schema-matching research has gained more attention with the realization that matchers are inherently uncertain [Gal 2011]. Statistically monotonic matchers may assign a slightly lower similarity than it should to mappings which may be of a specific precision, thus matching with top-k mappings becomes a potential solution to this [Gal 2006]. ...
Article
Event processing follows a decoupled model of interaction in space, time, and synchronization. However, another dimension of semantic coupling also exists and poses a challenge to the scalability of event processing systems in highly semantically heterogeneous and dynamic environments such as the Internet of Things (IoT). Current state-of-the-art approaches of content-based and concept-based event systems require a significant agreement between event producers and consumers on event schema or an external conceptual model of event semantics. Thus, they do not address the semantic coupling issue. This article proposes an approach where participants only agree on a distributional statistical model of semantics represented in a corpus of text to derive semantic similarity and relatedness. It also proposes an approximate model for relaxing the semantic coupling dimension via an approximation-enabled rule language and an approximate event matcher. The model is formalized as an ensemble of semantic and top-k matchers along with a probability model for uncertainty management. The model has been empirically validated on large sets of events and subscriptions synthesized from real-world smart city and energy management systems. Experiments show that the proposed model achieves more than 95% F(1)Score of effectiveness and thousands of events/sec of throughput for medium degrees of approximation while not requiring users to have complete prior knowledge of event semantics. In semantically loosely-coupled environments, one approximate subscription can compensate for hundreds of exact subscriptions to cover all possibilities in environments which require complete prior knowledge of event semantics. Results indicate that approximate semantic event processing could play a promising role in the IoT middleware layer.
... In computer science, there is an abundant literature on schema matching, with many approaches and tools described, e.g. [18]- [21]. An early review of the literature is [22], and an updated version is [23] However, our focus is different, in that it applies insights from schema matching and machine learning to the problem of enterprise integration. ...
Conference Paper
Full-text available
Today, enterprise integration and cross-enterprisecollaboration is becoming evermore important. The Internet ofthings, digitization and globalization are pushing continuousgrowth in the integration market. However, setting up integrationsystems today is still largely a manual endeavour. Most probably,future integration will need to leverage more automation in orderto keep up with demand. This paper presents a first version ofa system that uses tools from artificial intelligence and machinelearning to ease the integration of information systems, aiming toautomate parts of it. Three models are presented and evaluatedfor precision and recall using data from real, past, integrationprojects. The results show that it is possible to obtainF0.5scoresin the order of80%for models trained on a particular kindof data, and in the order of60%−70%for less specific modelstrained on a several kinds of data. Such models would be valuableenablers for integration brokers to keep up with demand, andobtain a competitive advantage. Future work includes fusingthe results from the different models, and enabling continuouslearning from an operational production system.
... • Concept or mapping discovery: identify a new concept or a new mapping using inductive reasoning and techniques from schema matching, taking into account aspects of uncertainty [15]. Based on the result, the domain ontology and the mappings can be augmented. ...
Article
A data ecosystem (DE) offers a keystone-player or alliance-driven infrastructure that enables the interaction of different stakeholders and the resolution of interoperability issues among shared data. However, despite years of research in data governance and management, trustability is still affected by the absence of transparent and traceable data-driven pipelines. In this work, we focus on requirements and challenges that DEs face when ensuring data transparency. Requirements are derived from the data and organizational management, as well as from broader legal and ethical considerations. We propose a novel knowledge-driven DE architecture, providing the pillars for satisfying the analyzed requirements. We illustrate the potential of our proposal in a real-world scenario. Last, we discuss and rate the potential of the proposed architecture in the fulfillmentof these requirements.
... The foundations of our approach can be found in the fields of ontology and schema matching [29,30]. ...
Article
In recent years, a considerable number of process model matching techniques have been proposed. The goal of these techniques is to identify correspondences between the activities of two process models. However , the results from the Process Model Matching Contest 2015 reveal that there is still no universally applicable matching technique and that each technique has particular strengths and weaknesses. It is hard or even impossible to choose the best technique for a given matching problem. We propose to cope with this problem by running an ensemble of matching techniques and automatically selecting a subset of the generated correspondences. To this end, we propose a Markov Logic based optimization approach that automatically selects the best correspondences. The approach builds on an adaption of a voting technique from the domain of schema matching and combines it with process model specific constraints. Our experiments show that our approach is capable of generating results that are significantly better than alternative approaches.
... We formally define process model matching based upon notions from [11]. For any pair of event class sets {E 1 , E 2 }, a matching task creates an n × n similarity matrix M(E 1 , E 2 ) over E 1 × E 2 . ...
Conference Paper
Full-text available
Process model matching provides the basis for many process analysis techniques such as inconsistency detection and process querying. The matching task refers to the automatic identification of correspondences between activities in two process models. Numerous techniques have been developed for this purpose, all share a focus on process-level information. In this paper we introduce instance-based process matching , which specifically focuses on information related to instances of a process. In particular, we introduce six similarity metrics that each use a different type of instance information stored in the event logs associated with processes. The proposed metrics can be used as standalone matching techniques or to complement existing process model matching techniques. A quantitative evaluation on real-world data demonstrates that the use of information from event logs is essential in identifying a considerable amount of correspondences.
... A plethora of matching techniques have been developed and applied in various fields, including schema matching (cf. [13], [51], [52]), ontology alignment (c.f.. [15], [53], [54]), and process model matching (cf. [20], [55], [56]). ...
... Following Gal [14], we can subdivide the matching process into first line matching and second line matching. A first line matcher takes the sets of activities A 1 and A 2 from the process models as input and produces a similarity matrix M(A 1 , A 2 ) with |A 1 | rows and |A 2 | columns. ...
Conference Paper
Full-text available
Process model matching techniques aim at automatically identifying activity correspondences between two process models that represent the same or similar behavior. By doing so, they provide essential input for many advanced process model analysis techniques such as process model search. Despite their importance, the performance of process model matching techniques is not yet convincing and several attempts to improve the performance have not been successful. This raises the question of whether it is really not possible to further improve the performance of process model matching techniques. In this paper, we aim to answer this question by conducting two consecutive analyses. First, we review existing process model matching techniques and give an overview of the specific technologies they use to identify similar activities. Second, we analyze the correspondences of the Process Model Matching Contest 2015 and reflect on the suitability of the identified technologies to identify the missing correspondences. As a result of these analyses, we present a list of three specific recommendations to improve the performance of process model matching techniques in the future.
... This fragmentation might require a repeated alignment of information from all relevant parties operating on the blockchain. Work on matching could represent a promising starting point to solve this problem [Cayoglu et al. 2014;Euzenat and Shvaiko 2013;Gal 2011]. There is both the risk and opportunity of conducting process mining on blockchain data. ...
Article
Full-text available
(Note that we have updated the paper to the accepted version on 23 Jan 2018) Blockchain technology offers a sizable promise to rethink the way inter-organizational business processes are managed because of its potential to realize execution with- out a central party serving as a single point of trust (and failure). To stimulate research on this promise and the limits thereof, in this paper we outline the challenges and opportunities of blockchain for Business Process Management (BPM). We structure our commentary alongside two established frameworks, namely the six BPM core capabilities and the BPM lifecycle, and detail seven research directions for investigating the application of blockchain technology to BPM.
... This fragmentation might require a repeated alignment of information from all relevant parties operating on the blockchain. Work on matching could represent a promising starting point to solve this problem (Cayoglu et al. 2014;Euzenat and Shvaiko 2013;Gal 2011). There is both the risk and opportunity of conducting process mining on blockchain data. ...
Chapter
Blockchain technology bears the potential to support the execution of inter-organizational business processes in an efficient way. Furthermore, it addresses various notorious problems of collaboratively designing choreographies and overcoming lack of trust. In this paper, we discuss this potential in more detail and highlight several research challenges that future research has to address towards generic blockchain support for inter-organizational business processes in various application scenarios.
... In fact, the task of establishing eventto-activity mappings is conceptually equivalent to matching tasks found in the fields of schema matching and process matching. Such matching tasks have been shown to be inherently uncertain [14,28]. Due to this uncertainty, the goal of mapping techniques becomes choosing the best mapping from a number of potential ones [18]. ...
Conference Paper
Full-text available
A crucial requirement for compliance-checking techniques is that observed behavior, captured in event traces, can be mapped to the process models that specify allowed behavior. Without a mapping, it is not possible to determine if observed behavior is compliant or not. A considerable problem in this regard is that establishing a mapping between events and process model activities is an inherently uncertain task. Since the use of a particular mapping directly influences the compliance of a trace to a specification, this uncertainty represents a major issue for compliance checking. To overcome this issue, we introduce a probabilistic compliance-checking method that can deal with uncertain mappings. Our method avoids the need to select a single mapping, but rather works on a spectrum of possible mappings. A quantitative evaluation demonstrates that our method can be applied on a considerable number of real-world processes where traditional compliance-checking methods fail.
Conference Paper
Schema matching is a prime problem in data integration domain. Some well-known automated tools have been provided to accomplish the task of schema matching, but the results generated by these tools are often uncertain. The uncertainty is universally inherent because of the inability of schema to fully capture the semantics of the represented data. We propose the method to a semi-automatic approach to reduce the uncertainty of schema matching. Our experimental results shows that the approach is able to reduce the uncertainty and improve the precision and recall.
Conference Paper
Schema and ontology matching is a process of establishing correspondences between schema attributes and ontology concepts, for the purpose of data integration. Various commercial and academic tools have been developed to support this task. These tools provide impressive results on some datasets. However, as the matching is inherently uncertain, the developed heuristic techniques give rise to results that are not completely correct. In practice, post-matching human expert effort is needed to obtain a correct set of correspondences. We study this post-matching phase with the goal of reducing the costly human effort. We formally model this human-assisted phase and introduce a process of matching reconciliation that incrementally leads to identifying the correct correspondences. We achieve the goal of reducing the involved human effort by exploiting a network of schemas that are matched against each other.We express the fundamental matching constraints present in the network in a declarative formalism, Answer Set Programming that in turn enables to reason about necessary user input. We demonstrate empirically that our reasoning and heuristic techniques can indeed substantially reduce the necessary human involvement.
Article
The problem of scaling up data integration, such that new sources can be quickly utilized as they are discovered, remains elusive: Global schemas for integrated data are difficult to develop and expand, and schema and record matching techniques are limited by the fact that data and metadata are often under-specified and must be disambiguated by data experts. One promising approach is to avoid using a global schema, and instead to develop keyword search-based data integration—where the system lazily discovers associations enabling it to join together matches to keywords, and return ranked results. The user is expected to understand the data domain and provide feedback about answers’ quality. The system generalizes such feedback to learn how to correctly integrate data. A major open challenge is that under this model, the user only sees and offers feedback on a few “top- \(k\) ” results: This result set must be carefully selected to include answers of high relevance and answers that are highly informative when feedback is given on them. Existing systems merely focus on predicting relevance, by composing the scores of various schema and record matching algorithms. In this paper, we show how to predict the uncertainty associated with a query result’s score, as well as how informative feedback is on a given result. We build upon these foundations to develop an active learning approach to keyword search-based data integration, and we validate the effectiveness of our solution over real data from several very different domains.
Chapter
Schema matching aims to identify the correspondences among attributes of database schemas. It is frequently considered as the most challenging and decisive stage existing in many contemporary web semantics and database systems. Low-quality algorithmic matchers fail to provide improvement while manually annotation consumes extensive human efforts. Further complications arise from data privacy in certain domains such as healthcare, where only schema-level matching should be used to prevent data leakage. For this problem, we propose SMAT, a new deep learning model based on state-of-the-art natural language processing techniques to obtain semantic mappings between source and target schemas using only the attribute name and description. SMAT avoids directly encoding domain knowledge about the source and target systems, which allows it to be more easily deployed across different sites. We also introduce a new benchmark dataset, OMAP, based on real-world schema-level mappings from the healthcare domain. Our extensive evaluation of various benchmark datasets demonstrates the potential of SMAT to help automate schema-level matching tasks.
Article
To integrate data on the Internet, we often have to deal with uncertainties when matching data schemas from different sources. The paper proposes an approach called Mashroom+ to support human-machine interactive data mashup, which can better handle uncertainties during the semantic matching process. To improve the correctness of matching results, an interactive matching algorithm is proposed to synthesize the matching results from multiple automatic matchers based on user feedbacks. Meanwhile, to avoid bringing too much burden on users, we utilize the entropy in information theory to measure and quantify the ambiguities of different matchers and calculate the best times for users to participate. An interactive integration environment is developed based on our approach with operator recommendation capability to support on-demand data integration. Experiments show that Mashroom+ approach can achieve good balance between high correctness of matching results and low user burden with real data.
Chapter
Full-text available
Engineering large-scale systems requires the collaboration among experts who use different modeling languages and create multiple models. Due to their independent creation and evolution, these models may exhibit discrepancies in terms of the domain concepts they represent. To help re-align the models without an explicit synchronization, we propose a technique that provides the modelers with suggested concepts that they may be interested in adding to their own models. The approach is modeling-language agnostic since it processes only the text in the models, such as the labels of elements and relationships. In this paper, we focus on determining the similarity of compound nouns, which are frequently used in conceptual models. We propose two algorithms, that make use of word embeddings and domain models, respectively. We report an early validation that assesses the effectiveness of our similarity algorithms against state-of-the-art machine learning algorithms with respect to human judgment.
Article
Full-text available
Oceanographic research is a multidisciplinary endeavor that involves the acquisition of an increasing amount of in-situ and remotely sensed data. A large and growing number of studies and data repositories are now available on-line. However, manually integrating different datasets is a tedious and grueling process leading to a rising need for automated integration tools. A key challenge in oceanographic data integration is to map between data sources that have no common schema and that were collected, processed, and analyzed using different methodologies. Concurrently, artificial agents are becoming increasingly adept at extracting knowledge from text and using domain ontologies to integrate and align data. Here, we deconstruct the process of ocean science data integration, providing a detailed description of its three phases: discover, merge, and evaluate/correct. In addition, we identify the key missing tools and underutilized information sources currently limiting the automation of the integration process. The efforts to address these limitations should focus on (i) development of artificial intelligence-based tools for assisting ocean scientists in aligning their schema with existing ontologies when organizing their measurements in datasets; (ii) extension and refinement of conceptual coverage of – and conceptual alignment between – existing ontologies, to better fit the diverse and multidisciplinary nature of ocean science; (iii) creation of ocean-science-specific 'entity resolution' benchmarks to accelerate the development of tools utilizing ocean science terminology and nomenclature; (iv) creation of ocean-science-specific schema matching and mapping benchmarks to accelerate the development of matching and mapping tools utilizing semantics encoded in existing vocabularies and ontologies; (v) annotation of datasets, and development of tools and benchmarks for the extraction and categorization of data quality and preprocessing descriptions from scientific text; and (vi) creation of large-scale word embeddings trained upon ocean science literature to accelerate the development of information extraction and matching tools based on artificial intelligence.
Article
There are millions of searchable data sources on the Web and to a large extent their contents can only be reached through their own query interfaces. There is an enormous interest in making the data in these sources easily accessible. There are primarily two general approaches to achieve this objective. The first is to surface the contents of these sources from the deep Web and add the contents to the index of regular search engines. The second is to integrate the searching capabilities of these sources and support integrated access to them. In this book, we introduce the state-of-the-art techniques for extracting, understanding, and integrating the query interfaces of deep Web data sources. These techniques are critical for producing an integrated query interface for each domain. The interface serves as the mediator for searching all data sources in the concerned domain. While query interface integration is only relevant for the deep Web integration approach, the extraction and understanding of query interfaces are critical for both deep Web exploration approaches. This book aims to provide in-depth and comprehensive coverage of the key technologies needed to create high quality integrated query interfaces automatically. The following technical issues are discussed in detail in this book: query interface modeling, query interface extraction, query interface clustering, query interface matching, query interface attribute integration, and query interface integration. Table of Contents: Introduction / Query Interface Representation and Extraction / Query Interface Clustering and Categorization / Query Interface Matching / Query Interface Attribute Integration / Query Interface Integration / Summary and Future Research
Conference Paper
Schema matching is the process of establishing correspondences between the attributes of database schemas for data integration purpose. Although several schema matching tools have been developed, their results are often incomplete or erroneous. To obtain correct attribute correspondences, in practice, human experts edit the mapping results and fix the mapping problems. As the scale and complexity of data integration tasks have increased dramatically in recent years, the reconciliation phase becomes more and more a bottleneck. Moreover, one often needs to establish the correspondences in not only between two but a network of schemas simultaneously. In such reconciliation settings, it is desirable to involve several experts. In this paper, we propose a tool that supports a group of experts to collaboratively reconcile a set of matched correspondences. The experts might have conflicting views whether a given correspondence is correct or not. As one expects global consistency conditions in the network, the conflict resolution might require discussion and negotiation among the experts to resolve such disagreements. We have developed techniques and a tool that allow approaching this reconciliation phase in a systematic way. We represent the expert’s views as arguments to enable formal reasoning on the assertions of the experts. We detect complex dependencies in their arguments, guide and present them the possible consequences of their decisions. These techniques thus can greatly help them to overlook the complex cases and work more effectively.
Article
Schema matching is an essential task in data integration scenarios. The proposed schema matchers can be classified into several categories, one of which is the category of linguistic matchers, which evaluate relatedness comparing the textual elements in the schemata. There are several linguistic matchers the performance of which may vary from scenario to scenario. In order to eliminate the incorrect similarity values, we have implemented a voting based approach. Another contribution presented in this paper is the proposed fuzzy linguistic matching, which involves the transformation of similarity values into membership functions.
Conference Paper
Ontology & schema matching predictors assess the quality of matchers in the absence of an exact match. We propose MCD (Match Competitor Deviation), a new diversity-based predictor that compares the strength of a matcher confidence in the correspondence of a concept pair with respect to other correspondences that involve either concept. We also propose to use MCD as a regulator to optimally control a balance between Precision and Recall and use it towards 1:1 matching by combining it with a similarity measure that is based on solving a maximum weight bipartite graph matching (MWBM). Optimizing the combined measure is known to be an NP-Hard problem. Therefore, we propose CEM, an approximation to an optimal match by efficiently scanning multiple possible matches, using rare event estimation. Using a thorough empirical study over several benchmark real-world datasets, we show that MCD outperforms other state-of-the-art predictor and that CEM significantly outperform existing matchers.
Article
Schema matching is used for data integration, mediation, and conversion between heterogeneous sources. Nevertheless, mappings identified with an automatic or semi-automatic process can never be completely certain. In a process of concept alignment, it is necessary to manage uncertainty. In this paper, we present a fuzzy-based process of concept alignment for uncertainty management in schema matching problem. The ultimate goal is to enable interoperability between different electronic health records. Data integration of health information is done through the mediation of our ubiquitous user model framework. Results look promising and fuzzy theory proved to be a good fit for modeling uncertain schema matching. Fuzzy combined similarities can handle uncertainty in the schema matching process to enable interoperability between electronic health records improving the quality of mappings and diminishing the human error to verify the mappings.
Chapter
In a distributed and open system, such as the semantic web and many of the applications presented in the previous chapter, heterogeneity cannot be avoided. Different actors have different interests and habits, use different tools and knowledge, and most often, at different levels of detail. These various reasons for heterogeneity lead to diverse forms of heterogeneity, and, therefore, should be carefully taken into consideration.
Conference Paper
In this work we explore relationships between human and algorithmic schema matchers. We provide a novel approach to similar schema matchers termed coordinated matchers and use it to predict future human matching choices. We show throughout a comprehensive analysis that human matchers are usually coordinated with intuitive algorithms, e.g., based on attribute name similarity, and frequently do not assign lower confidence levels, which indicates over confidence in their choices. Finally, we show that human choices can be reasonably predicted using collaborative algorithmic opinions based on matchers coordination.
Chapter
Data integration has been the focus of research for many years now. At the heart of the data integration process is a schema matching problem whose outcome is a collection of correspondences between different representations of the same real-world construct. In recent years, data integration has been facing new challenges as a result of the presence of big data. These challenges require the development of a set of methods to support a matching process using uncertainty management tools to quantify the inherent uncertainty in the process. This chapter is devoted to the introduction of uncertain schema matching. It also discusses existing and future research, as well as possible applications.
Chapter
This chapter is an overview of matchers which have emerged during the last decades. There have already been some comparisons of matching systems, in particular in (Parent and Spaccapietra 2000; Rahm and Bernstein 2001; Do et al. 2002; Kalfoglou and Schorlemmer 2003b; Noy 2004a; Doan and Halevy 2005; Shvaiko and Euzenat 2005; Choi et al. 2006; Bellahsene et al. 2011). Our purpose here is not to compare them in full detail, though we give some comparisons, but rather to show their variety, in order to demonstrate in how many different ways the methods presented in the previous chapters have been practically exploited.
Conference Paper
Despite the profusion of approaches that were proposed to deal with the problem of the Automatic Schema Matching, yet the challenges and difficulties caused by the complexity and uncertainty characterizing both the process and the outcome of Schema Matching motivated us to investigate how bio-inspired emerging paradigm can help with understanding, managing, and ultimately overcoming those challenges. In this paper, we explain how we approached Schema Matching as a Complex Adaptive System (CAS) and how we modeled it using the approach of Agent-Based Modeling and Simulation (ABMS) giving birth to a new tool (prototype) for schema matching called Reflex-SMAS. This prototype was submitted to a set of experiments which aimed to demonstrate the viability of our approach to two main aspects: (i) effectiveness (increasing the quality of the found matchings) and (ii) efficiency (reducing the effort required for this efficiency). The results, came to demonstrate the viability of our approach, both in terms of effectiveness or that of efficiency.
Chapter
The basic techniques presented in Chap. 5 and the global techniques provided in Chap. 6 are the building blocks on which a matching system is built. Once the similarity or dissimilarity between ontology entities is available, the alignment remains to be computed. This involves more comprehensive treatments. In particular, the following aspects of building a working matching system are considered in this chapter: preparing, if necessary, to handle large scale ontologies (Sect. 7.1.1), organising the combination of various similarities or matching algorithms (Sect. 7.2), exploiting background knowledge sources (Sect. 7.3), aggregating the results of the basic methods in order to compute the compound similarity between entities (Sect. 7.4), learning matchers from data (Sect. 7.5) and tuning them (Sect. 7.6), extracting alignments from the resulting (dis)similarity: indeed, different alignments with different characteristics may be extracted from the same (dis)similarity (Sect. 7.7), improving alignments through disambiguation, debugging and repair (Sect. 7.8).
Conference Paper
Full-text available
One of the key challenges in the development of open semantic-based systems is enabling the exchange of meaningful information across applications which may use autonomously developed schemata. One of the typical solutions for that problem is the definition of a mapping between pairs of schemas, namely a set of point–to–point relations between the elements of different schemas. A lot of (semi-)automatic methods for generating such mappings have been proposed. In this paper we provide a preliminary investigation on the notion of correctness for schema matching methods. In particular we define different notions of soundness, strictly depending on what dimension (syntactic, semantic, pragmatic) of the language the mappings are defined on. Finally, we discuss some preliminary conditions under which a two different notions of soundness (semantic and pragmatic) can be related.
Conference Paper
Full-text available
The Web has been rapidly "deepened" by myriad searchable databases online, where data are hidden behind query interfaces. As an essential task toward integrating these massive "deep Web" sources, large scale schema matching (i.e., discovering semantic correspondences of attributes across many query interfaces) has been actively studied recently. In particular, many works have emerged to address this problem by "holistically" matching many schemas at the same time and thus pursuing "mining" approaches in nature. However, while holistic schema matching has built its promise upon the large quantity of input schemas, it also suffers the robustness problem caused by noisy data quality. Such noises often inevitably arise in the automatic extraction of schema data, which is mandatory in large scale integration. For holistic matching to be viable, it is thus essential to make it robust against noisy schemas. To tackle this challenge, we propose a data-ensemble framework with sampling and voting techniques, which is inspired by bagging predictors. Specifically, our approach creates an ensemble of matchers, by randomizing input schema data into many independently downsampled trials, executing the same matcher on each trial and then aggregating their ranked results by taking majority voting. As a principled basis, we provide analytic justification of the effectiveness of this data-ensemble framework. Further, empirically, our experiments on real Web data show that the "ensemblization" indeed significantly boosts the matching accuracy under noisy schema input, and thus maintains the desired robustness of a holistic matcher.
Conference Paper
Full-text available
Schema mappings are high-level specifications that describe the relationship between two database schemas. Two operators on schema mappings, namely the composition operator and the inverse operator, are regarded as especially important. Progress on the study of the inverse operator was not made until very recently, as even finding the exact semantics of this operator turned out to be a fairly delicate task. Furthermore, this notion is rather restrictive, since it is rare that a schema mapping possesses an inverse. In this paper, we introduce and study the notion of a quasi-inverse of a schema mapping. This notion is a principled relaxation of the notion of an inverse of a schema mapping; intuitively, it is obtained from the notion of an inverse by not differentiating between instances that are equivalent for data-exchange purposes. For schema mappings specified by source-to-target tuple-generating dependencies (s-t tgds), we give a necessary and sufficient combinatorial condition for the existence of a quasi-inverse, and then use this condition to obtain both positive and negative results about the existence of quasi-inverses. In particular, we show that every LAV (local-as-view) schema mappinghas a quasi-inverse, but that there are schema mappings specified by full s-t tgds that have no quasi-inverse. After this, we study the language needed to express quasi-inverses of schema mappings specifiedby s-t tgds, and we obtain a complete characterization. We also characterize the language needed to express inverses of schema mappings, and thereby solve a problem left open in the earlier study of the inverse operator. Finally, we show that quasi-inverses can be used in many cases to recover the data that was exported by the original schemamapping when performing data exchange.
Conference Paper
Full-text available
Information seeking is the process in which human beings recourse to information resources in order to increase their level of knowledge with respect to their goals. In this paper we offer a methodology for automating the evolution of ontologies and share the results of our experiments in supporting a user in seeking information using interactive systems. The main conclusion of our experiments is that if one narrows down the scope of the domain, ontologies can be extracted with a very high level of precision (more than 90% in some cases). The paper is a step in providing theoretical, as well as practical, foundation for automatic ontology generation. It is our belief that such a process would allow the creation of flexible tools to manage metadata, either as an aid to a designer or as an independent system (“smart agent”) for time critical missions.
Conference Paper
Full-text available
Automatic classification may be used in object knowledge bases in order to suggest hypothesis about the structure of the available object sets. Yet its direct application meets some difficulties due to the way data is represented: attributes relating objects, multi-valued attributes, non-standard and external data types used in object descriptions. We present here an approach to the automatic classification of objects based on a specific dissimilarity model. The topological measure, presented in a previous paper, accounts for both object relations and the variety of available data types. In this paper, the extension of the topological measure on multi-valued object attributes, e.g. lists or sets, is presented. The resulting dissimilarity is completely integrated in the knowledge model Tropes which enables the definition of a classification strategy for an arbitrary knowledge base built on top of Tropes.
Conference Paper
Full-text available
This paper introduces the first formal framework for learning map- pings between heterogeneous schemas which is based on logics and probability theory. This task, also called "schema matching", is a crucial step in integrating heterogeneous collections. As schemas may have different granularities, and as schema attributes do not always match precisely, a general-purpose schema map- ping approach requires support for uncertain mappings, and mappings have to be learned automatically. The framework combines different classifiers for finding suitable mapping candidates (together with their weights), and selects that set of mapping rules which is the most likely one. Finally, the framework with different variants has been evaluated on two different data sets.
Conference Paper
Full-text available
The core of a model theory for generic schema management is developed. This theory has two distinctive features: it applies to a variety of categories of schemas, and it applies to transformations of both the schema structure and its integrity constraints. A subtle problem of schema integration is considered in its general form, not bound to any particular category of schemas. The proposed solution, as well as the overall theory, is based entirely on schema morphisms that carry both structural and semantic properties. Duality results that apply to the schema and the data levels are established. These results lead to the main contribution of this paper: a formal schema and data management framework for generic schema management. Implications of this theory are established that apply to integrity problems in schema integration. The theory is illustrated by a particular category of schemas with object-oriented features along with typical database integrity constraints.
Conference Paper
Full-text available
Schema matching is the task of finding semantic cor- respondences between elements of two schemas. It is needed in many database applications, such as integra- tion of web data sources, data warehouse loading and XML message mapping. To reduce the amount of user effort as much as possible, automatic approaches com- bining several match techniques are required. While such match approaches have found considerable inter- est recently, the problem of how to best combine dif- ferent match algorithms still requires further work. We have thus developed the COMA schema matching sys- tem as a platform to combine multiple matchers in a flexible way. We provide a large spectrum of individ- ual matchers, in particular a novel approach aiming at reusing results from previous match operations, and several mechanisms to combine the results of matcher executions. We use COMA as a framework to com- prehensively evaluate the effectiveness of different matchers and their combinations for real-world sche- mas. The results obtained so far show the superiority of combined match approaches and indicate the high value of reuse-oriented strategies.
Conference Paper
Full-text available
An XML-to-relational mapping scheme consists of a procedure for shredding documents into relational databases, a procedure for publishing databases back as documents, and a set of constraints the databases must satisfy. In previous work, we defined two notions of information preservation for mapping schemes: losslessness, which guarantees that any document can be reconstructed from its corresponding database; and validation, which requires every legal database to correspond to a valid document. We also described one information-preserving mapping scheme, called Edge++, and showed that, under reasonable assumptions, losslessness and validation are both undecidable. This leads to the question we study in this paper: how to design mapping schemes that are information-preserving. We propose to do it by starting with a scheme known to be information-preserving and applying to it equivalence-preserving transformations written in weakly recursive ILOG. We study an instance of this framework, the LILO algorithm, and show that it provides significant performance improvements over Edge++ and introduces constraints that are efficiently enforced in practice.
Conference Paper
Full-text available
The goal of schema matching is to identify correspondences be- tween the elements of two schemas. Most schema matching sys- tems calculate and display the entire set of correspondences in a single shot. Invariably, the result presented to the engineer includes many false positives, especially for large schemas. The user is of- ten overwhelmed by all of the edges, annoyed by the false posi- tives, and frustrated at the inability to see second- and third-best choices. We demonstrate a tool that circumvents these problems by doing the matching interactively. The tool suggests candidate matches for a selected schema element and allows convenient nav- igation between the candidates. The ranking of match candidates is based on lexical similarity, schema structure, element types, and the history of prior matching actions. The technical challenges are to make the match algorithm fast enough for incremental matching in large schemas and to devise a user interface that avoids over- whelming the user. The tool has been integrated with a prototype version of Microsoft BizTalk Mapper, a visual programming tool for generating XML-to-XML mappings.
Conference Paper
Full-text available
The paper provides a conceptual framework for designing and exe- cuting business processes using semantic Web services. We envision a world in which a designer defines a "virtual" Web service as part of a business process, while requiring the system to seek actual Web services that match the specifi- cations of the designer and can be invoked whenever the virtual Web service is activated. Taking a conceptual modeling approach, the relationships between ontology concepts and syntactic Web services are identified. We then propose a generic algorithm for ranking top-K Web services in a decreasing order of their benefit vis- ´ a-vis the semantic Web service. We conclude with an extention of the framework to handle uncertainty as a result of concept mismatch and the desired properties of a schema matching algorithm to support Web service identification.
Conference Paper
Full-text available
Schema integration is the activity of providing a unified representation of multiple data sources. The core problems in schema integration are: schema matching, i.e. the identification of correspondences, or mappings, between schema objects, and schema merging, i.e. the creation of a unified schema based on the identified mappings. Existing schema matching approaches attempt to identify a single mapping between each pair of objects, for which they are 100% certain of its correctness. However, this is impossible in general, thus a human expert always has to validate or modify it. In this paper, we propose a new schema integration approach where the uncertainty in the identified mappings that is inherent in the schema matching process is explicitly represented, and that uncertainty propagates to the schema merging process, and finally it is depicted in the resulting integrated schema.
Conference Paper
Full-text available
Model management aims at reducing the amount of programming needed for the development of metadata-intensive applications. We present a first complete prototype of a generic model management system, in which high-level operators are used to manipulate models and mappings between models. We define the key conceptual structures: models, morphisms, and selectors, and describe their use and implementation. We specify the semantics of the known model-management operators applied to these structures, suggest new ones, and develop new algorithms for implementing the individual operators. We examine the solutions for two model-management tasks that involve manipulations of relational schemas, XML schemas, and SQL views.
Conference Paper
Full-text available
One significant part of today's Web is Web databases, which can dy- namically provide information in response to user queries. To help users submit queries to and collect query results from different Web databases, the query inter- face matching problem needs to be addressed. To solve this problem, we propose a new complex schema matching approach, Holistic Schema Matching (HSM). By examining the query interfaces of real Web databases, we observe that at- tribute matchings can be discovered from attribute-occurrence patterns. For ex- ample, First Name often appears together with Last Name while it is rarely co-present with Author in the Books domain. Thus, we design a count-based greedy algorithm to identify which attributes are more likely to be matched in the query interfaces. In particular, HSM can identify both simple matching and complex matching, where the former refers to 1:1 matching between attributes and the latter refers to 1:n or m:n matching between attributes. Our experiments show that HSM can discover both simple and complex matchings accurately and efficiently on real data sets.
Conference Paper
Full-text available
Despite of advances in machine learning technolo- gies, a schema matching result between two database schemas (e.g., those derived from COMA++) is likely to be imprecise. In particular, numerous instances of "possible mappings" between the schemas may be derived from the matching result. In this paper, we study the problem of managing possible mappings between two heterogeneous XML schemas. We observe that for XML schemas, their possible mappings have a high degree of overlap. We hence propose a novel data structure, called the block tree, to capture the commonalities among possible mappings. The block tree is useful for representing the possible mappings in a compact manner, and can be generated efficiently. Moreover, it supports the evaluation of probabilistic twig query (PTQ), which returns the probability of portions of an XML document that match the query pattern. For users who are interested only in answers with k-highest probabilities, we also propose the top-k PTQ, and present an efficient solution for it. The second challenge we have tackled is to efficiently generate possible mappings for a given schema matching. While this problem can be solved by existing algorithms, we show how to improve the performance of the solution by using a divide-and- conquer approach. An extensive evaluation on realistic datasets show that our approaches significantly improve the efficiency of generating, storing, and querying possible mappings.
Conference Paper
Full-text available
A key aspect of any data integration endeavor is establishing a transformation that translates instances of one or more source schemata into instances of a target schema. This schema integration task must be tackled regardless of the integration architecture or mapping formalism. In this paper we provide a task model for schema integration. We use this breakdown to motivate a workbench for schema integration in which multiple tools share a common knowledge repository. In particular, the workbench facilitates the interoperation of research prototypes for schema matching (which automatically identify likely semantic correspondences) with commercial schema mapping tools (which help produce instance-level transformations). Currently, each of these tools provides its own ad hoc representation of schemata and mappings; combining these tools requires aligning these representations. The workbench provides a common representation so that these tools can more rapidly be combined.
Conference Paper
Full-text available
Recent interest in managing uncertainty in data integration has led to the introduction of probabilistic schema mappings and the use of probabilistic methods to answer queries across multiple databases using two semantics: by-table and by- tuple. In this paper, we develop three possible semantics for aggregate queries: the range, distribution, and expected value semantics, and show that these three semantics combine with the by-table and by-tuple semantics in six ways. We present algorithms to process COUNT, AVG, SUM, MIN, and MAX queries under all six semantics and develop results on the complexity of processing such queries under all six semantics. We show that computing COUNT is in PTIME for all six semantics and computing SUM is in PTIME for all but the by-tuple/distribution semantics. Finally, we show that AVG, MIN, and MAX are PTIME computable for all by-table semantics and for the by-tuple/range semantics. We developed a prototype implementation and experi- mented with both real-world traces and simulated data. We show that, as expected, naive processing of aggregates does not scale beyond small databases with a small number of mappings. The results also show that the polynomial time algorithms are scalable up to several million tuples as well as with a large number of mappings.
Conference Paper
Full-text available
Schema matching is the task of matching between concepts describing the meaning of data in various heterogeneous, distributed data sources. With many heuristics to choose from, several tools have enabled the use of schema matcher ensembles, combining principles by which dierent schema matchers judge the similarity between concepts. In this work, we investigate means of estimating the uncertainty involved in schema matching and harnessing it to improve an ensemble outcome. We propose a model for schema matching, based on simple probabilistic principles. We then propose the use of machine learning in determining the best mapping and discuss its pros and cons. Finally, we provide a thorough empirical analysis, using both real-world and synthetic data, to test the proposed technique. We conclude that the proposed heuristic performs well, given an accurate modeling of uncertainty in matcher decision making.
Conference Paper
Full-text available
Schema matching is recognized to be one of the basic operations required by the process of data and schema integration, and thus has a great impact on its outcome. We propose a new approach to combining matchers into ensembles, called Schema Matcher Boosting (SMB). This approach is based on a well-known machine learning technique, called boosting. We present a boosting algorithm for schema matching with a unique ensembler feature, namely the ability to choose the matchers that participate in an ensemble. SMB introduces a new promise for schema matcher designers. Instead of trying to design a perfect schema matcher that is accurate for all schema pairs, a designer can focus on finding better than random schema matchers. We provide a thorough comparative empirical results where we show that SMB outperforms, on average, any individual matcher. In our experiments we have compared SMB with more than 30 other matchers over a real world data of 230 schemata and several ensembling approaches, including the Meta-Learner of LSD. Our empirical analysis shows that SMB improves, on average, over the performance of individual matchers. Moreover, SMB is shown to be consistently dominant, far beyond any other individual matcher. Finally, we observe that SMB performs better than the Meta-Learner in terms of precision, recall and F-Measure.
Article
Full-text available
One of the fundamental principles of the database approach is that a database allows a nonredundant, unified representation of all data managed in an organization. This is achieved only when methodologies are available to support integration across organizational and application boundaries. Methodologies for database design usually perform the design activity by separately producing several schemas, representing parts of the application, which are subsequently merged. Database schema integration is the activity of integrating the schemas of existing or proposed databases into a global, unified schema. The aim of the paper is to provide first a unifying framework for the problem of schema integration, then a comparative review of the work done thus far in this area. Such a framework, with the associated analysis of the existing approaches, provides a basis for identifying strengths and weaknesses of individual methodologies, as well as general guidelines for future improvements and extensions.
Article
Full-text available
Most recent schema matching systems assemble multiple components, each employing a particular matching technique. The domain user mustthen tune the system: select the right component to be executed and correctly adjust their numerous “knobs” (e.g., thresholds, formula coefficients). Tuning is skill and time intensive, but (as we show) without it the matching accuracy is significantly inferior. We describe eTuner, an approach to automatically tune schema matching systems. Given a schema S, we match S against synthetic schemas, for which the ground truth mapping is known, and find a tuning that demonstrably improves the performance of matching S against real schemas. To efficiently search the huge space of tuning configurations, eTuner works sequentially, starting with tuning the lowest level components. To increase the applicability of eTuner, we develop methods to tune a broad range of matching components. While the tuning process is completely automatic, eTuner can also exploit user assistance (whenever available) to further improve the tuning quality. We employed eTuner to tune four recently developed matching systems on several real-world domains. The results show that eTuner produced tuned matching systems that achieve higher accuracy than using the systems with currently possible tuning methods.
Article
Full-text available
Schema matching is a basic problem in many database application domains, such as data integration, E- business, data warehousing, and semanticquery proc essing. In current implementations, schema matching is typically per- formed manually, which has significant limitations. On the other hand, previous research papers have proposed many techniques to achieve a partial automation of the match op- eration for specific application domains. We present a taxon- omy that covers many of these existing approaches, and we describe the approaches in some detail. In particular, we distin- guish between schema-level and instance-level, element-level and structure-level, and language-based and constraint-based matchers. Based on our classification we review some pre- vious match implementations thereby indicating which part of the solution space they cover. We intend our taxonomy and review of past work to be useful when comparing different ap- proaches to schema matching, when developing a new match algorithm, and when implementing a schema matching com- ponent.
Article
Full-text available
The introduction of the Semantic Web vision and the shift toward machine understandable Web resources has unearthed the importance of automatic semantic reconcili- ation.Consequently, new tools for automating the process were proposed.In this work we present a formal model of semantic reconciliation and analyze in a systematic manner the properties of the process outcome, primarily the inherent uncertainty of the matching process and how it reflects on the resulting mappings.An important feature of this research is the identification and analysis of factors that impact the effectiveness of algorithms for automatic semantic reconcili- ation, leading, it is hoped, to the design of better algorithms by reducing the uncertainty of existing algorithms.Against this background we empirically study the aptitude of two algorithms to correctly match concepts.This research is both timely and practical in light of recent attempts to develop and utilize methods for automatic semantic reconciliation.
Article
Full-text available
In this paper, we propose to extend current practice in schema matching with the simultaneous use of top-K schema mappings rather than a single best mapping. This is a natural extension of existing methods (which can be considered to fall into the top-1 category), taking into account the imprecision inherent in the schema matching process. The essence of this method is the simultaneous generation and examination of K best schema mappings to identify useful mappings. The paper discusses efficient methods for generating top-K methods and propose a generic methodology for the simultaneous utilization of top-K mappings. We also propose a concrete heuristic that aims at improving precision at the cost of recall. We have tested the heuristic on real as well as synthetic data and anlyze the emricial results. The novelty of this paper lies in the robust extension of existing methods for schema matching, one that can gracefully accommodate less-than-perfect scenarios in which the exact mapping cannot be identified in a single iteration. Our proposal represents a step forward in achieving fully automated schema matching, which is currently semi-automated at best.
Article
Full-text available
complex tasks of heterogeneous data transformation and integration. In Clio, we have collected together a powerful set of data management techniques that have proven invaluable in tackling these difficult problems. In this paper, we present the underlying themes of our approach and present a brief case study.
Article
Full-text available
Schema matching identifies elements of two given schemas that correspond to each other. Although there are many algorithms for schema matching, little has been written about building a system that can be used in practice. We describe our initial experience building such a system, a customizable schema matcher called Protoplasm.
Article
This paper reports our first set of results on managing uncertainty in data integration. We posit that data-integration systems need to handle uncertainty at three levels and do so in a principled fashion. First, the semantic mappings between the data sources and the mediated schema may be approximate because there may be too many of them to be created and maintained or because in some domains (e.g., bioinformatics) it is not clear what the mappings should be. Second, the data from the sources may be extracted using information extraction techniques and so may yield erroneous data. Third, queries to the system may be posed with keywords rather than in a structured form. As a first step to building such a system, we introduce the concept of probabilistic schema mappings and analyze their formal foundations. We show that there are two possible semantics for such mappings: by-table semantics assumes that there exists a correct mapping but we do not know what it is; by-tuple semantics assumes that the correct mapping may depend on the particular tuple in the source data. We present the query complexity and algorithms for answering queries in the presence of probabilistic schema mappings, and we describe an algorithm for efficiently computing the top-k answers to queries in such a setting. Finally, we consider using probabilistic mappings in the scenario of data exchange.
Article
We review the Lawler-Murty [24,20] procedure for finding theK best solutions to combinatorial optimization problems. Then we introduce an alternative algorithm which is based on a binary search tree procedure. We apply both algorithms to the problems of finding theK best bases in a matroid, perfect matchings, and best cuts in a network.
Conference Paper
We discuss, compare and relate some old and some new models for incomplete and probabilistic databases. We characterize the expressive power of c-tables over infinite domains and we introduce a new kind of result, algebraic completion, for studying less expressive models. By viewing probabilistic models as incompleteness models with additional probability information, we define completeness and closure under query languages of general probabilistic database models and we introduce a new such model, probabilistic c-tables, that is shown to be complete and closed under the relational algebra.
Article
In the first part of the paper we consider the problem of dynamically apportioning resources among a set of options in a worst-case on-line framework. The model we study can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting. We show that the multiplicative weight-update Littlestone-Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems. We show how the resulting learning algorithm can be applied to a variety of problems, including gambling, multiple-outcome prediction, repeated games, and prediction of points in n. In the second part of the paper we apply the multiplicative weight-update technique to derive a new boosting algorithm. This boosting algorithm does not require any prior knowledge about the performance of the weak learning algorithm. We also study generalizations of the new boosting algorithm to the problem of learning functions whose range, rather than being binary, is an arbitrary finite set or a bounded segment of the real line.
Article
Developing intelligent tools for the integration of information extracted from multiple heterogeneous sources is a challenging issue to effectively exploit the numerous sources available on-line in global information systems. In this paper, we propose intelligent, tool-supported techniques to information extraction and integration from both structured and semistructured data sources. An object-oriented language, with an underlying Description Logic, called ODLI3, derived from the standard ODMG is introduced for information extraction. ODLI3 descriptions of the source schemas are exploited first to set a Common Thesaurus for the sources. Information integration is then performed in a semiautomatic way by exploiting the knowledge in the Common Thesaurus and ODLI3 descriptions of source schemas with a combination of clustering techniques and Description Logics. This integration process gives rise to a virtual integrated view of the underlying sources for which mapping rules and integrity constraints are specified to handle heterogeneity. Integration techniques described in the paper are provided in the framework of the MOMIS system based on a conventional wrapper/mediator architecture.
Article
In the K-best perfect matching problem (KM) one wants to find K pairwise different, perfect matchings M1,…,Mk such that w(M1) ≥ w(M2) ≥ ⋯ ≥ w(Mk) ≥ w(M), ∀M ≠ M1, M2,…, Mk. The procedure discussed in this paper is based on a binary partitioning of the matching solution space. We survey different algorithms to perform this partitioning. The best complexity bound of the resulting algorithms discussed is O(Kn3), where n is the number of nodes in the graph.
Article
Assume that each object in a database has m grades, or scores, one for each of m attributes. For example, an object can have a color grade, that tells how red it is, and a shape grade, that tells how round it is. For each attribute, there is a sorted list, which lists each object and its grade under that attribute, sorted by grade (highest grade first). Each object is assigned an overall grade, that is obtained by combining the attribute grades using a fixed monotone aggregation function, or combining rule, such as min or average. To determine the top k objects, that is, k objects with the highest overall grades, the naive algorithm must access every object in the database, to find its grade under each attribute. Fagin has given an algorithm (“Fagin's Algorithm”, or FA) that is much more efficient. For some monotone aggregation functions, FA is optimal with high probability in the worst case. We analyze an elegant and remarkably simple algorithm (“the threshold algorithm”, or TA) that is optimal in a much stronger sense than FA. We show that TA is essentially optimal, not just for some monotone aggregation functions, but for all of them, and not just in a high-probability worst-case sense, but over every database. Unlike FA, which requires large buffers (whose size may grow unboundedly as the database size grows), TA requires only a small, constant-size buffer. TA allows early stopping, which yields, in a precise sense, an approximate version of the top k answers. We distinguish two types of access: sorted access (where the middleware system obtains the grade of an object in some sorted list by proceeding through the list sequentially from the top), and random access (where the middleware system requests the grade of object in a list, and obtains it in one step). We consider the scenarios where random access is either impossible, or expensive relative to sorted access, and provide algorithms that are essentially optimal for these cases as well.
Article
A formal analysis of probabilities, possibilities, and fuzzy sets is presented in this paper. A number of theorems proved show the above measures have equal representational power when their domains are infinite. However, for finite domains, it is proved that probabilities have a higher representational power than both possibilities and fuzzy sets. The cost of this increased power is high computational complexity and reduced computational efficiency. The resulting trade-off of high complexity and representational power versus computational efficiency is discussed under the spectrum of experimental systems and applications.
Article
A fuzzy set is a class of objects with a continuum of grades of membership. Such a set is characterized by a membership (characteristic) function which assigns to each object a grade of membership ranging between zero and one. The notions of inclusion, union, intersection, complement, relation, convexity, etc., are extended to such sets, and various properties of these notions in the context of fuzzy sets are established. In particular, a separation theorem for convex fuzzy sets is proved without requiring that the fuzzy sets be disjoint.
Article
Schema matching is the problem of finding correspondences (mapping rules, e.g. logical formulae) between heterogeneous schemas e.g. in the data exchange domain, or for distributed IR in federated digital libraries. This paper introduces a probabilistic framework, called sPLMap, for automatically learning schema mapping rules, based on given instances of both schemas. Different techniques, mostly from the IR and machine learning fields, are combined for finding suitable mapping candidates. Our approach gives a probabilistic interpretation of the prediction weights of the candidates, selects the rule set with highest matching probability, and outputs probabilistic rules which are capable to deal with the intrinsic uncertainty of the mapping process. Our approach with different variants has been evaluated on several test sets.
Conference Paper
This panel brings together researchers and practitioners from academia, government and industry to address the challenges of disaster data management (DisDM) and discuss solution approaches.
Conference Paper
Fundamental notions of relative information capacity between database structures are studied in the context of the relational model. Four progressively less restrictive formal definitions of "dominance" between pairs of relational database schemata are given. Each of these is shown to capture intuitively appealing, semantically meaningful properties which are natural for measures of relative information capacity between schemata. Relational schemata, both with and without key dependencies, are studied using these notions. A significant intuitive conclusion concerns the informal notion of relative information capacity often suggested in the conceptual database literature, which is based on accessability of data via queries. Results here indicate that this notion is too general to accurately measure whether an underlying semantic connection exists between database schemata. Another important result of the paper shows that under any natural notion of information capacity equivalence, two relational schemata (with no dependencies) are equivalent if and only if they are identical (up to re-ordering of the attributes and relations). The approach and definitions used here can form part of the foundation for a rigorous investigation of a variety of important database problems involving data relativism, including those of schema integration and schema translation.
Conference Paper
Most virtual database systems are suitable for environments in which the set of member information sources is small and stable. Consequently, present virtual database systems do not scale up very well. The main reason is the complexity and cost of incorporating new information sources into the virtual database. In this paper we describe a system, called Autoplex, which uses machine learning techniques for automating the discovery of new content for virtual database systems. Autoplex assumes that several information sources have already been incorporated ("mapped") into the virtual database system by human experts (as done in standard virtual database systems). Autoplex learns the features of these examples. It then applies this knowledge to new candidate sources, trying to infer views that "resemble" the examples. In this paper we report initial results from the Autoplex project.
Conference Paper
Uncertainty management at the core of data integration was motivated by new approaches to data management, such as dataspaces [2] and the use of fullyautomatic schema matching takes an increasingly prominent role in this field. Recent works suggested the use, in parallel, of several alternative schema matching, as an uncertainty management tool [3,1]. We offer in this work OntoMatcher, an extension of the OntoBuilder [4] schema matching tool to support the management of multiple (top-K) schema matching alternatives.
Conference Paper
Data integration systems offer a uniform interface to a set of data sources. Despite recent progress, setting up and maintaini ng a data integration application still requires significant up front effort of creating a mediated schema and semantic mappings from the data sources to the mediated schema. Many application contexts involving multiple data sources (e.g., the web, personal in forma- tion management, enterprise intranets) do not require full integra- tion in order to provide useful services, motivating a pay-as-you-go approach to integration. With that approach, a system starts with very few (or inaccurate) semantic mappings and these mappings are improved over time as deemed necessary. This paper describes the first completely self-configuring d ata integration system. The goal of our work is to investigate how ad- vanced of a starting point we can provide a pay-as-you-go system. Our system is based on the new concept of a probabilistic medi- ated schema that is automatically created from the data sources. We automatically create probabilistic schema mappings between the sources and the mediated schema. We describe experiments in multiple domains, including 50-800 data sources, and show that our system is able to produce high-quality answers with no human intervention.
Conference Paper
In this paper we discuss aspects of cardinality constraints in schema matching. A new cardinality classification is proposed, emphasizing the challenges in schema matching that evolve from cardinality constraints. We also offer a new research direction for automating schema matching to manage cardinality constraints.
Conference Paper
In applications that accomplish XML data integration and XML instance querying, the problem of XML path matching plays a central role. This paper presents an approach for matching XML paths that consists of (1) PathSim, a similarity function specifically designed for matching XML paths and (2) a set of pre-processing functions to be applied to XML paths that are to be compared by a similarity function. The reported experiments demonstrate that PathSim achieves matches of higher quality than a similarity function for XML paths found in literature. The experiments further show that matches of higher quality are achieved when the proposed pre-processing functions are employed.
Article
This paper surveys the techniques used for designing the most efficient algorithms for finding a maximum cardinality or weighted matching in (general or bipartite) graphs. It also lists some open problems concerning possible improvements in existing algorithms and the existence of fast parallel algorithms for these problems.
Article
Schema matching is the task of providing correspondences between concepts describing the meaning of data in various heterogeneous, distributed data sources. It is recognized to be one of the basic operations required by the process of data and schema integration and its outcome serves in many tasks such as targeted content delivery and view integration. Schema matching research has been going on for more than 25 years now. An interesting research topic, that was largely left untouched involves the automatic selection of schema matchers to an ensemble, a set of schema matchers. To the best of our knowledge, none of the existing algorithmic solutions offer such a selection feature. In this paper we provide a thorough investigation of this research topic. We introduce a new heuristic, Schema Matcher Boosting (SMB). We show that SMB has the ability to choose among schema matchers and to tune their importance. As such, SMB introduces a new promise for schema matcher designers. Instead of trying to design a perfect schema matcher, a designer can instead f