Article

Learning to Rerank Schema Matches

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Schema matching is at the heart of integrating structured and semi-structured data with applications in data warehousing, data analysis recommendations, Web table matching, etc. Schema matching is known as an uncertain process and a common method to overcome this uncertainty introduces a human expert with a ranked list of possible schema matches to choose from, known as top-K matching. In this work we propose a learning algorithm that utilizes an innovative set of features to rerank a list of schema matches and improves upon the ranking of the best match. We provide a bound on the size of an initial match list, tying the number of matches with a desired level of confidence in finding the best match. We also propose the use of matching predictors as features in a learning task, and tailored nine new matching predictors for this purpose. The proposed algorithm assists the matching process by introducing a quality set of alternative matches to a human expert. It also serves as a step towards eliminating the involvement of human experts as decision makers in a matching process altogether. A large scale empirical evaluation with real-world benchmark shows the effectiveness of the proposed algorithmic solution.

No full-text available

... A major challenge in data integration is a matching task, which creates correspondences between model elements, may they be schema attributes, ontology concepts, model entities, or process activities. Matching research has been a focus for multiple disciplines including Databases [33], Artificial Intelligence [7], Semantic Web [12], Process Management [26], and Data Mining [16]. Most studies have focused on designing high quality matchers, automatic tools for identifying correspondences. ...
... A matching predictor is a function that quantifies the quality of a match (given as a matching matrix). For example, a dominants predictor measures the proportion of dominant Recently, matching predictors were suggested as features in learning to rerank schema matches (LRSM) [16]. We use the matching matrix (computed from the history H, see Section II-A2) to generate matching predictors features, denoted as Φ LRSM (H). ...
... Specifically, predictors that capture negative characteristics such as uncertainty, diversity, and variability were shown to correlate with recall (and negatively correlate with precision). For example, matrix norm predictors [16] are used to quantify the amount of error in the matching matrix, which can be attributed to uncertainty. ...
Preprint
Full-text available
Matching is a task at the heart of any data integration process, aimed at identifying correspondences among data elements. Matching problems were traditionally solved in a semi-automatic manner, with correspondences being generated by matching algorithms and outcomes subsequently validated by human experts. Human-in-the-loop data integration has been recently challenged by the introduction of big data and recent studies have analyzed obstacles to effective human matching and validation. In this work we characterize human matching experts, those humans whose proposed correspondences can mostly be trusted to be valid. We provide a novel framework for characterizing matching experts that, accompanied with a novel set of features, can be used to identify reliable and valuable human experts. We demonstrate the usefulness of our approach using an extensive empirical evaluation. In particular, we show that our approach can improve matching results by filtering out inexpert matchers.
... Learning to Rerank Matches: Choosing the best match from a top-is basically a cognitive task, usually done by humans. A recent work offers an algorithmic replacement to humans in selecting the best match [3,4], a task traditionally reserved for human verifiers. The novelty of this work is in the use of similarity matrices as a basis for learning features, creating feature-rich datasets that fit learning and enriches algorithmic matching beyond that of human matching. ...
Preprint
Data integration has been recently challenged by the need to handle large volumes of data, arriving at high velocity from a variety of sources, which demonstrate varying levels of veracity. This challenging setting, often referred to as big data, renders many of the existing techniques, especially those that are human-intensive, obsolete. Big data also produces technological advancements such as Internet of things, cloud computing, and deep learning, and accordingly, provides a new, exciting, and challenging research agenda. Given the availability of data and the improvement of machine learning techniques, this blog discusses the respective roles of humans and machines in achieving cognitive tasks in matching, aiming to determine whether traditional roles of humans and machines are subject to change. Such investigation, we believe, will pave a way to better utilize both human and machine resources in new and innovative manners. We shall discuss two possible modes of change, namely humans out and humans in. Humans out aim at exploring out-of-the-box latent matching reasoning using machine learning algorithms when attempting to overpower human matcher performance. Pursuing out-of-the-box thinking, machine and deep learning can be involved in matching. Humans in explores how to better involve humans in the matching loop by assigning human matchers with a symmetric role to algorithmic matcher in the matching process.
... Schema matching research originated in the database community [52] and has been a focus for other disciplines as well, from artificial intelligence [32], to semantic web [24], to data mining [29,34]. Schema matching research has been going on for more than 30 years now, focusing on designing high quality matchers, automatic tools for identifying correspondences among database attributes. ...
Preprint
Full-text available
Schema matching is a core task of any data integration process. Being investigated in the fields of databases, AI, Semantic Web and data mining for many years, the main challenge remains the ability to generate quality matches among data concepts (e.g., database attributes). In this work, we examine a novel angle on the behavior of humans as matchers, studying match creation as a process. We analyze the dynamics of common evaluation measures (precision, recall, and f-measure), with respect to this angle and highlight the need for unbiased matching to support this analysis. Unbiased matching, a newly defined concept that describes the common assumption that human decisions represent reliable assessments of schemata correspondences, is, however, not an inherent property of human matchers. In what follows, we design PoWareMatch that makes use of a deep learning mechanism to calibrate and filter human matching decisions adhering the quality of a match, which are then combined with algorithmic matching to generate better match results. We provide an empirical evidence, established based on an experiment with more than 200 human matchers over common benchmarks, that PoWareMatch predicts well the benefit of extending the match with an additional correspondence and generates high quality matches. In addition, PoWareMatch outperforms state-of-the-art matching algorithms.
... Reference matches in these datasets were manually constructed by domain experts and considered as ground truth for our purposes. Experiments are performed per dataset consistent with existing schema matching papers [17,31,37]. For each dataset, 80% was used to train the initial prediction model, the 10% used to further tune the weights, and the remaining 10% used to evaluate the experiments. ...
Chapter
Schema matching aims to identify the correspondences among attributes of database schemas. It is frequently considered as the most challenging and decisive stage existing in many contemporary web semantics and database systems. Low-quality algorithmic matchers fail to provide improvement while manually annotation consumes extensive human efforts. Further complications arise from data privacy in certain domains such as healthcare, where only schema-level matching should be used to prevent data leakage. For this problem, we propose SMAT, a new deep learning model based on state-of-the-art natural language processing techniques to obtain semantic mappings between source and target schemas using only the attribute name and description. SMAT avoids directly encoding domain knowledge about the source and target systems, which allows it to be more easily deployed across different sites. We also introduce a new benchmark dataset, OMAP, based on real-world schema-level mappings from the healthcare domain. Our extensive evaluation of various benchmark datasets demonstrates the potential of SMAT to help automate schema-level matching tasks.
... This can be done through matchers (such as string similarity matchers [17]) that employ attribute names, instance data, schema structure, etc. To separate genuine relationships from false positives generated by poor matchers, ranking techniques have to be employed [19]. ...
Preprint
Full-text available
Virtual Knowledge Graphs (VKG) constitute one of the most promising paradigms for integrating and accessing legacy data sources. A critical bottleneck in the integration process involves the definition, validation, and maintenance of mappings that link data sources to a domain ontology. To support the management of mappings throughout their entire lifecycle, we propose a comprehensive catalog of sophisticated mapping patterns that emerge when linking databases to ontologies. To do so, we build on well-established methodologies and patterns studied in data management, data analysis, and conceptual modeling. These are extended and refined through the analysis of concrete VKG benchmarks and real-world use cases, and considering the inherent impedance mismatch between data sources and ontologies. We validate our catalog on the considered VKG scenarios, showing that it covers the vast majority of patterns present therein.
Article
Schema matching is a core task of any data integration process. Being investigated in the fields of databases, AI, Semantic Web and data mining for many years, the main challenge remains the ability to generate quality matches among data concepts ( e.g., database attributes). In this work, we examine a novel angle on the behavior of humans as matchers, studying match creation as a process. We analyze the dynamics of common evaluation measures (precision, recall, and f-measure), with respect to this angle and highlight the need for unbiased matching to support this analysis. Unbiased matching, a newly defined concept that describes the common assumption that human decisions represent reliable assessments of schemata correspondences, is, however, not an inherent property of human matchers. In what follows, we design PoWareMatch that makes use of a deep learning mechanism to calibrate and filter human matching decisions adhering the quality of a match, which are then combined with algorithmic matching to generate better match results. We provide an empirical evidence, established based on an experiment with more than 200 human matchers over common benchmarks, that PoWareMatch predicts well the benefit of extending the match with an additional correspondence and generates high quality matches. In addition, PoWareMatch outperforms state-of-the-art matching algorithms.
Article
Schema matching is a process that serves in integrating structured and semi-structured data. Being a handy tool in multiple contemporary business and commerce applications, it has been investigated in the fields of databases, AI, Semantic Web, and data mining for many years. The core challenge still remains the ability to create quality algorithmic matchers, automatic tools for identifying correspondences among data concepts ( e.g. , database attributes). In this work, we offer a novel post processing step to schema matching that improves the final matching outcome without human intervention. We present a new mechanism, similarity matrix adjustment , to calibrate a matching result and propose an algorithm (dubbed ADnEV) that manipulates, using deep neural networks, similarity matrices, created by state-of-the-art algorithmic matchers. ADnEV learns two models that iteratively adjust and evaluate the original similarity matrix. We empirically demonstrate the effectiveness of the proposed algorithmic solution for improving matching results, using real-world benchmark ontology and schema sets. We show that ADnEV can generalize into new domains without the need to learn the domain terminology, thus allowing cross-domain learning. We also show ADnEV to be a powerful tool in handling schemata which matching is particularly challenging. Finally, we show the benefit of using ADnEV in a related integration task of ontology alignment.
Chapter
Historically, matching problems (including process matching, schema matching, and entity resolution) were considered semiautomated tasks in which correspondences are generated by matching algorithms and subsequently validated by human expert(s). The role of humans as validators is diminishing, in part due to the amount and size of matching tasks. Our vision for the changing role of humans in matching is divided into two main approaches, namely Humans Out and Humans In. The former questions the inherent need for humans in the matching loop, while the latter focuses on overcoming human cognitive biases via algorithmic assistance. Above all, we observe that matching requires unconventional thinking demonstrated by advanced machine learning methods to complement (and possibly take over) the role of humans in matching.
Chapter
Full-text available
Process model matching refers to the task of creating correspondences among activities of different process models. This task is crucial whenever comparison and alignment of process models are called for. In recent years, there have been a few attempts to tackle process model matching. Yet, evaluating the obtained sets of correspondences reveals high variability in the results. Addressing this issue, we propose a method for predicting the quality of results derived by process model matchers. As such, prediction serves as a case-by-case decision making tool in estimating the amount of trust one should put into automatic matching. This paper proposes a model of prediction for process matching based on both process properties and preliminary match results.
Article
Full-text available
LambdaMART is the boosted tree version of LambdaRank, which is based on RankNet. RankNet, LambdaRank, and LambdaMART have proven to be very suc-cessful algorithms for solving real world ranking problems: for example an ensem-ble of LambdaMART rankers won Track 1 of the 2010 Yahoo! Learning To Rank Challenge. The details of these algorithms are spread across several papers and re-ports, and so here we give a self-contained, detailed and complete description of them.
Conference Paper
Full-text available
We present the Auto Mapping Core (AMC), a new framework that supports fast construction and tuning of schema matching approaches for specific domains such as ontology alignment, model matching or database-schema matching. Distinctive features of our framework are new visualisation techniques for modelling matching processes, stepwise tuning of parameters, intermediate result analysis and performance-oriented rewrites. Furthermore, existing matchers can be plugged into the framework to comparatively evaluate them in a common environment. This allows deeper analysis of behaviour and shortcomings in existing complex matching systems.
Conference Paper
Full-text available
Information seeking is the process in which human beings recourse to information resources in order to increase their level of knowledge with respect to their goals. In this paper we offer a methodology for automating the evolution of ontologies and share the results of our experiments in supporting a user in seeking information using interactive systems. The main conclusion of our experiments is that if one narrows down the scope of the domain, ontologies can be extracted with a very high level of precision (more than 90% in some cases). The paper is a step in providing theoretical, as well as practical, foundation for automatic ontology generation. It is our belief that such a process would allow the creation of flexible tools to manage metadata, either as an aid to a designer or as an independent system (“smart agent”) for time critical missions.
Conference Paper
Full-text available
Schema matching is the task of finding semantic cor- respondences between elements of two schemas. It is needed in many database applications, such as integra- tion of web data sources, data warehouse loading and XML message mapping. To reduce the amount of user effort as much as possible, automatic approaches com- bining several match techniques are required. While such match approaches have found considerable inter- est recently, the problem of how to best combine dif- ferent match algorithms still requires further work. We have thus developed the COMA schema matching sys- tem as a platform to combine multiple matchers in a flexible way. We provide a large spectrum of individ- ual matchers, in particular a novel approach aiming at reusing results from previous match operations, and several mechanisms to combine the results of matcher executions. We use COMA as a framework to com- prehensively evaluate the effectiveness of different matchers and their combinations for real-world sche- mas. The results obtained so far show the superiority of combined match approaches and indicate the high value of reuse-oriented strategies.
Conference Paper
Full-text available
In the field of information retrieval, one is often faced with the problem of computing the correlation between two ranked lists. The most commonly used statistic that quantifies this correlation is Kendall's Τ. Often times, in the information retrieval community, discrepancies among those items having high rankings are more important than those among items having low rankings. The Kendall's Τ statistic, however, does not make such distinctions and equally penalizes errors both at high and low rankings. In this paper, we propose a new rank correlation coefficient, AP correlation (Τap), that is based on average precision and has a probabilistic interpretation. We show that the proposed statistic gives more weight to the errors at high rankings and has nice mathematical properties which make it easy to interpret. We further validate the applicability of the statistic using experimental data.
Conference Paper
Full-text available
The paper describes a prototype tool, named DIXSE, which supports the integration of XML Document Type Definitions (DTDs) into a common conceptual schema. The mapping from each individual DTD into the common schema is used to automatically generate wrappers for XML documents, which conform to a given DTD. These wrappers are used to populate the common conceptual schema thereby achieving data integration for XML documents.
Conference Paper
Full-text available
Schema integration is the problem of creating a unified target schema based on a set of existing source schemas and based on a set of correspondences that are the result of matching the source schemas. Previous methods for schema integration rely on the exploration, implicit or explicit, of the multiple design choices that are possible for the integrated schema. Such exploration relies heavily on user interaction; thus, it is time consuming and labor intensive. Furthermore, previous methods have ignored the additional information that typically results from the schema matching process, that is, the weights and in some cases the directions that are associated with the correspondences. In this paper, we propose a more automatic approach to schema integration that is based on the use of directed and weighted correspondences between the concepts that appear in the source schemas. A key component of our approach is a novel top-k ranking algorithm for the automatic generation of the best candidate schemas. The algorithm gives more weight to schemas that combine the concepts with higher similarity or coverage. Thus, the algorithm makes certain decisions that otherwise would likely be taken by a human expert. We show that the algorithm runs in polynomial time and moreover has good performance in practice.
Conference Paper
Full-text available
Schema matching is the task of matching between concepts describing the meaning of data in various heterogeneous, distributed data sources. With many heuristics to choose from, several tools have enabled the use of schema matcher ensembles, combining principles by which dierent schema matchers judge the similarity between concepts. In this work, we investigate means of estimating the uncertainty involved in schema matching and harnessing it to improve an ensemble outcome. We propose a model for schema matching, based on simple probabilistic principles. We then propose the use of machine learning in determining the best mapping and discuss its pros and cons. Finally, we provide a thorough empirical analysis, using both real-world and synthetic data, to test the proposed technique. We conclude that the proposed heuristic performs well, given an accurate modeling of uncertainty in matcher decision making.
Article
Full-text available
We define two generalized types of a priority queue by allowing some forms of changing the priorities of the elements in the queue. We show that they can be implemented efficiently. Consequently, each operation takes O(log n) time. We use these generalized priority queues to construct an O(EV log V) algorithm for finding a maximal weighted matching in general graphs. Key words, matching, augmenting path, blossoms, generalized priority queues, primal dual algorithm, time complexity Introduction. We are given a graph G- (V, E) with vertex set V and edge set E. Each edge (i,j) E has a weight wij associated with it. A matching is a subset of the edges, no two of which have a common vertex. We want to find a matching with the maximal total weight. In this paper we deal with the general problem. There are three restricted versions
Article
Full-text available
We present a new ranking algorithm that combines the strengths of two previous methods: boosted tree classification, and LambdaRank, which has been shown to be empirically optimal for a widely used information retrieval measure. Our algorithm is based on boosted regression trees, although the ideas apply to any weak learners, and it is significantly faster in both train and test phases than the state of the art, for comparable accuracy. We also show how to find the optimal linear combination for any two rankers, and we use this method to solve the line search problem exactly during boosting. In addition, we show that starting with a previously trained model, and boosting using its residuals, furnishes an effective technique for model adaptation, and we give significantly improved results for a particularly pressing problem in web search—training rankers for markets for which only small amounts of labeled data are available, given a ranker trained on much more data from a larger market.
Article
Full-text available
The introduction of the Semantic Web vision and the shift toward machine understandable Web resources has unearthed the importance of automatic semantic reconcili- ation.Consequently, new tools for automating the process were proposed.In this work we present a formal model of semantic reconciliation and analyze in a systematic manner the properties of the process outcome, primarily the inherent uncertainty of the matching process and how it reflects on the resulting mappings.An important feature of this research is the identification and analysis of factors that impact the effectiveness of algorithms for automatic semantic reconcili- ation, leading, it is hoped, to the design of better algorithms by reducing the uncertainty of existing algorithms.Against this background we empirically study the aptitude of two algorithms to correctly match concepts.This research is both timely and practical in light of recent attempts to develop and utilize methods for automatic semantic reconciliation.
Conference Paper
Full-text available
While numerous metrics for information retrieval are avail- able in the case of binary relevance, there is only one com- monly used metric for graded relevance, namely the Dis- counted Cumulative Gain (DCG). A drawback of DCG is its additive nature and the underlying independence assump- tion: a document in a given position has always the same gain and discount independently of the documents shown above it. Inspired by the "cascade" user model, we present a new editorial metric for graded relevance which overcomes this difficulty and implicitly discounts documents which are shown below very relevant documents. More precisely, this new metric is defined as the expected reciprocal length of time that the user will take to find a relevant document. This can be seen as an extension of the classical recipro- cal rank to the graded relevance case and we call this metric Expected Reciprocal Rank (ERR). We conduct an extensive evaluation on the query logs of a commercial search engine and show that ERR correlates better with clicks metrics than other editorial metrics.
Article
Full-text available
To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To tackle this challenge, this article takes a conceptually novel approach by viewing schema matching as correlation mining, for our task of matching Web query interfaces to integrate the myriad databases on the Internet. On this “deep Web, ” query interfaces generally form complex matchings between attribute groups (e.g., {author} corresponds to {first name, last name} in the Books domain). We observe that the co-occurrences patterns across query interfaces often reveal such complex semantic relationships: grouping attributes (e.g., {first name, last name}) tend to be co-present in query interfaces and thus positively correlated. In contrast, synonym attributes are negatively correlated because they rarely co-occur. This insight enables us to discover complex matchings by a correlation mining approach. In particular, we develop the DCM framework, which consists of data preprocessing, dual mining of positive and negative correlations, and finally matching construction. We evaluate the DCM framework on manually extracted interfaces and the results show good accuracy for discovering complex matchings. Further, to automate the
Article
Full-text available
A family of scaling corrections aimed to improve the chi-square approximation of goodness-of-fit test statistics in small samples, large models, and nonnormal data was proposed in Satorra and Bentler (1994). For structural equations models, Satorra-Bentler's (SB) scaling corrections are available in standard computer software. Often, however, the interest is not on the overall fit of a model, but on a test of the restrictions that a null model sayM 0 implies on a less restricted oneM 1. IfT 0 andT 1 denote the goodness-of-fit test statistics associated toM 0 andM 1, respectively, then typically the differenceT d =T 0−T 1 is used as a chi-square test statistic with degrees of freedom equal to the difference on the number of independent parameters estimated under the modelsM 0 andM 1. As in the case of the goodness-of-fit test, it is of interest to scale the statisticT d in order to improve its chi-square approximation in realistic, that is, nonasymptotic and nonormal, applications. In a recent paper, Satorra (2000) shows that the difference between two SB scaled test statistics for overall model fit does not yield the correct SB scaled difference test statistic. Satorra developed an expression that permits scaling the difference test statistic, but his formula has some practical limitations, since it requires heavy computations that are not available in standard computer software. The purpose of the present paper is to provide an easy way to compute the scaled difference chi-square statistic from the scaled goodness-of-fit test statistics of modelsM 0 andM 1. A Monte Carlo study is provided to illustrate the performance of the competing statistics.
Article
Full-text available
This chapter describes gene expression analysis by Singular Value Decomposition (SVD), emphasizing initial characterization of the data. We describe SVD methods for visualization of gene expression data, representation of the data using a smaller number of variables, and detection of patterns in noisy gene expression data. In addition, we describe the precise relation between SVD analysis and Principal Component Analysis (PCA) when PCA is calculated using the covariance matrix, enabling our descriptions to apply equally well to either method. Our aim is to provide definitions, interpretations, examples, and references that will serve as resources for understanding and extending the application of SVD and PCA to gene expression analysis.
Article
This paper proposes evaluation methods based on the use of non-dichotomous relevance judgements in IR experiments. It is argued that evaluation methods should credit IR methods for their ability to retrieve highly relevant documents. This is desirable from the user point of view in modem large IR environments. The proposed methods are (1) a novel application of P-R curves and average precision computations based on separate recall bases for documents of different degrees of relevance, and (2) two novel measures computing the cumulative gain the user obtains by examining the retrieval result up to a given ranked position. We then demonstrate the use of these evaluation methods in a case study on the effectiveness of query types, based on combinations of query structures and expansion, in retrieving documents of various degrees of relevance. The test was run with a best match retrieval system (In- Query I) in a text database consisting of newspaper articles. The results indicate that the tested strong query structures are most effective in retrieving highly relevant documents. The differences between the query types are practically essential and statistically significant. More generally, the novel evaluation methods and the case demonstrate that non-dichotomous relevance assessments are applicable in IR experiments, may reveal interesting phenomena, and allow harder testing of IR methods.
Conference Paper
Ontology & schema matching predictors assess the quality of matchers in the absence of an exact match. We propose MCD (Match Competitor Deviation), a new diversity-based predictor that compares the strength of a matcher confidence in the correspondence of a concept pair with respect to other correspondences that involve either concept. We also propose to use MCD as a regulator to optimally control a balance between Precision and Recall and use it towards 1:1 matching by combining it with a similarity measure that is based on solving a maximum weight bipartite graph matching (MWBM). Optimizing the combined measure is known to be an NP-Hard problem. Therefore, we propose CEM, an approximation to an optimal match by efficiently scanning multiple possible matches, using rare event estimation. Using a thorough empirical study over several benchmark real-world datasets, we show that MCD outperforms other state-of-the-art predictor and that CEM significantly outperform existing matchers.
Article
Many organizations maintain textual process descriptions alongside graphical process models. The purpose is to make process information accessible to various stakeholders, including those who are not familiar with reading and interpreting the complex execution logic of process models. Despite this merit, there is a clear risk that model and text become misaligned when changes are not applied to both descriptions consistently. For organizations with hundreds of different processes, the effort required to identify and clear up such conflicts is considerable. To support organizations in keeping their process descriptions consistent, we present an approach to automatically identify inconsistencies between a process model and a corresponding textual description. Our approach detects cases where the two process representations describe activities in different orders and detect process model activities not contained in the textual description. A quantitative evaluation with 53 real-life model-text pairs demonstrates that our approach accurately identifies inconsistencies between model and text.
Conference Paper
Data analysis may be a difficult task, especially for non-expert users, as it requires deep understanding of the investigated domain and the particular context. In this demo we present REACT, a system that hooks to the analysis UI and provides the users with personalized recommendations of analysis actions. By matching the current user session to previous sessions of analysts working with the same or other data sets, REACT is able to identify the potentially best next analysis actions in the given user context. Unlike previous work that mainly focused on individual components of the analysis work, REACT provides a holistic approach that captures a wider range of analysis action types by utilizing novel notions of similarity in terms of the individual actions, the analyzed data and the entire analysis workflow. We demonstrate the functionality of REACT, as well as its effectiveness through a digital forensics scenario where users are challenged to detect cyber attacks in real life data achieved from honeypot servers.
Article
The amount of research papers published nowadays related to ontology matching is remarkable and we believe that reflects the growing interest of the research community. However, for new practitioners that approach the field, this amount of information might seem overwhelming. Therefore, the purpose of this work is to help in guiding new practitioners get a general idea on the state of the field and to determine possible research lines. To do so, we first perform a literature review of the field in the last decade by means of an online search. The articles retrieved are sorted using a classification framework that we propose, and the different categories are revised and analyzed. The information in this review is extended and supported by the results obtained by a survey that we have designed and conducted among the practitioners.
Book
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Article
Web search engines are increasingly deploying many features, combined using learning to rank techniques. However, various practical questions remain concerning the manner in which learning to rank should be deployed. For instance, a sample of documents with sufficient recall is used, such that re-ranking of the sample by the learned model brings the relevant documents to the top. However, the properties of the document sample such as when to stop ranking—i.e. its minimum effective size—remain unstudied. Similarly, effective listwise learning to rank techniques minimise a loss function corresponding to a standard information retrieval evaluation measure. However, the appropriate choice of how to calculate the loss function—i.e. the choice of the learning evaluation measure and the rank depth at which this measure should be calculated—are as yet unclear. In this paper, we address all of these issues by formulating various hypotheses and research questions, before performing exhaustive experiments using multiple learning to rank techniques and different types of information needs on the ClueWeb09 and LETOR corpora. Among many conclusions, we find, for instance, that the smallest effective sample for a given query set is dependent on the type of information need of the queries, the document representation used during sampling and the test evaluation measure. As the sample size is varied, the selected features markedly change—for instance, we find that the link analysis features are favoured for smaller document samples. Moreover, despite reflecting a more realistic user model, the recently proposed ERR measure is not as effective as the traditional NDCG as a learning loss function. Overall, our comprehensive experiments provide the first empirical derivation of best practices for learning to rank deployments.
Article
Web-scale data integration involves fully automated efforts which lack knowledge of the exact match between data descriptions. In this paper, we introduce schema matching prediction, an assessment mechanism to support schema matchers in the absence of an exact match. Given attribute pair-wise similarity measures, a predictor predicts the success of a matcher in identifying correct correspondences. We present a comprehensive framework in which predictors can be defined, designed, and evaluated. We formally define schema matching evaluation and schema matching prediction using similarity spaces and discuss a set of four desirable properties of predictors, namely correlation, robustness, tunability, and generalization. We present a method for constructing predictors, supporting generalization, and introduce prediction models as means of tuning prediction toward various quality measures. We define the empirical properties of correlation and robustness and provide concrete measures for their evaluation. We illustrate the usefulness of schema matching prediction by presenting three use cases: We propose a method for ranking the relevance of deep Web sources with respect to given user needs. We show how predictors can assist in the design of schema matching systems. Finally, we show how prediction can support dynamic weight setting of matchers in an ensemble, thus improving upon current state-of-the-art weight setting methods. An extensive empirical evaluation shows the usefulness of predictors in these use cases and demonstrates the usefulness of prediction models in increasing the performance of schema matching.
Conference Paper
Web services technology and the corresponding software have been widely acknowledged in recent years. However, some new features of Web service-based software such as heterogeneity and loose-coupling bring great trouble for its latter maintenance and comprehension. The complexity analysis of such system is helpful to solve it. At present, the existing researches mainly concerns on the complexity metrics for control flow. In the paper, we give a data complexity metric set as an effective complement. The data complexity can be measured from two perspectives: data traffic and data dependency. For the first one, the volume of data flow is scaled by analyzing service's parameters and their data types. The second one is implemented by analyzing the definition and use of variables in BPEL program dependence graph. Based on the def-use pairs, the metric subsets about degree, def-use chain and entropy are addressed. Based on the proposed metric set, we can more fully understand the Web service-based system. In addition, it can also facilitate the performance or defects analysis for such type of system.
Article
Methods for constructing simultaneous confidence intervals for all possible linear contrasts among several means of normally distributed variables have been given by Scheffé and Tukey. In this paper the possibility is considered of picking in advance a number (say m) of linear contrasts among k means, and then estimating these m linear contrasts by confidence intervals based on a Student t statistic, in such a way that the overall confidence level for the m intervals is greater than or equal to a preassigned value. It is found that for some values of k, and for m not too large, intervals obtained in this way are shorter than those using the F distribution or the Studentized range. When this is so, the experimenter may be willing to select the linear combinations in advance which he wishes to estimate in order to have m shorter intervals instead of an infinite number of longer intervals.
Conference Paper
Pseudo-feedback-based automatic query expansion yields ef- fective retrieval performance on average, but results in per- formance inferior to that of using the original query for many information needs. We address an important cause of this robustness issue, namely, the query drift problem, by fusing the results retrieved in response to the original query and to its expanded form. Our approach posts performance that is significantly better than that of retrieval based only on the original query and more robust than that of retrieval using the expanded query.
Conference Paper
Uncertainty management at the core of data integration was motivated by new approaches to data management, such as dataspaces [2] and the use of fullyautomatic schema matching takes an increasingly prominent role in this field. Recent works suggested the use, in parallel, of several alternative schema matching, as an uncertainty management tool [3,1]. We offer in this work OntoMatcher, an extension of the OntoBuilder [4] schema matching tool to support the management of multiple (top-K) schema matching alternatives.
Article
This tutorial is concerned with a comprehensive introduction to the research area of learning to rank for information retrieval. In the first part of the tutorial, we will introduce three major approaches to learning to rank, i.e., the pointwise, pairwise, and listwise approaches, analyze the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures, evaluate the performance of these approaches on the LETOR benchmark datasets, and demonstrate how to use these approaches to solve real ranking applications. In the second part of the tutorial, we will discuss some advanced topics regarding learning to rank, such as relational ranking, diverse ranking, semi-supervised ranking, transfer ranking, query-dependent ranking, and training data preprocessing. In the third part, we will briefly mention the recent advances on statistical learning theory for ranking, which explain the generalization ability and statistical consistency of different ranking methods. In the last part, we will conclude the tutorial and show several future research directions.
Article
Although Pseudo-Relevance Feedback (PRF) is a widely used technique for enhancing average retrieval performance, it may actually hurt performance for around one-third of a given set of topics. To enhance the reliability of PRF, Flexible PRF has been proposed, which adjusts the number of pseudo-relevant documents and/or the number of expansion terms for each topic. This paper explores a new, inexpensive Flexible PRF method, called Selective Sampling, which is unique in that it can skip documents in the initial ranked output to look for more “novel” pseudo-relevant documents. While Selective Sampling is only comparable to Traditional PRF in terms of average performance and reliability, per-topic analyses show that Selective Sampling outperforms Traditional PRF almost as often as Traditional PRF outperforms Selective Sampling. Thus, treating the top P documents as relevant is often not the best strategy. However, predicting when Selective Sampling outperforms Traditional PRF appears to be as difficult as predicting when a PRF method fails. For example, our per-topic analyses show that even the proportion of truly relevant documents in the pseudo-relevant set is not necessarily a good performance predictor.
Article
We discuss the following problem given a random sample X = (X 1, X 2,…, X n) from an unknown probability distribution F, estimate the sampling distribution of some prespecified random variable R(X, F), on the basis of the observed data x. (Standard jackknife theory gives an approximate mean and variance in the case R(X, F) = $$\theta \left( {\hat F} \right) - \theta \left( F \right)$$, θ some parameter of interest.) A general method, called the “bootstrap”, is introduced, and shown to work satisfactorily on a variety of estimation problems. The jackknife is shown to be a linear approximation method for the bootstrap. The exposition proceeds by a series of examples: variance of the sample median, error rates in a linear discriminant analysis, ratio estimation, estimating regression parameters, etc.
Article
Self-regulation is a complex process that involves consumers’ persistence, strength, motivation, and commitment in order to be able to override short-term impulses. In order to be able to pursue their long-term goals, consumers typically need to forgo immediate pleasurable experiences that are detrimental to reach their overarching goals. Although this sometimes involves resisting to simple and small temptations, it is not always easy, since the lure of momentary temptations is pervasive. In addition, consumers’ beliefs play an important role determining strategies and behaviors that consumers consider acceptable to engage in, affecting how they act and plan actions to attain their goals. This dissertation investigates adequacy of some beliefs typically shared by consumers about the appropriate behaviors to exert self-regulation, analyzing to what extent these indeed contribute to the enhancement of consumers’ ability to exert self-regulation.
Conference Paper
Schema matching is the problem of identifying corresponding elements in different schemas. Discovering these correspondences or matches is inherently difficult to automate. Past solutions have proposed a principled combination of multiple algorithms. However, these solutions sometimes perform rather poorly due to the lack of sufficient evidence in the schemas being matched. In this paper we show how a corpus of schemas and mappings can be used to augment the evidence about the schemas being matched, so they can be matched better. Such a corpus typically contains multiple schemas that model similar concepts and hence enables us to learn variations in the elements and their properties. We exploit such a corpus in two ways. First, we increase the evidence about each element being matched by including evidence from similar elements in the corpus. Second, we learn statistics about elements and their relationships and use them to infer constraints that we use to prune candidate mappings. We also describe how to use known mappings to learn the importance of domain and generic constraints. We present experimental results that demonstrate corpus-based matching outperforms direct matching (without the benefit of a corpus) in multiple domains.
Schema Matching and Mapping. Data-Centric Systems and Applications
• Z Bellahsène
• A Bonifati
• E Rahm
Z. Bellahsène, A. Bonifati, and E. Rahm. Schema Matching and Mapping. Data-Centric Systems and Applications. Springer, 2011.
• X Dong
• A Halevy
• C Yu
X. Dong, A. Halevy, and C. Yu. Data integration with uncertainty. VLDB Journal, 18:469-500, 2009.
Matching web tables to dbpedia-a feature utility study. context
• D Ritze
• C Bizer
D. Ritze and C. Bizer. Matching web tables to dbpedia-a feature utility study. context, 42(41):19, 2017.