Article

Stitching web tables for improving matching quality

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

HTML tables on web pages ("web tables") cover a wide variety of topics. Data from web tables can thus be useful for tasks such as knowledge base completion or ad hoc table extension. Before table data can be used for these tasks, the tables must be matched to the respective knowledge base or base table. The challenges of web table matching are the high heterogeneity and the small size of the tables. Though it is known that the majority of web tables are very small, the gold standards that are used to compare web table matching systems mostly consist of larger tables. In this experimental paper, we evaluate T2K Match, a web table to knowledge base matching system, and COMA, a standard schema matching tool, using a sample of web tables that is more realistic than the gold standards that were previously used. We find that both systems fail to produce correct results for many of the very small tables in the sample. As a remedy, we propose to stitch (combine) the tables from each web site into larger ones and match these enlarged tables to the knowledge base or base table afterwards. For this stitching process, we evaluate different schema matching methods in combination with holistic correspondence refinement. Limiting the stitching procedure to web tables from the same web site decreases the heterogeneity and allows us to stitch tables with very high precision. Our experiments show that applying table stitching before running the actual matching method improves the matching results by 0.38 in F1-measure for T2K Match and by 0.14 for COMA. Also, stitching the tables allows us to reduce the amount of tables in our corpus from 5 million original web tables to as few as 100,000 stitched tables.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The full results are reported on the challenge websites. 10 Table 6 depicts the CEA results in the F1 score and Precision of MTab4D compared to the other systems on the original version of SemTab 2019. Because of the high data inconsistencies in Rounds 1 and 2, MTab4D could not provide comparable results with the original system MTab. ...
... The tabular data annotation tasks could be categorized as structure or semantic annotation. The structural annotation contains table type prediction [17], datatype prediction, table header annotation, subject column prediction, and holistic matching across tables [10]. In SemTab 2019, most tables are represented as a horizontal relational type; headers are located at the first row of tables, and the subject column is in the first table column. ...
... -MTab4D assumes the input table as a horizontal relational type as in Assumption 2. To make MTab4D work for other table types, e.g., vertical relational, we need to perform further preprocessing steps to identify table types and transform table data to horizontal relational. -Many tables could have a shared schema, e.g., tables on the Web could be divided into many web pages; therefore, we can expect an improving matching performance by stitching tables on the same web page (or domain) [10,21]. ...
Article
Full-text available
Semantic annotation of tabular data is the process of matching table elements with knowledge graphs. As a result, the table contents could be interpreted or inferred using knowledge graph concepts, enabling them to be useful in downstream applications such as data analytics and management. Nevertheless, semantic annotation tasks are challenging due to insufficient tabular data descriptions, heterogeneous schema, and vocabulary issues. This paper presents an automatic semantic annotation system for tabular data, called MTab4D, to generate annotations with DBpedia in three annotation tasks: 1) matching table cells to entities, 2) matching columns to entity types, and 3) matching pairs of columns to properties. In particular, we propose an annotation pipeline that combines multiple matching signals from different table elements to address schema heterogeneity, data ambiguity, and noisiness. Additionally, this paper provides insightful analysis and extra resources on benchmarking semantic annotation with knowledge graphs. Experimental results on the original and adapted datasets of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019) show that our system achieves an impressive performance for the three annotation tasks. MTab4D’s repository is publicly available at https://github.com/phucty/mtab4dbpedia.
... The ultimate goal of dataset discovery is to augment a dataset with information previously unknown to the user. There are many flavors of dataset discovery: i) searching for tables that can be joined [1], [2], [6], ii) augmenting a given table with more data entries or extra attributes [3]- [5], [9], frequently for improving the accuracy of machine learning models [10], [11], and iii) finding similar tables to a given one using different similarity measures [7], [8]. ...
... Method Match Type Attribute Overlap [3], [6], [9] Value Overlap [1], [3], [6]- [9], [11] Semantic Overlap [2], [8] Data Type [7] Distribution [7], [9] Embeddings [7]- [9] Cupid [14] Similarity Flooding [15] COMA [16] Distribution-based [17] SemProp [18] EmbDI [19] Jaccard-Levenshtein TABLE I Schema matching techniques implemented in Valentine, and the match types they cover. Match types are marked with the discovery methods requiring them. ...
... Method Match Type Attribute Overlap [3], [6], [9] Value Overlap [1], [3], [6]- [9], [11] Semantic Overlap [2], [8] Data Type [7] Distribution [7], [9] Embeddings [7]- [9] Cupid [14] Similarity Flooding [15] COMA [16] Distribution-based [17] SemProp [18] EmbDI [19] Jaccard-Levenshtein TABLE I Schema matching techniques implemented in Valentine, and the match types they cover. Match types are marked with the discovery methods requiring them. ...
Preprint
Full-text available
Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Surprisingly, although a discovery method's success relies highly on the quality of the underlying matching algorithms, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and evaluation metrics. In this paper, we aim to rectify the problem of evaluating the effectiveness and efficiency of schema matching methods for the specific needs of dataset discovery. To this end, we propose Valentine, an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. The contributions of Valentine are: i) the definition of four schema matching scenarios as encountered in dataset discovery methods, ii) a principled dataset fabrication process tailored to the scope of dataset discovery methods and iii) the most comprehensive evaluation of schema matching techniques to date, offering insight on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods.
... It is considered as one of the main limitations of linking tabular mentions to DBpedia. To overcome this, Lehmberg and Bizer [39] stitch tables, i.e., merge tables from the same website as a single large table, to improve entity linking performance. [48] DBpedia Lookup-based Any pair of entities in the same row Mulwad et al. [49] DBpedia Semantic passing Any pair of columns Mulwad et al. [50] DBpedia Utilizing CTI and EL Any pair of columns Sekhavat et al. [63] YAGO, PATTY Probabilistic Any pair of entities in the same row Venetis et al. [68] IS-A database Frequency-based Core + attribute columns Zhang [87] Freebase Optimization Any pair of columns ...
... The T2K method utilizes the T2D dataset to execute iterative steps between candidate matching and property matching, to find proper entities/schemas in DBpedia for table rows/columns. However, T2D mainly focuses on large tables and does not work that well for small-sized tables [39]. To counter this problem, Lehmberg and Bizer [39] propose to combine tables from each website into larger tables for table matching, building on the intuition that tables from the same website are created in a similar fashion. ...
... However, T2D mainly focuses on large tables and does not work that well for small-sized tables [39]. To counter this problem, Lehmberg and Bizer [39] propose to combine tables from each website into larger tables for table matching, building on the intuition that tables from the same website are created in a similar fashion. ...
Article
Full-text available
Tables are powerful and popular tools for organizing and manipulating data. A vast number of tables can be found on the Web, which represent a valuable knowledge resource. The objective of this survey is to synthesize and present two decades of research on web tables. In particular, we organize existing literature into six main categories of information access tasks: table extraction, table interpretation, table search, question answering, knowledge base augmentation, and table augmentation. For each of these tasks, we identify and describe seminal approaches, present relevant resources, and point out interdependencies among the different tasks.
... It is considered as one of the main limitations of methods matching tables to DBpedia. To overcome this, Lehmberg and Bizer [2017] stitch tables, i.e., merge tables from the same website as a single large table, in order to improve the matching quality. ...
... The T2K method utilizes the T2D dataset to execute iterative steps between candidate matching and property matching, to find proper entities/schemas in DBpedia for table rows/columns. However, T2D mainly focuses on large tables and does not work that well for small-sized tables [Lehmberg and Bizer, 2017]. To counter this problem, Lehmberg and Bizer [2017] propose to combine tables from each website into larger tables for table matching, building on the intuition that tables from the same website are created in a similar fashion. ...
... However, T2D mainly focuses on large tables and does not work that well for small-sized tables [Lehmberg and Bizer, 2017]. To counter this problem, Lehmberg and Bizer [2017] propose to combine tables from each website into larger tables for table matching, building on the intuition that tables from the same website are created in a similar fashion. ...
Thesis
Full-text available
PhD Thesis
... It is considered as one of the main limitations of linking tabular mentions to DBpedia. To overcome this, Lehmberg and Bizer [38] stitch tables, i.e., merge tables from the same website as a single large table, in order to improve entity linking performance. Table 9. ...
... The T2K method utilizes the T2D dataset to execute iterative steps between candidate matching and property matching, to find proper entities/schemas in DBpedia for table rows/columns. However, T2D mainly focuses on large tables and does not work that well for small-sized tables [38]. To counter this problem, Lehmberg and Bizer [38] propose to combine tables from each website into larger tables for table matching, building on the intuition that tables from the same website are created in a similar fashion. ...
... However, T2D mainly focuses on large tables and does not work that well for small-sized tables [38]. To counter this problem, Lehmberg and Bizer [38] propose to combine tables from each website into larger tables for table matching, building on the intuition that tables from the same website are created in a similar fashion. ...
Preprint
Full-text available
Tables are a powerful and popular tool for organizing and manipulating data. A vast number of tables can be found on the Web, which represents a valuable knowledge resource. The objective of this survey is to synthesize and present two decades of research on web tables. In particular, we organize existing literature into six main categories of information access tasks: table extraction, table interpretation, table search, question answering, knowledge base augmentation, and table augmentation. For each of these tasks, we identify and describe seminal approaches, present relevant resources, and point out interdependencies among the different tasks.
... Alternatively, a dataset search may involve a set of related observations that are organized for a particular purpose by the searcher themselves. This pattern of behavior is particularly marked in data lakes [62,156], data markets [13,70], and tabular search [114,198]. Example 2 illustrates this kind of data search. ...
... In the context of constructive dataset search, the Mannheim Search Join Engine [114,115] and WikiTables [20] use a table similarity approach for table extension but also look at the unconstrained task. In both cases, a similarity ranking between the input and augmentation tables is used to decide which columns should be added. ...
... These are attributes independent of a dataset's content or topical relevance which can influence whether a user is actually able to engage with the dataset. Tabular [63,196] [ 114,115,190] ...
Article
Full-text available
Generating value from data requires the ability to find, access and make sense of datasets. There are many efforts underway to encourage data sharing and reuse, from scientific publishers asking authors to submit data alongside manuscripts to data marketplaces, open data portals and data communities. Google recently beta-released a search service for datasets, which allows users to discover data stored in various online repositories via keyword queries. These developments foreshadow an emerging research field around dataset search or retrieval that broadly encompasses frameworks, methods and tools that help match a user data need against a collection of datasets. Here, we survey the state of the art of research and commercial systems and discuss what makes dataset search a field in its own right, with unique challenges and open questions. We look at approaches and implementations from related areas dataset search is drawing upon, including information retrieval, databases, entity-centric and tabular search in order to identify possible paths to tackle these questions as well as immediate next steps that will take the field forward.
... For the analysis in this paper, we combine established methods for web table extraction [9,18], web table understanding [16,25], and functional dependency discovery [11,15]. 3 A corpus of 5 million web tables is matched to the DBpedia knowledge base to determine the main topic (knowledge base class) of each web table and align the columns with existing properties. ...
... Further, all web tables from the same web site are extended with context attributes, which are extracted from the web page surrounding the web table, and matched among each other. Using the matching, the individual web tables are stitched (combined) [16] into larger relations. We then apply functional dependency discovery [11,15] on these relations and determine a primary key for each relation [23] which is guaranteed to contain the subject column. ...
... This additional information is (step 1 in Figure 1) extracted from the pages into context attributes (here: "Page Title" and "URL 1"). These context attributes, in combination with the stitched [16] version (step 2) of the original web tables, enable the discovery of the key {State, Page Title, URL 1} (step 3), which allows downstream applications and end-users to correctly understand the meaning of the wage values. ...
Conference Paper
The Web contains millions of relational HTML tables, which cover a multitude of different, often very specific topics. This rich pool of data has motivated a growing body of research on methods that use web table data to extend local tables with additional attributes or add missing facts to knowledge bases. Nearly all existing approaches for these tasks build upon the assumption that web table data consists of binary relations, meaning that an attribute value depends on a single key attribute, and that the key attribute value is contained in the HTML table. Inspecting randomly chosen tables on the Web, however, quickly reveals that both assumptions are wrong for a large fraction of the tables. In order to better understand the potential of non-binary web table data for downstream applications, this papers analyses a corpus of 5 million web tables originating from 80 thousand different web sites with respect to how many web table attributes are non-binary, what composite keys are required to correctly interpret the semantics of the non-binary attributes, and whether the values of these keys are found in the table itself or need to be extracted from the page surrounding the table. The profiling of the corpus shows that at least 38% of the relations are non-binary. Recognizing these relations requires information from the title or the URL of the web page in 50% of the cases. We find that different websites use keys of varying length for the same dependent attribute, e.g. one cluster of websites presents employment numbers depending on time, another cluster presents them depending on time and profession. By identifying these clusters, we lay the foundation for selecting Web data sources according to the specificity of the keys that are used to determine specific attributes.
... In this paper, we present a method to synthesize n-ary relations from web tables for the use case of knowledge base extension. The method is able to exploit information from the page around the table and stitches [28] multiple tables from the same website together in order to enable the data-driven discovery of functional dependencies to identify n-ary relations. Our method builds on approaches for web table understanding [28,34,41], holistic schema matching [5,23,24], functional dependency discovery [17,25], and normalisation [38]. ...
... The method is able to exploit information from the page around the table and stitches [28] multiple tables from the same website together in order to enable the data-driven discovery of functional dependencies to identify n-ary relations. Our method builds on approaches for web table understanding [28,34,41], holistic schema matching [5,23,24], functional dependency discovery [17,25], and normalisation [38]. Ultimately, it produces an integrated schema for each web site, which extends the schema of the knowledge base with the n-ary relations that can be populated from the web tables. ...
... This process is based on the assumption that all web tables that originate from one web site were generated by queries to the same database, which allows us to reconstruct a schema from which these web tables could have been generated. This has been shown to be reasonable for web tables from the same web site [28], hence, the method processes all web tables from a single web site at a time. Our implementation of the SNoW method is based on the WInte.r ...
Conference Paper
The Web contains a large number of relational HTML tables, which cover a multitude of different, often very specific topics. This rich pool of data has motivated a growing body of research on methods that use web table data to extend local tables with additional attributes or add missing facts to knowledge bases. Nearly all existing approaches for these tasks are limited to the extraction of binary relations from web tables, e.g. an unemployment number may only depend on the state. Inspecting randomly chosen tables on the Web quickly reveals that many relations in the tables are non-binary, e.g. unemployment numbers also depend on the point in time and the profession. Treating such n-ary relations as binary leads to data that cannot be interpreted correctly. The extraction of n-ary relations from web tables is complicated by two factors: 1. important attributes might be stated outside of the table; 2. relational web tables are usually too small for functional dependency discovery. This paper presents a method to synthesize n-ary relations from web tables for the use case of knowledge base extension. The method exploits information from the page around the table and stitches (combines) multiple tables from the same website. We apply the method to a corpus of 5 million web tables originating from 80 thousand different web sites and find that 38% of the synthesized relations are non-binary. We find different relations for the same dependent attribute, e.g. relations providing unemployment numbers based on time, location, or profession. By identifying groups of websites which provide these relations, we lay the foundation for applications in knowledge base augmentation and data search, which allow for a specific selection of relations that determine an attribute according to the applications' data requirements.
... The existing literature widely recognizes the importance of discovering related tables on the Web or in data lakes for enriching the conveyed information. Many approaches have therefore been proposed for the efficient detection of unionable tables [15,41,50], joinable tables [18,27,30,63,64], or more generally related tables [13,19,62], also providing insights about their subsequent integration [39]. Furthermore, representation learning has started to be used also for tables [4,25,26,56,60], and several large table corpora [29,35,42,59] have been generated over the years and are widely adopted in these scenarios. ...
... Table Discovery. As stated in Section 1, a plethora of algorithms have been designed for the efficient discovery of unionable [15,41,50], joinable [18,23,27,61,64], and related tables [13,19,62]. These algorithms cannot compute the actual value of the largest overlap, but in some cases they can produce an upper bound for it; thus, they can be exploited to scale to large table corpora, passing to Sloth only the most promising pairs. ...
Article
Full-text available
Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. Unfortunately, efficiently quantifying the overlap between two tables is not trivial. In particular, detecting their largest overlap, i.e., their largest common subtable, is a computationally challenging problem. As the information overlap may not occur in contiguous portions of the tables, only the ability to permute columns and rows can reveal it. The detection of the largest overlap can help us in relevant tasks such as the discovery of multiple coexisting versions of the same table, which can present differences in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, matching tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. We present the first formal definition of this problem, and with it Sloth, our solution to efficiently detect the largest overlap between two tables. We experimentally demonstrate on real-world datasets its efficacy in solving this task, analyzing its performance and showing its impact on multiple use cases.
... Existing solutions. In the data management research, the problem of finding relationships among datasets has been investigated in three different contexts (more details in Section 3): i) schema matching, with a multitude of automated methods [9,21,30,33,43,52]; ) related-dataset search [5,6,17,19,27,41,53,54], and ) columntype detection [26,50]. In short, traditional schema matching methods are a) computationally and resource expensive; b) they cannot always be employed in the setting of data silos as they require colocating all datasets to calculate similarities; c) they do not leverage existing knowledge within silos. ...
... Scalability Issues. Most importantly, applying schema matching solutions [9,21,30,33,43,52] requires, in the worst case, computation of similarities between all pairs of columns. As the number of columns increases beyond the thousands -a small number considering the size of data lakes and commercial databases -computing O( 2 ) similarities is impractical. ...
Preprint
Full-text available
Virtually every sizable organization nowadays is building a form of a data lake. In theory, every department or team in the organization would enrich their datasets with metadata, and store them in a central data lake. Those datasets can then be combined in different ways and produce added value to the organization. In practice, though, the situation is vastly different: each department has its own privacy policies, data release procedures, and goals. As a result, each department maintains its own data lake, leading to data silos. For such data silos to be of any use, they need to be integrated. This paper presents SiMa, a method for federating data silos that consistently finds more correct relationships than the state-of-the-art matching methods, while minimizing wrong predictions and requiring 20x to 1000x less time to execute. SiMa leverages Graph Neural Networks (GNNs) to learn from the existing column relationships and automated data profiles found in data silos. Our method makes use of the trained GNN to perform link prediction and find new column relationships across data silos. Most importantly, SiMa can be trained incrementally on the column relationships within each silo individually, and does not require consolidating all datasets into one place.
... MENTOR [2] leverages patterns occurring in headers of Wikipedia tables to consistently discover DBpedia relations. Lehmberg et al. [15] tackle the problem of small web tables with table stitching, i.e., they combine several small tables with a similar context (e.g., same page or domain and matching schema) to a larger one, making it easier to extract facts from it. ...
... But we identify two other influencing factors: Wikipedia has very specific guidelines for editing species, especially with regard to standardization and formatting rules. 15 In addition to that, the genus relation is functional and hence trivially fulfilling the PCA. As our approach is strongly relying on this assumption and it potentially inhibits the mining of practical rules for non-functional predicates (like, for example, for artist), we plan on investigating this relationship further. ...
Preprint
Full-text available
Knowledge about entities and their interrelations is a crucial factor of success for tasks like question answering or text summarization. Publicly available knowledge graphs like Wikidata or DBpedia are, however, far from being complete. In this paper, we explore how information extracted from similar entities that co-occur in structures like tables or lists can help to increase the coverage of such knowledge graphs. In contrast to existing approaches, we do not focus on relationships within a listing (e.g., between two entities in a table row) but on the relationship between a listing's subject entities and the context of the listing. To that end, we propose a descriptive rule mining approach that uses distant supervision to derive rules for these relationships based on a listing's context. Extracted from a suitable data corpus, the rules can be used to extend a knowledge graph with novel entities and assertions. In our experiments we demonstrate that the approach is able to extract up to 3M novel entities and 30M additional assertions from listings in Wikipedia. We find that the extracted information is of high quality and thus suitable to extend Wikipedia-based knowledge graphs like DBpedia, YAGO, and CaLiGraph. For the case of DBpedia, this would result in an increase of covered entities by roughly 50%.
... The structure annotation contains many sub-tasks as table type prediction [23], data type prediction, table header annotation, core attribute prediction, and holistic matching across tables [14]. In this paper, we assume that the input tables are relational vertical types, and these are independence to each other. ...
... For example, tables on the same domain in the Web, or tables in data resources with a different version of Open Data Portals. Stitching those tables also improves the performance of entity annotation [14]. ...
Preprint
Full-text available
In the Open Data era, a large number of table resources have been made available on the Web and data portals. However, it is difficult to directly utilize such data due to the ambiguity of entities, name variations, heterogeneous schema, missing, or incomplete metadata. To address these issues, we propose a novel approach, namely TabEAno, to semantically annotate table rows toward knowledge graph entities. Specifically, we introduce a "two-cells" lookup strategy bases on the assumption that there is an existing logical relation occurring in the knowledge graph between the two closed cells in the same row of the table. Despite the simplicity of the approach, TabEAno outperforms the state of the art approaches in the two standard datasets e.g, T2D, Limaye with, and in the large-scale Wikipedia tables dataset.
... normalization [12,13,34], but FDs also support query optimization [22,35], data integration [23,28], and data translation [8,10] activities. Furthermore, when used as integrity constraints, FDs improve data cleaning applications to detect and resolve data inconsistencies [9,20,25]. ...
Article
Full-text available
Functional dependencies (FDs) are among the most important integrity constraints in databases. They serve to normalize datasets and thus resolve redundancies, they contribute to query optimization, and they are frequently used to guide data cleaning efforts. Because the FDs of a particular dataset are usually unknown, automatic profiling algorithms are needed to discover them. These algorithms have made considerable advances in the past few years, but they still require a significant amount of time and memory to process datasets of practically relevant sizes. We present FDHits, a novel FD discovery algorithm that finds all valid, minimal FDs in a given relational dataset. FDHits is based on several discovery optimizations that include a hybrid validation approach, effective hitting set enumeration techniques, one-pass candidate validations, and parallelization. Our experiments show that FDHits, even without parallel execution, has a median speedup of 8.1 compared to state-of-the-art FD discovery algorithms while using significantly less memory. This allows the discovery of all FDs even on datasets that could not be processed by the current state-of-the-art.
... This operation can be defined for both single-, i.e., unary, and multi-column. i.e., n-ary or composite, join keys, ii) union search [7,9,17,23,25,30], where the user aims to increase the number of rows in the input dataset by finding tables that contain complementing information in a similar schema to augment it vertically, and iii) correlated feature discovery [5,12,15,32], where the objective is to discover tables based on their potential to increase the accuracy of a downstream ML model. ...
Article
Full-text available
Data discovery is an iterative and incremental process that necessitates the execution of multiple data discovery queries to identify the desired tables from large and diverse data lakes. Current methodologies concentrate on single discovery tasks such as join, correlation, or union discovery. However, in practice, a series of these approaches and their corresponding index structures are necessary to enable the user to discover the desired tables. This paper presents Blend, a comprehensive data discovery system that empowers users to develop ad-hoc discovery tasks without the need to develop new algorithms or build a new index structure. To achieve this goal, we introduce a general index structure capable of addressing multiple discovery queries. We develop a set of lower-level operators that serve as the fundamental building blocks for more complex and sophisticated user tasks. These operators are highly efficient and enable end-to-end efficiency. To enhance the execution of the discovery pipeline, we rewrite the search queries into optimized SQL statements to push the data operators down to the database. We demonstrate that our holistic system is able to achieve comparable effectiveness and runtime efficiency to the individual state-of-the-art approaches specifically designed for a single task.
... In dataset discovery, it is crucial to find related tables in data lakes. The table union search task has recently received considerable attentions (Ling et al., 2013;Lehmberg and Bizer, 2017;Khatiwada and Fan, 2023). The initial attempt is made by Nargesian et al. (2018). ...
... Remarks on future directions. Data lake exploration could benefit greatly by taking into account recent results on web table exploration [18], [84], [113], data wrangling [51], [143], or external applications upon the data lake. With the existing works mainly focusing on evaluating the accuracy of similarity computation, or the performance (query processing time), deep analysis and further improvement on the accuracy and completeness of the top-k result set are still rare. ...
Article
Full-text available
Data lakes are becoming increasingly prevalent for big data management and data analytics. In contrast to traditional ‘schema-on-write’ approaches such as data warehouses, data lakes are repositories storing raw data in its original formats and providing a common access interface. Despite the strong interest raised from both academia and industry, there is a large body of ambiguity regarding the definition, functions and available technologies for data lakes. A complete, coherent picture of data lake challenges and solutions is still missing. This survey reviews the development, architectures, and systems of data lakes. We provide a comprehensive overview of research questions for designing and building data lakes. We classify the existing approaches and systems based on their provided functions for data lakes, which makes this survey a useful technical reference for designing, implementing and deploying data lakes. We hope that the thorough comparison of existing solutions and the discussion of open research challenges in this survey will motivate the future development of data lake research and practice.
... Other prominent ideas are also proposed for entity linking on web tables, e.g., [Wu et al., 2016] considers multiple knowledge bases for entity linking, by leveraging "sameAs" relation to reduce errors and ensure good coverage; [Lehmberg and Bizer, 2017] merge tables from the same web page as a single large table to improve entity linking performance; etc. ...
... The table union search problem has been well explored recently. Ling et al. [31] and Lehmberg et al. [24] illustrated the importance of finding unionable tables. Nargesian et al. [39] proposed the first definition and comprehensive solution for the table union search problem. ...
Preprint
Full-text available
Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder of Starmie captures the rich contextual semantic information within tables by leveraging a contrastive multi-column pre-training strategy. We utilize the cosine similarity between column embedding vectors as the column unionability score and propose a filter-and-verification framework that allows exploring a variety of design choices to compute the unionability score between two tables accordingly. Empirical evaluation results on real table benchmark datasets show that Starmie outperforms the best-known solutions in the effectiveness of table union search by 6.8 in MAP and recall. Moreover, Starmie is the first to employ the HNSW (Hierarchical Navigable Small World) index for accelerate query processing of table union search which provides a 3,000X performance gain over the linear scan baseline and a 400X performance gain over an LSH index (the state-of-the-art solution for data lake indexing).
... Lehmberg et al. [20] uses attribute labels and value overlap between attributes to determine table matching. They build on work from Ling et al. [24] that relies on web tables having identical or similar schemas and uses value overlap. ...
Preprint
Full-text available
Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover semantic relationship between pairs of columns. The first uses an existing knowledge base (KB), the second (which we call a "synthesized KB") uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm, called SANTOS, outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating a synthesized KBs from data lakes with limited KB coverage and using them for union search.
... This work heavily relies on the identical schemas of the tables. Based on [44], Lehmberg and Bizer have had union tables of stitched tables using schema-based and instance-based matching techniques [31]. ...
Article
Full-text available
Data nowadays are an extremely valuable resource. Data can come from different sources, and it can originate from the government of a country, an organization, a company, or just a normal person. Furthermore, the content of data is varied: the data could be about primary education in the U.K, it could be about medical care in the U.S., or it could be about agriculture in Vietnam, etc. It is reasonable to assume that among those datasets, some datasets would be about the same topic. Moreover, those datasets could have the same structures, or at least, similar structures. It is beneficial that we can union those datasets into more meaningful datasets: The unionized datasets would contain the collective information of the datasets, and the users and scientists do not have to spend a lot of time searching and combining the datasets themselves, etc. In this paper, we proposed a data union method based on hierarchical clustering and Set Unionability for JSON-format data. Besides, we also performed some experiments to evaluate our method and prove its feasibility.
... However, domain-specific dataset search approaches are designed for searching a set of related observations organized for a particular domain by searchers. This pattern of behavior is particularly marked in data lakes [15,31], data markets [2,18], and tabular search [23]. ...
Chapter
Full-text available
Dataset repositories publish a significant number of datasets continuously within the context of a variety of domains, such as biodiversity and oceanography. To conduct multidisciplinary research, scientists and practitioners must discover datasets from various disciplines unfamiliar with them. Well-known search engines, such as Google dataset and Mendeley data, try to support researchers with cross-domain dataset discovery based on their contents. However, as datasets typically contain scientific observations or collected data from service providers, their contextual information is limited. Accordingly, effective dataset indexing can be impossible to increase the Findability, Accessibility, Interoperability, and Reusability (FAIRness) based on their contextual information. This paper presents an indexing pipeline to extend contextual information of datasets based on their scientific domains by using topic modeling and a set of suggested rules and domain keywords (such as essential variables in environment science) based on domain experts’ suggestions. The pipeline relies on an open ecosystem, where dataset providers publish semantically enhanced metadata on their data repositories. We aggregate, normalize, and reconcile such metadata, providing a dataset search engine that enables research communities to find, access, integrate, and reuse datasets. We evaluated our approach on a manually created gold standard and a user study.KeywordsDataset indexingDataset discoveryInverted indexingMetadata standardData repository
... We need three steps to achieve table enrichment: table retrieval, table join, and ML evaluation. There are many related works on table retrieval [4,30] with different end tasks such as column and row extension [17,20,22,28], schema matching [16,18], table filling [26,27], and knowledge graph building [13,23]. ...
Preprint
Data scientists are constantly facing the problem of how to improve prediction accuracy with insufficient tabular data. We propose a table enrichment system that enriches a query table by adding external attributes (columns) from data lakes and improves the accuracy of machine learning predictive models. Our system has four stages, join row search, task-related table selection, row and column alignment, and feature selection and evaluation, to efficiently create an enriched table for a given query table and a specified machine learning task. We demonstrate our system with a web UI to show the use cases of table enrichment.
... DL exploration could benefit greatly by taking into account recent results on web table exploration [76,19,105], data wrangling [124,46], or external applications upon the data lake. With the existing works mainly focus on evaluating the accuracy of similarity computation, or the performance (query processing time), deep analysis and further improvement on the accuracy and completeness of the top-k result set are still rare. ...
Preprint
Full-text available
Although big data has been discussed for some years, it still has many research challenges, especially the variety of data. It poses a huge difficulty to efficiently integrate, access, and query the large volume of diverse data in information silos with the traditional 'schema-on-write' approaches such as data warehouses. Data lakes have been proposed as a solution to this problem. They are repositories storing raw data in its original formats and providing a common access interface. This survey reviews the development, definition, and architectures of data lakes. We provide a comprehensive overview of research questions for designing and building data lakes. We classify the existing data lake systems based on their provided functions, which makes this survey a useful technical reference for designing, implementing and applying data lakes. We hope that the thorough comparison of existing solutions and the discussion of open research challenges in this survey would motivate the future development of data lake research and practice.
... Data lakes are usually seen as vast repositories of company, government or Web data (e.g., [11], [12]). Previous work has Tables considered dataset discovery in the guise of table augmentation and stitching (e.g., [13], [14]), unionability discovery, or joinability discovery (e.g., [9], [10], [15], [16]). We add to this work with a focus on a notion of relatedness, defined in the next section, construed as unionability and/or joinability. ...
Preprint
Full-text available
Data analytics stands to benefit from the increasing availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data wrangling, specific target datasets can be constructed that enable value-adding analytics. Given the potential vastness of such data lakes, the issue arises of how to pull out of the lake those datasets that might contribute to wrangling out a given target. We refer to this as the problem of dataset discovery in data lakes and this paper contributes an effective and efficient solution to it. Our approach uses features of the values in a dataset to construct hash-based indexes that map those features into a uniform distance space. This makes it possible to define similarity distances between features and to take those distances as measurements of relatedness w.r.t. a target table. Given the latter (and exemplar tuples), our approach returns the most related tables in the lake. We provide a detailed description of the approach and report on empirical results for two forms of relatedness (unionability and joinability) comparing them with prior work, where pertinent, and showing significant improvements in all of precision, recall, target coverage, indexing and discovery times.
... -Assumption 3: In reality, some tables could have shared schema. For example, tables on the Web could be divided to many web pages, therefore we can expect improving matching performance by stitching those tables in the same web page (or domain) [3], [7]. Therefore, performing holistic matching could help improve MTab performance. ...
Preprint
Full-text available
This paper presents the design of our system, namely MTab, for Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019). MTab combines the voting algorithm and the probability models to solve critical problems of the matching tasks. Results on SemTab 2019 show that MTab obtains promising performance for the three matching tasks.
... This work is not intended to scale to large sets. Lehmberg et al. [13] found that stitching small web tables can help match them with knowledge bases. Nargesian et al. [21] proposed techniques to search for unionable tables from data lakes. ...
Conference Paper
We present a new solution for finding joinable tables in massive data lakes: given a table and one join column, find tables that can be joined with the given table on the largest number of distinct values. The problem can be formulated as an overlap set similarity search problem by considering columns as sets and matching values as intersection between sets. Although set similarity search is well-studied in the field of approximate string search (e.g., fuzzy keyword search), the solutions are designed for and evaluated over sets of relatively small size (average set size rarely much over 100 and maximum set size in the low thousands) with modest dictionary sizes (the total number of distinct values in all sets is only a few million). We observe that modern data lakes typically have massive set sizes (with maximum set sizes that may be tens of millions) and dictionaries that include hundreds of millions of distinct values. Our new algorithm, JOSIE (Joining Search using Intersection Estimation) minimizes the cost of set reads and inverted index probes used in finding the top-k sets. We show that JOSIE completely out performs the state-of-the-art overlap set similarity search techniques on data lakes. More surprising, we also consider state-of-the-art approximate algorithm and show that our new exact search algorithm performs almost as well, and even in some cases better, on real data lakes.
... Further methods focused on named entities, by annotating table cells with entities and classes from the knowledge base [1,3,8,10,11]. Table data fusion for search and schema inference was studied in [7,[13][14][15]. ...
Conference Paper
Web pages and other documents often contain tables to provide numerical details in a structured manner. Typically, the text explains and highlights important quantities, often using approximate numbers and aggregates such as totals, averages or ratios. For a human reader, it is crucial to navigate between text and tables to understand the key information in its context, drill down into tables when details are needed, and obtain explanations on specific figures from the accompanying text. In this demo, we present ExQuisiTe: a system to align quantity mentions in the text with related quantity mentions in tables, and enable extractive summarization that considers table contents. ExQuisiTe links quantity mentions in the text with quantities in tables, to support user-friendly exploration. ExQuisiTe handles exact single-cell references as well as rounded or truncated numbers and aggregations such as row or column totals.
Article
Matching tabular data to a knowledge graph (KG) is critical for understanding the semantic column types, column relationships, and entities of a table. Existing matching approaches rely heavily on core columns that represent primary subject entities on which other columns in the table depend. However, discovering these core columns before understanding the table’s semantics is challenging. Most prior works use heuristic rules, such as the leftmost column, to discover a single core column, while an insightful discovery of the core column set that accurately captures the dependencies between columns is often overlooked. To address these challenges, we introduce Dependency-aware Core Column Set Discovery ( DaCo ), an iterative method that uses a novel rough matching strategy to identify both inter-column dependencies and the core column set. Additionally, DaCo can be seamlessly integrated with pre-trained language models, as proposed in the optimization module. Unlike other methods, DaCo does not require labeled data or contextual information, making it suitable for real-world scenarios. In addition, it can identify multiple core columns within a table, which is common in real-world tables. We conduct experiments on six datasets, including five datasets with single core columns and one dataset with multiple core columns. Our experimental results show that DaCo outperforms existing core column set detection methods, further improving the effectiveness of table understanding tasks.
Article
This paper studies entity linking (EL) in Web tables, which aims to link the string mentions in table cells to their referent entities in a knowledge base. Two main problems exist in previous studies: 1) contextual information is not well utilized in mention-entity similarity computation; 2) the assumption on entity coherence that all entities in the same row or column are highly related to each other is not always correct. In this paper, we propose NPEL , a new N eural P aired E ntity L inking framework, to overcome the above problems. In NPEL, we design a deep learning model with different neural networks and an attention mechanism, to model different kinds of contextual information of mentions and entities, for mention-entity similarity computation in Web tables. NPEL also relaxes the above assumption on entity coherence by a new paired entity linking algorithm, which iteratively selects two mentions with the highest confidence for EL. Experiments on real-world datasets exhibit that NPEL has the best performance compared with state-of-the-art baselines in different evaluation metrics.
Chapter
In a relational table, core columns represent the primary subject entities that other columns in the table depend on. While discovering core columns is crucial for understanding a table’s semantic column types, column relations, and entities, it is often overlooked. Previous methods typically rely on heuristic rules or contextual information, which can fail to accurately capture the dependencies between columns and make it difficult to preserve their relationships. To address these challenges, we introduce Dependency-aware Core Column Discovery (DaCo), an iterative method that uses a novel rough matching strategy to identify both inter-column dependencies and core columns. Unlike other methods, DaCo does not require labeled data or contextual information, making it suitable for practical scenarios. Additionally, it can identify multiple core columns within a table, which is common in real-world tables. Our experimental results demonstrate that DaCo outperforms existing core column discovery methods, substantially improving the efficiency of table understanding tasks.
Article
Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of the union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover the semantic relationships between pairs of columns. The first uses an existing knowledge base (KB), and the second (which we call a "synthesized KB") uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm, called SANTOS, outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating synthesized KBs from data lakes with limited KB coverage and using them for union search.
Article
Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder of Starmie captures the rich contextual semantic information within tables by leveraging a contrastive multi-column pre-training strategy. We utilize the cosine similarity between column embedding vectors as the column unionability score and propose a filter-and-verification framework that allows exploring a variety of design choices to compute the unionability score between two tables accordingly. Empirical results on real table benchmarks show that Starmie outperforms the best-known solutions in the effectiveness of table union search by 6.8 in MAP and recall. Moreover, Starmie is the first to employ the HNSW (Hierarchical Navigable Small World) index to accelerate query processing of table union search which provides a 3,000X performance gain over the linear scan baseline and a 400X performance gain over an LSH index (the state-of-the-art solution for data lake indexing).
Article
We have made tremendous strides in providing tools for data scientists to discover new tables useful for their analyses. But despite these advances, the proper integration of discovered tables has been under-explored. An interesting semantics for integration, called Full Disjunction, was proposed in the 1980's, but there has been little progress in using it for data science to integrate tables culled from data lakes. We provide ALITE, the first proposal for scalable integration of tables that may have been discovered using join, union or related table search. We empirically show that ALITE can outperform previous algorithms for computing the Full Disjunction. ALITE relaxes previous assumptions that tables share common attribute names (which completely determine the join columns), are complete (without null values), and have acyclic join patterns. To evaluate ALITE, we develop and share three new benchmarks for integration that use real data lake tables.
Article
A core operation in data discovery is to find joinable tables for a given table. Real-world tables include both unary and n-ary join keys. However, existing table discovery systems are optimized for unary joins and are ineffective and slow in the existence of n-ary keys. In this paper, we introduce Mate, a table discovery system that leverages a novel hash-based index that enables n-ary join discovery through a space-efficient super key. We design a filtering layer that uses a novel hash, Xash. This hash function encodes the syntactic features of all column values and aggregates them into a super key, which allows the system to efficiently prune tables with non-joinable rows. Our join discovery system is able to prune up to 1000 x more false positives and leads to over 60 x faster table discovery in comparison to state-of-the-art.
Chapter
Data nowadays is an extremely valuable resource. The owner of the data could be an organization, a company, the government, or just a normal person. Because of that, the content of the datasets that came from those sources would be varied: the data content could be about primary education, it could be about medical care in the U.S or it could be about agriculture in Vietnam, etc. It is intuitive that some datasets would be about the same topic so that they would have the same structures, or at least, similar structures. It is beneficial that we can union those datasets into a more meaningful dataset. In this paper, we proposed a data union method based on hierarchical clustering and Set Unionability. For simplicity, the method will used JSON data format as input data type.
Article
We propose methods for extracting triples from Wikipedia’s HTML tables using a reference knowledge graph. Our methods use a distant-supervision approach to find existing triples in the knowledge graph for pairs of entities on the same row of a table, postulating the corresponding relation for pairs of entities from other rows in the corresponding columns, thus extracting novel candidate triples. Binary classifiers are applied on these candidates to detect correct triples and thus increase the precision of the output triples. We extend this approach with a preliminary step where we first group and merge similar tables, thereafter applying extraction on the larger merged tables. More specifically, we propose an observed schema for individual tables, which is used to group and merge tables. We compare the precision and number of triples extracted with and without table merging, where we show that with merging, we can extract a larger number of triples at a similar precision. Ultimately, from the tables of English Wikipedia, we extract 5.9 million novel and unique triples for Wikidata at an estimated precision of 0.718.
Article
Full-text available
http://www.semantic-web-journal.net/content/effective-and-efficient-semantic-table-interpretation-using-tableminer-0
Conference Paper
Full-text available
Functional dependencies are structural metadata that can be used for schema normalization, data integration, data cleansing, and many other data management tasks. Despite their importance, the functional dependencies of a specific dataset are usually unknown and almost impossible to discover manually. For this reason, database research has proposed various algorithms for functional dependency discovery. None, however, are able to process datasets of typical real-world size, e.g., datasets with more than 50 attributes and a million records. We present a hybrid discovery algorithm called HyFD, which combines fast approximation techniques with efficient validation techniques in order to find all minimal functional dependencies in a given dataset. While operating on compact data structures, HyFD not only outperforms all existing approaches, it also scales to much larger datasets.
Article
Full-text available
While much work has focused on efficient processing of Big Data, little work considers how to understand them. In this paper, we describe Helix, a system for guided exploration of Big Data. Helix provides a unified view of sources, ranging from spreadsheets and XML files with no schema, all the way to RDF graphs and relational data with well-defined schemas. Helix users explore these heterogeneous data sources through a combination of keyword searches and navigation of linked web pages that include information about the schemas, as well as data and semantic links within and across sources. At a technical level, the paper describes the research challenges involved in developing Helix, along with a set of real-world usage scenarios and the lessons learned.
Conference Paper
Full-text available
While syntactic transformations require the application of a formula on the input values, such as unit conversion or date format conversions, semantic transformations, such as zip code to city, require a look-up in some reference data. We recently presented DataXFormer, a system that leverages Web tables, Web forms, and expert sourcing to cover a wide range of transformations. In this demonstration, we present the user-interaction with DataXFormer and show scenarios on how it can be used to transform data and explore the effectiveness and efficiency of several approaches for transformation discovery, leveraging about 112 million tables and online sources.
Article
Full-text available
While much work has focused on efficient processing of Big Data, little work considers how to understand them. In this paper, we describe Helix, a system for guided exploration of Big Data. Helix provides a unified view of sources, ranging from spreadsheets and XML files with no schema, all the way to RDF graphs and relational data with well-defined schemas. Helix users explore these heterogeneous data sources through a combination of keyword searches and navigation of linked web pages that include information about the schemas, as well as data and semantic links within and across sources. At a technical level, the paper describes the research challenges involved in developing Helix, along with a set of real-world usage scenarios and the lessons learned.
Conference Paper
Full-text available
We describe work on automatically inferring the intended meaning of tables and representing it as RDF linked data, making it available for improving search, interoperability and integration. We present implementation details of a joint inference module that uses knowledge from the linked open data (LOD) cloud to jointly infer the semantics of column headers, table cell values (e.g., strings and numbers) and relations between columns. The framework generates linked data by mapping column headers to classes, cell values to LOD entities (existing or new) and by identifying relations between columns. We also implement a novel Semantic Message Passing algorithm which uses LOD knowledge to improve existing message passing schemes. We evaluate our implemented techniques on tables from the Web and Wikipedia.
Article
Full-text available
The DBpedia community project extracts structured, multilingual knowledge from Wikipedia and makes it freely available on the Web using Semantic Web and Linked Data technologies. The project extracts knowledge from 111 different language editions of Wikipedia. The largest DBpedia knowledge base which is extracted from the English edition of Wikipedia consists of over 400 million facts that describe 3.7 million things. The DBpedia knowledge bases that are extracted from the other 110 Wikipedia editions together consist of 1.46 billion facts and describe 10 million additional things. The DBpedia project maps Wikipedia infoboxes from 27 different language editions to a single shared ontology consisting of 320 classes and 1,650 properties. The mappings are created via a world-wide crowd-sourcing effort and enable knowledge from the different Wikipedia editions to be combined. The project publishes releases of all DBpedia knowledge bases for download and provides SPARQL query access to 14 out of the 111 language editions via a global network of local DBpedia chapters. In addition to the regular releases, the project maintains a live knowledge base which is updated whenever a page in Wikipedia changes. DBpedia sets 27 million RDF links pointing into over 30 external data sources and thus enables data from these sources to be used together with DBpedia data. Several hundred data sets on the Web publish RDF links pointing to DBpedia themselves and make DBpedia one of the central interlinking hubs in the Linked Open Data (LOD) cloud. In this system report, we give an overview of the DBpedia community project, including its architecture, technical implementation, maintenance, internationalisation, usage statistics and applications.
Article
Full-text available
We consider the problem of finding related tables in a large corpus of heterogenous tables. Detecting related tables provides users a powerful tool for enhancing their tables with additional data and enables effective reuse of available public data. Our first contribution is a framework that captures several types of relatedness, including tables that are candidates for joins and tables that are candidates for union. Our second contribution is a set of algorithms for detecting related tables that can be either unioned or joined. We describe a set of experiments that demonstrate that our algorithms produce highly related tables. We also show that we can often improve the results of table search by pulling up tables that are ranked much lower based on their relatedness to top-ranked tables. Finally, we describe how to scale up our algorithms and show the results of running it on a corpus of over a million tables extracted from Wikipedia.
Conference Paper
Full-text available
To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To tackle this challenge, this paper takes a conceptually novel approach by viewing schema matching as correlation mining, for our task of matching Web query interfaces to integrate the myriad databases on the Internet. On this "deep Web," query interfaces generally form complex matchings between attribute groups (e.g., [author] corresponds to [first name, last name] in the Books domain). We observe that the co-occurrences patterns across query interfaces often reveal such complex semantic relationships: grouping attributes (e.g., [first name, last name]) tend to be co-present in query interfaces and thus positively correlated. In contrast, synonym attributes are negatively correlated because they rarely co-occur. This insight enables us to discover complex matchings by a correlation mining approach. In particular, we develop the DCM framework, which consists of data preparation, dual mining of positive and negative correlations, and finally matching selection. Unlike previous correlation mining algorithms, which mainly focus on finding strong positive correlations, our algorithm cares both positive and negative correlations, especially the subtlety of negative correlations, due to its special importance in schema matching. This leads to the introduction of a new correlation measure, H-measure, distinct from those proposed in previous work. We evaluate our approach extensively and the results show good accuracy for discovering complex matchings.
Conference Paper
Full-text available
Schema matching is the task of finding semantic cor- respondences between elements of two schemas. It is needed in many database applications, such as integra- tion of web data sources, data warehouse loading and XML message mapping. To reduce the amount of user effort as much as possible, automatic approaches com- bining several match techniques are required. While such match approaches have found considerable inter- est recently, the problem of how to best combine dif- ferent match algorithms still requires further work. We have thus developed the COMA schema matching sys- tem as a platform to combine multiple matchers in a flexible way. We provide a large spectrum of individ- ual matchers, in particular a novel approach aiming at reusing results from previous match operations, and several mechanisms to combine the results of matcher executions. We use COMA as a framework to com- prehensively evaluate the effectiveness of different matchers and their combinations for real-world sche- mas. The results obtained so far show the superiority of combined match approaches and indicate the high value of reuse-oriented strategies.
Conference Paper
Full-text available
We demonstrate the schema and ontology matching tool COMA++. It extends our previous prototype COMA utilizing a composite approach to combine different match algorithms [3]. COMA++ implements significant improvements and offers a comprehensive infrastructure to solve large real-world match problems. It comes with a graphical interface enabling a variety of user interactions. Using a generic data representation, COMA++ uniformly supports schemas and ontologies, e.g. the powerful standard languages W3C XML Schema and OWL. COMA++ includes new approaches for ontology matching, in particular the utilization of shared taxonomies. Furthermore, different match strategies can be applied including various forms of reusing previously determined match results and a so-called fragment-based match approach which decomposes a large match problem into smaller problems. Finally, COMA++ cannot only be used to solve match problems but also to comparatively evaluate the effectiveness of different match algorithms and strategies.
Conference Paper
Full-text available
One significant part of today's Web is Web databases, which can dy- namically provide information in response to user queries. To help users submit queries to and collect query results from different Web databases, the query inter- face matching problem needs to be addressed. To solve this problem, we propose a new complex schema matching approach, Holistic Schema Matching (HSM). By examining the query interfaces of real Web databases, we observe that at- tribute matchings can be discovered from attribute-occurrence patterns. For ex- ample, First Name often appears together with Last Name while it is rarely co-present with Author in the Books domain. Thus, we design a count-based greedy algorithm to identify which attributes are more likely to be matched in the query interfaces. In particular, HSM can identify both simple matching and complex matching, where the former refers to 1:1 matching between attributes and the latter refers to 1:n or m:n matching between attributes. Our experiments show that HSM can discover both simple and complex matchings accurately and efficiently on real data sets.
Article
Full-text available
In a paper published in the 2001 VLDB Conference, we proposed treating generic schema matching as an independent problem. We developed a taxonomy of existing techniques, a new schema matching algorithm, and an approach to comparative evaluation. Since then, the field has grown into a major research topic. We briefly summarize the new techniques that have been developed and applications of the techniques in the commercial world. We conclude by discussing future trends and recommendations for further work. 1.
Article
Full-text available
The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google's general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each relational table has its own "schema" of labeled and typed columns, each such table can be considered a small structured database. The resulting corpus of databases is larger than any other corpus we are aware of, by at least five orders of magnitude. We describe the WEBTABLES system to explore two fundamental questions about this collection of databases. First, what are effective techniques for searching for structured data at search-engine scales? Second, what additional power can be derived by analyzing such a huge corpus? First, we develop new techniques for keyword search over a corpus of tables, and show that they can achieve substantially higher relevance than solutions based on a traditional search engine. Second, we introduce a new object derived from the database corpus: the attribute correlation statistics database (AcsDB) that records corpus-wide statistics on co-occurrences of schema elements. In addition to improving search relevance, the AcsDB makes possible several novel applications: schema auto-complete , which helps a database designer to choose schema elements; attribute synonym finding , which automatically computes attribute synonym pairs for schema matching; and join-graph traversal , which allows a user to navigate between extracted schemas using automatically-generated join links.
Article
Full-text available
Schema matching is a basic problem in many database application domains, such as data integration, E- business, data warehousing, and semanticquery proc essing. In current implementations, schema matching is typically per- formed manually, which has significant limitations. On the other hand, previous research papers have proposed many techniques to achieve a partial automation of the match op- eration for specific application domains. We present a taxon- omy that covers many of these existing approaches, and we describe the approaches in some detail. In particular, we distin- guish between schema-level and instance-level, element-level and structure-level, and language-based and constraint-based matchers. Based on our classification we review some pre- vious match implementations thereby indicating which part of the solution space they cover. We intend our taxonomy and review of past work to be useful when comparing different ap- proaches to schema matching, when developing a new match algorithm, and when implementing a schema matching com- ponent.
Conference Paper
Full-text available
Most data integration applications require a matching between the schemas of the respective data sets. We show how the existence of duplicates within these data sets can be exploited to automatically identify matching attributes. We describe an algorithm that first discovers duplicates among data sets with unaligned schemas and then uses these duplicates to perform schema matching between schemas with opaque column names. Discovering duplicates among data sets with unaligned schemas is more difficult than in the usual setting, because it is not clear which fields in one object should be compared with which fields in the other. We have developed a new algorithm that efficiently finds the most likely duplicates in such a setting. Now, our schema matching algorithm is able to identify corresponding attributes by comparing data values within those duplicate records. An experimental study on real-world data shows the effectiveness of this approach.
Article
Full-text available
Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondence. This paper proposes a di#erent approach, motivated by integrating large numbers of data sources on the Internet. On this "deep Web," we observe two distinguishing characteristics that o#er a new view for considering schema matching: First, as the Web scales, there are ample sources that provide structured information in the same domains (e.g., books and automobiles). Second, while sources proliferate, their aggregate schema vocabulary tends to converge at a relatively small size. Motivated by these observations, we propose a new paradigm, statistical schema matching : Unlike traditional approaches using pairwise-attribute correspondence, we take a holistic approach to match all input schemas by finding an underlying generative schema model. We propose a general statistical framework MGS for such hidden model discovery, which consists of hypothesis modeling, generation, and selection. Further, we specialize the general framework to develop Algorithm MGSsd , targeting at synonym discovery, a canonical problem of schema matching, by designing and discovering a model that specifically captures synonym attributes. We demonstrate our approach over hundreds of real Web sources in four domains and the results show good accuracy.
Conference Paper
Cross-domain knowledge bases such as DBpedia, YAGO, or the Google Knowledge Graph have gained increasing attention over the last years and are starting to be deployed within various use cases. However, the content of such knowledge bases is far from being complete, far from always being correct, and suffers from deprecation (i.e. population numbers become outdated after some time). Hence, there are efforts to leverage various types of Web data to complement, update and extend such knowledge bases. A source of Web data that potentially provides a very wide coverage are millions of relational HTML tables that are found on the Web. The existing work on using data from Web tables to augment cross-domain knowledge bases reports only aggregated performance numbers. The actual content of the Web tables and the topical areas of the knowledge bases that can be complemented using the tables remain unclear. In this paper, we match a large, publicly available Web table corpus to the DBpedia knowledge base. Based on the matching results, we profile the potential of Web tables for augmenting different parts of cross-domain knowledge bases and report detailed statistics about classes, properties, and instances for which missing values can be filled using Web table data as evidence. In order to estimate the potential quality of the new values, we empirically examine the Local Closed World Assumption and use it to determine the maximal number of correct facts that an ideal data fusion strategy could generate. Using this as ground truth, we compare three data fusion strategies and conclude that knowledge-based trust outperforms PageRank- and voting-based fusion.
Conference Paper
Attribute synonyms are important ingredients for keyword-based search systems. For instance, web search engines recognize queries that seek the value of an entity on a specific attribute (referred to as e+a queries) and provide direct answers for them using a combination of knowledge bases, web tables and documents. However, users often refer to an attribute in their e+a query differently from how it is referred in the web table or text passage. In such cases, search engines may fail to return relevant answers. To address that problem, we propose to automatically discover all the alternate ways of referring to the attributes of a given class of entities (referred to as attribute synonyms) in order to improve search quality. The state-of-the-art approach that relies on attribute name co-occurrence in web tables suffers from low precision. Our main insight is to combine positive evidence of attribute synonymity from query click logs, with negative evidence from web table attribute name co-occurrences. We formalize the problem as an optimization problem on a graph, with the attribute names being the vertices and the positive and negative evidences from query logs and web table schemas as weighted edges. We develop a linear programming based algorithm to solve the problem that has bi-criteria approximation guarantees. Our experiments on real-life datasets show that our approach has significantly higher precision and recall compared with the state-of-the-art.
Conference Paper
Relational Web tables have become an important resource for applications such as factual search and entity augmentation. A major challenge for an automatic identification of relevant tables on the Web is the fact that many of these tables have missing or non-informative column labels. Research has focused largely on recovering the meaning of columns by inferring class labels from the instances using external knowledge bases. The table context, which often contains additional information on the table's content, is frequently considered as an indicator for the general content of a table, but not as a source for column-specific details. In this paper, we propose a novel approach to identify and extract column-specific information from the context of Web tables. In our extraction framework, we consider different techniques to extract directly as well as indirectly related phrases. We perform a number of experiments on Web tables extracted from Wikipedia. The results show that column-specific information extracted using our simple heuristic significantly boost precision and recall for table and column search.
Conference Paper
Relational tables collected from HTML pages ("web tables") are used for a variety of tasks including table extension, knowledge base completion, and data transformation. Most of the existing algorithms for these tasks assume that the data in the tables has the form of binary relations, i.e., relates a single entity to a value or to another entity. Our exploration of a large public corpus of web tables, however, shows that web tables contain a large fraction of non-binary relations which will likely be misinterpreted by the state-of-the-art algorithms. In this paper, we propose a categorisation scheme for web table columns which distinguishes the different types of relations that appear in tables on the Web and may help to design algorithms which better deal with these different types. Designing an automated classifier that can distinguish between different types of relations is non-trivial, because web tables are relatively small, contain a high level of noise, and often miss partial key values. In order to be able to perform this distinction, we propose a set of features which goes beyond probabilistic functional dependencies by using the union of multiple tables from the same web site and from different web sites to overcome the problem that single web tables are too small for the reliable calculation of functional dependencies.
Conference Paper
Web tables form a valuable source of relational data. The Web contains an estimated 154 million HTML tables of relational data, with Wikipedia alone containing 1.6 million high-quality tables. Extracting the semantics of Web tables to produce machine-understandable knowledge has become an active area of research. A key step in extracting the semantics of Web content is entity linking (EL): the task of mapping a phrase in text to its referent entity in a knowledge base (KB). In this paper we present TabEL, a new EL system for Web tables. TabEL differs from previous work by weakening the assumption that the semantics of a table can be mapped to pre-defined types and relations found in the target KB. Instead, TabEL enforces soft constraints in the form of a graphical model that assigns higher likelihood to sets of entities that tend to co-occur in Wikipedia documents and tables. In experiments, TabEL significantly reduces error when compared to current state-of-the-art table EL systems, including a 75%75\% error reduction on Wikipedia tables and a 60%60\% error reduction on Web tables. We also make our parsed Wikipedia table corpus and test datasets publicly available for future work.
Conference Paper
Millions of HTML tables containing structured data can be found on the Web. With their wide coverage, these tables are potentially very useful for filling missing values and extending cross-domain knowledge bases such as DBpedia, YAGO, or the Google Knowledge Graph. As a prerequisite for being able to use table data for knowledge base extension, the HTML tables need to be matched with the knowledge base, meaning that correspondences between table rows/columns and entities/schema elements of the knowledge base need to be found. This paper presents the T2D gold standard for measuring and comparing the performance of HTML table to knowledge base matching systems. T2D consists of 8 700 schema-level and 26 100 entity-level correspondences between the WebDataCommons Web Tables Corpus and the DBpedia knowledge base. In contrast related work on HTML table to knowledge base matching, the Web Tables Corpus (147 million tables), the knowledge base, as well as the gold standard are publicly available. The gold standard is used afterward to evaluate the performance of T2K Match, an iterative matching method which combines schema and instance matching. T2K Match is designed for the use case of matching large quantities of mostly small and narrow HTML tables against large cross-domain knowledge bases. The evaluation using the T2D gold standard shows that T2K Match discovers table-to-class correspondences with a precision of 94%, row-to-entity correspondences with a precision of 90%, and column-to-property correspondences with a precision of 77%.
Article
A Search Join is a join operation which extends a user-provided table with additional attributes based on a large corpus of heterogeneous data originating from the Web or corporate intranets. Search Joins are useful within a wide range of application scenarios: Imagine you are an analyst having a local table describing companies and you want to extend this table with attributes containing the headquarters, turnover, and revenue of each company. Or imagine you are a film enthusiast and want to extend a table describing films with attributes like director, genre, and release date of each film. This article presents the Mannheim Search Join Engine which automatically performs such table extension operations based on a large corpus of Web data. Given a local table, the Mannheim Search Join Engine searches the corpus for additional data describing the entities contained in the input table. The discovered data are joined with the local table and are consolidated using schema matching and data fusion techniques. As result, the user is presented with an extended table and given the opportunity to examine the provenance of the added data. We evaluate the Mannheim Search Join Engine using heterogeneous data originating from over one million different websites. The data corpus consists of HTML tables, as well as Linked Data and Microdata annotations which are converted into tabular form. Our experiments show that the Mannheim Search Join Engine achieves a coverage close to 100% and a precision of around 90% for the tasks of extending tables describing cities, companies, countries, drugs, books, films, and songs.
Article
Recent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft's Satori, and Google's Knowledge Graph. To increase the scale even further, we need to explore automatic methods for constructing knowledge bases. Previous approaches have primarily focused on text-based extraction, which can be very noisy. Here we introduce Knowledge Vault, a Web-scale probabilistic knowledge base that combines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repositories. We employ supervised machine learning methods for fusing these distinct information sources. The Knowledge Vault is substantially bigger than any previously published structured knowledge repository, and features a probabilistic inference system that computes calibrated probabilities of fact correctness. We report the results of multiple studies that explore the relative utility of the different information sources and extraction methods.
Conference Paper
Several recent works have focused on harvesting HTML tables from the Web and recovering their semantics [Cafarella et al., 2008a; Elmeleegy et al., 2009; Limaye et al., 2010; Venetis et al., 2011]. As a result, hundreds of millions of high quality structured data tables can now be explored by the users. In this paper, we argue that those efforts only scratch the surface of the true value of structured data on the Web, and study the challenging problem of synthesizing tables from the Web, i.e., producing never-before-seen tables from raw tables on the Web. Table synthesis offers an important semantic advantage: when a set of related tables are combined into a single union table, powerful mechanisms, such as temporal or geographical comparison and visualization, can be employed to understand and mine the underlying data holistically. We focus on one fundamental task of table synthesis, namely, table stitching. Within a given site, many tables with identical schemas can be scattered across many pages. The task of table stitching involves combining such tables into a single meaningful union table and identifying extra attributes and values for its rows so that rows from different original tables can be distinguished. Specifically, we first define the notion of stitchable tables and identify collections of tables that can be stitched. Second, we design an effective algorithm for extracting hidden attributes that are essential for the stitching process and for aligning values of those attributes across tables to synthesize new columns. We also assign meaningful names to these synthesized columns. Experiments on real world tables demonstrate the effectiveness of our approach.
Conference Paper
This paper presents a framework that attempts to harvest useful knowledge from the rich corpus of relational data on the Web: HTML tables. Through a multi-phase algorithm, and with the help of a universal probabilistic taxonomy called Probase, the framework is capable to understanding the entitles, attributes and values in many tables on the Web. With this knowledge, we built two interesting applications: a semantic table search engine which returns relevant tables from keyword queries, and a tool to further expand and enrich Probase. Our experiments indicate generally high performance in both table search results and taxonomy expansion. This showed that the proposed framework practically benefits knowledge discovery and semantic search.
Article
The Web contains a vast corpus of HTML tables, specifically entity attribute tables. We present three core operations, namely entity augmentation by attribute name, entity augmentation by example and attribute discovery, that are useful for "information gathering" tasks (e.g., researching for products or stocks). We propose to use web table corpus to perform them automatically. We require the operations to have high precision and coverage, have fast (ideally interactive) response times and be applicable to any arbitrary domain of entities. The naive approach that attempts to directly match the user input with the web tables suffers from poor precision and coverage. Our key insight is that we can achieve much higher precision and coverage by considering indirectly matching tables in addition to the directly matching ones. The challenge is to be robust to spuriously matched tables: we address it by developing a holistic matching framework based on topic sensitive pagerank and an augmentation framework that aggregates predictions from multiple matched tables. We propose a novel architecture that leverages preprocessing in MapReduce to achieve extremely fast response times at query time. Our experiments on real-life datasets and 573M web tables show that our approach has (i) significantly higher precision and coverage and (ii) four orders of magnitude faster response times compared with the state-of-the-art approach.
Conference Paper
In a Web database that dynamically provides information in response to user queries, two distinct schemas, interface schema (the schema users can query) and result schema (the schema users can browse), are presented to users. Each partially reflects the actual schema of the Web database. Most previous work only studied the problem of schema matching across query interfaces of Web databases. In this paper, we propose a novel schema model that distinguishes the interface and the result schema of a Web database in a specific domain. In this model, we address two significant Web database schema- matching problems: intra-site and inter-site. The first problem is crucial in automatically extracting data from Web databases, while the second problem plays a significant role in meta-retrieving and integrating data from different Web databases. We also investigate a unified solution to the two problems based on query probing and instance-based schema matching techniques. Using the model, a cross validation technique is also proposed to improve the accuracy of the schema matching. Our experiments on real Web databases demonstrate that the two problems can be solved simultaneously with high precision and recall.
Article
Tables are a universal idiom to present relational data. Bil- lions of tables on Web pages express entity references, at- tributes and relationships. This representation of relational world knowledge is usually considerably better than com- pletely unstructured, free-format text. At the same time, unlike manually-created knowledge bases, relational infor- mation mined from \organic" Web tables need not be con- strained by availability of precious editorial time. Unfortu- nately, in the absence of any formal, uniform schema im- posed on Web tables, Web search cannot take advantage of these high-quality sources of relational information. In this paper we propose new machine learning techniques to annotate table cells with entities that they likely mention, table columns with types from which entities are drawn for cells in the column, and relations that pairs of table columns seek to express. We propose a new graphical model for mak- ing all these labeling decisions for each table simultaneously, rather than make separate local decisions for entities, types and relations. Experiments using the YAGO catalog, DB- Pedia, tables from Wikipedia, and over 25 million HTML tables from a 500 million page Web crawl uniformly show the superiority of our approach. We also evaluate the im- pact of better annotations on a prototype relational Web search tool. We demonstrate clear benets of our annota- tions beyond indexing tables in a purely textual manner.
Article
The Web offers a corpus of over 100 million tables [6], but the meaning of each table is rarely explicit from the table itself. Header rows exist in few cases and even when they do, the attribute names are typically useless. We describe a system that attempts to recover the semantics of tables by enriching the table with additional annotations. Our annotations facilitate operations such as searching for tables and finding related tables. To recover semantics of tables, we leverage a database of class labels and relationships automatically extracted from the Web. The database of classes and relationships has very wide coverage, but is also noisy. We attach a class label to a column if a sufficient number of the values in the column are identified with that label in the database of class labels, and analogously for binary relationships. We describe a formal model for reasoning about when we have seen sufficient evidence for a label, and show that it performs substantially better than a simple majority scheme. We describe a set of experiments that illustrate the utility of the recovered semantics for table search and show that it performs substantially better than previous approaches. In addition, we characterize what fraction of tables on the Web can be annotated using our approach.
Article
The Web contains a vast amount of structured information such as HTML tables, HTML lists and deep-web databases; there is enormous potential in combining and re-purposing this data in creative ways. However, integrating data from this relational web raises several challenges that are not addressed by current data integration systems or mash-up tools. First, the structured data is usually not published cleanly and must be extracted (say, from an HTML list) before it can be used. Second, due to the vastness of the corpus, a user can never know all of the potentially-relevant databases ahead of time (much less write a wrapper or mapping for each one); the source databases must be discovered during the integration process. Third, some of the important information regarding the data is only present in its enclosing web page and needs to be extracted appropriately. This paper describes Octopus, a system that combines search, extraction, data cleaning and integration, and enables users to create new data sets from those found on the Web. The key idea underlying Octopus is to offer the user a set of best-effort operators that automate the most labor-intensive tasks. For example, the Search operator takes a search-style keyword query and returns a set of relevance-ranked and similarity-clustered structured data sources on the Web; the Context operator helps the user specify the semantics of the sources by inferring attribute values that may not appear in the source itself, and the Extend operator helps the user find related sources that can be joined to add new attributes to a table. Octopus executes some of these operators automatically, but always allows the user to provide feedback and correct errors. We describe the algorithms underlying each of these operators and experiments that demonstrate their efficacy.
Applying webtables in practice
  • S Balakrishnan
  • A Y Halevy
  • B Harb
  • Balakrishnan S.
Understanding a large corpus of web tables through matching with knowledge bases: an empirical study
  • O Hassanzadeh
  • M J Ward
  • M Rodriguez-Muro
  • K Srinivas
Matching Web Tables To DBpedia - A Feature Utility Study
  • D Ritze
  • C Bizer
Understanding Tables on the Web
  • J Wang
  • H Wang
  • Z Wang
  • K Q Zhu
Alon Halevy , Nodira Khoussainova, Data integration for the relational web
  • J Michael
  • Cafarella
Ritze and C. Bizer. Matching Web Tables To DBpedia - A Feature Utility Study
  • D Ritze
  • C Bizer
  • Ritze D.