[Show abstract][Hide abstract] ABSTRACT: Many data sharing communities create data standards ("hub" schemata) to speed information integration by increasing reuse of both data definitions and mappings. Unfortunately, creation of these standards and the mappings to the enterprise's implemented systems is both time consuming and expensive. This paper presents Unity, a novel tool for speeding the development of a community vocabulary, which includes both a standard schema and the necessary mappings. We present Unity's scalable algorithms for creating vocabularies and its novel human computer interface which gives the integrator a powerful environment for refining the vocabulary. We then describe Unity's extensive reuse of data structures and algorithms from the OpenII information integration framework, which not only sped the construction of Unity but also results in reuse of the artifacts produced by Unity: vocabularies serve as the basis of information exchanges, and also can be reused as thesauri by other tools within the OpenII framework. Unity has been applied to real U.S. government information integration challenges.
Proceedings of the IEEE International Conference on Information Reuse and Integration, IRI 2011, 3-5 August 2011, Las Vegas, Nevada, USA; 01/2011
[Show abstract][Hide abstract] ABSTRACT: Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies. Data profiling deserves a fresh look for two reasons: First, the area itself is neither established nor defined in any principled way, despite significant research activity on individual parts in the past. Second, more and more data beyond the traditional relational databases are being created and beg to be profiled. The article proposes new research directions and challenges, including interactive and incremental profiling and profiling heterogeneous and non-relational data.
ACM SIGMOD Record 02/2014; 42(4):40-49. · 0.46 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Generating new knowledge from scientific databases, fusioning products information of business companies or computing an overlap between various data collections are a few examples of applications that require data integration. A crucial step during this integration process is the discovery of correspondences between the data sources, and the evaluation of their quality. For this purpose, the overall metric has been designed to compute the post-match effort, but it suffers from major drawbacks. Thus, we present in this paper two related metrics to compute this effort. The former is called post-match effort, i.e., the amount of work that the user must provide to correct the correspondences that have been discovered by the tool. The latter enables the measurement of human-spared resources, i.e., the rate of automation that has been gained by using a matching tool.
On the Move to Meaningful Internet Systems: OTM 2011 - Confederated International Conferences: CoopIS, DOA-SVI, and ODBASE 2011, Hersonissos, Crete, Greece, October 17-21, 2011, Proceedings, Part I; 01/2011
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.