[Show abstract][Hide abstract] ABSTRACT: Schema matching and value mapping across two information sources, such as databases, are critical information aggregation tasks. Before data can be integrated from multiple tables, the columns and values within the tables must be matched. The complexities of both these problems grow quickly with the number of attributes to be matched and due to multiple semantics of data values. Traditional research has mostly tackled schema matching and value mapping independently, and for categorical (discrete-valued) attributes. We propose novel methods that leverage value mappings to enhance schema matching in the presence of opaque column names for schemas consisting of both continuous and discrete-valued attributes. An additional source of complexity is that a discrete-valued attribute in one schema could in fact be a quantized, encoded version of a continuous-valued attribute in the other schema. In our approach, which can tackle both “onto” and bijective schema matching, the fitness objective for matching a pair of attributes from two schemas exploits the statistical distribution over values within the two attributes. Suitable fitness objectives are based on Euclidean-distance and the data log-likelihood, both of which are applied in our experimental study. A heuristic local descent optimization strategy that uses two-opt switching to optimize attribute matches, while simultaneously embedding value mappings, is applied for our matching methods. Our experiments show that the proposed techniques matched mixed continuous and discrete-valued attribute schemas with high accuracy and, thus, should be a useful addition to a framework of (semi) automated tools for data alignment.
ACM Transactions on Database Systems (TODS). 04/2013; 38(1).
[Show abstract][Hide abstract] ABSTRACT: In this paper, we introduce a web-enabled geovisual analytics approach to leveraging Twitter in support of crisis management. The approach is implemented in a map-based, interactive web application that enables information foraging and sensemaking using "tweet" indexing and display based on place, time, and concept characteristics. In this paper, we outline the motivation for the research, review selected background briefly, describe the web application we have designed and implemented, and discuss our planned next steps.
[Show abstract][Hide abstract] ABSTRACT: Geographically-grounded situational awareness (SA) is critical to crisis management and is essential in many other decision making domains that range from infectious disease monitoring, through regional planning, to political campaigning. Social media are becoming an important information input to support situational assessment (to produce awareness) in all domains. Here, we present a geovisual analytics approach to supporting SA for crisis events using one source of social media, Twitter. Specifically, we focus on leveraging explicit and implicit geographic information for tweets, on developing place-time-theme indexing schemes that support overview+detail methods and that scale analytical capabilities to relatively large tweet volumes, and on providing visual interface methods to enable understanding of place, time, and theme components of evolving situations. Our approach is user-centered, using scenario-based design methods that include formal scenarios to guide design and validate implementation as well as a systematic claims analysis to justify design choices and provide a framework for future testing. The work is informed by a structured survey of practitioners and the end product of Phase-I development is demonstrated / validated through implementation in SensePlace2, a map-based, web application initially focused on tweets but extensible to other media.
2011 IEEE Conference on Visual Analytics Science and Technology, VAST 2011, Providence, Rhode Island, USA, October 23-28, 2011; 01/2011
[Show abstract][Hide abstract] ABSTRACT: In case of emergencies (e.g., earthquakes, flooding), rapid responses are needed in order to address victims' requests for help. Social media used around crises involves self-organizing behavior that can produce accurate results, often in advance of official communications. This allows affected population to send tweets or text messages, and hence, make them heard. The ability to classify tweets and text messages automatically, together with the ability to deliver the relevant information to the appropriate personnel are essential for enabling the personnel to timely and efficiently work to address the most urgent needs, and to understand the emergency situation better. In this study, we developed a reusable information technology infrastructure, called Enhanced Messaging for the Emergency Response Sector (EMERSE). The components of EMERSE are: (i) an iPhone application; (ii) a Twitter crawler component; (iii) machine translation; and (iv) automatic message classification. While each component is important in itself and deserves a detailed analysis, in this paper we focused on the automatic classification component, which classifies and aggregates tweets and text messages about the Haiti disaster relief so that they can be easily accessed by non-governmental organizations, relief workers, people in Haiti, and their friends and families.
[Show abstract][Hide abstract] ABSTRACT: Schema matching and value mapping across two heterogeneous information sources are critical tasks in applications involving data integration, data warehousing, and federation of databases. Before data can be integrated from multiple tables, the columns and the values appearing in the tables must be matched. The complexity of the problem grows quickly with the number of data attributes/columns to be matched and due to multiple semantics of data values. Traditional research has tackled schema matching and value mapping independently. We propose a novel method that optimizes embedded value mappings to enhance schema matching in the presence of opaque data values and column names. In this approach, the fitness objective for matching a pair of attributes from two schemas depends on the value mapping function for each of the two attributes. Suitable fitness objectives include the euclidean distance measure, which we use in our experimental study, as well as relative (cross) entropy. We propose a heuristic local descent optimization strategy that uses sorting and two-opt switching to jointly optimize value mappings and attribute matches. Our experiments show that our proposed technique outperforms earlier uninterpreted schema matching methods, and thus, should form a useful addition to a suite of (semi) automated tools for resolving structural heterogeneity.
IEEE Transactions on Knowledge and Data Engineering 01/2010; 22(2):291-304. · 1.89 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: ABSTRACT Linguists and geographers are more and more interested in route direction documents because they contain interesting motion descriptions and language patterns. A large num- ber of such documents can be easily found on the Internet. A challenging task is to automatically extract meaningful route parts, i.e. destinations, origins and instructions, from route direction documents. However, no work exists on this issue. In this paper, we introduce our efiort toward this goal. Based on our observation that sentences are the ba- sic units for route parts, we extract sentences from HTML documents using both the natural language knowledge and HTML tag information. Additionally, we study the sentence classiflcation problem in route direction documents and its sequential nature. Several machine learning methods are compared,and analyzed. The impacts of difierent sets of features are studied. Based on the obtained insights, we propose to use sequence labelling models such as CRFs and MEMMs and they yield a high accuracy in route part extrac- tion. The approach is evaluated on over 10,000 hand-tagged sentences in 100 documents. The experimental results show the efiectiveness of our method. The above techniques have been implemented and published as the flrst module of the GeoCAM, system, which will also be brie∞y introduced in this paper.
12th International Workshop on the Web and Databases, WebDB 2009, Providence, Rhode Island, USA, June 28, 2009; 01/2009
[Show abstract][Hide abstract] ABSTRACT: This article describes research in the ongoing search for better semantic similarity tools: such methods are important when attempting to reconcile or integrate knowledge, or knowledge-related resources such as ontologies and database schemas. We describe an extensible, open platform for experimenting with different measures of similarity for ontologies and concept maps. The platform is based around three different types of similarity, that we ground in cognitive principles and provide a taxonomy and structure by which new similarity methods can be integrated and used. The platform supports a variety of specific similarity methods, to which researchers can add others of their own. It also provides flexible ways to combine the results from multiple methods, and some graphic tools for visualizing and communicating multi-part similarity scores. Details of the system, which forms part of the ConceptVista open codebase, are described, along with associated details of the interfaces by which users can add new methods, choose which methods are used and select how multiple similarity scores are aggregated. We offer this as a community resource, since many similarity methods have been proposed but there is still much confusion about which one(s) might work well for different geographical problems; hence a test environment that all can access and extend would seem to be of practical use. We also provide some examples of the platform in use.
Transactions in GIS 01/2008; 12:713-732. · 0.54 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Chem XSeer is a digital library and a data repository for the chemistry domain. The data deposited into our repository is linked with digital documents to create aggregates of resources representing the links between the data and the articles in which the data is reported. Chem XSeer enables the user to annotate the data using a metadata capturing tool. The metadata is indexed and searched to return relevant datasets to the user. Chem XSeer extracts chemical formulae and chemical names, disambiguates them and indexes them to allow for domain-knowledge enhanced search capabilities. As search engines mature, we foresee such vertical search engines, employing domain-specific knowledge to perform information extraction and indexing, especially for scientific domains, become more popular. Though substantial research has been pursued on information extraction from text, extracting information from tables and figures has received little attention. In the Chem XSeer project, we are building tools that allow automatic extraction of tables and figures.
Semantic Scientific Knowledge Integration, Papers from the 2008 AAAI Spring Symposium, Technical Report SS-08-05, Stanford, California, USA, March 26-28, 2008; 01/2008
[Show abstract][Hide abstract] ABSTRACT: TexPlorer is an integrated system for exploring and analyzing large amounts of text documents. The data processing modules of TexPlorer consist of named entity extraction, entity relation extraction, hierarchical clustering, and text summarization tools. Using a timeline tool, tree-view, table-view, and concept maps, TexPlorer provides an analytical interface for exploring a set of text documents from different perspectives and allows users to explore vast amount of text documents efficiently.
Proceedings of the IEEE Symposium on Visual Analytics Science and Technology, IEEE VAST 2007, Sacramento, California, USA, October 30-November 1, 2007; 01/2007
[Show abstract][Hide abstract] ABSTRACT: TexPlorer is an integrated system for exploring and analyzing vast amount of text documents. The data processing modules of TexPlorer consist of named entity extraction, entity relation extraction, hierarchical clustering, and text summarization tools. Using time line tool, tree-view, table-view, and concept maps, TexPlorer provides visualizations from different aspects and allows analysts to explore vast amount of text documents efficiently.
[Show abstract][Hide abstract] ABSTRACT: Increasingly, scientists are seeking to collaborate and share data among themselves. Such sharing is can be readily done by publishing data on the World-Wide Web. Meaningful querying and searching on such data depends upon the availability of accurate and adequate metadata that describes the data and the sources of the data. In this paper, we outline the architecture of an implemented cyber-infrastructure for chemistry that provides tools for users to upload datasets and their metadata to a database. Our proposal combines a two level metadata system with a centralized database repository and analysis tools to create an effective and capable data sharing infrastructure. Our infrastructure is extensible in that it can handle data in different formats and allows different analytic tools to be plugged in.
Eigth ACM International Workshop on Web Information and Data Management (WIDM 2006), Arlington, Virginia, USA, November 10, 2006; 01/2006
[Show abstract][Hide abstract] ABSTRACT: Discovering Web services and clustering them based on their functionalities are important problems with few existing solutions. Users may search for Web services using keywords and receive services that semantically match the keywords. Semantic Web service matchmaking, proposed to enhance the precision of matchmaking using syntactical cues, is generally based upon semantic service descriptions in ontology markup languages as add-ons or replacements to the underlying WSDL descriptors. Ways to improve the performance of direct matchmaking in WSDL, however, remains less studied. In this paper, we introduce a novel corpus-based method to facilitate matchmaking in WSDL files. We show that our method can identify semantically similar Web services with satisfactory recall.
[Show abstract][Hide abstract] ABSTRACT: Resolving semantic heterogeneity among information sources is a central problem in information interoperation, information integra-tion, and information sharing among websites. Ontologies express the se-mantics of the terminology used in these websites. Semantic heterogene-ity can be resolved by mapping ontologies from diverse sources. Mapping large ontologies manually is almost impossible and results in a number of errors of omission and commission. Therefore, automated ontology mapping algorithms are a must. However, most existing ontology map-ping tools do not provide exact mappings. Rather, there is usually some degree of uncertainty. We describe a framework to improve existing ontol-ogy mappings using a Bayesian Network. Omen, an Ontology Mapping ENhancer uses a set of meta-rules that capture the influence of the ontol-ogy structure and the semantics of ontology relations and matches nodes that are neighbors of already matched nodes in the two ontologies. We have implemented a protype ontology matcher using probabilistic meth-ods that can enhance existing matches between ontology concepts. Ex-periments demonstrate that Omen successfully identifies and enhances ontology mappings significantly.