[Show abstract][Hide abstract] ABSTRACT: Schema matching and value mapping across two information sources, such as databases, are critical information aggregation tasks. Before data can be integrated from multiple tables, the columns and values within the tables must be matched. The complexities of both these problems grow quickly with the number of attributes to be matched and due to multiple semantics of data values. Traditional research has mostly tackled schema matching and value mapping independently, and for categorical (discrete-valued) attributes. We propose novel methods that leverage value mappings to enhance schema matching in the presence of opaque column names for schemas consisting of both continuous and discrete-valued attributes. An additional source of complexity is that a discrete-valued attribute in one schema could in fact be a quantized, encoded version of a continuous-valued attribute in the other schema. In our approach, which can tackle both “onto” and bijective schema matching, the fitness objective for matching a pair of attributes from two schemas exploits the statistical distribution over values within the two attributes. Suitable fitness objectives are based on Euclidean-distance and the data log-likelihood, both of which are applied in our experimental study. A heuristic local descent optimization strategy that uses two-opt switching to optimize attribute matches, while simultaneously embedding value mappings, is applied for our matching methods. Our experiments show that the proposed techniques matched mixed continuous and discrete-valued attribute schemas with high accuracy and, thus, should be a useful addition to a framework of (semi) automated tools for data alignment.
No preview · Article · Apr 2013 · ACM Transactions on Database Systems
[Show abstract][Hide abstract] ABSTRACT: This article focuses on integrating computational and visual methods in a system that supports analysts to identify, extract, map, and relate linguistic accounts of movement. We address two objectives: (1) build the conceptual, theoretical, and empirical framework needed to represent and interpret human-generated directions; and (2) design and implement a geovisual analytics workspace for direction document analysis. We have built a set of geo-enabled, computational methods to identify documents containing movement statements, and a visual analytics environment that uses natural language processing methods iteratively with geographic database support to extract, interpret, and map geographic movement references in context. Additionally, analysts can provide feedback to improve computational results. To demonstrate the value of this integrative approach, we have realized a proof-of-concept implementation focusing on identifying and processing documents that contain human-generated route directions. Using our visual analytic interface, an analyst can explore the results, provide feedback to improve those results, pose queries against a database of route directions, and interactively represent the route on a map.
No preview · Article · Dec 2011 · Journal of Spatial Information Science
[Show abstract][Hide abstract] ABSTRACT: Geographically-grounded situational awareness (SA) is critical to crisis management and is essential in many other decision making domains that range from infectious disease monitoring, through regional planning, to political campaigning. Social media are becoming an important information input to support situational assessment (to produce awareness) in all domains. Here, we present a geovisual analytics approach to supporting SA for crisis events using one source of social media, Twitter. Specifically, we focus on leveraging explicit and implicit geographic information for tweets, on developing place-time-theme indexing schemes that support overview+detail methods and that scale analytical capabilities to relatively large tweet volumes, and on providing visual interface methods to enable understanding of place, time, and theme components of evolving situations. Our approach is user-centered, using scenario-based design methods that include formal scenarios to guide design and validate implementation as well as a systematic claims analysis to justify design choices and provide a framework for future testing. The work is informed by a structured survey of practitioners and the end product of Phase-I development is demonstrated / validated through implementation in SensePlace2, a map-based, web application initially focused on tweets but extensible to other media.
[Show abstract][Hide abstract] ABSTRACT: In case of emergencies (e.g., earthquakes, flooding), rapid responses are needed in order to address victims' requests for help. Social media used around crises involves self-organizing behavior that can produce accurate results, often in advance of official communications. This allows affected population to send tweets or text messages, and hence, make them heard. The ability to classify tweets and text messages automatically, together with the ability to deliver the relevant information to the appropriate personnel are essential for enabling the personnel to timely and efficiently work to address the most urgent needs, and to understand the emergency situation better. In this study, we developed a reusable information technology infrastructure, called Enhanced Messaging for the Emergency Response Sector (EMERSE). The components of EMERSE are: (i) an iPhone application; (ii) a Twitter crawler component; (iii) machine translation; and (iv) automatic message classification. While each component is important in itself and deserves a detailed analysis, in this paper we focused on the automatic classification component, which classifies and aggregates tweets and text messages about the Haiti disaster relief so that they can be easily accessed by non-governmental organizations, relief workers, people in Haiti, and their friends and families.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we introduce a web-enabled geovisual analytics approach to leveraging Twitter in support of crisis management. The approach is implemented in a map-based, interactive web application that enables information foraging and sensemaking using "tweet" indexing and display based on place, time, and concept characteristics. In this paper, we outline the motivation for the research, review selected background briefly, describe the web application we have designed and implemented, and discuss our planned next steps.
[Show abstract][Hide abstract] ABSTRACT: Schema matching and value mapping across two heterogeneous information sources are critical tasks in applications involving data integration, data warehousing, and federation of databases. Before data can be integrated from multiple tables, the columns and the values appearing in the tables must be matched. The complexity of the problem grows quickly with the number of data attributes/columns to be matched and due to multiple semantics of data values. Traditional research has tackled schema matching and value mapping independently. We propose a novel method that optimizes embedded value mappings to enhance schema matching in the presence of opaque data values and column names. In this approach, the fitness objective for matching a pair of attributes from two schemas depends on the value mapping function for each of the two attributes. Suitable fitness objectives include the euclidean distance measure, which we use in our experimental study, as well as relative (cross) entropy. We propose a heuristic local descent optimization strategy that uses sorting and two-opt switching to jointly optimize value mappings and attribute matches. Our experiments show that our proposed technique outperforms earlier uninterpreted schema matching methods, and thus, should form a useful addition to a suite of (semi) automated tools for resolving structural heterogeneity.
Preview · Article · Feb 2010 · IEEE Transactions on Knowledge and Data Engineering
[Show abstract][Hide abstract] ABSTRACT: ABSTRACT Linguists and geographers are more and more interested in route direction documents because they contain interesting motion descriptions and language patterns. A large num- ber of such documents can be easily found on the Internet. A challenging task is to automatically extract meaningful route parts, i.e. destinations, origins and instructions, from route direction documents. However, no work exists on this issue. In this paper, we introduce our efiort toward this goal. Based on our observation that sentences are the ba- sic units for route parts, we extract sentences from HTML documents using both the natural language knowledge and HTML tag information. Additionally, we study the sentence classiflcation problem in route direction documents and its sequential nature. Several machine learning methods are compared,and analyzed. The impacts of difierent sets of features are studied. Based on the obtained insights, we propose to use sequence labelling models such as CRFs and MEMMs and they yield a high accuracy in route part extrac- tion. The approach is evaluated on over 10,000 hand-tagged sentences in 100 documents. The experimental results show the efiectiveness of our method. The above techniques have been implemented and published as the flrst module of the GeoCAM, system, which will also be brie∞y introduced in this paper.
[Show abstract][Hide abstract] ABSTRACT: This article describes research in the ongoing search for better semantic similarity tools: such methods are important when attempting to reconcile or integrate knowledge, or knowledge-related resources such as ontologies and database schemas. We describe an extensible, open platform for experimenting with different measures of similarity for ontologies and concept maps. The platform is based around three different types of similarity, that we ground in cognitive principles and provide a taxonomy and structure by which new similarity methods can be integrated and used. The platform supports a variety of specific similarity methods, to which researchers can add others of their own. It also provides flexible ways to combine the results from multiple methods, and some graphic tools for visualizing and communicating multi-part similarity scores. Details of the system, which forms part of the ConceptVista open codebase, are described, along with associated details of the interfaces by which users can add new methods, choose which methods are used and select how multiple similarity scores are aggregated. We offer this as a community resource, since many similarity methods have been proposed but there is still much confusion about which one(s) might work well for different geographical problems; hence a test environment that all can access and extend would seem to be of practical use. We also provide some examples of the platform in use.
Full-text · Article · Dec 2008 · Transactions in GIS
[Show abstract][Hide abstract] ABSTRACT: E-science or cyberinfrastructure have become crucial for scientific progress and open source systems have greatly facilitated design and implementation. In chemistry, the growth of data has been explosive and timely and effective information and data access is critical [Atkins 2003, Hey 2006]. Many have argued that cyberinfrastructures for science are domain sensitive [Snow 2006] and many have been proposed. We have proposed and are developing the ChemXSeer architecture, a portal for academic researchers in environmental chemistry, which integrates the scientific literature with experimental, analytical and simulation datasets. ChemXSeer will be comprised of information crawled from the web, manual submission of scientific documents and user submitted datasets, as well as scientific documents and metadata provided by major publishers. Information crawled by ChemXSeer from the web and user submitted data will be publicly accessible whereas access to publisher resources can be provided by linking to their respective sites. Thus, instead of being a fully open search engine and repository, ChemXSeer will be a hybrid one, limiting access to some resources.
[Show abstract][Hide abstract] ABSTRACT: Chem XSeer is a digital library and a data repository for the chemistry domain. The data deposited into our repository is linked with digital documents to create aggregates of resources representing the links between the data and the articles in which the data is reported. Chem XSeer enables the user to annotate the data using a metadata capturing tool. The metadata is indexed and searched to return relevant datasets to the user. Chem XSeer extracts chemical formulae and chemical names, disambiguates them and indexes them to allow for domain-knowledge enhanced search capabilities. As search engines mature, we foresee such vertical search engines, employing domain-specific knowledge to perform information extraction and indexing, especially for scientific domains, become more popular. Though substantial research has been pursued on information extraction from text, extracting information from tables and figures has received little attention. In the Chem XSeer project, we are building tools that allow automatic extraction of tables and figures.
[Show abstract][Hide abstract] ABSTRACT: TexPlorer is an integrated system for exploring and analyzing large amounts of text documents. The data processing modules of TexPlorer consist of named entity extraction, entity relation extraction, hierarchical clustering, and text summarization tools. Using a timeline tool, tree-view, table-view, and concept maps, TexPlorer provides an analytical interface for exploring a set of text documents from different perspectives and allows users to explore vast amount of text documents efficiently.
[Show abstract][Hide abstract] ABSTRACT: TexPlorer is an integrated system for exploring and analyzing vast amount of text documents. The data processing modules of TexPlorer consist of named entity extraction, entity relation extraction, hierarchical clustering, and text summarization tools. Using time line tool, tree-view, table-view, and concept maps, TexPlorer provides visualizations from different aspects and allows analysts to explore vast amount of text documents efficiently.
[Show abstract][Hide abstract] ABSTRACT: Increasingly, scientists are seeking to collaborate and share data among themselves. Such sharing is can be readily done by publishing data on the World-Wide Web. Meaningful querying and searching on such data depends upon the availability of accurate and adequate metadata that describes the data and the sources of the data. In this paper, we outline the architecture of an implemented cyber-infrastructure for chemistry that provides tools for users to upload datasets and their metadata to a database. Our proposal combines a two level metadata system with a centralized database repository and analysis tools to create an effective and capable data sharing infrastructure. Our infrastructure is extensible in that it can handle data in different formats and allows different analytic tools to be plugged in.