Conference Paper

SQL Queries Over Unstructured Text Databases

Columbia Univ., New York, NY;
DOI: 10.1109/ICDE.2007.368986 Conference: Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on
Source: DBLP

ABSTRACT Text documents often embed data that is structured in nature. By processing a text database with information extraction systems, we can define a variety of structured "relations" over which we can then issue SQL queries. Processing SQL queries in this text-based scenario presents multiple challenges. One key challenge is efficiency: information extraction is a time-consuming process, so query processing strategies should pick efficient extraction systems whenever possible, and also minimize the number of documents that they process. Another key challenge is result quality: extraction systems might output erroneous information or miss information that they should capture; also, efficiency-related query processing decisions (e.g., to avoid processing large numbers of useless documents) may compromise result completeness. To address these challenges, we characterize SQL query processing strategies in terms of their efficiency and result quality, and discuss the (user-specific) tradeoff between these two properties.

0 Bookmarks
 · 
123 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a multiword-enhanced author topic model that clusters authors with similar interests and expertise, and apply it to an information retrieval system that returns a ranked list of authors related to a keyword. For example, we can retrieve Eugene Charniak via search for statistical parsing. The existing works on author topic modeling assume a "bag-of-words" representation. However, many semantic atomic concepts are represented by multiwords in text documents. This paper presents a pre-computation step as a way to discover these multiwords in the corpus automatically and tags them in the term-document matrix. The key advantage of this method is that it retains the simplicity and the computational efficiency of the unigram model. In addition to a qualitative evaluation, we evaluate the results by using the topic models as a component in a search engine. We exhibit improved retrieval scores when the documents are represented via sets of latent topics and authors.
    06/2010;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We investigate a graph-based semi-supervised learning approach for labeling semantic components of questions such as topic, focus, event, etc., for question understanding task. We focus on graph construction to handle learning with dense/sparse graphs and present Relaxed Linear Neighborhoods method, in which each node is linearly constructed from varying sizes of its neighbors based on the density/sparsity of its surrounding. With the new graph representation, we show performance improvements on syntactic and real datasets, primarily due to the use of unlabeled data and relaxed graph construction.
    06/2010;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Information extraction (IE) systems are trained to extract specific relations from text databases. Real-worldapplications often require that the output of multiple IE systems be joined to produce the data of interest. Tooptimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time.In fact, the quality of the join output is of critical importance: unlike in the relational world, different join executionplans can produce join results of widely different quality whenever IE systems are involved. In this paper, we developa principled approach to understand, estimate, and incorporate output quality into the join optimization processover extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systemsused to process the documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actualjoin algorithm used. Our analysis considers a variety of join algorithms from relational query optimization, andpredicts the output quality and, of course, the execution time of the alternate execution plans. We establishthe accuracy of our analytical models, as well as study the effectiveness of a quality-aware join optimizer, with alarge-scale experimental evaluation over real-world text collections and state-of-the-art IE systems.

Full-text

Download
9 Downloads
Available from