Conference Paper

SQL Queries Over Unstructured Text Databases

Columbia Univ., New York, NY;
DOI: 10.1109/ICDE.2007.368986 Conference: Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on
Source: DBLP

ABSTRACT Text documents often embed data that is structured in nature. By processing a text database with information extraction systems, we can define a variety of structured "relations" over which we can then issue SQL queries. Processing SQL queries in this text-based scenario presents multiple challenges. One key challenge is efficiency: information extraction is a time-consuming process, so query processing strategies should pick efficient extraction systems whenever possible, and also minimize the number of documents that they process. Another key challenge is result quality: extraction systems might output erroneous information or miss information that they should capture; also, efficiency-related query processing decisions (e.g., to avoid processing large numbers of useless documents) may compromise result completeness. To address these challenges, we characterize SQL query processing strategies in terms of their efficiency and result quality, and discuss the (user-specific) tradeoff between these two properties.

  • Source
    11th International Workshop on the Web and Databases, WebDB 2008, Vancouver, BC, Canada, June 13, 2008; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As the web evolves, increasing quantities of structured information is embedded in web pages in disparate formats. For example, a digital camera’s description may include its price and megapixels whereas a professor’s description may include her name, university, and research interests. Both types of pages may include additional ambiguous information. General search engines (GSEs) do not support queries over these types of data because they ignore the web document semantics. Conversely, describing requisite semantics through structured queries into databases populated by information extraction (IE) techniques are expensive and not easily adaptable to new domains. This paper describes a methodology for rapidly developing search engines capable of answering structured queries over unstructured corpora by utilizing machine learning to avoid explicit IE. We empirically show that with minimum additional human effort, our system outperforms a GSE with respect to structured queries with clear object semantics.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We investigate a graph-based semi-supervised learning approach for labeling semantic components of questions such as topic, focus, event, etc., for question understanding task. We focus on graph construction to handle learning with dense/sparse graphs and present Relaxed Linear Neighborhoods method, in which each node is linearly constructed from varying sizes of its neighbors based on the density/sparsity of its surrounding. With the new graph representation, we show performance improvements on syntactic and real datasets, primarily due to the use of unlabeled data and relaxed graph construction.


Available from