Conference Paper

SQL Queries Over Unstructured Text Databases

Columbia Univ., New York, NY;
DOI: 10.1109/ICDE.2007.368986 Conference: Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on
Source: DBLP

ABSTRACT Text documents often embed data that is structured in nature. By processing a text database with information extraction systems, we can define a variety of structured "relations" over which we can then issue SQL queries. Processing SQL queries in this text-based scenario presents multiple challenges. One key challenge is efficiency: information extraction is a time-consuming process, so query processing strategies should pick efficient extraction systems whenever possible, and also minimize the number of documents that they process. Another key challenge is result quality: extraction systems might output erroneous information or miss information that they should capture; also, efficiency-related query processing decisions (e.g., to avoid processing large numbers of useless documents) may compromise result completeness. To address these challenges, we characterize SQL query processing strategies in terms of their efficiency and result quality, and discuss the (user-specific) tradeoff between these two properties.

0 Bookmarks
 · 
95 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We investigate a graph-based semi-supervised learning approach for labeling semantic components of questions such as topic, focus, event, etc., for question understanding task. We focus on graph construction to handle learning with dense/sparse graphs and present Relaxed Linear Neighborhoods method, in which each node is linearly constructed from varying sizes of its neighbors based on the density/sparsity of its surrounding. With the new graph representation, we show performance improvements on syntactic and real datasets, primarily due to the use of unlabeled data and relaxed graph construction.
    06/2010;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a multiword-enhanced author topic model that clusters authors with similar interests and expertise, and apply it to an information retrieval system that returns a ranked list of authors related to a keyword. For example, we can retrieve Eugene Charniak via search for statistical parsing. The existing works on author topic modeling assume a "bag-of-words" representation. However, many semantic atomic concepts are represented by multiwords in text documents. This paper presents a pre-computation step as a way to discover these multiwords in the corpus automatically and tags them in the term-document matrix. The key advantage of this method is that it retains the simplicity and the computational efficiency of the unigram model. In addition to a qualitative evaluation, we evaluate the results by using the topic models as a component in a search engine. We exhibit improved retrieval scores when the documents are represented via sets of latent topics and authors.
    06/2010;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: With the information explosion on the Internet, finding precise answers efficiently is a prevalent requirement by many users. Today, search engines answer keyword queries with a ranked list of documents. Users might not be always willing to read the top ranked documents in order to satisfy their information need. It would save lots of time and efforts if the the answer to a query can be provided directly, instead of a link to a document which might contain the answer. To realize this functionality, users must be able to define their information needs precisely, e.g., by using structured queries, and, on the other hand, the system must be able to extract information from unstructured text documents to answer these queries. To this end, we introduce a system which supports structured queries over unstructured text documents, aiming at finding structured answers to the users' information need. Our goal is to extract answers from unstructured natural text, by applying various efficient techniques that allow fast query processing over text documents from the web or other heterogeneous sources. A key feature of our approach is that it does not require any upfront integration efforts such as the definition of a common data model or ontology.
    12th IEEE International Conference on Mobile Data Management, MDM 2011, Luleå, Sweden, June 6-9, 2011, Volume 2; 01/2011

Full-text

View
0 Downloads
Available from