Conference Paper

SQL Queries Over Unstructured Text Databases

Columbia Univ., New York, NY
DOI: 10.1109/ICDE.2007.368986 Conference: Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on
Source: DBLP

ABSTRACT Text documents often embed data that is structured in nature. By processing a text database with information extraction systems, we can define a variety of structured "relations" over which we can then issue SQL queries. Processing SQL queries in this text-based scenario presents multiple challenges. One key challenge is efficiency: information extraction is a time-consuming process, so query processing strategies should pick efficient extraction systems whenever possible, and also minimize the number of documents that they process. Another key challenge is result quality: extraction systems might output erroneous information or miss information that they should capture; also, efficiency-related query processing decisions (e.g., to avoid processing large numbers of useless documents) may compromise result completeness. To address these challenges, we characterize SQL query processing strategies in terms of their efficiency and result quality, and discuss the (user-specific) tradeoff between these two properties.

6 Reads
  • Source
    • "However, as the results are in the form of documents, structured answers still need further extraction when needed. The latter line of work, including [3], [5], [12]–[14], produce structured answers over text, i.e., query-in-tuple-out paradigm. They first extract triples from a given collection of documents into a triple store, and then answer the structured query against the triple store. "
    [Show abstract] [Hide abstract]
    ABSTRACT: With the information explosion on the Internet, finding precise answers efficiently is a prevalent requirement by many users. Today, search engines answer keyword queries with a ranked list of documents. Users might not be always willing to read the top ranked documents in order to satisfy their information need. It would save lots of time and efforts if the the answer to a query can be provided directly, instead of a link to a document which might contain the answer. To realize this functionality, users must be able to define their information needs precisely, e.g., by using structured queries, and, on the other hand, the system must be able to extract information from unstructured text documents to answer these queries. To this end, we introduce a system which supports structured queries over unstructured text documents, aiming at finding structured answers to the users' information need. Our goal is to extract answers from unstructured natural text, by applying various efficient techniques that allow fast query processing over text documents from the web or other heterogeneous sources. A key feature of our approach is that it does not require any upfront integration efforts such as the definition of a common data model or ontology.
    12th IEEE International Conference on Mobile Data Management, MDM 2011, Luleå, Sweden, June 6-9, 2011, Volume 2; 01/2011
  • Source
    • "With the implementation in RDBMS, correlation query can benefit in the following aspects. First, we can utilize the optimization facilities provided by RDBMS engine, such as join order selection, to automatically achieve the optimal performance [9] [5] [18]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Large scale of short text records are now prevalent, such as news highlights, scientific paper citations, and posted messages in a discussion forum, which are often stored as set records in (hidden) databases. Many interesting information retrieval tasks are correspondingly raised on the correlation query over these short text records, such as finding hot topics over news highlights and searching related scientific papers on a certain topic. However, current relational database management systems (RDBMS) do not directly provide support on set correlation query. Thus, in this paper, we address both the effectiveness and efficiency issues of set correlation query over set records in databases. First, we present a framework of set correlation query inside databases. To our best knowledge, only the Pearson's correlation can be implemented to construct token correlations by using RDBMS facilities. Thereby, we propose a novel correlation coefficient to extend Pearson's correlation, and provide a pure-SQL implementation inside databases. We further propose optimal strategies to set up correlation filtering threshold, which can greatly reduce the query time. Our theoretical analysis proves that, with a proper setting of filtering threshold, we can improve the query efficiency with a little effectiveness loss. Finally, we conduct extensive experiments to show the effectiveness and efficiency of proposed correlation query and optimization strategies.
    Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October 26-30, 2010; 01/2010
  • Source
    • "Finally, unlike [19] [20], we do not optimize for a pre-specified target recall, but consider the goal of balancing recall, precision, and execution time in a flexible manner. A preliminary, 3-page version of this paper appears in [21]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Text documents often embed data that is structured in nature, and we can expose this structured data using information extraction technology. By processing a text database with information extraction systems, we can materialize a variety of structured "relations," over which we can then issue regular SQL queries. A key challenge to process SQL queries in this text-based scenario is efficiency: information extraction is time-consuming, so query processing strategies should minimize the number of documents that they process. Another key challenge is result quality: in the traditional relational world, all correct execution strategies for a SQL query produce the same (correct) result; in contrast, a SQL query execution over a text database might produce answers that are not fully accurate or complete, for a number of reasons. To address these challenges, we study a family of select-project-join SQL queries over text databases, and characterize query processing strategies on their efficiency and - critically - on their result quality as well. We optimize the execution of SQL queries over text databases in a principled, cost-based manner, incorporating this tradeoff between efficiency and result quality in a user-specific fashion. Our large-scale experiments- over real data sets and multiple information extraction systems - show that our SQL query processing approach consistently picks appropriate execution strategies for the desired balance between efficiency and result quality.
    Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on; 05/2008
Show more


6 Reads
Available from