Conference Paper

SQL Queries Over Unstructured Text Databases

Columbia Univ., New York, NY
DOI: 10.1109/ICDE.2007.368986 Conference: Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on
Source: DBLP


Text documents often embed data that is structured in nature. By processing a text database with information extraction systems, we can define a variety of structured "relations" over which we can then issue SQL queries. Processing SQL queries in this text-based scenario presents multiple challenges. One key challenge is efficiency: information extraction is a time-consuming process, so query processing strategies should pick efficient extraction systems whenever possible, and also minimize the number of documents that they process. Another key challenge is result quality: extraction systems might output erroneous information or miss information that they should capture; also, efficiency-related query processing decisions (e.g., to avoid processing large numbers of useless documents) may compromise result completeness. To address these challenges, we characterize SQL query processing strategies in terms of their efficiency and result quality, and discuss the (user-specific) tradeoff between these two properties.

Full-text preview

Available from:
  • Source
    • "However, as the results are in the form of documents, structured answers still need further extraction when needed. The latter line of work, including [3], [5], [12]–[14], produce structured answers over text, i.e., query-in-tuple-out paradigm. They first extract triples from a given collection of documents into a triple store, and then answer the structured query against the triple store. "
    [Show abstract] [Hide abstract]
    ABSTRACT: With the information explosion on the Internet, finding precise answers efficiently is a prevalent requirement by many users. Today, search engines answer keyword queries with a ranked list of documents. Users might not be always willing to read the top ranked documents in order to satisfy their information need. It would save lots of time and efforts if the the answer to a query can be provided directly, instead of a link to a document which might contain the answer. To realize this functionality, users must be able to define their information needs precisely, e.g., by using structured queries, and, on the other hand, the system must be able to extract information from unstructured text documents to answer these queries. To this end, we introduce a system which supports structured queries over unstructured text documents, aiming at finding structured answers to the users' information need. Our goal is to extract answers from unstructured natural text, by applying various efficient techniques that allow fast query processing over text documents from the web or other heterogeneous sources. A key feature of our approach is that it does not require any upfront integration efforts such as the definition of a common data model or ontology.
    Full-text · Conference Paper · Jan 2011
  • Source
    • "With the implementation in RDBMS, correlation query can benefit in the following aspects. First, we can utilize the optimization facilities provided by RDBMS engine, such as join order selection, to automatically achieve the optimal performance [9] [5] [18]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Large scale of short text records are now prevalent, such as news highlights, scientific paper citations, and posted messages in a discussion forum, which are often stored as set records in (hidden) databases. Many interesting information retrieval tasks are correspondingly raised on the correlation query over these short text records, such as finding hot topics over news highlights and searching related scientific papers on a certain topic. However, current relational database management systems (RDBMS) do not directly provide support on set correlation query. Thus, in this paper, we address both the effectiveness and efficiency issues of set correlation query over set records in databases. First, we present a framework of set correlation query inside databases. To our best knowledge, only the Pearson's correlation can be implemented to construct token correlations by using RDBMS facilities. Thereby, we propose a novel correlation coefficient to extend Pearson's correlation, and provide a pure-SQL implementation inside databases. We further propose optimal strategies to set up correlation filtering threshold, which can greatly reduce the query time. Our theoretical analysis proves that, with a proper setting of filtering threshold, we can improve the query efficiency with a little effectiveness loss. Finally, we conduct extensive experiments to show the effectiveness and efficiency of proposed correlation query and optimization strategies.
    Full-text · Conference Paper · Jan 2010
  • Source
    • "This is too much work for a user who inserts a document. The most relevant work in this area is the recent work of Jain et al. [21], which shows how IE systems can be combined to efficiently answer SQL queries on documents. However, they still assume that someone has created these IE systems for specific schemas. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Content management tools like Microsoft's SharePoint allow users of an application domain to share documents and tag them in an ad-hoc way. Similarly, Google Base allows users to define attributes for their objects or choose from predefined templates. This ad-hoc or predefined annotation of the shared data incurs problems like schema explosion or inadequate data annotation, which in turn lead to poor search and analysis capabilities. We propose CADS, a Collaborative Adaptive Data Sharing platform, where the information demand of the community–e.g., query workload–is exploited to annotate the data at insertion-time. A key novelty of CADS is that it learns with time the most important data attributes of the application, and uses this knowledge to guide the data insertion and querying. In this position paper, we present the challenges and preliminary design ideas for building a CADS platform. We use the application of CADS on the Business Continuity Information Network (BCIN) of South Florida as a motivating example.
    Preview · Article · Jan 2009
Show more