Conference Paper

SQL Queries Over Unstructured Text Databases

Columbia Univ., New York, NY
DOI: 10.1109/ICDE.2007.368986 Conference: Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on
Source: DBLP

ABSTRACT Text documents often embed data that is structured in nature. By processing a text database with information extraction systems, we can define a variety of structured "relations" over which we can then issue SQL queries. Processing SQL queries in this text-based scenario presents multiple challenges. One key challenge is efficiency: information extraction is a time-consuming process, so query processing strategies should pick efficient extraction systems whenever possible, and also minimize the number of documents that they process. Another key challenge is result quality: extraction systems might output erroneous information or miss information that they should capture; also, efficiency-related query processing decisions (e.g., to avoid processing large numbers of useless documents) may compromise result completeness. To address these challenges, we characterize SQL query processing strategies in terms of their efficiency and result quality, and discuss the (user-specific) tradeoff between these two properties.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Information extraction (IE) systems are trained to extract specific relations from text databases. Real-worldapplications often require that the output of multiple IE systems be joined to produce the data of interest. Tooptimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time.In fact, the quality of the join output is of critical importance: unlike in the relational world, different join executionplans can produce join results of widely different quality whenever IE systems are involved. In this paper, we developa principled approach to understand, estimate, and incorporate output quality into the join optimization processover extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systemsused to process the documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actualjoin algorithm used. Our analysis considers a variety of join algorithms from relational query optimization, andpredicts the output quality and, of course, the execution time of the alternate execution plans. We establishthe accuracy of our analytical models, as well as study the effectiveness of a quality-aware join optimizer, with alarge-scale experimental evaluation over real-world text collections and state-of-the-art IE systems.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Databases and documents are commonly isolated from each other, controlled by Database Management Systems (DBMS) and Information Retrieval Systems (IRS), respectively. However, both systems are likely to store data about the same entities, a strong argument in favor of their integration. We propose a DBMS-IRS integration approach that uses terms in DBMS queries as keywords to IRS searches, retrieving documents strongly related to the queries. The IRS keywords are built “expanding” an initial set of user-provided keywords, with top-ranked terms found in a query result: the terms are ranked based on a measure of term diffusion over the query result. Our experiments show the effectiveness of the approach in two different domains, in comparison to other DBMS-IRS integration methods, as well as to other term-ranking methods.
    Sofsem 2015 - 41st International Conference on Current Trends in Theory and Practice of Computer Science, Pec pod Sněžkou, Czech Republic; 01/2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: A relational database is a basic repository for many businesses, with its robust data structure for retrieving, organizing, and managing data. However, despite its data structure characteristic, a massive amount of data it contains remains unstructured. These unstructured data affects query processing performance and contributes to the difficulty of the user to manage or retrieve the data. Many attempts have been made to reorganize or directly process these data. In this paper, discusses methods of managing unstructured data in the relational database management system. And show the significance of managing these data. Furthermore, the difference in managing such data between relational and NoSQL databases is highlighted. This study will help developers and researchers in managing unstructured data and in addressing important issues that affect query processing which otherwise meaningless if those were not well managed.
    2013 IEEE Conference on Systems, Process & Control (ICSPC); 12/2013


Available from