SQL Queries Over Unstructured Text Databases
ABSTRACT Text documents often embed data that is structured in nature. By processing a text database with information extraction systems, we can define a variety of structured "relations" over which we can then issue SQL queries. Processing SQL queries in this text-based scenario presents multiple challenges. One key challenge is efficiency: information extraction is a time-consuming process, so query processing strategies should pick efficient extraction systems whenever possible, and also minimize the number of documents that they process. Another key challenge is result quality: extraction systems might output erroneous information or miss information that they should capture; also, efficiency-related query processing decisions (e.g., to avoid processing large numbers of useless documents) may compromise result completeness. To address these challenges, we characterize SQL query processing strategies in terms of their efficiency and result quality, and discuss the (user-specific) tradeoff between these two properties.
- SourceAvailable from: aclweb.org[Show abstract] [Hide abstract]
ABSTRACT: In this paper, we propose a multiword-enhanced author topic model that clusters authors with similar interests and expertise, and apply it to an information retrieval system that returns a ranked list of authors related to a keyword. For example, we can retrieve Eugene Charniak via search for statistical parsing. The existing works on author topic modeling assume a "bag-of-words" representation. However, many semantic atomic concepts are represented by multiwords in text documents. This paper presents a pre-computation step as a way to discover these multiwords in the corpus automatically and tags them in the term-document matrix. The key advantage of this method is that it retains the simplicity and the computational efficiency of the unigram model. In addition to a qualitative evaluation, we evaluate the results by using the topic models as a component in a search engine. We exhibit improved retrieval scores when the documents are represented via sets of latent topics and authors.06/2010;
- [Show abstract] [Hide abstract]
ABSTRACT: We investigate a graph-based semi-supervised learning approach for labeling semantic components of questions such as topic, focus, event, etc., for question understanding task. We focus on graph construction to handle learning with dense/sparse graphs and present Relaxed Linear Neighborhoods method, in which each node is linearly constructed from varying sizes of its neighbors based on the density/sparsity of its surrounding. With the new graph representation, we show performance improvements on syntactic and real datasets, primarily due to the use of unlabeled data and relaxed graph construction.06/2010;
- [Show abstract] [Hide abstract]
ABSTRACT: Information extraction (IE) systems are trained to extract specific relations from text databases. Real-worldapplications often require that the output of multiple IE systems be joined to produce the data of interest. Tooptimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time.In fact, the quality of the join output is of critical importance: unlike in the relational world, different join executionplans can produce join results of widely different quality whenever IE systems are involved. In this paper, we developa principled approach to understand, estimate, and incorporate output quality into the join optimization processover extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systemsused to process the documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actualjoin algorithm used. Our analysis considers a variety of join algorithms from relational query optimization, andpredicts the output quality and, of course, the execution time of the alternate execution plans. We establishthe accuracy of our analytical models, as well as study the effectiveness of a quality-aware join optimizer, with alarge-scale experimental evaluation over real-world text collections and state-of-the-art IE systems.
SQL Queries Over Unstructured Text Databases
University of Wisconsin-Madison
Text documents often embed data that is structured in na-
ture. By processing a text database with information extrac-
tion systems, we can define a variety of structured “rela-
tions,” over which we can then issue SQL queries. Process-
ing SQL queries in this text-based scenario presents multi-
ple challenges. One key challenge is efficiency: information
extraction is a time-consuming process, so query process-
ing strategies should pick efficient extraction systems when-
ever possible, and also minimize the number of documents
that they process. Another key challenge is result quality:
extraction systems might output erroneous information or
miss information that they should capture; also, efficiency-
related query processing decisions (e.g., to avoid process-
ing large numbers of useless documents) may compromise
result completeness. To address these challenges, we char-
acterize SQL query processing strategies in terms of their
efficiency and result quality, and discuss the (user-specific)
tradeoff between these two properties.
Text often embeds valuable structured data, such as who
recommends selling which stocks or how many people were
affected by a disease outbreak. To make use of this intrin-
sically structured data that is embedded in natural-language
text, we can leverage information extraction systems and
follow an “extract-then-query” paradigm: we first apply ex-
traction systems to the available documents, in an offline
step, to uncover structured data. For example, we can use an
extraction system trained for identifying disease outbreaks
to extract a tuple ?Ebola, Zaire, 1976? from a news arti-
cle, meaning that an outbreak of Ebola occurred in Zaire in
1976. We can then load such structured data into a DBMS
(perhaps after a data cleaning step) so that we can later an-
swer structured queries (e.g., written in SQL) as they arrive.
In some situations, the fully offline extraction approach,
where all text documents in a database are processed with
extraction systems in a lengthy offline step, may be unde-
sirable or simply impossible, because the underlying data
is vast or evolves quickly, or because the target information
to extract is not known in advance. As an alternative, we
explore a fully online extraction approach (Section 2): to
process a SQL query, we define candidate execution strate-
gies to retrieve text documents at query time and feed them
to appropriate extraction systems. The choice of document
retrieval strategies and extraction systems for a SQL query
affects both the efficiency and the result quality of a query
execution, as we will see. So we consider the user-specified
desired balance between query execution efficiency and re-
sult quality, and choose query execution strategies accord-
ingly, in a principled, cost-based manner (Section 3).
2SQL Queries over Text Databases
Consider an archive of newspaper articles, together with
an information extraction system trained to extract a Head-
quarters(Company, Location) relation, where a tuple ?c, ??
indicates that c is a company whose headquarters are lo-
cated in ?, as well as another system trained for an Execu-
tives(Company, CEO) relation, where a tuple ?c,e? indicates
that person e is the CEO of company c. We can then define
a view CompanyInfo(Company, Location, CEO) by joining
the two “base” relations, and express SQL queries such as:
Q1:SELECT Company, CEO FROM CompanyInfo
WHERELocation = ‘Redmond’
To process the above query, we could sequentially feed
each database document to the extraction systems, then join
the extracted Headquarters and Executives relations, and fi-
nally return only those tuples that satisfy the query condi-
tion. This exhaustive query execution might be unnecessar-
ily inefficient if only a small fraction of the documents in
the database contribute to our extraction tasks.
Beyond efficiency considerations, we should also an-
alyze the query result quality for the different execution
strategies. Unlike in the relational world, where all correct
plans for a query produce the same results, different strate-
gies for a query over a text database might indeed produce
different results. The exhaustive approach above, which
processes all the database documents, is time-consuming
but has the advantage of producing “complete” results rela-
tive to the extraction systems of choice. Faster alternatives
that process fewer documents may result in the loss of some
result tuples, hence hurting result completeness. Another
important consideration is the characteristics of the relevant
extraction systems, which often differ in their output quality
and extraction efficiency. Overall, which plan is best for a
query depends on the efficiency and result quality require-
ments for the query. Sometimes users are after receiving a
few result tuples quickly; some other times, users prefer to
receive query results that are as complete as possible, even
if it takes a relatively long time to produce the results. Our
SQL query processing problem is then as follows.
Problem Statement. Consider a text database D and n
“base” relations R1, ..., Rndefined over D. Each base
relation Rican be extracted from D using one or more in-
formation extraction systems. We assume that all base re-
lations R1, ..., Rnshare the same primary key K and no
other attributes, and define a view V =K 1n
natural outerjoin of R1, ..., Rnover the K attributes. We
consider SQL selection-projection queries over V with se-
lection condition conjuncts of the form A = t, where A is
a textual attribute and t is a constant. Then, given such a
SQL query, our goal is to identify an execution strategy that
meets the desired efficiency and result quality requirements
as closely as possible.
To evaluate a query Q over a text database D, we need to:
(1) Select an extraction system Eij for each base relation Ri, as
well as a document retrieval strategy Xifor Eij.
(2) Use strategy Xi to retrieve from database D the set of text
documents Pithat Eij will process.
(3) Process the documents in Pi with extraction system Eij to
obtain a relation instance ri.
(4) Apply data cleaning techniques to the extracted relations, for
record linkage, and eliminate data inconsistencies.
(5) Generate a candidate view v = K 1n
“clean” version of ri.
(6) Execute Q over v and return the execution results.
i=1ri?, where ri?is a
The execution strategies for a SQL query differ in the Step
(1) choices on how to execute Steps (2) and (3). In some
scenarios, we might have more than one extraction system
available for a relation; the choice of extraction systems for
Step (3) will depend on their efficiency and output quality.
For Step (2), we consider the following document retrieval
strategies, which lead to SQL query executions with differ-
ent efficiency and result quality characteristics:
Scan: We can sequentially scan the database and feed each
document to an extraction system E. This strategy yields
complete query results for E, at the expense of efficiency.
PromD: Alternatively, we could avoid processing all doc-
uments by identifying just the “promising” ones via query-
ing, using QXtract  to derive keyword queries for each
extraction system via machine learning. This query-based
[based AND shares
Figure 1. Stages in the execution of query Q1.
document retrieval strategy reduces the SQL query execu-
tion time, at the expense of answer completeness.
Const: As another alternative to reduce the number of doc-
uments that we process, we can exploit any SQL query con-
stants to construct keyword queries. For example, the key-
word “Redmond” in query Q1 can be used to retrieve only
those documents with this word for the extraction of Head-
quarters (documents without the word “Redmond” could
not contribute useful tuples for this query). The selectivity
associated with the constants in the query determines the
efficiency of this retrieval strategy.
PromC: We can naturally combine the PromD and Const
strategies by ANDing their queries. For example, for Q1 we
[Redmond] from Const to obtain query [based AND shares
AND Redmond]. Similar to PromD, this strategy also has
the advantage of reducing the SQL query execution time,
but at the expense of answer completeness.
Figure 1 shows a possible execution for query Q1, with
two document retrieval strategies, PromC for Headquar-
ters and Scan for Executives, chosen in Step (1). PromC
issues keyword queries such as [based AND shares AND
Redmond] to the text database to retrieve promising docu-
ments (Step (2)). After feeding each of these documents
to the extraction system for Headquarters, we obtain tuples
such as ?Microsoft, Redmond? (Step (3)). To extract Exec-
utives, Scan retrieves all documents exhaustively, one at a
time (Step (2)), and feeds them to the extraction system for
this relation (Step (3)), to extract tuples such as ?Microsoft
Corp., Bill Gates?. After extraction, we use record link-
age techniques to conclude that “Microsoft” and “Microsoft
Corp.” refer to the same company in the two base rela-
tions, and further data cleaning resolves inconsistencies in
the extracted relations, to eliminate erroneous tuples such
as ?Microsoft Corp., New York? (Step (4)). As a final step,
we generate the query results (Steps (5) and (6)).
3Query Execution Properties
To compare alternate execution strategies for a query Q
over a database D, we define efficiency as follows:
Definition 3.1. (Efficiency) The efficiency of a query exe-
cution S over a text database D, E(S,D), is the inverse of
the execution time of S, in seconds, over D. 2
We also compare execution strategies based on their
query result quality, for which we need to characterize the
“ideal” result for Q over D, Ideal(Q,D). This hypothet-
ical ideal result consists of all the correct query results
for Q that could be derived from database D. Of course,
Ideal(Q,D) would be prohibitively expensive to compute
for any large database D, since this “computation” would
necessarily involve substantial human effort (e.g., no “per-
fect” extraction systems exist). However, the ideal results
are conceptually helpful to characterize the precision and
recall of query execution strategies, as follows:
Definition 3.2. (Precision and Recall) Consider an execu-
tion strategy S for a query Q over a text database D, and let
R be the results that S produces. We define the precision
of S over D as P(S,D) =|R∩Ideal(Q,D)|
S over D as R(S,D) =|R∩Ideal(Q,D)|
We combine precision and recall into a single metric by
computing their geometric mean, as follows:
Definition 3.3. (Quality) Consider an execution strategy S
for a query Q over a text database D. We define the quality
of S over D as Q(S,D) = (P(S,D) · R(S,D))1/2. 2
As discussed above, query execution strategies over text
databases exhibit a tradeoff between efficiency and result
quality. Ultimately, the right balance between efficiency
and quality is user-specific, and we capture it with a (user-
specified) query processing parameter w, ranging in value
from 0, to privilege efficiency, to 1, to privilege result qual-
ity. In turn, we use parameter w to characterize the overall
goodness of a query execution, as follows:
Definition 3.4. (Goodness) The goodness of a query exe-
cution S over a text database D for user-specified parameter
w is Gw(S,D) = Q(S,D)w· E(S,D)(1−w). 2
sider the strategies derived from instantiating the general al-
gorithm with the different extraction systems and document
retrieval strategies. We estimate the goodness of each in-
stantiation and choose the best option. We omit the details
on our (sampling-based) methods to derive these estimates
because of space limitations.
and the recall of
The problem of information extraction from text has re-
ceived significant attention (see [4, 7] for surveys). Earlier
approaches rely on hand-crafted extraction rules, but many
recent efforts (e.g., [1, 3, 6]) have developed unsupervised
or learning-based extraction techniques. Recent work has
started to address efficiency issues, including the QXtract
system , which we used in this paper.
Closest to this paper is the analysis in , which con-
siders (among others) the problem of identifying the docu-
ment retrieval strategy for a single extraction system S that
reaches a pre-specified target recall in minimum time. 
assumes that S is perfect in that it produces all—and only—
correct tuples. Our current work is related to , but differs
from it in several crucial aspects. First, we consider the
more general problem of optimizing the combined execu-
tion of multiple information extraction systems and mul-
tiple document retrieval strategies, all towards the goal of
efficiently answering a SQL query. Second, we remove the
assumption of perfect extraction systems, and model extrac-
tion errors. Finally, we do not optimize for a pre-specified
target recall, but consider the goal of balancing recall, pre-
cision, and execution time in a flexible manner.
We performed an extensive evaluation of our query pro-
cessing approach, which relies on document sampling to es-
timate the goodness of the candidate query execution strate-
gies. We do not report our results because of space con-
straints. As a summary of our conclusions, our query pro-
cessing approach produced high goodness executions for
both selection and projection queries and for all values of
user-specified parameter w (see Definition 3.4). As ex-
pected, our approach privileges efficiency for low values of
w and result quality for high values of w, and dynamically
adapts to the user-specified efficiency-quality balance.
 E. Agichtein and L. Gravano. Snowball: Extracting relations
from large plain-text collections. In DL, 2000.
 E. Agichtein and L. Gravano. Querying text databases for
efficient information extraction. In ICDE, 2003.
 S. Brin. Extracting patterns and relations from the world wide
web. In WebDB, 1998.
 R. Grishman. Information extraction: Techniques and chal-
lenges. In SCIE, 1997.
 P. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano. To search
or to crawl? Towards a query optimizer for text-centric tasks.
In SIGMOD, 2006.
 I. Mansuri and S. Sarawagi. A system for integrating unstruc-
tured data into relational databases. In ICDE, 2006.
 A. McCallum. Information extraction: distilling structured
data from unstructured text. ACM Queue, 2005.