ArticlePDF Available

Holistic Query Expansion Using Graphical Models

Authors:

Abstract and Figures

this paper we present a method for answering relationship questions, as posed for example in the spring of 2003 evaluation exercise of the AQUAINT program, which has funded this research
Content may be subject to copyright.
Chapter 24
Holistic Query Expansion using Graphical Models
Daniel Mahler
Cycorp, Inc.
3721 Executive Center Dr, suite 100
Austin, TX 78731
mahler@cyc.com
1. Introduction
In this paper we present a method for answering relationship questions, as posed
for example in the spring of 2003 evaluation exercise of the AQUAINT1
program, which has funded this research.
The goal of the exercise was to provide answers to questions requesting an
account of the relationship between two or more entities. No restriction on the
format of the answer was imposed, except that it should not consist of entire
documents and that it should list document identifiers for the passages from
which the answer was drawn or generated.
2. Relation Questions
An example question from the relationship exercise and the answer provided by
our system, QUIRK, is presented in figure 1.
Question: What does Soviet Cosmonaut Valentina Tereshkova
(Vladimirovna) and U.S. Astronaut Sally Ride have in common?
Connecting terms: first space shuttle woman
Answer:
APW19990618.0223.11 In 1963, Soviet cosmonaut Valentina Tereshkova
1 See http: //www.ic-arda.org/InfoExploit/aquaint/. Participants in this program have
access to two corpora a newswire corpus spanning 1996-2000 and a subset of the Center
for Nonproliferation Studies abstracts. The AQUAINT evaluations are carried out
against these corpora. They are referred to as the AQUAINT and CNS corpus
respectively.
returned to Earth after spending nearly three days as the first woman in space.
APW19990115.0263.10 In 1978, NASA named 35 candidates to fly on the
space shuttle, including Sally K. Ride, who became America’s first woman in
space, and Guion S. Bluford Jr., who became America’s first black astronaut in
space.
NYT19990723.0009.1 CAPE CANAVERAL, Fla. After a year of technical
delays and two frustrating waits on the launching pad, the space shuttle
Columbia blasted off from Cape Canaveral early Friday morning carrying the
$1.5 billion Chandra X-ray Observatory, and vaulting its commander, Col.
Eileen M. Collins of the Air Force into the ranks of aviation pioneers like
Amelia Earhart and the Soviet cosmonaut Valentina Tereshkova, the first
woman in space.
APW19981020.1367.6 June 18, 1983: Sally Ride becomes first American
woman in space.
APW19990521.0292.6 Anger in space, by astronauts and cosmonauts, has
been common since early in the manned space program.
APW19990525.0041.29 Former astronaut Sally K. Ride is 48.
Figure 1: A relationship question
What makes this question-answer pair interesting is the fact that no
document in the target corpus mentions both Valentina Tereshkova and Sally
Ride. This means that the passages that collectively elucidate the relationship
between the two must be retrieved from the corpus by
first finding terms (which we will call connecting terms) that are not
present in the question but which describe (or are evocative of) the
relationship between the terms in the question, in this case first space
shuttle and woman, and then
retrieving relevant passages for the original query expanded with these
new terms.
Identifying such connecting terms from a corpus is the key aspect of the
technique we present in this paper. This technique differs in important respects
from other well established query expansion techniques [19,10,2,6] as the
following examples are designed to show.2
2 Unlike the example in figure 1, these examples are hypothetical and designed for
pedagogical purposes. It is not claimed that any system exactly replicates any given
example. They are offered in place for the actual output of the QUIRK system to avoid
confusing the intuitions we wish to convey to the reader with the imperfections of our
Ideal examples of connecting terms for pairs of query terms are:
France, Germany Europe country EU
Rushdie, Khomeini fatwa, blasphemy
cats, dogs pets domestic mammals chase fight
In contrast, based on our understanding of the algorithms, we would expect most
other approaches to automatic query expansion to favor terms strongly related to
individual query terms independently of other terms in the query. Hypothetical
examples could be
France, Germany French, Paris, German, Berlin
Rushdie, Khomeini Salman, Ayatollah
cats, dogs Poodle, Retriever, Siamese, Manx
where Paris is only related to France and Berlin is only related to Germany.
This style of expansion has the potential to loose the focus of the query. For
example expanding a query as in
dogs, training Poodle, Retriever, Terrier, Setter, jogging, gym, weights
would be likely to cause a search engine to retrieve a mixture of documents
about dog breeds and about sports training rather than documents on dog
training, which is presumably the user’s intent. We refer to this phenomenon as
the outweighing or overpowering of the query by the expansion terms.
Identifying connecting terms such as
dogs, training obedience, sit, heel, leash, reward
cannot always be achieved by considering the query terms in isolation.
2.1 Answer Presentation
The most effective presentation of answers to relationship questions is an issue
in itself. Relationship and cause and effect questions differ significantly from
factoid questions used in TREC Q&A track competitions, where the goal is to
identify a single phrase that gives the answer. Often the relationship is not
current system. For example, in figure 1, shuttle is actually returned by our system as a
connecting term even if it is in fact irrelevant to the query.
described in the corpus in a single concise phrase. Such a description may not be
appropriate because of subtlety/complexity of the relationship, as several
questions in sample in section 5 should illustrate. Even in cases where the
connecting terms have been correctly identified, by themselves they usually do
not constitute a useful answer, because, unless one already knows the answer, it
is not obvious how the connecting terms account for the relationship between the
query terms3. Even in patent cases such as
Kennedy, Oswald kill
if one were ignorant of modern history it would be impossible to rely on the
connecting term kill to decide between the following interpretations:
Kennedy killed Oswald;
Oswald killed Kennedy;
Kennedy and Oswald both killed someone;
Kennedy and Oswald were both killed by someone;
...
Thus, we believe the most effective use of the discovered connecting terms is to
expand the original query to find passages which mention the query terms
together with the connecting terms. Such passages should then be presented to
the user with the terms suitably highlighted, as in figure 1.
3. Identifying Connecting Terms
In this section we consider possible algorithms, which, given a corpus and a
query, determine expansion terms that would match our informal notion of
connecting terms. These algorithms then define a family of different formal
notions of connecting terms. We adopt the standard statistical IR assumption that
inter-term relevance is reflected in the co-occurrence statistics over the corpus of
interest. Applying this principle to the notion of connecting terms means that the
connecting terms should lie between the query terms according to some co-
occurrence based distance or similarity measure (see section 3.2) on terms [17].
This requires in turn defining what it means for a term C to be between two
terms A and B with respect to the given measure. Thus, there are three
parameters to making the original informal notion concrete:
3 This is an important point to which we return in section 5.
1. the raw co-occurrence data itself;
2. the co-occurrence measure;
3. a notion of an optimal connection, for a given set of seed nodes in a
weighted graph.4
Algorithm 1 (General form of the connecting terms expansion algorithm.)
For some choice of:
a Probabilistic5 Information Retrieval engine E;
a number N of documents to be retrieved;
a number of relevant sentences R;
a similarity measure M;
a number W of content words to consider;
a similarity measure K;
an algorithm C for retrieving nodes from a graph given a seed set of nodes;
a second number J of documents to be retrieved;
1. Compute query terms {q1,...,qn} from a query Q; in a Q&A
exercise this may involve stripping question words and stop words.
2. Submit {q1,...,qn} to a search engine E and retrieve the top N
documents;
3. Split documents into sentences and remove stop words from sentences;
4. Select the R sentences {s1,...sR} that are most relevant to Q by some
measure of similarity M between {q1,...,qn} and si;
5. Build word-sentence matrix for the W words {w1,...,wW} that
occur in the greatest number of sentences in {s1,...sR};
6. Use similarity measure K to build a graph representation G of the
pairwise similarity between any two wi,wj in {w1,...,wW};
7. Use method C to compute a set {c1,...,cm} of connecting terms
for {q1,...,qn} from the graph G;
8. Submit {q1,...,qn}{c1,...,cm} to
the search engine E and retrieve the top J documents;
9. Split documents into sentences and cluster them by agglomerative
clustering [8];
4 This would be a connected subgraph of the initial graph that contains all the seed
nodes and optimizes some property, as a shortest path or a maximum spanning tree
does. Any nodes in this subgraph, but not in the seed set, can be considered to be
between the seed nodes in the initial graph.
5 A probabilistic retrieval model enables us to obtain good partial matches for a set of
terms without us having to construct complex boolean query expressions.
10. Return a representative sentence from each cluster;
These parameters define a very large design space that results from instantiating
the algorithm schema in algorithm 1. The steps that distinguish the connecting
terms expansion algorithm from other forms of query expansion are highlighted.
Information Retrieval Engine (E) Lemur
Number of documents retrieved (N) 25
Number of relevant sentences (R) 1/7 of all sentences in the N
documents
Similarity measure (M) overlap
Number of content words (W) 400
Similarity measure (K) weight of evidence [9,7]
Algorithm for extracting smallest connecting subgraph
connecting terms from graph (C) of maximum spanning tree of G
that contains {q1,...,qn}
Number of documents retrieved (J) 50
Figure 2: Current parameter values for the QUIRK connecting terms feedback
algorithm.
Figure 2 lists the particular values currently used by the QUIRK system. Below
we discuss our choice for the values of the most significant parameters and
alternatives that would be worth exploring.
3.1 The Definition Of Connecting Node
In order to compute the list of connecting terms, we cast the problem in terms of
graphs. We construct a graph G that has as nodes the union of the original query
terms and a set of candidate expansion terms. The procedure by means of which
the set of candidate expansion terms is selected is one of the three parameters of
the connecting terms expansion algorithm schema and is explained in section 3.3
below. G is fully connected and the edge between nodes A and B is assigned as
weight the similarity between A and B as gleaned from the corpus co-occurrence
statistics, according to some measure (step 6 in algorithm 3.). In the QUIRK
system we define the connecting terms for a query as all the nodes in G that can
be found on the smallest connected graph which
1. contains all query terms; and
2. is a subgraph of G’s maximum spanning tree (e.g.[14]).
The method for extracting this set corresponds to parameter C in step 7 of the
connecting terms expans ion algorithm schema.
Figure 3 is meant to depict the maximum spanning tree of some graph. The
set of nodes highlighted in grey represents the smallest connected subgraph that
contains Q1 and Q2. If Q1 and Q2 represent terms from a query (as Tereshkova
and Ride might be from the example in figure 1) then C1, C2 and C3 represent
connecting terms for it, as first, space and woman might be in the same
example.
Figure 3: An example of connecting terms selection
While the maximum spanning tree method proved to be satisfactory in the
selection of connecting terms, one could readily think of plausible alternatives.
Among them, other network algorithms such as maxflow and mincut algorithms
[14], Bayesian Propagation as described in [12,11] or algorithms based on
spreading activation such as [13,4]. Choosing one of these other methods would
result in a different definition of what it is for a node to be a connecting term for
a query, and so in a different instantiation of the connecting terms expansion
algorithm schema.
3.2 Co-occurrence Measures
A second parameter of the connecting terms expansion algorithm schema is the
particular choice of the similarity measure between two terms that appear in the
corpus of interest (the K in step 6 of figure 3.). Widely used measures of
dependence6 [1] include χ2, mutual information, correlation coefficient, cosine,
6 sometimes also called measures of (im)purity [7] or measures of interestingness
[5,15,16]
Gini index and many others (see [5,15,16] for several insightful comparisons
among them).
In order to experiment with a large number of such co-occurrence measures
we have used Christian Borgelt’s INES package [1]. This package implements
over 20 such measures for which it can construct various graphical models such
as optimal spanning trees. Experimenting with such a large number of measures
we have observed that:
1. measures tend to produce trees with characteristic shape types, with width
vs. depth being a salient dimension for differentiating them;
2. deep and narrow trees tend to be inferior as a source of connecting terms
because
(a) deep trees yield large sets of connecting terms
(b) large sets of connecting terms can start to outweigh the original
query terms and lead the IR engine astray; and
(c) large singly connected graphs, from which the connecting terms are
derived, are more likely to contain a low weight link and thus not
be coherent units as desired.7
Shallow trees tend to be produced by the measures weight of evidence, quadratic
information gain, relief and relevance, while deep trees tend to be produced by
information gain, stochastic complexity and reduction of description length. In
addition, we have found that weight of evidence [9,7] tends to outperform the
other measures in its group because the small set of terms it produces contains
terms that one would intuitively recognize as more relevant.
3.3 Selecting Initial Co-occurrence Data
Because it is impractical to compute co-occurrence statistics for all terms in the
corpus, our methods includes a step in which a selection of passages is retrieved
over which such statistics are computed. This step is the most variable, and in
some sense unsatisfactory aspect of our approach. The possible variations are
too numerous to discuss in detail here. Here we simply state the choices we
settled on after extensive experiments.
The co-occurrence data is selected from a small set of documents that match
the original query sufficiently well as is done in pseudo-relevance feedback and
local context analysis [19]. Currently we retrieve a fixed number of documents
(25), but until recently we were setting the cutoff dynamically based on the
7 This is a generalized case of a chain being only as strong as its weakest link.
differences amongst the scores8 that the IR engine assigns to the top ranking
documents. Further investigations would be needed to gather more empirical
evidence on which strategy is best.
The retrieved documents are split into sentences, and a sentence-word matrix
is constructed.9 Only a top fraction of the sentences, as measured by their
overlap [8] with the original query, is retained. Using sentences as the basis of
co-occurrence is done because of the structure of news corpora, where
documents tend to have tens of sentences, with most being irrelevant to the
query. Having a finer grain of co-occurrence and discarding the irrelevant
sentences from the documents leads to better connecting terms. This process
may be inappropriate for a more focused corpus, such as the CNS corpus, which
seems to average only about 6 sentences per document, with highly focused
documents.
Repetition counts are discarded and only presence/absence information is
retained. This is done both because repetition within a single sentence is not
significant and because some dependence measures are designed for boolean
data. Only terms that occur in a sufficient number of sentences and original
query terms are retained for the computation of the connecting terms. This
reduces dimensionality, helping prevent overfitting as well as speeding up
computation.
3.4 Obtaining final answers
As discussed earlier returning the connecting terms by themselves is typically of
limited use to a user. Instead we construct a new query, consisting of the
original query terms and the connecting terms, with the original terms given
greater weight. We retrieve a number of best matching sentences. Since the
AQUAINT corpus contains large numbers of nearly identical passages, we have
found it necessary to cluster the retrieved sentences using overlap, and return
only one representative per cluster. This is designed to produce a set of
sentences that between them contain all the query terms, without too much
redundancy between the sentences.
4. Related Work
While our method bears superficial resemblance to local context analysis [19], it
seems conceptually closer to mining for indirect associations [17].
8 Kullback-Leibler (or KL) divergence of the retrieved document from the query.
9 We effectively treat sentences as documents in their own right.
Local context analysis expands queries with terms that are correlated to one
or more query terms. However, the scoring formula does not appear to take the
higher order dependency structure into account. In other words, given the query
terms dog and cat local context analysis and pseudo relevance feedback are more
likely to select terms that are strongly connected to a part of the query like
poodle, bone, siamese, mouse, then terms that are moderately connected to all
(or most) of the query like veterinary, pet, animal10 which our method is
designed to favor. This bias towards connecting terms is shared by Tan, Kumar
and Srivastava [17].11 However, our approach can find linear chains of terms
where each term is only correlated to its neighbors. This makes it at least in
principle possible to trace a more complex relation between Alice and Emily like
Alice lives next to Bob
Bob works with Cathy
Cathy carpools with Dick
Dick is Emily’s cousin
provided the pairwise correlations exist in the corpus.
It also extends naturally to connecting more than two query terms since we
actually work with trees. Tan, Kumar and Srivastava only look for one step
chains.
An experiment worth conducting would be to use our algorithm with Tan,
Kumar and Srivastava’s IS similarity measure,12 which they show to have
desirable properties for detecting associations.
There have also been other applications of graphical models to information
retrieval [3,18] as for example in the Inquery system [3]. However, they are used
at different levels of abstraction, with nodes representing entire documents and
queries, rather than individual term occurrences.
5. Empirical Evaluation
10 This example is hypothetical, intended to leverage the reader’s intuitions about likely
strength of associations. It would be an interesting research problem to actually define a
protocol that would allow to test these intuitions empirically and to use that protocol to
test if there really is a significant difference between the technique we describe in this
paper and the other techniques to which we are comparing it.
11 We have become aware of this work only recently.
12 This measure is not among the measures provide by the Bayes net package we
currently use. The experiment would thus require us to modify the internals of the INES
package.
Our system participated in the relationship track evaluation pilot conducted by
the AQUAINT program.13 Two assessors independently judged answers to 100
relationship questions, assigning to each answer a score between 0-4, inclusive.
Zero was the low score: it meant that the response had no value at all. Four was
the top score, and meant that the answer was completely satisfying. The other
values had no specific meaning attached. The assessors were explicitly
instructed to judge for content, not form, of the response. Apart from the three
submissions to the pilot the assessors were also given an answer set supplied by
the human subjects that had created the questions. The assessors were unaware
of the identities of the runs or even of the presence of a human generated run.
The following is a random sample of questions from the exercise:
1. In what country and by whom did Operation Turquoise take place?
2. What is the purpose in taking DHEA?
3. Why is the International Olympic Committee (IOC) concerned about
human growth hormone?
4. What part did ITT (International Telephone and Telegraph) and Anaconda
Copper play in the Chilean 1970 election?
5. What’s the connection between the United States Navy and Puerto Rico’s
Vieques Island?
6. Who are the leaders of the Lebanese guerrilla organization Hezbollah?
7. What is the relationship between the West Nile Virus and the St. Louis
encephalitis?
8. What caused the government to sue Microsoft?
9. What effect did introduction of the "iMac" computer have on the fortunes
of Apple Corp.?
10. What is the connection between Jesse Ventura and Target Stores?
11. Why does Romania have a non-Slavic name?
12. What has been the relationship between Osama bin Laden and Sudan?
13. Was London’s Millennium Dome successful as a tourist attraction?
In all results presented below our QUIRK system is system A.
Human System A (QUIRK)
System B System C
308.5 165.5 48.5 34.5
Figure 4: Total scores averaged over assessors
13 We thank Ellen Voorhees, who organized the relationship question evaluation at
NIST, for giving us permission to use the evaluation data in this paper.
The total scores for the runs averaged between the two assessors are given in
figure 4.
Run Assessor 1 Assessor 2
0 >0 0>0
human 2 98 11 89
A (QUIRK) 22 78 45 55
B 76 24 79 21
C 78 22 87 13
Figure 5: Zero vs. Nonzero scores by the assessors
Figure 5 gives the zero/non-zero score counts given by the two judges.
Run Min Max Median Mean
humans 16 1007 151.5 200.0
A (QUIRK) 32 2844 757.0 832.1
B 4 401 48.5 86.8
C 2 863 15.0 33.2
Figure 6: Answer length statistics measured in non-white-space characters
We must stress that these results must be interpreted against the background
of an important design decision. While systems B and C tried to restrict their
answers to either a single sentence or a short list of words describing the
relationship14 (comparable to our connecting terms), we decided, based on the
reasoning in section 2.1, to display them in the context of complete sentences, as
shown in figure 1. This accounts for the disparity in answer length, and makes it
harder to draw conclusions from the scores. The developers of C confirmed our
reasoning by performing their own evaluation, comparable to that in figure 5, but
also including a significantly higher scoring run of their system with answers
presented in context (Run C-J). Their results of are presented in figure 7.
Run 0
>0
A (QUIRK) 39
61
B 80
20
C 78
22
C-J 59
41
14 Personal communications.
Figure 7: Team C’s reevaluation
Run C-J had a median response size of 263 and mean of 561.
6. Conclusion and Further Work
The system, originally tuned to work on the AQUAINT corpus, and tested on the
100 questions in the AQUAINT relationship pilot as discussed above,15 appears
to work well on new corpora and unseen questions without recalibration. We use
the same settings for querying the CNS abstracts data but with different
questions, appropriate to this data set without apparent loss in performance. We
also use the query expansion method as a part in other evaluations, where we
have observed it to be beneficial. Thus it appears at least that we have not
overfitted the initial question set, which was quite diverse to begin with. This
lends some further empirical support to the underlying ideas of our approach.
However, many incidental decisions had to be made to turn these ideas into a
working system. Most of these were arrived at by trial and error. More data on
comparisons with other system is also desirable.
7. Acknowledgements
I would like to thank the other QUIRK members Stefano Bertolo and Bjoern
Aldag for their feedback on the ideas leading to this approach and extensive
editorial work on this paper. We wish to acknowledge the support of the ARDA
AQUAINT program, which has funded this research. We would also like to
thank the anonymous reviewers for their constructive suggestions. Much credit
is also due to the Open Source software community that has provided many
critical components of our system: specific tools like Lemur and INES, as well
as the ubiquitous infrastructure of Linux, GNU, Python, Perl ...
References
[1] C. Borgelt. Data Mining with Graphical Models. PhD thesis, Otto-von-
Guericke-University of Magdeburg, Germany, 2000.
[2] Chris Buckley, Gerard Salton, James Allan, and Amit Singhal. Automatic
query expansion using SMART: TREC 3. In Text REtrieval Conference,
15 See Appendix for a sample of the questions in the pilot.
pages 0–, 1994.
[3] James P. Callan, W. Bruce Croft, and Stephen M. Harding. The INQUERY
retrieval system. In Proceedings of DEXA-92, 3rd International Conference
on Database and Expert Systems Applications, pages 78–83, 1992.
[4] Eugene Charniak. A neat theory of marker passing. AAAI-86, 1: 584588,
1986.
[5] Robert J. Hilderman and Howard J. Hamilton. Heuristic measures of
interestingness. In Principles of Data Mining and Knowledge Discovery,
pages 232–241, 1999.
[6] Jaana Kekäläinen and Kalervo Järvelin. The impact of query structure and
query expansion on retrieval.
[7] Igor Kononenko. On biases in estimating multi-valued attributes. In IJCAI,
pages 10341040, 1995.
[8] Christopher D. Manning and Hinrich Schütze. Foundations of Statistical
Natural Language Processing. The MIT Press, Cambridge, Massachusetts,
1999.
[9] D. Michie. Personal models of rationality. Journal of Statistical Planning
and Inference, 1988.
[10] Mandar Mitra, Amit Singhal, and Chris Buckley. Improving automatic
query expansion. In Research and Development in Information Retrieval,
pages 206–214, 1998.
[11] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of
plausible inference. Morgan Kaufmann, revised second edition, 1994.
[12] Judea Pearl. Graphical models for probabilistic and causal reasoning. In
Dov M. Gabbay and Philippe Smets, editors, Handbook of Defeasible
Reasoning and Uncertainty Management Systems, Volume 1: Quantified
Representation of Uncertainty and Imprecision, pages 367389. Kluwer
Academic Publishers, Dordrecht, 1998.
[13] M. R. Quillian. Semantic memory. In M. Minsky, editor, Semantic
Information Processing, pages 227270. MIT Press, Cambridge, 1968.
[14] R. Sedgewick. Algorithms. Addison Wesley, 1983.
[15] P. Tan and V. Kumar. Interestingness measures for association patterns: A
perspective, 2000.
[16] P. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness
measure for association patterns, 2002.
[17] Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava. Indirect association:
Mining higher order dependencies in data. In Principles of Data Mining and
Knowledge Discovery, pages 632–637, 2000.
[18] Konstadinos Tzeras and Stephan Hartmann. Automatic indexing based on
bayesian inference networks. In Robert Korfhage, Edie Rasmussen, and
Peter Willett, editors, Proceedings of SIGIR-93, 16th ACM International
Conference on Research and Development in Information Retrieval, pages
22–34, Pittsburgh, US, 1993. ACM Press, New York, US.
[19] Jinxi Xu and W. Bruce Croft. Query expansion using local and global
document analysis. In Proceedings of the Nineteenth Annual International
ACM SIGIR Conference on Research and Development in Information
Retrieval, pages 4–11, 1996.
... To the best of our knowledge, [1,2] are the major published work on answering general Relationship queries. They focus on traditional document sets rather than the Web. ...
... They focus on traditional document sets rather than the Web. The method proposed in [2] has some limitations when working with Web pages on the Web. For example, that method first forms a query Q that contains both entities' keywords and uses Q to retrieve 25 documents from a search engine. ...
Article
Finding relationships between two or more entities on the web is a novel and challenging problem. Existing search engines excel in keyword matching and document ranking, but they are not tailored to work well with more than a single query or entity of interest. Often, it is possible that the actual information need of the user posing a query on these search engines is to find a connection between people, places or firms. The goal of this project is to address this unique information need of a web user. Initially a framework was implemented using an existing approach and its shortcomings were analyzed. Based on this, we propose a new algorithm with enhanced preprocessing and an improved ranking scheme which is capable of extracting key relationships. Further, the Lingo algorithm is used to cluster these relationships and return comprehensible labels displayed on a customized web user interface. Evaluation of the algorithm was done using the standard NDCG metric. It was found that our algorithm consistently outperformed the existing approach and gave strong results both on traditional document sets and web queries.
... Adding the terms to query using ontology may retrieve irrelevant documents, also Li et al. [42]; Macdonald et al. [43]. Mahler [44] proposed a holistic query expansion (HQE) method. Wang Zhixiao et al. [64] dispensed an approach for QE that relies on Global Analysis and Ontology (GAO) along with statistical and semantic approaches. ...
Article
Full-text available
The use of an automatic query expansion technique is to enhance the performance of the Information Retrieval System. Selecting the candidate terms for query expansion is an essential task to make query more precise to extract the most suitable documents. This paper provides a method to select the best terms for query enhancement. Firstly, the effect of abbreviation resolution, Lexical Variation, Synonyms, n-gram pseudo-relevance feedback, Co-occurrence method on baseline approaches of query expansion is analyzed.. In this work, we used the Okapi BM25 algorithm for ranking. We used Concept-based normalization to deal with concept terms. Here our results show the improvement in results than the baseline approach. A new combined technique that integrates lexical variation, synonyms, n-gram pseudo relevance feedback for query enhancement is proposed. For experimental purpose three English written datasets CACM, CISI, and TREC-3 is used. The obtained results show improvement in the performance of query expansion concerning mean average precision, F-measure, and precision-recall curve.
... In particular, this has been observed that adding to the query with concepts from ontology may results in the retrieval of irrelevant documents [20,21]. Mahler [22] has first introduced the query drift issue as an outweighting problem. When additional terms are more related to individual query terms than to the whole query, resulting in irrelevant retrieval of results, then the resulting phenomenon is called outweighting. ...
Article
Full-text available
A major issue in effective information retrieval is the problem of vocabulary mismatches. The method called query expansion addresses this issue by reformulating each search query with additional terms that better define the information needs of the user. Many researchers have contributed to improving the accuracy of information retrieval systems, through different approaches to query expansion. In this article, we primarily discuss statistical query expansion approaches that include document analysis, search and browse log analyses, and web knowledge analyses. In addition to proposing a comprehensive classification for these approaches, we also briefly analyse the pros and cons of each technique. Finally, we evaluate these techniques using five functional features and experimental settings such as TREC collection and results of performance metrics. An in-depth survey of different statistical query expansion approaches suggests that the selection of the best approach depends on the type of search query, the nature and availability of data resources, and performance efficiency requirements.
... An approach that answers complex questions with random walk model was proposed for answering complex questions relying on question decomposition and gives answers by a multi-document summarization system [5]. Graph model was used to expand the query for answering relation query on entities [8]. The method uses connecting terms to reconstruct the query, and finds more relations. ...
Article
Automatically discovering semantic links among documents is the basis of developing advanced applications on large-scale documentary resources. This paper proposes an approach to automatically discover semantic links in a given document set. It has the following advantages: (1) It does not rely on any predefined ontology. (2) The semantic link networks and relevant rules automatically evolve. (3) It can adapt to the update of the adopted techniques. Experiments on document sets of different types (scientific papers and Web pages on Dunhuang culture) and different scales show the proposed approach feasible. The approach can be used to automatically construct semantic overlays on large document sets to support advanced applications like various relation queries on documents.
... SLN of documents can support relational query and query expansion. Graph model was used to expand the query for answering relational queries on entities [11]. To discover the relations between two entities on the Web, document pairs containing potential semantic links are ranked based on the connecting words between documents [12]. ...
Article
Knowing semantic links among resources is the basis of realizing machine intelligence over large-scale resources. Discovering semantic links among resources with limited human interference is a challenge issue. This paper proposes an approach to automatically discovering and predicting semantic links in a document set based on a model of document semantic link network (SLN). The approach has the following advantages: it supports probabilistic relational reasoning; SLNs and the relevant rules automatically evolve; and, it can adapt to the update of the adopted techniques. The approach can support cyber space applications, such as documentation recommendation and relational queries, on large documents. Copyright © 2010 John Wiley & Sons, Ltd.
Article
Purpose The Web is the largest repository of information. Personal information is usually scattered on various pages of different websites. Search engines have made it easier to find personal information. An attacker may collect a user's scattered information together via search engines, and infer some privacy information. The authors call this kind of privacy attack “Privacy Inference Attack via Search Engines”. The purpose of this paper is to provide a user‐side automatic detection service for detecting the privacy leakage before publishing personal information. Design/methodology/approach In this paper, the authors propose a user‐side automatic detection service. In the user‐side service, the authors construct a user information correlation (UICA) graph to model the association between user information returned by search engines. The privacy inference attack is mapped into a decision problem of searching a privacy inferring path with the maximal probability in the UICA graph and it is proved that it is a nondeterministic polynomial time (NP)‐complete problem by a two‐step reduction. A Privacy Leakage Detection Probability (PLD‐Probability) algorithm is proposed to find the privacy inferring path: it combines two significant factors which can influence the vertexes' probability in the UICA graph and uses greedy algorithm to find the privacy inferring path. Findings The authors reveal that privacy inferring attack via search engines is very serious in real life. In this paper, a user‐side automatic detection service is proposed to detect the risk of privacy inferring. The authors make three kinds of experiments to evaluate the seriousness of privacy leakage problem and the performance of methods proposed in this paper. The results show that the algorithm for the service is reasonable and effective. Originality/value The paper introduces a new family of privacy attacks on the Web: privacy inferring attack via search engines and presents a privacy inferring model to describe the process and principles of personal privacy inferring attack via search engines. A user‐side automatic detection service is proposed to detect the privacy inference before publishing personal information. In this user‐side service, the authors propose a Privacy Leakage Detection Probability (PLD‐Probability) algorithm. Extensive experiments show these methods are reasonable and effective.
Conference Paper
Discovering semantic relationships between entities is a crucial problem for many data analysis work. Most recent studies, however, only focus on extracting predefined semantic instances, and the current semantic relationships representations are also weak. This paper presents a new method for extracting meaningful semantic relationships from unstructured natural language sources. The method is based on the maximal frequent connecting sequences extracted from the contexts of entities. For identifying the semantic relationships of entities, connecting terms are found out and used as the seeds to discover the maximal frequent connecting sequences. Experimental results show the effectiveness of our methods.
Conference Paper
We propose a system for finding intermediate entities from two examples by using web search engine indices. For example, a user wants to find recipients of the Nobel Peace Prize in the thirty years between Mother Teresa in 1979 and Barack Obama in 2009. In this example, the answer is, for example, Kofi Atta Annan. In this situation, the user wants to find something intermediate between two entities. We first describe the problem of finding entities between two examples. We then propose a system for extracting intermediate entities between two inputs by using a Web search engine indices. The system focuses on the positions of terms in Web pages and then extracts candidate terms that are likely to appear between the two inputs. Then, our system ranks candidate terms based on term frequencies and positions. Finally, we conducted experiments to show the usefulness of our system.
Conference Paper
Finding relationships between entities on the Web, e. g., the connections between different places or the commonalities of people, is a novel and challenging problem. Existing Web search engines excel in keyword matching and document ranking, but they cannot well handle many relationship queries. This paper proposes a new method for answering relationship queries on two entities. Our method first respectively retrieves the top Web pages for either entity from a Web search engine. It then matches these Web pages and generates an ordered list of Web page pairs. Each Web page pair consists of one Web page for either entity. The top ranked Web page pairs are likely to contain the relationships between the two entities. One main challenge in the ranking process is to effectively filter out the large amount of noise in the Web pages without losing much useful information. To achieve this, our method assigns appropriate weights to terms in Web pages and intelligently identifies the potential connecting terms that capture the relationships between the two entities. Only those top potential connecting terms with large weights are used to rank Web page pairs. Finally, the top ranked Web page pairs are presented to the searcher. For each such pair, the query terms and the top potential connecting terms are properly highlighted so that the relationships between the two entities can be easily identified. We implemented a prototype on top of the Google search engine and evaluated it under a wide variety of query scenarios. The experimental results show that our method is effective at finding important relationships with low overhead.
Chapter
Full-text available
The explosion of data stored in commercial or administrational databases calls for intelligent techniques to discover the patterns hidden in them and thus to exploit all available information. There-fore a new line of research has recently been established, which became known under the names “Data Mining” and “Knowledge Discovery in Databases”. In this paper we study a popular technique from its arsenal of methods to do dependency analysis, namely learning inference networks (also called “graphical models”) from data. We review the already well-known probabilistic networks and provide an introduction to the recently developed and closely related possibilistic networks.
Conference Paper
Full-text available
Most casual users of IR systems type short queries. Recent research has shown that adding new words to these queries via odhoc feedback improves the re- trieval effectiveness of such queries. We investigate ways to improve this query expansion process by refining the set of documents used in feedback. We start by using manually formulated Boolean filters along with proxim- ity constraints. Our approach is similar to the one pro- posed by Hearst(l2). Next, we investigate a completely automatic method that makes use of term cooccurrence information to estimate word correlation. Experimental results show that refining the set of documents used in query expansion often prevents the query drift caused by blind expansion and yields substantial improvements in retrieval effectiveness, both in terms of average preci- sion and precision in the top twenty documents. More importantly, the fully automatic approach developed in this study performs competitively with the best manual approach and requires little computational overhead.
Conference Paper
Full-text available
This paper introduces a novel pattern called indirect association and examines its utility in various application domains. Existing algorithms for mining associations, such as Apriori, will only discover itemsets that have support above a user-defined threshold. Any itemsets with support below the minimum support requirement are filtered out. We believe that an infrequent pair of items can be useful if the items are related indirectly via some other set of items. In this paper, we propose an algorithm for deriving indirectly associated itempairs and demonstrate the potential application of these patterns in the retail, textual and stock market domains.
Conference Paper
The tuples in a generalized relation (i.e., a summary generated from a database) are unique, and therefore, can be considered to be a population with a structure that can be described by some probability distribution. In this paper, we present and empirically compare sixteen heuristic measures that evaluate the structure of a summary to assign a single real-valued index that represents its interestingness relative to other summaries generated from the same database. The heuristics axe based upon well-known measures of diversity, dispersion, dominance, and inequality used in several areas of the physical, social, ecological, management, information, and computer sciences. Their use for ranking summaries generated from databases is a new application area. All sixteen heuristics rank less complex summaries (i.e., those with few tuples and/or few non-ANY attributes) as most interesting. We demonstrate that for sample data sets, the order in which some of the measures rank summaries is highly correlated.
Article
Conference Paper
The effects of query structures and query expansion (QE) on retrieval performance were tested with a best match retrieval system (INQUERY ). Query structure means the use of operators to express the relations between search keys. Eight different structures were tested, representing weak structures (averages and weighted averages of the weights of the keys) and strong structures (e.g., queries with more elaborated search key relations). QE was based on concepts, which were first selected from a conceptual model, and then expanded by semantic relationships given in the model. The expansion levels were (a) no expansion, (b) a synonym expansion, (c) a narrower concept expansion, (d) an associative concept expansion, and (e) a cumulative expansion of all other expansions. With weak structures and Boolean structured queries, QE was not very effective. The best performance was achieved with one of the strong structures at the largest expansion level.
Conference Paper
In this paper, a Bayesian inference network model for automatic indexing with index terms (descriptors) from a prescribed vocabulary is presented. It requires an indexing dictionary with rules mapping terms of the respective subject field onto descriptors and inverted lists for terms occurring in a set of documents of the subject field and descriptors manually assigned to these documents. The indexing dictionary can be derived automatically from a set of manually indexed documents. An application of the network model is described, followed by an indexing example and some experimental results about the indexing performance of the network model.