Conference PaperPDF Available

Inferring the most important types of a query: a semantic approach

Authors:

Abstract

In this paper we present a technique for ranking the most important types or categories for a given query. Rather than trying to nd the category of the query, known as query cat- egorization, our approach seeks to nd the most important types related to the query results. Not necessarily the query category falls into this ranking of types and therefore our approach can be complementary.
Inferring the Most Important Types of a Query: a Semantic
Approach
David Vallet
Universidad Autónoma de Madrid
Ciudad Universitaria de Cantoblanco
Madrid 28049, Spain
david.vallet@uam.es
Hugo Zaragoza
Yahoo! Research Barcelona
Ocata 1
Barcelona 08003, Spain
hugoz@es.yahoo-inc.com
ABSTRACT
In this paper we present a technique for ranking the most
important types or categories for a given query. Rather than
trying to find the category of the query, known as query cat-
egorization, our approach seeks to find the most important
types related to the query results. Not necessarily the query
category falls into this ranking of types and therefore our
approach can be complementary.
Categories and Subject Descriptors: H.3.3 [Informa-
tion Storage and Retrieval]: Retrieval Models
General Terms: Algorithms, Experimentation
Keywords:Type Ranking, Entity Ranking, Faceted Search
1. INTRODUCTION
Under the current Web search paradigm, search engines
return ordered list of snippets, each representing a web page
of interest. In the past few years, new forms of search
are appearing both in the academic literature and in novel
commercial search services. For example, if pages contain
metadata (i.e. page categories, types, properties, authors,
etc.) this metadata can be used to organise the search re-
sults and allow the user to browse or filter based on them.
This idea was adopted early by search engines over prod-
uct databases such as those found in catalog and shopping
sites (e.g. www.amazon.com, www.kelkoo.com), travel (e.g.
www.opodo.com), etc. Search engines also allow users to
select in which category their search falls (e.g. Web, Music,
Video). Faceted Search [2] is a recently proposed framework
to formalise these approaches. We will use this term loosely
here to refer in general to search engines which expose the
metadata to the user for browsing. For collections with-
out explicit metadata, faceted search can still be applied by
developing automatic classifiers and extractors that process
the content of the documents and extract properties.
A crucial and difficult problem in Faceted Search is to
choose which are the most important facets for the query.
This is a problem of facet ranking rather than document
ranking. Once the most important facets are determined,
they can be used to adapt the presentation of results (chang-
ing the display, clustering, providing filters, reranking, etc.)
during a short internship at Yahoo! Research Barcelona
Copyright is held by the author/owner(s).
SIGIR’08, July 20–24, 2008, Singapore.
ACM 978-1-60558-164-4/08/07.
Entity Retrieval (ER) is a different trend that modifies
the traditional search engine paradigm. Entities are phrases
with an associated semantic type (e.g. “CITY:San Fran-
cisco”, “DATE:July 2008”). In ER the result to a user query
is not a ranked list of snippets, but rather, a ranked list of en-
tities [1, 3]. Interestingly, Entity Retrieval can provide very
useful information to Faceted Search, specially in the case
when explicit metadata is not available. By analysing the
entities relevant to a query, we can gain information about
the types that are most interesting to this query. For exam-
ple, the query ‘New York city’ retrieves, in our Wikipedia
corpus, both entities related to well-known locations of the
city and also entities related to important dates of the city’s
history. This tells us that the LOCATION and DATE types
are important for this query, more so than other types such
as PERSON, ORGANIZATION, etc. Similar to the prob-
lem of selecting facets, the problem here is to rank types,
not entities. Once we select the most important types for a
query, we could use this information in a number of ways.
For example, using type-specific displays for the most impor-
tant types (e.g. a map for locations, a timeline for dates) or
letting the user filter the results.
We are interested here in the use of entity retrieval for the
automatic prediction of the most relevant types for a query.
We call this problem entity type ranking to differentiate from
standard entity ranking. To the best of our knowledge this
is the first work that studies this problem.
2. ENTITY AND TYPE RANKING
To study this problem we adopt the entity ranking set-
ting described in [3]. We use the same corpus as them,
a snapshot of Wikipedia with named entities automatically
extracted: 20.3M occurrences of .8M unique named entities
[3]. An entity eis represented by the named entity phrase
and its associated type (64 different types and subtypes)
t. The entity ranker works as follows. The query is exe-
cuted against a standard passage retrieval algorithm, which
retrieves the 500 most relevant passages and collects all the
entities that appear on them. A Kullback-Leibler distance
(KLD) based ranking algorithm is applied to the entities1:
scoreq(e) = KLD(Pq(e)||Pc(e)) = Pq(e)log(Pq(e)/Pc(e))
where P(e) is the prior probability of an entity being in a
sentence, Pq(e) and Pc(e) are the maximum likelihood esti-
mates of P(e) computed over the top 500 retrieved sentences
1This improved slightly (3% AvgPrec on average) over the rank-
ing methods proposed in [3].
Table 1: Example of queries and associated types
Query Query Type Type 1 Type 2 Type 3
Australia Country Location Date Organization
Hanseatic League Organization Lo cation Date Organization
Paris Dakar Event Location Person Vehicle
for query qand over the entire corpus respectively.
The entity ranker system returns for a query a ranked
list of entities E(q) = e1, e2,...,enq, ordered by their de-
creasing score. The type ranking system takes this as an
input and needs to produce a ranked list of entity types
T(q) = t1, t2,...,tn0
qfrom most to least important types
related to the query. The family of type ranking functions
that we have experimented with can be expressed as:
score(t, q) =
nq
X
i=1 (wq(t, i) if type(i) = t
0 otherwise
where wq(t, i) is a weighting function for that type and that
query at the given position i of the ranked entity list E. We
tried a number of weighting functions; we report here the
four which seem more interesting to us 2:
wq(t, i) = countq(t, i) := 1
wq(t, i) = scoreq(t, i) := scoreq(ei)
wq(t, i) = posq(t, i) := (nqi)
wq(t, i) = pos2
q(t, i) := (nqi)2
We also tried using these weighting functions only on the
top kentities in E.
3. EXPERIMENTS AND CONCLUSION
The weighting functions were evaluated with 50 queries.
The type relevance assessments for these queries were cre-
ated by 1) launching the queries with the entity retrieval
system, 2) making a relevance assessment of the returned
entities and 3) ordering the list of types with the highest
percentage of entities judged as relevant. Table 1 shows an
example of queries, their type (or category), and the three
top most important types, inferred from the assessments as
described above. The query type does not always coincide
with the top most important types. For instance, while the
Hanseatic League is an organization, the most important
types are the locations (countries, cities) that were part of
this trading alliance, the significant dates of the alliance’s
history and finally other related organizations. As evalu-
ation metrics we used NDCG values (with gain values of
10,5,2) and P@N values. The precision values try to evalu-
ate how effective would be the system on selecting a set of
relevant types given a query. Is thus defined as the percent-
age of relevant result types up to position N that belong to
the top N relevant types from the assessments.
Table 2 shows the results of the evaluation. Values be-
tween parenthesis are the results using only the top k= 70
entity results from E(q). This value of kled to the maxi-
mum value over a subset of 20 queries for all type ranking
weighting functions.
2We also tried other polynomial and exponential discounting
functions, without improvements.
Table 2: Average NDCG and P@N
NDCG P@1 P@2 P@3
count 0.651(0.766) 0.480(0.660) 0.540(0.630) 0.607(0.693)
score 0.671(0.628) 0.520(0.460) 0.560(0.490) 0.593(0.573)
pos 0.678(0.769) 0.480(0.660) 0.590(0.640) 0.640(0.693)
pos20.733(0.774) 0.640(0.680) 0.610(0.660) 0.627(0.700)
The main difference between the four proposed approaches
is the importance that the weighting function gives to the
top result types. count gives the same weight regardless of
the position, and even being the most naive approach, it still
achieves a considerable performance. The score function
gives slightly more importance to the top results, as their
score values are higher, but its improvement is marginal.
The weighting functions pos and pos2(to a higher degree)
give more importance to the first results. pos slightly im-
proves the baseline approach, whereas pos2yields a greater
improvement: 13% on NDCG 33% on P@1. This suggests
that the entities with the relevant types are more frequent
on the upper positions of the results sets of our entity re-
trieval system. The latter function seems to adapt better
to this distribution. This hypothesis is further validated by
examining the top k= 70 modification results, showing im-
provements ranging from a 18% NDCG on the baseline ap-
proach, to a 13% and 6% on the pos and pos2approaches,
which already give more importance to the top positions.
4. CONCLUSION
In this work, we propose the task of entity type ranking,
and present a method to predict the more important types
relevant to a query in an informational search task. We do
this by making use of entity extraction and entity ranking
systems. The proposed methods can achieve up to 70% pre-
cision on the three top inferred types. This can have direct
application to faceted search systems, specially in informa-
tional search and with corpora where metadata needs to be
extracted from the documents.
5. ACKNOWLEDGEMENTS
This research was partially supported by the European
Commission under contract FP6-027685 MESH. The ex-
pressed content is the view of the authors but not necessarily
the view of the MESH project as a whole.
6. REFERENCES
[1] A. P. de Vries, J. A. Thom, A. M. Vercoustre,
N. Craswell, and M. Lalmas. Inex 2007 entity ranking
track guidelines. In INEX 2007 Workshop
preproceedings, pages 481–486, 2007.
[2] M. Tvarozek and M. Bielikova. Adaptive faceted
browser for navigation in open information spaces. In
WWW ’07, pages 1311–1312, New York, NY, USA,
2007. ACM.
[3] H. Zaragoza, H. Rode, P. Mika, J. Atserias,
M. Ciaramita, and G. Attardi. Ranking very many
typed entities on wikipedia. In CIKM ’07, pages
1015–1018, New York, NY, USA, 2007. ACM.
... Target types are commonly considered either as a set (Pehcevski et al. 2010;Demartini et al. 2010a;Raviv et al. 2012;Kaptein and Kamps 2013) or as a bag (weighted set) (Vallet and Zaragoza 2008;Balog et al. 2011;Sawant and Chakrabarti 2013). Various ways of measuring type-based similarity have been proposed (Vercoustre et al. 2008;Kaptein and Kamps 2009;Weerkamp et al. 2009;Zhu et al. 2008;Demartini et al. 2008). ...
... When explicit target type information is lacking, one might attempt to infer types from the keyword query. This subtask was introduced by Vallet and Zaragoza (2008) as the entity type ranking problem. They extract entity mentions from the set of top relevant passages, then consider the types associated with the top-ranked entities using various weighting functions. ...
... The entity-centric model can be regarded as the most common approach for determining the target types for a query, see, e.g., Kaptein et al. (2010), Balog and Neumayer (2012), Vallet and Zaragoza (2008). This model also fits the late fusion design pattern for object retrieval (Zhang and Balog 2017). ...
Article
Full-text available
Today, the practice of returning entities from a knowledge base in response to search queries has become widespread. One of the distinctive characteristics of entities is that they are typed, i.e., assigned to some hierarchically organized type system (type taxonomy). The primary objective of this paper is to gain a better understanding of how entity type information can be utilized in entity retrieval. We perform this investigation in two settings: firstly, in an idealized “oracle” setting, assuming that we know the distribution of target types of the relevant entities for a given query; and secondly, in a realistic scenario, where target entity types are identified automatically based on the keyword query. We perform a thorough analysis of three main aspects: (i) the choice of type taxonomy, (ii) the representation of hierarchical type information, and (iii) the combination of type-based and term-based similarity in the retrieval model. Using a standard entity search test collection based on DBpedia, we show that type information can significantly and substantially improve retrieval performance, yielding up to 67% relative improvement in terms of NDCG@10 over a strong text-only baseline in an oracle setting. We further show that using automatic target type detection, we can outperform the text-only baseline by 44% in terms of NDCG@10. This is as good as, and sometimes even better than, what is attainable by using explicit target type information provided by humans. These results indicate that identifying target entity types of queries is challenging even for humans and attests to the effectiveness of our proposed automatic approach.
... A pool of target entity types is constructed from four baseline methods, taking the top 10 types from each: entity-centric [11,108,186] and type-centric [11], using both BM25 and LM as retrieval methods. Additionally, we included all types returned by an oracle method, which has knowledge of the set of relevant entities for each query (from the ground truth). ...
... Entity-centric model. The entity-centric model [11,76,108,186] can be regarded as the most common approach for determining the target types for a query. The idea is simple: first, rank entities based on their relevance to the query, then look at what types the top-K ranked entities have. ...
Book
Full-text available
Over the past decade, modern search engines have made significant progress towards better understanding searchers’ intents and providing them with more focused answers, a paradigm that is called “semantic search.” Semantic search is a broad area that encompasses a variety of tasks and has a core enabling data component, called the knowledge base. In this thesis, we utilize knowledge bases to address three tasks involved in semantic search: (i) query understanding, (ii) entity retrieval, and (iii) entity summarization. Query understanding is the first step in virtually every semantic search system. We study the problem of identifying entity mentions in queries and linking them to the corresponding entries in a knowledge base. We formulate this as the task of entity linking in queries, propose refinements to evaluation measures, and publish a test collection for training and evaluation purposes. We further establish a baseline method for this task through a reproducibility study, and introduce different methods with the aim to strike a balance between efficiency and effectiveness. Next, we turn to using the obtained annotations for answering the queries. Here, our focus is on the entity retrieval task: answering search queries by returning a ranked list of entities. We introduce a general feature-based model based on Markov Random Fields, and show improvements over existing baseline methods. We find that the largest gains are achieved for complex natural language queries. Having generated an answer to the query (from the entity retrieval step), we move on to presentation aspects of the results. We introduce and address the novel problem of dynamic entity summarization for entity cards, by breaking it into two subtasks, fact ranking and summary generation. We perform an extensive evaluation of our method using crowdsourcing, and show that our supervised fact ranking method brings substantial improvements over the most comparable baselines. In this thesis, we take the reproducibility of our research very seriously. Therefore, all resources developed within the course of this work are made publicly available. We further make two major software and resource contributions: (i) the Nordlys toolkit, which implements a range of methods for semantic search, and (ii) the extended DBpedia-Entity test collection.
... Representations of type information. Target types are commonly considered either as a set [8,18,29,32] or as a bag (weighted set) [1,33,36]. Various ways of measuring type-based similarity have been proposed [7,17,37,38,40]. In this work we employ a state-of-theart probabilistic approach by Balog et al. [1] (cf. ...
... e INEX Entity Ranking track [10] and the TREC Entity track [5] both featured scenarios where target types are provided by the user. In the lack of explicit target type information, one might a empt to infer types from the keyword query. is subtask is introduced by Vallet and Zaragoza [36] as the entity type ranking problem. ey extract entity mentions from the set of top relevant passages, then consider the types associated with the top-ranked entities using various weighting functions. ...
Conference Paper
Full-text available
Today, the practice of returning entities from a knowledge base in response to search queries has become widespread. One of the distinctive characteristics of entities is that they are typed, i.e., assigned to some hierarchically organized type system (type taxonomy). The primary objective of this paper is to gain a better understanding of how entity type information can be utilized in entity retrieval. We perform this investigation in an idealized "oracle" setting, assuming that we know the distribution of target types of the relevant entities for a given query. We perform a thorough analysis of three main aspects: (i) the choice of type taxonomy, (ii) the representation of hierarchical type information, and (iii) the combination of type-based and term-based similarity in the retrieval model. Using a standard entity search test collection based on DBpedia, we find that type information proves most useful when using large type taxonomies that provide very specific types. We provide further insights on the extensional coverage of entities and on the utility of target types.
... [cs.IR] 17 May 2017 one might a empt to infer types from the keyword query. Vallet and Zaragoza [20] introduce the entity type ranking problem, where they consider the types associated with the top-ranked entities using various weighting functions. Balog and Neumayer [2] address a hierarchical version of the target type identi cation task using the DBpedia ontology and language modeling techniques. ...
... e entity-centric model can be regarded as the most common approach for determining the target types for a query, see, e.g., [2,15,20]. is model also ts the late fusion design pa ern for object retrieval [21]. e idea is simple: rst, rank entities based on their relevance to the query, then look at what types the top-K ranked entities have. ...
Conference Paper
Full-text available
Identifying the target types of entity-bearing queries can help improve retrieval performance as well as the overall search experience. In this work, we address the problem of automatically detecting the target types of a query with respect to a type taxonomy. We propose a supervised learning approach with a rich variety of features. Using a purpose-built test collection, we show that our approach outperforms existing methods by a remarkable margin. This is an extended version of the article published with the same title in the Proceedings of SIGIR'17.
... However, they are for entity search. Vallet and Zaragoza (2008), Santos, et al. (2010), Demartini, et al. (2010, and Kaptein, et al. (2010) use only names and classes of NEs, and they are for entity ranking (Balog, et al. 2009). Gupta and Ratinov (2008), Chang, et al. (2008), Wang, et al. (2009), and Jing, et al. (2010 use only labels of concepts (NE names or WW forms) to represent documents and queries. ...
Preprint
Named entities and WordNet words are im-portant in defining the content of a text in which they occur. Named entities have onto-logical features, namely, their aliases, classes, and identifiers. WordNet words also have ontological features, namely, their synonyms, hypernyms, hyponyms, and senses. Those features of concepts may be hidden from their textual appearance. Besides, there are related concepts that do not appear in a query, but can bring out the meaning of the query if they are added. The traditional constrained spreading activation algorithms use all relations of a node in the network that will add unsuitable information into the query. Meanwhile, we only use relations represented in the query. We propose an ontology-based generalized Vector Space Model to semantic text search. It discovers relevant latent concepts in a query by relation constrained spreading activation. Besides, to represent a word having more than one possible direct sense, it combines the most specific common hypernym of the remaining undisambiguated multi-senses with the form of the word. Experiments on a benchmark dataset in terms of the MAP measure for the retrieval performance show that our model is 41.9% and 29.3% better than the purely keyword-based model and the traditional constrained spreading activation model, respectively.
... Probabilistic, LM-based ranking models have been recently used in the context of entity ranking [70,124,133,142,143]. The general idea is to view the LM of an entity e as the probability distribution of words seen in the context of e. ...
... Since our work can be used to suggest related queries as explained in the introduc-tion, it can be seen as complementary to these approaches, offering a new way of generating related queries. Another group of works that do not try to extend or improve the query results with new data, but only to organize them in some way that is more comprehensive to the user, is the one of faceted search [15] and query categorization [46]. Despite the fundamental difference from our approach, these works are also aiming at increasing the user satisfaction. ...
Article
Full-text available
Modern search engines employ advanced techniques that go beyond the structures that strictly satisfy the query conditions in an effort to better capture the user intentions. In this work, we introduce a novel query paradigm that considers a user query as an example of the data in which the user is interested. We call these queries exemplar queries. We provide a formal specification of their semantics and show that they are fundamentally different from notions like queries by example, approximate queries and related queries. We provide an implementation of these semantics for knowledge graphs and present an exact solution with a number of optimizations that improve performance without compromising the result quality. We study two different congruence relations, isomorphism and strong simulation, for identifying the answers to an exemplar query. We also provide an approximate solution that prunes the search space and achieves considerably better time performance with minimal or no impact on effectiveness. The effectiveness and efficiency of these solutions with synthetic and real datasets are experimentally evaluated, and the importance of exemplar queries in practice is illustrated.
... An early work about ranking entity types is [21] where authors propose a method to select the best type of the result for a Web search query. Similarly, in [22] authors propose methods to select the best type given a query by exploiting the type hierarchy from a background knowledge graph. ...
Article
A large fraction of online queries target entities. For this reason, Search Engine Result Pages (SERPs) increasingly contain information about the searched entities such as pictures, short summaries, related entities, and factual information. A key facet that is often displayed on the SERPs and that is instrumental for many applications is the entity type. However, an entity is usually not associated to a single generic type in the background knowledge graph but rather to a set of more specific types, which may be relevant or not given the document context. For example, one can find on the Linked Open Data cloud the fact that Tom Hanks is a person, an actor, and a person from Concord, California. All these types are correct but some may be too general to be interesting (e.g., person), while other may be interesting but already known to the user (e.g., actor), or may be irrelevant given the current browsing context (e.g., person from Concord, California). In this paper, we define the new task of ranking entity types given an entity and its context. We propose and evaluate new methods to find the most relevant entity type based on collection statistics and on the knowledge graph structure interconnecting entities and types. An extensive experimental evaluation over several document collections at different levels of granularity (e.g., sentences, paragraphs) and different type hierarchies (including DBpedia, Freebase, and schema.org) shows that hierarchy-based approaches provide more accurate results when picking entity types to be displayed to the end-user.
Chapter
Understanding what the user is looking for is at the heart of delivering a quality search experience. The focus of this chapter is on obtaining semantically enriched representations of search queries with the help of knowledge repositories. Specifically, we (1) identify the types or categories of entities that are targeted by the query, (2) recognize specific entity mentions in queries and annotate them with unique identifiers from the underlying knowledge repository, and (3) automatically generate query templates from a search log, which then can provide structured interpretations of queries.
Conference Paper
Thanks to information extraction and semantic Web efforts, search on unstructured text is increasingly refined using semantic annotations and structured knowledge bases. However, most users cannot become familiar with the schema of knowledge bases and ask structured queries. Interpreting free-format queries into a more structured representation is of much current interest. The dominant paradigm is to segment or partition query tokens by purpose (references to types, entities, attribute names, attribute values, relations) and then launch the interpreted query on structured knowledge bases. Given that structured knowledge extraction is never complete, here we choose a less trodden path: a data representation that retains the unstructured text corpus, along with structured annotations (mentions of entities and relationships) on it. We propose two new, natural formulations for joint query interpretation and response ranking that exploit bidirectional flow of information between the knowledge base and the corpus. One, inspired by probabilistic language models, computes expected response scores over the uncertainties of query interpretation. The other is based on max-margin discriminative learning, with latent variables representing those uncertainties. In the context of typed entity search, both formulations bridge a considerable part of the accuracy gap between a generic query that does not constrain the type at all, and the upper bound where the "perfect" target entity type of each query is provided by humans. Our formulations are also superior to a two-stage approach of first choosing a target type using recent query type prediction techniques, and then launching a type-restricted entity search query.
Conference Paper
Full-text available
Many realistic user tasks involve the retrieval of specific entities instead of just any type of documents. Examples of information needs include `Countries where one can pay with the euro' or `Impressionist art museums in The Netherlands'. The Initiative for Evaluation of XML Retrieval (INEX) started the XML Entity Ranking track (INEX-XER) to create a test collection for entity retrieval in Wikipedia. Entities are assumed to correspond to Wikipedia entries. The goal of the track is to evaluate how well systems can rank entities in response to a query; the set of entities to be ranked is assumed to be loosely defined either by a generic category (entity ranking) or by some example entities (list completion). This track overview introduces the track setup, and discusses the implications of the new relevance notion for entity ranking in comparison to ad hoc retrieval.
Conference Paper
Full-text available
Open information spaces have several unique characteristics such as their changeability, large size, complexity and di- verse user base. These result in novel challenges during user navigation, information retrieval and data visualization in open information spaces. We propose a method of navi- gation in open information spaces based on an enhanced faceted browser with support for dynamic facet generation and adaptation based on user characteristics. Categories and Subject Descriptors: H.3.3 (Informa- tion Systems): Information Search and Retrieval; H.5.4 (In- formation Systems): Hypertext/ Hypermedia—Navigation
Conference Paper
Full-text available
We discuss the problem of ranking very many entities of different types. In particular we deal with a heterogeneous set of types, some being very generic and some very specific. We discuss two approaches for this problem: i) exploiting the entity containment graph and ii) using a Web search engine to compute entity relevance. We evaluate these approaches on the real task of ranking Wikipedia entities typed with a state-of-the-art named-entity tagger. Results show that both approaches can greatly increase the performance of methods based only on passage retrieval.