Conference PaperPDF Available

Inferring the most important types of a query: a semantic approach

Authors:

Abstract

In this paper we present a technique for ranking the most important types or categories for a given query. Rather than trying to nd the category of the query, known as query cat- egorization, our approach seeks to nd the most important types related to the query results. Not necessarily the query category falls into this ranking of types and therefore our approach can be complementary.
Inferring the Most Important Types of a Query: a Semantic
Approach
David Vallet
Universidad Autónoma de Madrid
Ciudad Universitaria de Cantoblanco
Madrid 28049, Spain
david.vallet@uam.es
Hugo Zaragoza
Yahoo! Research Barcelona
Ocata 1
Barcelona 08003, Spain
hugoz@es.yahoo-inc.com
ABSTRACT
In this paper we present a technique for ranking the most
important types or categories for a given query. Rather than
trying to find the category of the query, known as query cat-
egorization, our approach seeks to find the most important
types related to the query results. Not necessarily the query
category falls into this ranking of types and therefore our
approach can be complementary.
Categories and Subject Descriptors: H.3.3 [Informa-
tion Storage and Retrieval]: Retrieval Models
General Terms: Algorithms, Experimentation
Keywords:Type Ranking, Entity Ranking, Faceted Search
1. INTRODUCTION
Under the current Web search paradigm, search engines
return ordered list of snippets, each representing a web page
of interest. In the past few years, new forms of search
are appearing both in the academic literature and in novel
commercial search services. For example, if pages contain
metadata (i.e. page categories, types, properties, authors,
etc.) this metadata can be used to organise the search re-
sults and allow the user to browse or filter based on them.
This idea was adopted early by search engines over prod-
uct databases such as those found in catalog and shopping
sites (e.g. www.amazon.com, www.kelkoo.com), travel (e.g.
www.opodo.com), etc. Search engines also allow users to
select in which category their search falls (e.g. Web, Music,
Video). Faceted Search [2] is a recently proposed framework
to formalise these approaches. We will use this term loosely
here to refer in general to search engines which expose the
metadata to the user for browsing. For collections with-
out explicit metadata, faceted search can still be applied by
developing automatic classifiers and extractors that process
the content of the documents and extract properties.
A crucial and difficult problem in Faceted Search is to
choose which are the most important facets for the query.
This is a problem of facet ranking rather than document
ranking. Once the most important facets are determined,
they can be used to adapt the presentation of results (chang-
ing the display, clustering, providing filters, reranking, etc.)
during a short internship at Yahoo! Research Barcelona
Copyright is held by the author/owner(s).
SIGIR’08, July 20–24, 2008, Singapore.
ACM 978-1-60558-164-4/08/07.
Entity Retrieval (ER) is a different trend that modifies
the traditional search engine paradigm. Entities are phrases
with an associated semantic type (e.g. “CITY:San Fran-
cisco”, “DATE:July 2008”). In ER the result to a user query
is not a ranked list of snippets, but rather, a ranked list of en-
tities [1, 3]. Interestingly, Entity Retrieval can provide very
useful information to Faceted Search, specially in the case
when explicit metadata is not available. By analysing the
entities relevant to a query, we can gain information about
the types that are most interesting to this query. For exam-
ple, the query ‘New York city’ retrieves, in our Wikipedia
corpus, both entities related to well-known locations of the
city and also entities related to important dates of the city’s
history. This tells us that the LOCATION and DATE types
are important for this query, more so than other types such
as PERSON, ORGANIZATION, etc. Similar to the prob-
lem of selecting facets, the problem here is to rank types,
not entities. Once we select the most important types for a
query, we could use this information in a number of ways.
For example, using type-specific displays for the most impor-
tant types (e.g. a map for locations, a timeline for dates) or
letting the user filter the results.
We are interested here in the use of entity retrieval for the
automatic prediction of the most relevant types for a query.
We call this problem entity type ranking to differentiate from
standard entity ranking. To the best of our knowledge this
is the first work that studies this problem.
2. ENTITY AND TYPE RANKING
To study this problem we adopt the entity ranking set-
ting described in [3]. We use the same corpus as them,
a snapshot of Wikipedia with named entities automatically
extracted: 20.3M occurrences of .8M unique named entities
[3]. An entity eis represented by the named entity phrase
and its associated type (64 different types and subtypes)
t. The entity ranker works as follows. The query is exe-
cuted against a standard passage retrieval algorithm, which
retrieves the 500 most relevant passages and collects all the
entities that appear on them. A Kullback-Leibler distance
(KLD) based ranking algorithm is applied to the entities1:
scoreq(e) = KLD(Pq(e)||Pc(e)) = Pq(e)log(Pq(e)/Pc(e))
where P(e) is the prior probability of an entity being in a
sentence, Pq(e) and Pc(e) are the maximum likelihood esti-
mates of P(e) computed over the top 500 retrieved sentences
1This improved slightly (3% AvgPrec on average) over the rank-
ing methods proposed in [3].
Table 1: Example of queries and associated types
Query Query Type Type 1 Type 2 Type 3
Australia Country Location Date Organization
Hanseatic League Organization Lo cation Date Organization
Paris Dakar Event Location Person Vehicle
for query qand over the entire corpus respectively.
The entity ranker system returns for a query a ranked
list of entities E(q) = e1, e2,...,enq, ordered by their de-
creasing score. The type ranking system takes this as an
input and needs to produce a ranked list of entity types
T(q) = t1, t2,...,tn0
qfrom most to least important types
related to the query. The family of type ranking functions
that we have experimented with can be expressed as:
score(t, q) =
nq
X
i=1 (wq(t, i) if type(i) = t
0 otherwise
where wq(t, i) is a weighting function for that type and that
query at the given position i of the ranked entity list E. We
tried a number of weighting functions; we report here the
four which seem more interesting to us 2:
wq(t, i) = countq(t, i) := 1
wq(t, i) = scoreq(t, i) := scoreq(ei)
wq(t, i) = posq(t, i) := (nqi)
wq(t, i) = pos2
q(t, i) := (nqi)2
We also tried using these weighting functions only on the
top kentities in E.
3. EXPERIMENTS AND CONCLUSION
The weighting functions were evaluated with 50 queries.
The type relevance assessments for these queries were cre-
ated by 1) launching the queries with the entity retrieval
system, 2) making a relevance assessment of the returned
entities and 3) ordering the list of types with the highest
percentage of entities judged as relevant. Table 1 shows an
example of queries, their type (or category), and the three
top most important types, inferred from the assessments as
described above. The query type does not always coincide
with the top most important types. For instance, while the
Hanseatic League is an organization, the most important
types are the locations (countries, cities) that were part of
this trading alliance, the significant dates of the alliance’s
history and finally other related organizations. As evalu-
ation metrics we used NDCG values (with gain values of
10,5,2) and P@N values. The precision values try to evalu-
ate how effective would be the system on selecting a set of
relevant types given a query. Is thus defined as the percent-
age of relevant result types up to position N that belong to
the top N relevant types from the assessments.
Table 2 shows the results of the evaluation. Values be-
tween parenthesis are the results using only the top k= 70
entity results from E(q). This value of kled to the maxi-
mum value over a subset of 20 queries for all type ranking
weighting functions.
2We also tried other polynomial and exponential discounting
functions, without improvements.
Table 2: Average NDCG and P@N
NDCG P@1 P@2 P@3
count 0.651(0.766) 0.480(0.660) 0.540(0.630) 0.607(0.693)
score 0.671(0.628) 0.520(0.460) 0.560(0.490) 0.593(0.573)
pos 0.678(0.769) 0.480(0.660) 0.590(0.640) 0.640(0.693)
pos20.733(0.774) 0.640(0.680) 0.610(0.660) 0.627(0.700)
The main difference between the four proposed approaches
is the importance that the weighting function gives to the
top result types. count gives the same weight regardless of
the position, and even being the most naive approach, it still
achieves a considerable performance. The score function
gives slightly more importance to the top results, as their
score values are higher, but its improvement is marginal.
The weighting functions pos and pos2(to a higher degree)
give more importance to the first results. pos slightly im-
proves the baseline approach, whereas pos2yields a greater
improvement: 13% on NDCG 33% on P@1. This suggests
that the entities with the relevant types are more frequent
on the upper positions of the results sets of our entity re-
trieval system. The latter function seems to adapt better
to this distribution. This hypothesis is further validated by
examining the top k= 70 modification results, showing im-
provements ranging from a 18% NDCG on the baseline ap-
proach, to a 13% and 6% on the pos and pos2approaches,
which already give more importance to the top positions.
4. CONCLUSION
In this work, we propose the task of entity type ranking,
and present a method to predict the more important types
relevant to a query in an informational search task. We do
this by making use of entity extraction and entity ranking
systems. The proposed methods can achieve up to 70% pre-
cision on the three top inferred types. This can have direct
application to faceted search systems, specially in informa-
tional search and with corpora where metadata needs to be
extracted from the documents.
5. ACKNOWLEDGEMENTS
This research was partially supported by the European
Commission under contract FP6-027685 MESH. The ex-
pressed content is the view of the authors but not necessarily
the view of the MESH project as a whole.
6. REFERENCES
[1] A. P. de Vries, J. A. Thom, A. M. Vercoustre,
N. Craswell, and M. Lalmas. Inex 2007 entity ranking
track guidelines. In INEX 2007 Workshop
preproceedings, pages 481–486, 2007.
[2] M. Tvarozek and M. Bielikova. Adaptive faceted
browser for navigation in open information spaces. In
WWW ’07, pages 1311–1312, New York, NY, USA,
2007. ACM.
[3] H. Zaragoza, H. Rode, P. Mika, J. Atserias,
M. Ciaramita, and G. Attardi. Ranking very many
typed entities on wikipedia. In CIKM ’07, pages
1015–1018, New York, NY, USA, 2007. ACM.
... Target types are commonly considered either as a set (Pehcevski et al. 2010;Demartini et al. 2010a;Raviv et al. 2012;Kaptein and Kamps 2013) or as a bag (weighted set) (Vallet and Zaragoza 2008;Balog et al. 2011;Sawant and Chakrabarti 2013). Various ways of measuring type-based similarity have been proposed (Vercoustre et al. 2008;Kaptein and Kamps 2009;Weerkamp et al. 2009;Zhu et al. 2008;Demartini et al. 2008). ...
... When explicit target type information is lacking, one might attempt to infer types from the keyword query. This subtask was introduced by Vallet and Zaragoza (2008) as the entity type ranking problem. They extract entity mentions from the set of top relevant passages, then consider the types associated with the top-ranked entities using various weighting functions. ...
... The entity-centric model can be regarded as the most common approach for determining the target types for a query, see, e.g., Kaptein et al. (2010), Balog and Neumayer (2012), Vallet and Zaragoza (2008). This model also fits the late fusion design pattern for object retrieval (Zhang and Balog 2017). ...
Article
Full-text available
Today, the practice of returning entities from a knowledge base in response to search queries has become widespread. One of the distinctive characteristics of entities is that they are typed, i.e., assigned to some hierarchically organized type system (type taxonomy). The primary objective of this paper is to gain a better understanding of how entity type information can be utilized in entity retrieval. We perform this investigation in two settings: firstly, in an idealized “oracle” setting, assuming that we know the distribution of target types of the relevant entities for a given query; and secondly, in a realistic scenario, where target entity types are identified automatically based on the keyword query. We perform a thorough analysis of three main aspects: (i) the choice of type taxonomy, (ii) the representation of hierarchical type information, and (iii) the combination of type-based and term-based similarity in the retrieval model. Using a standard entity search test collection based on DBpedia, we show that type information can significantly and substantially improve retrieval performance, yielding up to 67% relative improvement in terms of NDCG@10 over a strong text-only baseline in an oracle setting. We further show that using automatic target type detection, we can outperform the text-only baseline by 44% in terms of NDCG@10. This is as good as, and sometimes even better than, what is attainable by using explicit target type information provided by humans. These results indicate that identifying target entity types of queries is challenging even for humans and attests to the effectiveness of our proposed automatic approach.
... [cs.IR] 27 Jul 2017 one might a empt to infer types from the keyword query. Vallet and Zaragoza [21] introduce the entity type ranking problem, where they consider the types associated with the top-ranked entities using various weighting functions. Balog and Neumayer [2] address a hierarchical version of the target type identi cation task using the DBpedia ontology and language modeling techniques. ...
... e entity-centric model can be regarded as the most common approach for determining the target types for a query, see, e.g., [2,15,21]. is model also ts the late fusion design pa ern for object retrieval [22]. e idea is simple: rst, rank entities based on their relevance to the query, then look at what types the top-K ranked entities have. ...
Preprint
Identifying the target types of entity-bearing queries can help improve retrieval performance as well as the overall search experience. In this work, we address the problem of automatically detecting the target types of a query with respect to a type taxonomy. We propose a supervised learning approach with a rich variety of features. Using a purpose-built test collection, we show that our approach outperforms existing methods by a remarkable margin. This is an extended version of the article published with the same title in the Proceedings of SIGIR'17.
... Representations of type information. Target types are commonly considered either as a set [8,18,29,32] or as a bag (weighted set) [1,33,36]. Various ways of measuring type-based similarity have been proposed [7,17,37,38,40]. In this work we employ a state-of-theart probabilistic approach by Balog et al. [1] (cf. ...
... e INEX Entity Ranking track [10] and the TREC Entity track [5] both featured scenarios where target types are provided by the user. In the lack of explicit target type information, one might a empt to infer types from the keyword query. is subtask is introduced by Vallet and Zaragoza [36] as the entity type ranking problem. ey extract entity mentions from the set of top relevant passages, then consider the types associated with the top-ranked entities using various weighting functions. ...
Preprint
Today, the practice of returning entities from a knowledge base in response to search queries has become widespread. One of the distinctive characteristics of entities is that they are typed, i.e., assigned to some hierarchically organized type system (type taxonomy). The primary objective of this paper is to gain a better understanding of how entity type information can be utilized in entity retrieval. We perform this investigation in an idealized "oracle" setting, assuming that we know the distribution of target types of the relevant entities for a given query. We perform a thorough analysis of three main aspects: (i) the choice of type taxonomy, (ii) the representation of hierarchical type information, and (iii) the combination of type-based and term-based similarity in the retrieval model. Using a standard entity search test collection based on DBpedia, we find that type information proves most useful when using large type taxonomies that provide very specific types. We provide further insights on the extensional coverage of entities and on the utility of target types.
... In [28], they use probabilistic models to verify if the candidate answer types match the expected answer types to the question. Answer type prediction is also related to the task of inferring semantic types of queries, referred to as target entity type identification [1], which has been studied in the context of entity-oriented search [2], [7], [29]. There, it is approached as a ranking task, using different ways of aggregating entity descriptions [2], which can be combined with additional taxonomic and embedding-based features [7]. ...
Preprint
Full-text available
Semantic answer type prediction (SMART) is known to be a useful step towards effective question answering (QA) systems. The SMART task involves predicting the top-k knowledge graph (KG) types for a given natural language question. This is challenging due to the large number of types in KGs. In this paper, we propose use of extreme multi-label classification using Transformer models (XBERT) by clustering KG types using structural and semantic features based on question text. We specifically improve the clustering stage of the XBERT pipeline using textual and structural features derived from KGs. We show that these features can improve end-to-end performance for the SMART task, and yield state-of-the-art results.
Article
Fine-grained entity typing (FGET) is an important natural language processing (NLP) task. It is to assign fine-grained semantic types of a type taxonomy (e.g., Person / artist / actor ) to entity mentions. Fine-grained entity semantic types have been successfully applied in many natural language processing applications, such as relation extraction, entity linking and question answering. The key challenge for FGET is how to deal with label noises that disperse in corpora since the corpora are normally automatically annotated. Various type taxonomies, typing methods and representation learning approaches for FGET have been proposed and developed in the past two decades. This paper systematically categorizes and reviews these various typing methods and representation learning approaches to provide a reference for future studies on FGET. We also present a comprehensive review of type taxonomies, resources, applications for FGET and methods for automatically generating FGET training corpora. Furthermore, we identify the current trends in FGET research and discuss future research directions for FGET. To the best of our knowledge, this is the first comprehensive review of FGET.
Chapter
Full-text available
Understanding what the user is looking for is at the heart of delivering a quality search experience. The focus of this chapter is on obtaining semantically enriched representations of search queries with the help of knowledge repositories. Specifically, we (1) identify the types or categories of entities that are targeted by the query, (2) recognize specific entity mentions in queries and annotate them with unique identifiers from the underlying knowledge repository, and (3) automatically generate query templates from a search log, which then can provide structured interpretations of queries.
Chapter
Full-text available
This chapter introduces the different types of data sources, from unstructured to structured, that will be used in the remainder of the book. Specifically, we discuss the web, Wikipedia, and knowledge bases. We further introduce standard datasets and provide pointers to tools and resources.
Chapter
Full-text available
Ad hoc entity retrieval is the task of answering a free text query with a ranked list of entities. The main idea behind our approaches in this chapter can be summarized as follows: If textual representations can be constructed for entities, then the ranking of these representations (“entity descriptions”) becomes straightforward by building on traditional document retrieval techniques. Accordingly, the bulk of the work presented in this chapter revolves around assembling term-based entity representations from various sources, ranging from unstructured documents to structured knowledge bases. We also discuss evaluation methodology and standard test collections.
Conference Paper
Full-text available
Many realistic user tasks involve the retrieval of specific entities instead of just any type of documents. Examples of information needs include `Countries where one can pay with the euro' or `Impressionist art museums in The Netherlands'. The Initiative for Evaluation of XML Retrieval (INEX) started the XML Entity Ranking track (INEX-XER) to create a test collection for entity retrieval in Wikipedia. Entities are assumed to correspond to Wikipedia entries. The goal of the track is to evaluate how well systems can rank entities in response to a query; the set of entities to be ranked is assumed to be loosely defined either by a generic category (entity ranking) or by some example entities (list completion). This track overview introduces the track setup, and discusses the implications of the new relevance notion for entity ranking in comparison to ad hoc retrieval.
Conference Paper
Full-text available
Open information spaces have several unique characteristics such as their changeability, large size, complexity and di- verse user base. These result in novel challenges during user navigation, information retrieval and data visualization in open information spaces. We propose a method of navi- gation in open information spaces based on an enhanced faceted browser with support for dynamic facet generation and adaptation based on user characteristics. Categories and Subject Descriptors: H.3.3 (Informa- tion Systems): Information Search and Retrieval; H.5.4 (In- formation Systems): Hypertext/ Hypermedia—Navigation
Conference Paper
Full-text available
We discuss the problem of ranking very many entities of different types. In particular we deal with a heterogeneous set of types, some being very generic and some very specific. We discuss two approaches for this problem: i) exploiting the entity containment graph and ii) using a Web search engine to compute entity relevance. We evaluate these approaches on the real task of ranking Wikipedia entities typed with a state-of-the-art named-entity tagger. Results show that both approaches can greatly increase the performance of methods based only on passage retrieval.