Conference PaperPDF Available

Ranking very many typed entities on Wikipedia

Authors:

Abstract and Figures

We discuss the problem of ranking very many entities of different types. In particular we deal with a heterogeneous set of types, some being very generic and some very specific. We discuss two approaches for this problem: i) exploiting the entity containment graph and ii) using a Web search engine to compute entity relevance. We evaluate these approaches on the real task of ranking Wikipedia entities typed with a state-of-the-art named-entity tagger. Results show that both approaches can greatly increase the performance of methods based only on passage retrieval.
Content may be subject to copyright.
Ranking Very Many Typed Entities on Wikipedia
Hugo Zaragoza
Yahoo! Research
Barcelona, Spain
hugoz@yahoo-inc.com
Henning Rode
University of Twente
The Netherlands
h.rode@cs.utwente.nl
Peter Mika
Yahoo! Research
Barcelona, Spain
pmika@yahoo-inc.com
Jordi Atserias
Yahoo! Research
Barcelona, Spain
jordi@yahoo-inc.com
Massimiliano Ciaramita
Yahoo! Research
Barcelona, Spain
massi@yahoo-inc.com
Giuseppe Attardi
Università di Pisa
Italy
attardi@di.unipi.it
ABSTRACT
We discuss the problem of ranking very many entities of dif-
ferent types. In particular we deal with a heterogeneous set
of types, some being very generic and some very specific.
We discuss two approaches for this problem: i) exploiting
the entity containment graph and ii) using a Web search
engine to compute entity relevance. We evaluate these ap-
proaches on the real task of ranking Wikipedia entities typed
with a state-of-the-art named-entity tagger. Results show
that both approaches can greatly increase the performance
of methods based only on passage retrieval.
1. MOTIVATION
We are interested in the problem of ranking entities of dif-
ferent types as a response to an open (ad-hoc) query. In
particular, we are interested in collections with many enti-
ties and many types. This is the typical case when we deal
with collections which have been analyzed using NLP tech-
niques such as name entity recognition or semantic tagging.
Let us give an example of the task we are interested in.
Imagine that a user types an informational query such as
“Life of Pablo Picasso”or “Egyptian Pyramids” into a search
engine. Besides relevant documents, we wish to rank rele-
vant entities such as people,countries,dates, etc. so that
they can be presented to the user for browsing.
We believe this task is novel and interesting in its own right.
In some sense the task is similar to the expert finding task
in TREC [2]. However, this task will lead to very different
models, for two reasons. First we must deal with a hetero-
geneous set of entities; some of them are very general (like
“school”, “mother”) whereas others are very specific (“Pablo
Picasso”). Second, the entities are simply too many to build
entity-specific models as is done for experts in TREC [2].
on sabbatical at Yahoo! Research Barcelona.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.
Instead, we must develop models on the fly, at query time.
In this sense the models developed are more similar to those
described in [1]. However, here we wish to rank the entities
themselves, and not sentences.
In this paper we explore two types of algorithms for en-
tity ranking: i) algorithms that use the entity containment
graph to compute the importance of entities based on the
top ranked passages, and ii) algorithms that use correlation
on web search results.
2. PROBLEM SETTING
To study this task we followed these steps:
we used a statistical entity recognition algorithm to
identify many entities (and their corresponding types)
on a copy of the English Wikipedia,
we asked users to issue queries to our baseline entity
ranking systems and to evalute the results,
we compared the performance of several algorithms on
these queries.
In order to extract entities from Wikipedia, we first trained
a statistical entity extractor on the BBN Pronoun Corefer-
ence and Entity Type Corpus which includes annotation of
named entity types (Person, Facility, Organization, GPE,
Location, Nationality, Product, Event, Work of Art, Law,
Language, and Contact-Info), nominal entity types (Person,
Facility, Organization, GPE, Product, Plant, Animal, Sub-
stance, Disease and Game), and numeric types (Date, Time,
Percent, Money, Quantity, Ordinal and Cardinal). We note
that some types are dedicated to identify common nouns
that refer or describe named entities; for example, father and
artist could be tagged with the Person-Description type.
We applied this entity extractor on an XMLised Wikipedia
collection constructed by the 2006 INEX XML retrieval eval-
uation initiative [3] (625,405 Wikipedia entries). This iden-
tified 28 million occurrences of 5,5 million unique entities. A
special retieval index was then created containing both the
text and the identified entities. The overall processing time
was approximately one week on a single PC. This tagged
collection has been made available [7]; more detailed infor-
mation about its construction and content can be found at
this reference.
The evaluation framework for the task was set up as fol-
lows. First, the user chose a query on a topic that the user
Table 1: Example queries and entity judgedments (see text for discussion).
Query “Yahoo! Search Engine”
Most Important Entities Yahoo, Google, MSN, Inktomi, Yahoo.com.
Important Entities
Web, crawler, 2004, AltaVista, 2002, Amazon.com, Jeeves, TrustRank, WebCrawler, Search Engine
Placement, more than 20 billion Web, eBay, Worl WIde Web, BT OpenWorld, between 1997 and 1999,
Stanford University and Yahoo, AOL, Kelkoo, Konfabulator, AlltheWeb, Excite.
Related Entities users, Firefox, Teoma, LookSmart, Widget, companies, company, Dogpile, user, Searchen Networks,
MetaCrawler, Fitzmas, Hotbot, ...
Query “Budapest”
Most Important Entities Budapest, Hungary, Hungarian, city, Greater Budapest, capital, Danube, Budapesti K¨
ozgazdas´agtu-
dom´anyi ´es ´
Allamigazgat´asi Egyetem, M3 Line, Pest county.
Important Entities
University of Budapest, Austria, town, Budapest Metro, Soviet, 1956, Ferenc Joachim, Karl Marx
University of Economic Sciences, Budapest University of Economic Sciences, E¨
otv¨
os Lor´and University
of Budapest, Technical University of Budapest, 1895, February 13, Budapest Stock Exchange, Kispest,
...
Related Entities
Paris, Vienna, German, Prague, London, Munich, Collegium Budapest, government, Jewish, Nazi,
1950, Debrecen, 1977, M3, center, Tokyo, World War II, New York, Zagreb, Leipzig, population,
residences, state, cementery, Serbian, Novi Sad, 1949, Szeged, Turin, Graz, 6-3, Medgyessy, ...
Query “Tutankhamun curse”
Most Important Entities Tutankhamun, Carnarvon, mummies, Boy Pharaoh, The Curse, archaeologist, Howard Carter, 1922.
Important Entities Pharaohs, King Tutankhamun.
Related Entities Valley, KV62, Curse of Tutankhamun, Curse, King, Mummy’s Curse, ...
knew well and that was covered in Wikipedia. Then the
system ran this query against a standard passage retrieval
algorithm, that retrieved the 500 most relevant passages and
collected all the entities which appeared in them. This is
the candidate entity set which needs to be ranked by the
different algorithms. Finally, the entities were ranked using
our baseline entity ranking algorithm (discussed later) and
given to the user to evaluate. The possible judge assess-
ments were: Most Important,Important,Related,Unrelated,
or Don’t know. The user was asked to rank all entries if
possible, and at least the first fifty. Besides the judgment
labels, users were not given any specific training nor were
they given examples queries or judgments. 10 judges were
recruited and each judged from 3 to 10 queries, coming to a
total of 50 judged queries.
Some resulting queries and judgments are given in Table 1;
with these examples we want to stress the difficulty and sub-
jectivity of the evaluation task. Indeed we realize that our
task evaluation is quite na¨
ıve and may suffer from a number
of problems which we wish to address in future evaluations.
However, this initial evaluation allowed us to start studying
some of the properties of this task, and to compare (however
roughly) several ideas and approaches.
3. ENTITY RANKING METHODS
First, we will introduce some notation. Let a retrieved pas-
sage be the tuple (pID, s) where pID is the unique ID of
the passage and sis the retrieval score of the passage for
the given query. Call Pqthe set formed by the Khighest
scored retrieved passages with respect to the query q(in our
case K=500). Let an entity be the tuple (v,t) where vis its
string value (e.g. ’Einstein’) and tis its type (e.g. Person ).
Call Cthe set of all entities in the collection and Cqthe set
of all entities occurring in Pq.
The baseline model we consider is to use a passage retrieval
algorithm and score an entity by the maximum score sof
the passages pID in which the entity appears in Pq. This
is referred to as MaxScore in Table 2. We report a num-
ber of evaluation measures. P@K, MAP and MRR denote
precision at K, mean average precision and mean reciprocal
rank respectively; these measures were computed by bina-
rising the judgments into relevant (for Most Important and
Important labels) and irrelevant (for the rest). DCG is the
discounted cumulative gain function; we used gains 10, 3, 1
and 0 respectively for the Most Important to Unrelated la-
bels, and the discount function used was log(r+1). NDCG
is the normalized DCG.
3.1 Entity Containment Graph Methods
The first set of algorithms are based on the “entity contain-
ment” graph. This graph is constructed connecting every
passage in Pqto every entity in Cqwhich is contained in the
passage. This forms a bipartite graph in which the degree of
an entity equals its passage frequency in passages Pq. Fig-
ure 1 shows two of the entity-containment graph obtained
for the query ’Life of Pablo Picasso’.
Once this graph is constructed we can use different graph
centrality measures to rank entities in the graph. The most
basic one is the degree. This method (noted Degree in Table
2) alone yields a 47% relative increase in MAP and 22% in
NDCG. This is clear indication that the entity containment
Figure 1: Entity containment graphs for the query
“Life of Pablo Picasso”.
(a) Small Graph Detail (3 relevant sentences only):
(b) Full Entity Containment Graph
graph can be a useful representation of entity relevance. We
experimented with higher order centrality measures such as
finite stochastic walks or PageRank but the performance was
similar or worse than that of degree.
We observed that degree-dependent methods are biased by
very general entities (such as descriptions, country names,
etc.) which are not interesting but have high frequency. To
improve on this, we experimented with two different meth-
ods. An ad-hoc method consists in removing the description
types which are the most generic and would seem to be a
priori the less informative. However, doing this did not lead
to improved results (models noted F- in Table 2). Further-
more, this solution would not be applicable in practice since
we may not always know which are the less informative types
of a corpus. An alternative method considered was to weight
the degree of an entity by its inverse entity frequency:
ief := log(N/ne),
where Nis the total number of sentences and nethe number
containing the entity e. This improved the results further,
leading to a 76% relative increase over the baseline in MAP
and 31% in NDCG.
We also tried to improve results by weighting the entity de-
gree computation with the sentence relevance scores. This
approach (noted W- in Table 2) did not improve the results,
despite trying several forms of score normalization.
3.2 Web Search based Methods
For computing the relevance of entities to a given query,
we do not need to be constrained to the text of Wikipedia:
to compute entity relevance, we can rely on the Web as a
noisier, but much larger scale corpus. Based on this ob-
servation, we have experimented with ranking entities by
computing their correlation to the query phrase on the Web
using correlation measures well-known in text mining [6].
This technique has been successfully applied in the past, for
example to the problem of social network mining from the
Web [4, 5].
The difference here is that we are only interested in the
correlations between the query and the set of related entities,
while in co-occurrence analysis one typically computes the
correlations between all possible pairs of entities to create a
co-occurrence graph of entities. Query-to-entity correlation
measures can be easily computed using search engines by
observing the page counts returned for the entity, query and
their conjunction.
We found that of the common measures we tested (Jaccard-
coefficient, Simpson-coefficient and Google distance), the
Jaccard-coefficient clearly produced the best results (see Web
Jaccard in Table 2). It resulted in practice that we obtain
the best results from the search engine when quoting the
query string, but not the entity label. This can be explained
by the fact that queries are typically correct expressions,
while the entity tagger often makes mistakes in determining
the boundaries of entities. Enforcing these incorrect bound-
aries results in a decrease in performance.
The improvement obtained over the baseline (32% relative
in MAP and 6% in NDCG) is however not as good as that
obtained from the entity containment graph. One of the rea-
sons may be that, for some queries, the quality of the results
obtained from searching the Web may be inferior to that
obtained retrieving Wikipedia passages. For such queries,
results obtained after a certain rank are not relevant and
therefore bias the correlation measures. To alleviate this,
we experimented with a novel measure based on the idea of
discounting the importance of documents as their rank in-
creases. Simple versions of this did not lead to an increase
in performance. One of the main problems is that different
queries and entities result in result sets of varying quality
and size. This lead us to try slightly more sophisticated
methods.
In order not to penalize documents with lots of relevant re-
sults, instead of using the ranks directly we used a notion
of average precision where the relevant documents are those
returned both by the query and the entity. The method is
illustrated in Figure 2. We compare the set of top Kdoc-
uments returned by the query (thought of as relevant) with
the ranked list of results returned for a particular entity.
Next, we determine which of the documents returned for the
entity are in the relevant set and compute the their average
precision. Computing such an average precision has the ad-
vantage of almost eliminating the effect of K, which should
depend on the query. Indeed, this method greatly improves
the result over the Jaccard and baseline methods (see Web
RankDiscounted in Table 2). This method has achieved a
performance that is on par with degree-based methods that
take ief into account. Nevertheless, we still require the en-
tity extraction and passage retrieval steps to produce the set
of candidate entities.
4. DISCUSSION
We have taken the first steps towards studying the problem
Table 2: Performance of the different models (best two in bold).
MODEL MAP P@10 P@30 MRR DCG NDCG Relative ∆NDCG
MaxScore 0.34 0.37 0.28 0.64 67.91 0.64
MaxScore(1 + ief) 0.40 0.39 0.29 0.66 69.92 0.68 6%
Degree 0.50 0.54 0.37 0.96 79.69 0.78 22%
F-Degree 0.50 0.52 0.39 0.95 79.23 0.79 22%
Degree ·ief 0.60 0.63 0.451 0.98 83.89 0.84 31%
F-Degree ·ief 0.57 0.60 0.44 0.98 82.59 0.82 28%
W-Degree 0.48 0.51 0.38 0.92 79.11 0.76 21%
W-Degree ·ief 0.54 0.63 0.42 0.94 82.68 0.81 28%
Web RankDiscounted 0.62 0.65 0.50 0.95 86.34 0.83 30%
Web Jaccard 0.45 0.50 0.340 0.75 78.27 0.71 10%
of ad-hoc entity ranking in the presence of a large set of
heterogeneous entities. We have constructed a realistic test-
bed to carry out evaluation of entity ranking models, and
we have provided some initial directions of research. With
respect to entity containment graphs our results show that
it is important to take into account the notion of inverted
entity frequency to discount general types. With respect to
Web methods we showed that taking into account the rank of
the documents in the computation of correlations can yield
significant improvements in performance.
Web methods are complementary to graph methods and
could be combined in a number of ways. For example, corre-
lation measures can be used to compute correlation-graphs
among the entities; these graphs could replace the entity
containment graphs discussed above. Furthermore ief could
be combined with Web measures. Or we could define a ief
that depends on Web information. Furthermore, it may be
possible to select the candidate set of entities directly from
the search results (or even just the snippets) obtained from a
Web search engine. This would eliminate the need of offline
pre-processing collections. We plan to explore these issues
in the future.
Nevertheless, it is necessary to increase the quality of the
evaluation in order to further quantify the benefits of the
different methods. To this end, we have released to the
public the corpus used in this study and we plan to design
and carry our more thorough evaluations.
5. ACKNOWLEDGEMENTS
For entity extraction, we used the open source SuperSense
Tagger (http://sourceforge.net/projects/supersensetag/).
For indexing and retrieval, we used the IXE retrieval library
(http://www.ideare.com/products.shtml), kindly made avail-
able to us by Tiscali.
6. REFERENCES
[1] S. Chakrabarti, K. Puniyani, and S. Das. Optimizing
scoring functions and indexes for proximity search in
type-annotated corpora. In WWW ’06, pages 717–726,
New York, NY, USA, 2006. ACM Press.
[2] H. Chen, H. Shen, J. Xiong, S. Tan, and X. Cheng.
Social network structure behind the mailing lists:
ICT-IIIS at TREC 2006 expert finding track. In Text
REtrieval Conference (TREC), 2006.
[3] Ludovic Denoyer and Patrick Gallinari. The wikipedia
xml corpus. SIGIR Forum, 40(1):64–69, June 2006.
Figure 2: Computing the average precision of the
results returned for the entity ”Gertrude Stein” with
respect to the set of pages relevant to the query ”Life
of Pablo Picasso”.
[4] Yutaka Matsuo, Masahiro Hamasaki, Hideaki Takeda,
Junichiro Mori, Danushka Bollegara, Yoshiyuki
Nakamura, Takuichi Nishimura, Koiti Hasida, and
Mitsuru Ishizuka. Spinning Multiple Social Networks
for Semantic Web. In Proceedings of the Twenty-First
National Conference on Artificial Intelligence
(AAAI2006), 2006.
[5] Peter Mika. Flink: Semantic Web Technology for the
Extraction and Analysis of Social Networks. Journal of
Web Semantics, 3(2), 2005.
[6] G. Salton. Automatic text processing. Addison-Wesley,
Reading, MA, 1989.
[7] H. Zaragoza, J. Atserias, M. Ciaramita, and G. Attardi.
Semantically annotated snapshot of the English
Wikipedia v.0 (SW0).
http://www.yr-bcn.es/semanticWikipedia, 2007.
... Many approaches exist to rank parts of RDF graphs, in particular, nodes, with the aim of a better graph search (e.g. [25][26][27]). Some approaches also focus on machine learning to rank RDF nodes [28] and RDF edges [29]. ...
Article
Ontologies have been widely used in numerous and varied applications, e.g. to support data modeling, information integration, and knowledge management. With the increasing size of ontologies, ontology understanding, which is playing an important role in different tasks, is becoming more difficult. Consequently, ontology summarization, as a way to distill key information from an ontology and generate an abridged version to facilitate a better understanding, is getting growing attention. In this survey paper we review existing ontology summarization techniques and focus mainly on graph-based methods, which represent an ontology as a graph and apply centrality-based and other measures to identify the most important elements of an ontology as its summary. After analyzing their strengths and weaknesses, we highlight a few potential directions for future research.
... Zaragoza et al. proposed two methods using the candidate entities found in Wikipedia by a statistical entity extractor. The first proposed method utilized the entity containment graph with the inverse entity frequency, while the second method utilized Web search results with query-to-entity correlation measures where a set of retrieved documents by the query, entity, and their conjunction were used [22]. Pehcevski et al. utilized categories and the link structure of Wikipedia, and also exploited the link co-occurrence [23]. ...
Article
Full-text available
This paper proposes methods of finding a ranked list of entities for a given query (e.g. “Kennin-ji”, “Tenryu-ji”, or “Kinkaku-ji” for the query “ancient zen buddhist temples in kyoto”) by leveraging different types of modifiers in the query through identifying corresponding properties (e.g. established date and location for the modifiers “ancient” and “kyoto”, respectively). While most major search engines provide the entity search functionality that returns a list of entities based on users' queries, entities are neither presented for a wide variety of search queries, nor in the order that users expect. To enhance the effectiveness of entity search, we propose two entity ranking methods. Our first proposed method is a Web-based entity ranking that directly finds relevant entities from Web search results returned in response to the query as a whole, and propagates the estimated relevance to the other entities. The second proposed method is a property-based entity ranking that ranks entities based on properties corresponding to modifiers in the query. To this end, we propose a novel property identification method that identifies a set of relevant properties based on a Support Vector Machine (SVM) using our seven criteria that are effective for different types of modifiers. The experimental results showed that our proposed property identification method could predict more relevant properties than using each of the criteria separately. Moreover, we achieved the best performance for returning a ranked list of relevant entities when using the combination of the Web-based and property-based entity ranking methods.
... In the Web environment, IR search engines are primarily keyword-based and analyse the relevance of a document using content-based or graph-based methods. Early approaches to query search and ranking focused on entities of different types, which are present in Wikipedia [9]. Similarly, classic named-entity recognition (NER) approaches aim to find information about a given set of entities in a text. ...
Article
Full-text available
Motivation: Searching for precise terms and terminological definitions in the biomedical data space is problematic, as researchers find overlapping, closely related and even equivalent concepts in a single or multiple ontologies. Search engines that retrieve ontological resources often suggest an extensive list of search results for a given input term, which leads to the tedious task of selecting the best-fit ontological resource (class or property) for the input term and reduces user confidence in the retrieval engines. A systematic evaluation of these search engines is necessary to understand their strengths and weaknesses in different search requirements. Result: We have implemented seven comparable Information Retrieval ranking algorithms to search through ontologies and compared them against four search engines for ontologies. Free-text queries have been performed, the outcomes have been judged by experts and the ranking algorithms and search engines have been evaluated against the expert-based ground truth (GT). In addition, we propose a probabilistic GT that is developed automatically to provide deeper insights and confidence to the expert-based GT as well as evaluating a broader range of search queries. Conclusion: The main outcome of this work is the identification of key search factors for biomedical ontologies together with search requirements and a set of recommendations that will help biomedical experts and ontology engineers to select the best-suited retrieval mechanism in their search scenarios. We expect that this evaluation will allow researchers and practitioners to apply the current search techniques more reliably and that it will help them to select the right solution for their daily work. Availability: The source code (of seven ranking algorithms), ground truths and experimental results are available at https://github.com/danielapoliveira/bioont-search-benchmark.
... Ahn et al. [Ahn et al., 2004], were among the first to use Wikipedia as a knowledge resource to improve retrieval performance. Wikipedia has also been used to evaluate entity ranking techniques [Pehcevski et al., 2008;Vries et al., 2008;Zaragoza et al., 2007], and link-detection [Huang et al., 2008]. Kaptein et al. [Kaptein et al., 2009], successfully used the Wikipedia category structure to improve ad-hoc retrieval performance. ...
Thesis
Even though modern retrieval systems typically use a multitude of features to rank documents, the backbone for search ranking is usually the standard retrieval models.This thesis addresses a limitation of the standard retrieval models, the term mismatch problem. The term mismatch problem is a long standing problem in information retrieval. However, it was not well understood how often term mismatch happens in retrieval, how important it is for retrieval, or how it affects retrieval performance. This thesis answers the above questions.This research is enabled by the formal definition of term mismatch. In this thesis, term mismatch is defined as the probability that a term does not appear in a document given that this document is relevant. We propose several approaches for reducing term mismatch probability through modifying documents or queries. Our proposals are then followed by a quantitative analysis of term mismatch probability that shows how much the proposed approaches reduce term mismatch probability with maintaining the system performance. An essential component for achieving term mismatch probability reduction is the knowledge resource that defines terms and their relationships.First, we propose a document modification approach according to a user query. The main idea of our document modification approach is to deal with mismatched query terms. While prior research on document enrichment provides a static approach for document modification, we are concerned to only modify the document in case of mismatch. The modified document is then used in a standard retrieval model in order to obtain a mismatch aware retrieval model.Second, we propose a semantic query expansion approach based on a collaborative knowledge resource. We focus on the collaborative resource structure to obtain interesting expansion terms that contribute to reduce term mismatch probability, and as a result, improve the effectiveness of search.Third, we propose a query expansion approach based on neural language models. Neural language models are proposed to learn term vector representations, called distributed neural embeddings. Distributed neural embeddings capture relationships between terms, and they obtained impressive results comparing with state of the art approaches in term similarity tasks. However, in information retrieval, distributed neural embeddings are newly started to be exploited. We propose to use distributed neural embeddings as a knowledge resource in a query expansion scenario.Fourth, we apply the term mismatch probability definition for each contribution of the above contributions. We show how we use standard retrieval corpora with queries and relevance judgments to estimate the term mismatch probability. We estimate the term mismatch probability using original documents and queries, and we figure out how mismatch problem is clearly found in search systems for different types of indexing terms. Then, we point out how much our contributions reduce the estimated mismatch probability, and improve the system recall. As a result, we present how the modified document and query representations contribute to build a mismatch aware retrieval model that mitigate term mismatch problem theoretically and practically.This dissertation shows the effectiveness of our proposals to improve retrieval performance. Our experiments are conducted on corpora from two different domains: medical domain and cultural heritage domain. Moreover, we use two different types of indexing terms for representing documents and queries: words and concepts, and we exploit several types of relationships between indexing terms: hierarchical relationships, relationships based on a collaborative resource structure, relationships defined on distributed neural embeddings.Promising research directions are identified where the term mismatch research may make a significance impact on improving the search scenarios.
... Based on the event summaries, we recruit a group of CS graduates to label normalized entities into four classes: "most important", "important", "related" and "unrelated". Following the evaluation framework in [29], we calculate the average score among all human labelers for each normalized entity based on the rank score table in Table 5. We finally have a ranked list of 15 normalized entities w.r.t. a document collection, which are regarded as "key" entities. ...
Article
Full-text available
Most entity ranking research aims to retrieve a ranked list of entities from a Web corpus given a user query. The rank order of entities is determined by the relevance between the query and contexts of entities. However, entities can be ranked directly based on their relative importance in a document collection, independent of any queries. In this paper, we introduce an entity ranking algorithm named NERank+. Given a document collection, NERank+ first constructs a graph model called Topical Tripartite Graph, consisting of document, topic and entity nodes. We design separate ranking functions to compute the prior ranks of entities and topics, respectively. A meta-path constrained random walk algorithm is proposed to propagate prior entity and topic ranks based on the graph model.We evaluate NERank+ over real-life datasets and compare it with baselines. Experimental results illustrate the effectiveness of our approach. © 2017 Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature
... Wikipedia seems to be a good playground. The problem of ranking web pages could be easily reduced to Wikipedia entity ranking, plus that Wikipedia has a large collection of entities of different types [2] and Wikipedia contains valuable texts, human annotated tags, enriched links and a great structure to analyze ranking effectiveness. Certain ranking can serve as a pivot for extensibility or analysis [3], [4] or be used to answer queries in named entity recognition [5]. ...
Article
Wikipedia is a useful knowledge source that benefits many applications in language processing and knowledge representation. An important feature of Wikipedia is that of categories. Wikipedia pages are assigned different categories according to their contents as human-annotated labels which can be used in information retrieval, ad hoc search improvements, entity ranking and tag recommendations. However, important pages are usually assigned too many categories, which makes it difficult to recognize the most important ones that give the best descriptions. In this paper, we propose an approach to recognize the most descriptive Wikipedia categories. We observe that historical figures in a precise category presumably are mutually similar and such categorical coherence could be evaluated via texts or Wikipedia links of corresponding members in the category. We rank descriptive level of Wikipedia categories according to their coherence and our ranking yield an overall agreement of 88.27% compared with human wisdom.
... In [2], authors propose a hybrid model of entity ranking and selection in the context of displaying the most important entities for a given constraint while eliminating redundant entities. Entity ranking of Wikipedia entities in the context of a query, has been done using link structure and categories [5], as well as graph methods and web search [6]. ...
Article
Full-text available
We present work on building a global long-tailed ranking of entities across multiple languages using Wikipedia and Freebase knowledge bases. We identify multiple features and build a model to rank entities using a ground-truth dataset of more than 10 thousand labels. The final system ranks 27 million entities with 75% precision and 48% F1 score. We provide performance evaluation and empirical evidence of the quality of ranking across languages, and open the final ranked lists for future research.
... The two following studies focused on named entity ranking rather than ranking search results. Zaragoza et al. [138] examined the problem of ranking entities of different heterogeneous sets of types. They employed "a statistical entity recognition algorithm to identify many entities (and their corresponding types) on a copy of the English Wikipedia." ...
Article
Although primarily an encyclopedia, Wikipedia’s expansive content provides a knowledge base that has been continuously exploited by researchers in a wide variety of domains. This article systematically reviews the scholarly studies that have used Wikipedia as a data source, and investigates the means by which Wikipedia has been employed in three main computer science research areas: information retrieval, natural language processing, and ontology building. We report and discuss the research trends of the identified and examined studies. We further identify and classify a list of tools that can be used to extract data from Wikipedia, and compile a list of currently available data sets extracted from Wikipedia.
... Based on the event summaries, we recruit a group of CS graduates to label the entities as "most important", "important", "relevant", etc. Following the evaluation framework in [22], we finally have a ranked list of 15 entities w.r.t. a document collection by majority voting, which are regarded as "key" entities. ...
Conference Paper
Full-text available
Most entity ranking research aims to retrieve a ranked list of entities from a Web corpus given a query. However, entities in plain documents can be ranked directly based on their relative importance, in order to support entity-oriented Web applications. In this paper, we introduce an entity ranking algorithm NERank to address this issue. NERank first constructs a graph model called Topical Tripartite Graph from a document collection. A ranking function is designed to compute the prior ranks of topics based on three quality metrics. We further propose a meta-path constrained random walk method to propagate prior topic ranks to entities. We evaluate NERank over real-life datasets and compare it with baselines. Experimental results illustrate the effectiveness of our approach.
... However, this approach is biased in favor of very general entities, such as countries, cities, and family names. To solve this problem, we weight the informativeness of an entity with its inverse entity frequency (ief ), similar to the method by Zaragoza et al. [13]: ...
Conference Paper
Full-text available
Street names provide important insights into the local culture, history, and politics of places. Linked open data provide a wealth of knowledge that can be associated with street names, enabling novel ways to explore cultural geographies. This paper presents a three-fold contribution. We present (1) a technique to establish a correspondence between street names and the entities that they refer to. The method is based on Wikidata, a knowledge base derived from Wikipedia. The accuracy of this mapping is evaluated on a sample of streets in Rome. As this approach reaches limited coverage, we propose to tap local knowledge with (2) a simple web platform. Users can select the best correspondence from the calculated ones or add another entity not discovered by the automated process. As a result, we design (3) an enriched OpenStreetMap web map where each street name can be explored in terms of the properties of its associated entity. Through several filters, this tool is a first step towards the interactive exploration of toponymy, showing how open data can reveal facets of the cultural texture that pervades places.
Conference Paper
Full-text available
Social networks are important for the Semantic Web. Several means can be used to obtain social networks: using social networking services, aggregating Friend- of-a-Friend (FOAF) documents, mining text informa- tion on the Web or in e-mail messages, and observing face-to-face communication using sensors. Integrating multiple social networks is a key issue for further uti- lization of social networks in the Semantic Web. This paper describes our attempt to extract, analyze and in- tegrate multiple social networks from the same commu- nity: user-registered knows networks, web-mined col- laborator networks, and face-to-face meets networks. We operated a social network-based community support system called Polyphonet at the 17th, 18th and 19th An- nual Conferences of the Japan Society of Artificial In- telligence (JSAI2003, JSAI2004, and JSAI2005) and at The International Conference on Ubiquitous Comput- ing (UbiComp 2005). Multiple social networks were obtained and analyzed. We discuss the integration of multiple networks based on the analyses.
Conference Paper
Full-text available
We introduce a new, powerful class of text proximity queries: find an instance of a given "answer type" (person, place, distance) near "selector" tokens matching given literals or satisfying given ground predicates. An example query is type=distance NEAR Hamburg Munich. Nearness is defined as a flexible, trainable parameterized aggregation function of the selectors, their frequency in the corpus, and their dis- tance from the candidate answer. Such queries provide a key data reduction step for information extraction, data in- tegration, question answering, and other text-processing ap- plications. We describe the architecture of a next-generation information retrieval engine for such applications, and inves- tigate two key technical problems faced in building it. First, we propose a new algorithm that estimates a scoring func- tion from past logs of queries and answer spans. Plugging the scoring function into the query processor gives high ac- curacy: typically, an answer is found at rank 2-4. Second, we exploit the skew in the distribution over types seen in query logs to optimize the space required by the new index structures required by our system. Extensive performance studies with a 10GB, 2-million document TREC corpus and several hundred TREC queries show both the accuracy and the eciency of our system. From an initial 4.3GB index using 18,000 types from WordNet, we can discard 88% of the space, while inflating query times by a factor of only 1.9. Our final index overhead is only 20% of the total index space needed.
Article
This study evaluated the use of recombinant human bone morphogenetic protein (rhBMP-2) with various types of carrier media, and the effect of rhBMP-2 as an adjunct to autogenous iliac crest bone graft in the canine spinal fusion model. BMP induces mesenchymal cells to differentiate into cartilage and bone. The recent availability of rhBMP-2 has created the opportunity to evaluate this material's properties in augmenting autogenous bone graft in spinal fusion. Currently, the most appropriate type of carrier media for rhBMP-2 is undetermined. Bilateral inter-transverse spinal fusions were performed on six canine lumbar spines at L1-L2, L3-L4, and L5-L6, using autogenous posterior iliac crest bone graft at each level, creating a total of 18 segmental fusion sites. All 18 sites were then randomly assigned to one of six fusion methods: autogenous bone graft (ABG) alone, ABG + rhBMP-2, ABG + collagen (Helistat) "sandwich" + rhBMP-2, ABG + collagen (Helistat) morsels + rhBMP-2, ABG + polylactic/glycolic acid sponge (PLGA) sandwich + rhBMP-2, and ABG + open-pore polylactic acid morsels + rhBMP-2. Each material was evaluated for ease of handling and application at the index surgery. The animals underwent computed tomography (CT) scanning of the lumbar fusion sites after 8 weeks. Volumetric measurements of total fusion mass at each level were performed using two-dimensional CT scan slices and a volumetric program supplied by the Siemens Medical System. The animals were killed after imaging studies. The lumbar spine fusion sites were evaluated for integrity of the fusion mass, both visually and with manual mechanical stressing. Crossover of the fusion mass to adjoining levels was also evaluated. Histologic evaluation of all fusion sites was performed. The addition of rhBMP-2 significantly increased bone graft volume as noted on CT scan. Carrier that could be mixed with morselized bone graft offered easier handling and application and all spine segments fused. Polylactic/glycolic acid (PLGA) sites were associated with a greater incidence of voids within the fusion mass. No significant difference in carrier media for rhBMP-2 could be determined. However, PLGA was associated with a higher rate of fusion mass void formation. rhBMP-2, when added to autograft, significantly increased the volume and the maturity of the resulting fusion mass. (C) Lippincott-Raven Publishers.
Article
Immunohistochemical study of expression and localization of bone morphogenetic protein (BMP)-2/4 and type I and II receptors on intervertebral disc. To determine the biologic functions of BMPs and their receptors in the process of degeneration of the intervertebral disc. Biologic and pathologic processes in the cell during the degeneration of the intervertebral disc are as yet poorly understood. The cervical spines of 15 male senescence-accelerated mice aged 8, 24, or 50 weeks were used for histologic and immunohistochemical examination of BMP-2/4 and BMP receptors IA, IB, and II. Immunostaining was performed with the avidin-biotin-peroxidase complex method. Degenerative change was recognized within intervertebral discs of senescence-accelerated mice aged 50 weeks. BMP-2/4 and its receptors were abundant in hyaline cartilaginous cells within the endplate of the vertebrae at 8 and 24 weeks of age. However, the expression of BMP-2/4 and its receptors moved from the hyaline cartilage of the endplate of the vertebrae to fibrous cells within the anulus and to the calcified cartilage at the site of enthesis of mice aged 50 weeks. BMP-2/4 and its receptors may play roles in degenerative change of intervertebral disc.
Article
An economic model was developed to compare costs of stand-alone anterior lumbar interbody fusion with recombinant human bone morphogenetic protein 2 on an absorbable collagen sponge versus autogenous iliac crest bone graft in a tapered cylindrical cage or a threaded cortical bone dowel. The economic model was developed from clinical trial data, peer-reviewed literature, and clinical expert opinion. The upfront price of bone morphogenetic protein (3380 dollars) is likely to be offset to a significant extent by reductions in the use of other medical resources, particularly if costs incurred during the 2 year period following the index hospitalization are taken into account.
Article
The possibility that the non-osteogenic mouse pluripotent cell line, C3H10T1/2 (10T1/2), could be induced to differentiate into osteogenic cells by various hormones and cytokines was examined in vitro. Of a number of agents tested, recombinant human bone morphogenetic protein-2 (rhBMP-2) and retinoic acid induced alkaline phosphatase (ALP) activity in 1OT1/2 cells. rhBMP-2 also induced mRNA expression of ALP in the cells. Dexamethasone, 1α,25-dihydroxyvitamin D3, transforming growth factor-β1 and insulin-like growth factor-I did not stimulate ALP activity. Treatment with rhBMP-2 greatly induced cAMP production in response to parathyroid hormone in IOT1/2 cells. No ALP activity was induced in NIH3T3 fibroblasts treated with rhBMP-2 or retinoic acid. These results indicate that 10T1/2 cells have a potential to differentiate into osteogenic cells under the control of BMP-2.
Article
Osteogenic protein-1 (OP-1 or BMP-7) stimulates new bone formation in vivo and induces cell proliferation and differentiation of osteoblasts in vitro. In the present study, we examined effects of OP-1 on the expression of vascular endothelial growth factor (VEGF) in primary cultures of fetal rat calvaria (FRC) cells. OP-1 increased the steady-state level of VEGF mRNA by about 3-fold in an OP-1 concentration- and time-dependent manner. The increase in VEGF mRNA level depended on transcription and was sensitive to cell replication. The VEGF mRNA stability was unaffected. The mRNA levels for both types of VEGF receptors, Flk-1 and Flt-1 were low but detectable in FRC cells by RT-PCR and were not changed by OP-1. Inhibition of VEGF synthesis and function by antisense oligonucleotide and by suramin, respectively arrested the OP-1-induced alkaline phosphatase activity and mineralized bone nodule formation. Together with published studies of VEGF on vascular endothelial cells which are usually found in close proximity to osteoblastic cells in vivo, these results suggest that VEGF participates in the OP-1-induced osteogenesis by taking part in bone cell differentiation and by promoting angiogenesis at the site of bone formation.
Article
We present the Flink system for the extraction, aggregation and visualization of online social networks. Flink employs semantic technology for reasoning with personal information extracted from a number of electronic information sources including web pages, emails, publication archives and FOAF profiles. The acquired knowledge is used for the purposes of social network analysis and for generating a web-based presentation of the community. We demonstrate our novel method to social science based on electronic data using the example of the Semantic Web research community.
Conference Paper
Abstract— Expert finding system,is a challenging,problem,in the enterprise environment.,This paper,introduce,our research and experiments,on TREC 2006’s expert searching,track. In our experiments, we find some interesting features of the community structures in the mailing,list network. We also use some,link analysis approaches,to rank the candidates in the social networks. In our experiments, we choose the PageRank algorithm and a,revised,HITS algorithm,as link analysis,methods.,These approaches,give reasonable,results in our experiments.