Conference PaperPDF Available

Faceted Wikipedia Search

Authors:

Abstract and Figures

Wikipedia articles contain, besides free text, various types of structured information in the form of wiki markup. The type of wiki content that is most valuable for search are Wikipedia infoboxes, which display an article’s most relevant facts as a table of attribute-value pairs on the top right-hand side of the Wikipedia page. Infobox data is not used by Wikipedia’s own search engine. Standard Web search engines like Google or Yahoo also do not take advantage of the data. In this paper, we present Faceted Wikipedia Search, an alternative search interface for Wikipedia, which facilitates infobox data in order to enable users to ask complex questions against Wikipedia knowledge. By allowing users to query Wikipedia like a structured database, Faceted Wikipedia Search helps them to truly exploit Wikipedia’s collective intelligence.
Content may be subject to copyright.
Faceted Wikipedia Search
Rasmus Hahn1, Christian Bizer2, Christopher Sahnwaldt1, Christian Herta1,
Scott Robinson1, Michaela B¨urgle1, Holger D¨uwiger1, Ulrich Scheel1
1neofonie GmbH
firstname.lastname@neofonie.de
http://www.neofonie.de
Robert-Koch-Platz 4, 10115 Berlin, Germany
2Freie Universit¨at Berlin
chris@bizer.de
http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/
Garystraße 21, 14195 Berlin, Germany
Abstract. Wikipedia articles contain, besides free text, various types
of structured information in the form of wiki markup. The type of wiki
content that is most valuable for search are Wikipedia infoboxes, which
display an article’s most relevant facts as a table of attribute-value pairs
on the top right-hand side of the Wikipedia page. Infobox data is not
used by Wikipedia’s own search engine. Standard Web search engines like
Google or Yahoo also do not take advantage of the data. In this paper,
we present Faceted Wikipedia Search, an alternative search interface for
Wikipedia, which facilitates infobox data in order to enable users to ask
complex questions against Wikipedia knowledge. By allowing users to
query Wikipedia like a structured database, Faceted Wikipedia Search
helps them to truly exploit Wikipedia’s collective intelligence.
Key words: faceted search, faceted classification, Wikipedia, DBpedia,
knowledge representation
1 Introduction
This paper presents Faceted Wikipedia Search, an alternative search interface for
the English edition of Wikipedia. Faceted Wikipedia Search allows users to ask
complex questions, like “Which rivers flow into the Rhine and are longer than 50
kilometers?” or “Which skyscrapers in China have more than 50 floors and were
constructed before the year 2000?” against Wikipedia knowledge. Such questions
cannot be answered using keyword-based search as provided by Google, Yahoo
or Wikipedia’s own search engine.
In order to answers such questions, a search engine must facilitate structured
knowledge which needs to be extracted from the underlying articles. On the
user interface side, a search engine requires an interaction paradigm that en-
ables inexperienced users to express complex questions against a heterogeneous
information space in an exploratory fashion.
2 R. Hahn et al.
For formulating queries, Faceted Wikipedia Search relies on the faceted search
paradigm. Faceted search enables users to navigate a heterogeneous information
space by combining text search with a progressive narrowing of choices along
multiple dimensions [6, 7, 5]. The user subdivides an entity set into multiple
subsets. Each subset is defined by an additional restriction on a property. These
properties are called the facets. For example, facets of an entity “person” could be
“nationality” and “year-of-birth”. By selecting multiple facets, the user progres-
sively expresses the different aspects that make up his overall question. Realizing
a faceted search interface for Wikipedia poses three challenges:
1. Structured knowledge needs to be extracted from Wikipedia with precision
and recall that are high enough to meaningfully answer complex queries.
2. As Wikipedia describes a wide range of different types of entities, a search
engine must be able to deal with a large number of different facets. As the
number of facets per entity type may also be high, the search engine must
apply smart heuristics to display only the facets that are likely to be relevant
to the user.
3. Wikipedia describes millions of entities. In order to keep response times low,
a search engine must be able to efficiently deal with large amounts of entity
data.
Faceted Wikipedia Search addresses these challenges by relying on two soft-
ware components: The DBpedia Information Extraction Framework is used to
extract structured knowledge from Wikipedia [4]. neofonie search, a commercial
search engine, is used as an efficient faceted search implementation.
This paper is structured as follows: Section 2 describes the Faceted Wikipedia
Search user interface and explains how facets are used for navigating and filtering
Wikipedia knowledge. Section 3 gives an overview of the DBpedia Information
Extraction Framework and the resulting DBpedia knowledge base. Section 4
describes how the efficient handling of facets is realized inside neofonie search.
Section 5 compares Faceted Wikipedia Search with related work.
2 User Interface
This section describes how queries are formulated as a series of refinements within
the Faceted Wikipedia Search user interface. Faceted Wikipedia Search is publicly
accessible at http://dbpedia.neofonie.de. Several example queries are found
at http://wiki.dbpedia.org/FacetedSearch. Figure 1 shows a screen shot of
the interface. The main elements of the interface are:
1. Text Search: Free-text search terms can be entered into this search field.
2. Faceted Navigation: The most frequent values of the relevant facets are dis-
played in the faceted navigation. The user can define filters by selecting or
entering values.
Faceted Wikipedia Search 3
Fig. 1. Screen shot of the Faceted Wikipedia Search user interface. The facets are shown
in the leftmost area of the screen. The numbers in brackets are the number of results
corresponding with each facet value. Your Filters: (Area 3) displays the breadcrumb
of the selected facet values: item type River with properties has mouth at Rhine and
length more than 50000.
3. Your Filters: A breadcrumb navigation displays the selected facet values and
search terms. Facets and search terms can be disabled independently of each
other by clicking on the corresponding delete button.
4. Search Results: The search results contain the titles of the matching Wikipedia
articles, a teaser of each articles’ text, and an image from each article (if ex-
istent).
To formulate the question ”Which rivers flow into the Rhine and are longer
than 50 kilometers?”, a user would go through the following steps:
4 R. Hahn et al.
1. On the start page of the Wikipedia Faceted Browser, the user would type the
value “River” into the facet item type. As a result, 12,432 “River” entities
are shown.
2. With the selection of ”More Facets”, the has mouth at facet will be dis-
played. The user types “Rhine” into “has mouth at” entry field, which re-
stricts the results to the 32 rivers which flow into the Rhine.
3. To define the numeric-range constraint, he types 50000 in the “from” field
of the facet length (m). As result 26 entities which match the complete
query are returned.
In addition to the exploration of the entity space using facets, users can also
mix full-text search with facet selection.
3 DBpedia
Faceted Wikipedia Search relies on the DBpedia knowledge base to answer queries.
The knowledge base is provided by the DBpedia project [4], a community effort
to extract structured information from Wikipedia and to make this information
available on the Web under an open license. This section describes the informa-
tion extraction framework that is used to generate the DBpedia knowledge base
as well as the knowledge base itself.
3.1 The DBpedia Extraction Framework
Wikipedia articles consist mostly of free text, but also contain various types of
structured information in the form of wiki markup. Such information includes
infobox templates, categorization information, images, geo-coordinates, links to
external Web pages, disambiguation pages, redirects between pages, and links
across different language editions of Wikipedia. The DBpedia project extracts
this structured information from Wikipedia and turns it into an RDF knowledge
base [9].
The type of Wikipedia content that is most valuable for the DBpedia ex-
traction are infoboxes. Infoboxes display an article’s most relevant facts as a
table of attribute-value pairs on the top right-hand side of the Wikipedia page.
The Wikipedia infobox template system has evolved over time without central
coordination. Therefore, different communities of Wikipedia editors use differ-
ent templates to describe the same types of things (e.g. infobox_city_japan,
infobox_swiss_town and infobox_town_de). Different templates use different
names for the same property (e.g. birthplace and place-of-birth). As many
Wikipedia editors do not strictly follow the recommendations given on the page
that describes a template, property values are expressed using a wide range of
different formats and units of measurement.
In order to deal with the problems of synonymous attribute names and mul-
tiple templates being used for the same type of things, the DBpedia project
maps Wikipedia templates onto an ontology using a custom mapping language.
Faceted Wikipedia Search 5
This ontology was created by manually arranging the 550 most commonly used
infobox templates within the English edition of Wikipedia into a subsumption
hierarchy consisting of 205 classes and by mapping mapping 3200 infobox at-
tributes to 1843 properties of these classes. The property mappings define fine-
grained rules on how to parse infobox values and define target datatypes, which
help the parsers to process property values. For instance, if a mapping defines
the target datatype to be a list of links, the parser will ignore additional text
which may be present in the property value. The ontology currently uses 55 dif-
ferent datatypes. Deviant units of measurement are normalized to one of these
datatypes.
3.2 The DBpedia Knowledge Base
The DBpedia knowledge base currently consists of around 479 million RDF
triples, which have been extracted from the English, German, French, Span-
ish, Italian, Portuguese, Polish, Swedish, Dutch, Japanese, Chinese, Russian,
Finnish, Norwegian, Catalan, Ukrainian, Turkish, Czech, Hungarian, Roma-
nian, Volap¨uk, Esperanto, Danish, Slovak, Indonesian, Arabic, Korean, Hebrew,
Lithuanian, Vietnamese, Slovenian, Serbian, Bulgarian, Estonian, and Welsh
versions of Wikipedia. The knowledge base describes more than 2.9 million en-
tities. For 1.1 million out of these entities, the knowledge base contains clean
infobox data which has been extracted using the mapping-based approach de-
scribed above. The knowledge base features labels and short abstracts in 30 dif-
ferent languages; 609,000 links to images; 3,150,000 links to external web pages;
415,000 Wikipedia categories, and 286,000 YAGO categories [12]. Table 1 gives
an overview of common DBpedia classes, and shows the number of instances and
some example properties for each class.
Besides being provided for download in the form of RDF dumps, the DBpedia
knowledge base is also accessible on the Web via an public SPARQL endpoint
and is served as Linked Data [2]. In order to enable DBpedia users to discover
further information, the DBpedia knowledge base is interlinked with various
other data sources on the Web according to the Linked Data principles [2]. The
knowledge base currently contains 4.9 million outgoing data links that point
at complementary data about DBpedia entities, as well as meta-information
about media items depicting an entity. Altogether, the Web of interlinked data
around DBpedia provides approximately 13.1 billion pieces of information (RDF
triples) and covers domains such as geographic information, people, companies,
films, music, genes, drugs, books, and scientific publications [1].
In the future, the data links between DBpedia and the external databases
will allow applications like Faceted Wikipedia Search to answer queries based not
only on Wikipedia knowledge but based on a world wide web of databases.
6 R. Hahn et al.
Ontology Class Instances Example Properties
Person 282,000 name, birthdate, birthplace, employer, spouse
Artist 54,262 activeyears, awards, occupation, genre
Actor 26,009 academyaward, goldenglobeaward, activeyears
MusicalArtist 19,535 genre, instrument, label, voiceType
Athlete 74,832 currentTeam, currentPosition, currentNumber
Politician 12,874 predecessor, successor, party
Place 339,000 lat, long
Building 23,304 architect, location, openingdate, style
Airport 7,971 location, owner, IATA, lat, long
Bridge 1,420 crosses, mainspan, openingdate, length
Skyscraper 2,028 developer, engineer, height, architect, cost
PopulatedPlace 241,847 foundingdate, language, area, population
River 12,432 sourceMountain, length, mouth, maxDepth
Organisation 119,000 location, foundationdate, keyperson
Band 14,952 currentMembers, foundation, homeTown, label
Company 20,173 industry, products, netincome, revenue
Educ.Institution 29,154 dean, director, graduates, staff, students
Work 189,620 author, genre, language
Book 15,677 isbn, publisher, pages, author, mediatype
Film 44,680 director, producer, starring, budget, released
MusicalWork 101,985 runtime, artist, label, producer
Album 74,055 artist, label, genre, runtime, producer, cover
Single 24,597 album, format, releaseDate, band, runtime
Software 5,652 developer, language, platform, license
TelevisionShow 10,169 network, producer, episodenumber, theme
Table 1. Common DBpedia classes with the number of their instances and example
properties.
4 Faceted Search Implementation
This section gives an overview of the requirements that had to be met by the
Faceted Wikipedia Search implementation as well as the approach that is used
to select the potentially relevant subset of facets that is displayed to the user
and the approach that is used to represent facet values in memory.
4.1 Requirements
In Faceted Wikipedia Search, each document is ordered to an item type, which
the facets are then assigned to. For example, a document about a person may
have a property nationality, but this property makes little sense when ordered to
a document about a celestial body. But, a facet age would make sense for both
documents. Therefore, a collection of documents consisting of a large variety of
themes, like Wikipedia, will need a large total number of facets, but only a small
number of facets per document will be needed. The statistical characteristics of
the documents are shown in Table 2. In other scenarios, for example, an online
Faceted Wikipedia Search 7
shop, a much smaller number of total facets would be required, but documents
would have more facets in common, e.g. price.
Property Value
number of documents 1134853
number of types 205
number of different facets 1843
average number of unique facets per document 14.8
number of values 24368625
number of unique values 5569464
average number of values per document 21.5
average number of values per facet 13222
Table 2. Statistical characteristics of the documents of Faceted Wikipedia Search.
For the user, only two aspects of a faceted-search system are readily apparent:
Facet Display: For any set of documents (e.g. search result set) the facets and
facet values are displayed. For example, a user is returned a set of documents,
some of which correspond to the item type person. The user would be presented
the first-name facet of the document set. The user would then see that there
are 53 persons named John, and 63 people named James, etc.
Faceted Search: The user can narrow the set of chosen documents based on the
value of one or more facets by selecting a facet value. Technically, this is the
intersection of one set of documents with another set which has the specific,
selected values to the corresponding facets. Generally, for an implementation
of a faceted-search system, this means that when the user selects a facet value,
the search results reflect this selection.
4.2 Actual Implementation
In our implementation of faceted search these two aspects were implemented
independently from one another to a large extent. This is mainly due to the fact
that both aspects were implemented in an existing information retrieval engine1,
which already had various methods for document selection. Most notably, the
possibility of performing boolean queries [10] is a previously existing feature of
the search engine. Some adaptations were necessary to index facet values within
documents (no normalization, special tokenization). In the case of faceted search,
however, the user only selects the values that are presented to him, without
requiring him to enter keywords. Therefore, there is no need for normalization
and tokenization.
1We used our proprietary full text retrieval system for the implementation
8 R. Hahn et al.
Facet Display
The selection of the facet values which are to be displayed is dependent on the
number of corresponding documents in the currently displayed document set.
This set is determined by a previous search. Wikipedia Faceted Search offers the
user two possibilities of search: first, through the selection of facet values and
second, through a traditional full-text search.
The facet values are presented to the user as a list. The order of the values
presented is dependent on the number of documents concerning a particular facet
value. That means, for the selected document set the number of documents with
the same facet value for any facet is counted. The values are then ordered by
the absolute number of documents corresponding to a particular facet.
This ordering of values by number of occurrences of facet values is not neces-
sarily the only or most comprehensible for the user; there are many facets which
have a natural order (mainly numeric facets like e.g. year), but in the DBpedia
Search we do not use this.
Due to limitations of the user interface and diversity of documents, not all
facets and their values can be presented in a user friendly way. Therefore, the
facet values which are displayed are limited to a set that can clearly be rep-
resented. The set of facets which is retrieved from the system is preselected
at the time of the query. These we define as the target facets. This selection
is primarily done to keep the number of round trips, and the amount of data
transferred, small. This issue is readily apparent in the DBpedia system, as the
documents are heterogeneous, i.e. many facets are only defined for a small subset
of documents and only a few facets are shared between all documents.
To determine the target facets which are queried, we distinguish between
three cases:
1. At the start of the user-session (without any search or selection) only the
item type facet is displayed.
2. If there is no item type selected, the most generic facets, item-type,location
and year-of-appearance etc. are target facets, since these are facets of the
majority of documents.
3. If the user has selected an item type, the most frequent facets of the item
type are target facets.
The resulting target facets (of 2, 3) are ranked according to their most fre-
quent facet value. Only the target facets with the highest value frequencies are
displayed.
Faceted Search
Conceptually, the facet information in the data is a set of tuples of document,
facet and value, where a tuple (f, d, v) represents that a document dhas a value
vin the facet f. After the selection of a subset of documents Dqas a result
of a query q, and a choice of the facets Fq, the set of resulting facet values
must be calculated. For each facet and value the number of documents for this
Faceted Wikipedia Search 9
combination is returned as a response to the query. That is, given a subset of
documents Dqand a subset of facets Fqwe must calculate |{(f, d, v)|dDq}|
for each vand fFqefficiently. We do not know the set of documents Dqin
advance, which leads to the difficulty in calculating the number of these tuples.
We also do not know the set of facets Fqin advance, but as there are not many
facets in total, this does not pose much of a problem. However, as we have a
sizable amount of documents, this forces us to use a data representation which
allows us to represent the facet values efficiently. To accomplish this, we use a
(sparse) tree.
In the tree, the facet values are stored in a hierarchical data structure with
three levels, the first level being the facets, the second level being the documents
and the third level the values. This particular ordering is not strictly mandatory,
but since the query output is ordered by facet first, it is more memory-efficient
to reflect this in the data-structure, since each facet can then be calculated
independently. This also allows for more efficient memory usage in the case that
not all facets are queried, as we only need facets to be in the memory when they
are being used.
As it turns out, this design is especially useful for the DBpedia use case
where there is a large number of facets in relation to the number of documents.
By having the facets in the first level of the tree structure, the amount of data
to be examined is efficiently reduced.
5 Related Work
The following section is dedicated to discussing a sample of the related work on
faceted search and Wikipedia information extraction.
Faceted Search. An early faceted search prototype was the “flamenco” [5]
system developed at the University of California. “flamenco” is implemented on
top of a SQL-database and uses the group by-command and specific optimiza-
tions [3]. This setup was developed without a full-text search engine.
In Yitzhaz et. al. [14], a hierarchically faceted search implementation in the
Lucene search library is described. The facet value-information for the docu-
ments is stored in the Payload of a dedicated posting list (FacetInfo). The values
are counted with a customized HitCollector for the target facets. The main
difference to our approach is that their method aggregates by document first
while our approach aggregates by facet. In our opinion, our implementation is
better suited for the Wikipedia document collection (see 4.2).
Today, many commercial websites use faceted search. Examples include eBay
and Amazon. A faceted search system that works on similar content as Faceted
Wikipedia Search is Freebase Parallax2. Parallax focuses on extending faceted
search to a chained-sets navigation paradigm, while Faceted Wikipedia Search
aims at providing a simple, self-explanatory search interface for Wikipedia.
2http://www.freebase.com/labs/parallax/
10 R. Hahn et al.
Extraction of structured Wikipedia content. A second Wikipedia
knowledge extraction effort is the Freebase Wikipedia Extraction (WEX) [11].
Freebase3is a commercial company that builds a huge online database which
users can edit in a similar way as editing Wikipedia articles. Freebase employs
Wikipedia knowledge as initial content for their database that will afterwards
be edited by Freebase users. By synchronizing the DBpedia knowledge base with
Wikipedia, DBpedia in contrast relies on the existing Wikipedia community to
update content. Since November 2008, Freebase is published as Linked Data,
and DBpedia as well as Freebase include data links pointing to corresponding
entities in the respective other data source. These links allow applications to
fuse DBpedia and Freebase knowledge.
A third project that extracts structured knowledge from Wikipedia is the
YAGO project [12]. YAGO extracts 14 relationship types, such as subClassOf,
type,diedInYear,bornInYear,locatedIn etc. from Wikipedia category system
and from Wikipedia redirects. YAGO does not perform an infobox extraction
like DBpedia. The YAGO and DBpedia projects cooperate and we serve the
resulting YAGO classification together with the DBpedia knowledge base.
In [13] the KOG system is presented, which refines existing Wikipedia in-
foboxes based on machine learning techniques using both SVMs and a more
powerful joint-inference approach expressed in Markov Logic Networks. In con-
junction with DBpedia, KOG gives Wikipedia authors valuable insights about
inconsistencies and possible improvements of infobox data.
NLP-based knowledge extraction. There is a vast number of approaches
employing natural language processing techniques to obtain semantics from
Wikipedia. Yahoo! Research Barcelona, for example, published a semantically
annotated snapshot of Wikipedia4, which is used by Yahoo for entity rank-
ing [15]. A commercial venture, in this context, is the Powerset search engine5
which uses NLP for both understanding queries in natural language as well re-
trieving relevant information from Wikipedia. Further potential for the DBpedia
extraction as well as for the NLP-field in general lies in the idea of using huge
bodies of background knowledge — like DBpedia — to improve the results of
NLP-algorithms [8].
6 Conclusion
We have presented Faceted Wikipedia Search, an alternative search interface for
Wikipedia, which facilitates infobox data in order to enable users to ask complex
queries against Wikipedia. The answers to these queries are not generated using
key word matching like the search engines Google or Yahoo, but are generated
based on structured knowledge that has been extracted and combined from many
different Wikipedia articles.
3http://www.freebase.com
4http://www.yr-bcn.es/dokuwiki/doku.php?id=semantically_annotated_
snapshot_of_wikipedia
5http://www.powerset.com
Faceted Wikipedia Search 11
In future projects, we plan to extend the user interface of Faceted Wikipedia
Search with more sophisticated facet value selection components like maps,
timeline widgets and the automatic binning of numerical and date values. We
also plan to complement and extend the application’s knowledge base by fusing
Wikipedia infobox data with additional data from external Linked Data sources.
References
1. Christian Bizer. The emerging web of linked data. IEEE Intelligent Systems,
24:87–92, 2009.
2. Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data - the story so far.
Int. J. Semantic Web Inf. Syst., 5(3):1–22, 2009.
3. Kevin Chen. Computing query previews in the flamenco system. Technical report,
University of Berkeley, 2004.
4. Christian Bizer et al. Dbpedia - a crystallization point for the web of data. Journal
of Web Semantics, 7(3):154–165, 2009.
5. Jennifer English, Marti Hearst, Rashmi Sinha, Kirsten Swearingen, and Ka-Ping
Yee. Flexible search and navigation using faceted metadata. Technical report,
University of Berkeley, 2002.
6. Marti Hearst, Ame Elliott, Jennifer English, Rashmi Sinha, Kirsten Swearingen,
and Ka-Ping Yee. Finding the flow in web site search. Commun. ACM, 45(9):42–49,
2002.
7. Marti A. Hearst. Uis for faceted navigation: Recent advances and remaining open
problems. In HCIR08 Second Workshop on Human-Computer Interaction and
Information Retrieval. Microsoft, October 2008.
8. Junichi Kazama and Kentaro Torisawa. Exploiting wikipedia as external knowledge
for named entity recognition. In Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning, 2007.
9. Graham Klyne and Jeremy Carroll. Resource description frame-
work (rdf): Concepts and abstract syntax - w3c recommendation.
http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/, 2004.
10. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨utze. Introduction
to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.
11. Metaweb Technologies. Freebase wikipedia extraction (wex). http://download.
freebase.com/wex/, 2009.
12. Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A large ontol-
ogy from wikipedia and wordnet. Journal of Web Semantics, 6(3):203–217, 2008.
13. Fei Wu and Daniel Weld. Automatically Refining the Wikipedia Infobox Ontology.
In Proceedings of the 17th World Wide Web Conference, 2008.
14. Ori B. Yitzhak, Nadav Golbandi, Nadav Har’el, Ronny Lempel, Andreas Neu-
mann, Shila O. Koifman, Dafna Sheinwald, Eugene Shekita, Benjamin Sznajder,
and Sivan Yogev. Beyond basic faceted search. In WSDM ’08: Proceedings of the
international conference on Web search and web data mining, pages 33–44, New
York, NY, USA, 2008. ACM.
15. Hugo Zaragoza, Henning Rode, Peter Mika, Jordi Atserias, Massimiliano Cia-
ramita, and Giuseppe Attardi. Ranking very many typed entities on wikipedia. In
CIKM ’07: Proceedings of the sixteenth ACM conference on Conference on infor-
mation and knowledge management, pages 1015–1018, New York, NY, USA, 2007.
ACM.
... Malheureusement, aucune mesure d'intérêt ne filtre les différences inintéressantes et de ce fait, les attributs différents pour chaque entité mais inintéressants (comme les identifiants) sont aussi extraits. Les tâches les plus proches de la nôtre sont la génération de modèles d'infoboîte (Wu et Weld, 2008) et l'extraction de facettes (Hahn et al., 2010;Oren et al., 2006;Feddoul et al., 2019). Premièrement, une infoboîte est un ensemble de paires attribut-valeur décrivant une entité. ...
... Il existe quelques méthodes d'extraction automatique de facettes. Pour une classe donnée, Hahn et al. (2010) extraient des modèles d'infoboîtes les attributs dont les valeurs sont fréquemment observées. De même, (Oren et al., 2006) mesure la qualité d'un attribut en privilégiant les attributs fréquemment utilisés dont les valeurs sont peu nombreuses et uniformément réparties. ...
Conference Paper
Full-text available
Les tableaux comparatifs sont utiles pour comparer des entités en dégageant leurs similarités et leurs différences non triviales. Le choix manuel des caractéristiques de comparaison reste une tâche complexe et fastidieuse. Cet article présente VERSUS qui est la première méthode automatique de génération de tableaux comparatifs à partir du Web sémantique. Pour cela, nous introduisons une mesure, nommée niveau de référence contextuel, pour évaluer si une propriété peut être une caractéristique intéressante pour comparer des entités. Cette mesure repose sur des contextes qui sont des ensembles d'entités similaires aux entités comparées. Nous montrons comment VERSUS sélectionne ces contextes et comment il évalue efficacement le niveau de référence contextuel à partir d'un point d'accès SPARQL public. Nous avons construit un benchmark à partir de Wikidata pour évaluer l'efficacité de VERSUS, avec une campagne d'évaluation manuelle des caractéristiques : la précision et le rappel sont élevés.
... Unfortunately, no interestingness measure filters out irrelevant paths leading to too many attributes (including irrelevant ones like identifiers). The tasks closest to ours are the infobox template generation [11,17] and the facet extraction [5,6,8,15]. First, an infobox is a set of attribute-value pairs describing an entity. ...
... There are a few automatic facet extraction methods. For a given class, [6] extracts from the infobox templates the attributes whose values are frequently observed. Similarly, [8] measures the quality of an attribute by favoring frequently used attributes whose values are few and uniformly distributed. ...
Chapter
Full-text available
Comparison table is an efficient tool for comparing a small number of entities for decision making to analyze the main similarities and differences. The manual choice of their comparison features remains a complex and tedious task. This paper presents \(\textsc { Versus}\), which is the first automatic method for generating comparison tables from knowledge bases of the Semantic Web. For this purpose, we introduce the contextual reference level to evaluate whether a feature is relevant to compare a set of entities. This measure relies on contexts that are sets of entities similar to the compared entities. Its principle is to favor the features whose values for the compared entities are reference (or frequent) in these contexts. We show how to select these contexts and how to efficiently evaluate the contextual reference level from a public SPARQL endpoint limited by a fair-use policy. Using our publicly available benchmark based on Wikidata, the experiments show the interest of the contextual reference level for identifying the features deemed relevant by users with high precision and recall. In addition, the proposed optimizations significantly reduce the execution time and the number of required queries.
... To help users find information from KBs, many researchers have developed the faceted search methods for KBs (Arenas et al., 2004;Arenas et al., 2014;Bast et al., 2014;Brunk and Heim, 2011;Ferré, 2014;Hahn, 2010;Moreno-Vega and Hogan, 2018;Papadakos and Tzitzikas, 2014;Sherkhonov et al., 2017) that allow users to interactively search over a KB by specifying interested values in predefined facets, thereby browsing entities stored in the KB. These methods are useful in particular for nonexpert users who are not familiar with the structure of the KB. ...
Article
Purpose The purpose of this paper is to propose a scheme that allows users to interactively explore relations between entities in knowledge bases (KBs). KBs store a wide range of knowledge about real-world entities in a structured form as (subject, predicate, object). Although it is possible to query entities and relations among entities by specifying appropriate query expressions of SPARQL or keyword queries, the structure and the vocabulary are complicated, and it is hard for non-expert users to get the desired information. For this reason, many researchers have proposed faceted search interfaces for KBs. Nevertheless, existing ones are designed for finding entities and are insufficient for finding relations. Design/methodology/approach To this problem, the authors propose a novel “relation facet” to find relations between entities. To generate it, they applied clustering on predicates for grouping those predicates that are connected to common objects. Having generated clusters of predicates, the authors generated a facet according to the result. Specifically, they proposed to use a couple of clustering algorithms, namely, agglomerative hierarchical clustering (AHC) and CANDECOMP/PARAFAC (CP) tensor decomposition which is one of the tensor decomposition methods. Findings The authors experimentally show test the performance of clustering methods and found that AHC performs better than tensor decomposition. Besides, the authors conducted a user study and show that their proposed scheme performs better than existing ones in the task of searching relations. Originality/value The authors propose a relation-oriented faceted search method for KBs that allows users to explore relations between entities. As far as the authors know, this is the first method to focus on the exploration of relations between entities.
... Both systems allow for very fast response times even when large portions of the data match the facet selection. Some further prototypes facilitate indexing and faceted browsing of RDF data [10,11,12,13,14,15,16,17] as well. But since these systems require an index at runtime, they offer limited flexibility regarding the kinds of queries that can be posed. ...
Article
In this work, we present a schema-agnostic faceted browsing benchmark generation framework for RDF data and SPARQL engines. Faceted search is a technique that allows narrowing down sets of information items by applying constraints over their properties, whereas facets correspond to properties of these items. While our work can be used to realise real-world faceted search user interfaces, our focus lies on the construction and benchmarking of faceted search queries over knowledge graphs. The RDF model exhibits several traits that seemingly make it a natural foundation for faceted search: all information items are represented as RDF resources, property values typically already correspond to meaningful semantic classifications, and with SPARQL there is a standard language for uniformly querying instance and schema information. However, although faceted search is ubiquitous today, it is typically not performed on the RDF model directly. Two major sources of concern are the complexity of query generation and the query performance. To overcome the former, our framework comes with an intermediate domain-specific language. Thereby our approach is SPARQL-driven which means that every faceted search information need is intensionally expressed as a single SPARQL query. In regard to the latter, we investigate the possibilities and limits of real-time SPARQL-driven faceted search on contemporary triple stores. We report on our findings by evaluating systems performance and correctness characteristics when executing a benchmark generated using our generation framework. All components, namely the benchmark generator, the benchmark runners and the underlying faceted search framework, are published freely available as open source.
Article
This paper presents Versus, which is the first automatic method for generating comparison tables from knowledge bases of the Semantic Web. For this purpose, it introduces the contextual reference level to evaluate whether a feature is relevant to compare a set of entities. This measure relies on contexts that are sets of entities similar to the compared entities. Its principle is to favor the features whose values for the compared entities are reference (or frequent) in these contexts. The proposal efficiently evaluates the contextual reference level from a public SPARQL endpoint limited by a fair-use policy. Using a new benchmark based on Wikidata, the experiments show the interest of the contextual reference level for identifying the features deemed relevant by users with high precision and recall. In addition, the proposed optimizations significantly reduce the number of required queries for properties as well as for inverse relations. Interestingly, this experimental study also show that the inverse relations bring out a large number of numerical comparison features.
Article
Faceted Search is a widely used interaction scheme in digital libraries, e-commerce, and recently also in Linked Data. Surprisingly, object ranking in the context of Faceted Search is not well studied in the literature. In this article, we propose an extension of the model with two parameters that enable specifying the desired answer size and the granularity of the sought object ranking. These parameters allow tackling the problem of too big or too small answers and can specify how refined the sought ranking should be. Then, we provide an algorithm that takes as input these parameters and by considering the hard-constraints (filters), the soft-constraints (preferences), as well as the statistical properties of the dataset (through various frequency-based ranking schemes), produces an object ranking that satisfies these parameters, in a transparent way for the user. Then, we present extensive simulation-based evaluation results that provide evidence that the proposed model also improves the answers and reduces the user’s cost. Finally, we propose GUI extensions that are required and present an implementation of the model.
Article
Full-text available
The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions-the Web of Data. In this article we present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. We describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.
Article
Full-text available
We have developed an innovative search interface that allows non-expert users to move through large informa-tion spaces in a flexible manner without feeling lost. The design goal was to offer users a "browsing the shelves" experience seamlessly integrated with focused search. Key to achieving our goal is the explicit exposure of hi-erarchical faceted metadata in a manner that is intuitive and inviting to users. After several iterations of design and testing, the usability results are strikingly positive. We believe our approach marks a major step forward in search user interfaces and can serve as a model for web-based collections of up to 100,000 items.
Article
Full-text available
The term "Linked Data" refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions-the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.
Article
The Flamenco system (7, 6) is a web search interface that allows users to browse through large data sets using predened hierarchical faceted metadata. It is built on top of a conventional relational database and currently scales to collections of several tens of thousands of items. In the current implementation, the system translates each user query into multi- ple SQL group-by commands in order to obtain query preview information for possible future queries. These group-by's take up a signicant fraction of the query processing time. In this note, we describe an optimization that allows us to speed up the group-by computations dramatically. Our ideas have some similarity to the work of Beyer and Ramakrishnan on computing iceberg data cubes (4).
Article
Faceted navigation is a proven technique for supporting ex-ploration and discovery within an information collection. The underlying data model is simple enough to make nav-igation understandable while at the same time rich enough to make navigation flexible in a wide range of domains. Nonetheless, there remain issues in both the presentation of navigation options in the interface and in how to extend the model to allow more flexible discovery while still retain-ing understandability. This paper explores both of these issues.
Article
Designing a search system and interface may best be served (and executed) by scrutinizing usability studies.
Article
The DBpedia project is a community effort to extract structured information from Wikipedia and to make this information accessible on the Web. The resulting DBpedia knowledge base currently describes over 2.6 million entities. For each of these entities, DBpedia defines a globally unique identifier that can be dereferenced over the Web into a rich RDF description of the entity, including human-readable definitions in 30 languages, relationships to other resources, classifications in four concept hierarchies, various facts as well as data-level links to other Web data sources describing the entity. Over the last year, an increasing number of data publishers have begun to set data-level links to DBpedia resources, making DBpedia a central interlinking hub for the emerging Web of Data. Currently, the Web of interlinked data sources around DBpedia provides approximately 4.7 billion pieces of information and covers domains such as geographic information, people, companies, films, music, genes, drugs, books, and scientific publications. This article describes the extraction of the DBpedia knowledge base, the current status of interlinking DBpedia with other data sources on the Web, and gives an overview of applications that facilitate the Web of Data around DBpedia.