Towards an Evaluation of Semantic Searching in Digital
Repositories: A DSpace Case-Study
Georgia D. Solomou and Dimitrios A. Koutsomitropoulos
High Performance Information Systems Laboratory (HPCLab),
Computer Engineering and Informatics Dept., School of Engineering,
University of Patras, Building B, 26500 Patras-Rio, Greece
Abstract. Successful learning infrastructures and repositories often depend on
well-organized content collections for effective dissemination, maintenance and
preservation of resources. By combining semantic descriptions already lying or
implicit within their descriptive metadata, reasoning-based or semantic search-
ing of these collections can be enabled and produce novel possibilities for con-
tent browsing and retrieval. The specifics and necessities of such an approach
however make it hard to assess and measure its effectiveness. Therefore in this
paper we introduce a concrete methodology towards a pragmatic evaluation of
semantic searching in such scenarios, which is exemplified through the Seman-
tic Search plugin we have developed for the popular DSpace repository system.
Our results reveal that this approach can be appealing to expert as well as nov-
ice users alike, improve the effectiveness of content discovery and enable new
retrieval possibilities in comparison to traditional, keyword-based search.
Keywords: Digital libraries, leaning objects, information search and retrieval,
knowledge acquisition, ontology design.
Digital repository systems that exist today, offer an out-of-the-box and easily custom-
izable solution to educational institutions and organizations for managing their intel-
lectual outcome. Some of the most popular platforms include DSpace
, with over a
thousand instances worldwide
, EPrints, Digital Commons and CONTENTdm, where-
as ETD-db and Fedora have a notable presence as well .
In such a setting, semantic technologies can enable new dimensions in content dis-
coverability and retrieval, by allowing the intelligent combination of existing metada-
ta in order to enable semantic query answering amongst repository resources.
For such an approach to be appealing and thus useful to end-users, it has to with-
hold its inherent complexity, be as intuitive as possible, offer added-value and match
user needs. In order to assess the extent to which these requirements are met, a careful
evaluation has naturally to be conducted.
However, not only the evaluation of semantic search systems is inherently diffi-
cult, but also the particular scenario of semantic query answering in digital reposito-
ries creates distinct requirements that have to be taken into account, thus deviating
from standard approaches. This, combined with lack of existing relevant paradigms
can potentially obscure the added-value semantic search has to offer for digital reposi-
To overcome these issues, in this paper we present a methodology and procedure
for evaluating semantic query answering in digital repositories. To this end, we use
the Semantic Search plugin that we have developed for the DSpace repository system
and which is openly available to the community. After briefly introducing Semantic
Search for DSpace, we review relevant literature and discuss the evaluation method-
ology we actually followed. In doing so, the specific evaluation criteria comprising
our methodology as well as the distinct evaluation phases are identified and their as-
sociation to the entailment-centric character of semantic searching in repositories is
pointed out. Finally, we present the concrete evaluation procedure, discuss its results
and outline our observations w.r.t. system behavior, user appeal and future improve-
The rest of this paper is organized as follows: Section 2 gives an overview of Se-
mantic Search for DSpace, a self-contained plugin for semantic query answering in
the DSpace digital repository system, as well as in any web accessible ontology doc-
ument. Section 3 describes our evaluation methodology and procedure, starting with
the specific necessities for digital repositories, reviewing related work and then doc-
umenting the evaluation criteria it is comprised of. Sections 4 and 5 present and dis-
cuss the process and results of applying this methodology for evaluating semantic
search in DSpace in two phases, respectively. Finally Section 6 summarizes our con-
clusions and outlook.
2 Semantic Search for DSpace
Semantic Search v2 for DSpace is the outcome of an initiative to enable semantic
searching capabilities for web-based digital repository systems.
The main idea in v2 is to break down the intended user query into a series of atoms
and to assign them to different parts of a dynamic UI, enriched with auto-complete
features, intuitive syntax-checking warnings and drop-down menus (Fig. 1). In fact,
these atoms correspond to the building blocks of a Manchester Syntax expression
. Each such expression is an atomic or anonymous class in the Web Ontology
Language (OWL)  and its (both inferred and asserted) members are the answers to
Search is conducted against the DSpace ontology, which is automatically populat-
ed through the Open Archives Initiative Protocol for Metadata Harvesting (OAI-
PMH) interface , and results are presented as a browsable list. Effective querying
of the knowledge base is accomplished by interconnecting to an appropriate inference
engine, capable of reasoning with OWL 2 ontologies.
Fig. 1. The auto-complete and query construction mechanisms of the semantic search interface.
Ontology handling, query processing and reasoner communication are managed by
the DSpace Semantic API, which exhibits a pluggable reasoner design as well as
proper handling of OWL 2 and is based on OWL API v3 . As a result, the new
Semantic API features the ability to “hot-swap” between reasoners dynamically (see
Fig. 2). For the time being, any OWL API compliant reasoner can be supported, in-
cluding out-of-the-box support for Pellet, FaCT++ and HermiT.
Fig. 2. Changing reasoner through the Options tab.
Most importantly, this Semantic API is designed along the same principles as its
predecessor, i.e. to remain independent of the DSpace business logic and to be agnos-
tic to the rest of the user interface or even to the underlying ontology. In addition, it
has undergone several improvements aiming at greater extensibility and performance,
like for example ontology caching, inference precomputation and reasoner sharing
and reuse. Finally, Semantic Search v2 includes several enhancements and/or fixes
throughout the DSpace ontology creation process as well as in the involved ontology
documents, aiming for better resource discovery, reasoner compatibility and extended
Semantic Search is hosted and maintained as a Google Code project
and is listed
as an official DSpace add-on
. A public demo is also available
. A more thorough
account of the architecture, methodology and design principles behind Semantic
Search v2 can be found in .
3 Evaluation Methodology
In this section we present the methodology for evaluating semantic query answering
in digital repositories. We first note the added value of semantic searching especially
within the context of a digital repository. We specify the requirements of evaluation
for this specific scenario by identifying the determinant characteristics for this kind of
applications. We then survey related work in the field and point the major differences
and incompatibilities with other related approaches in the literature. Two evaluation
phases emerge, which both serve the purpose of assessing system qualities along dif-
ferent and often overlapping dimensions, based on a number of specific evaluation
3.1 Requirements for Evaluating Semantic Search in Digital Repositories
Semantic searching in digital repositories can be considered as a complementary task
that comes to embellish the suite of content discovery services for digital collections.
However, it deviates from standard practice at least in two aspects: First, semantic
queries require different mechanisms that cannot be put side by side with traditional
search. Second, the evaluation of these mechanisms has to consider a set of distinct
requirements, which are not typically met when implementing semantic search in
By traditional search we mean keyword-based search, i.e. search for the occurrenc-
es of particular text strings within available metadata. On the other hand, semantic
search operates on additional knowledge that is implied by the combination of these
metadata and their semantic interpretations in an ontology. This combination is made
possible by inference engines that implement automated reasoning techniques based
on formal logics , which are the essential underpinnings of the Semantic Web.
Keyword-based queries can often be ambiguous, due to the polysemy of terms
used in the query, thus leading to low precision; they can also be affected by syno-
nymity between query terms and the contents of the search corpus, thus leading to low
recall. Semantic search as metadata-based search defined according to an ontology,
enables overcoming both issues because ontology assertions are unambiguous and do
not suffer from synonymity . In addition, semantic search enables a new set of
queries that are based on the power of inference engines and are not possible with
traditional keyword-based search.
The difficulty in implementing and evaluating semantic-enabled search has its
roots in the requirement for expert syntaxes, the complexity of ontological queries and
the need to have at least a rough knowledge of the ontologies involved beforehand
. Beyond these general preconditions, a number of additional requirements need
also to be fulfilled in case we search a digital repository. In essence, querying the
Semantic Web is one thing; however, applying semantic search in the setting of digi-
tal repositories is quite another, since the latter have a special scope, a more specific
user-base and the nature and status of information maintained in them is different.
This set of requirements is described below. Each requirement triggers specific evalu-
ation criteria, which are explained in more detail in section 3.3.
R1. User experience/expertise/background. Searching the contents of a reposi-
tory is more than just browsing among hundreds or thousands of digital re-
sources. The effectiveness of a semantic search service has to strike a bal-
ance between assisting end-users with the composition of complex queries
and giving them freedom to express semantically rich concepts. In the end,
digital repository users need not always be Semantic Web experts and they
do not necessarily have specific expectations out of a semantic query mecha-
nism, which could justify the additional effort to put up with it. Investigating
user experiences and background (criterion C1-Demographics) is therefore
necessary in order to be able to comprehend their reaction to system behavior
(C3-Friendliness/Intuitiveness) and to assess the impact of the new service
modalities (C4-Perception of semantic search and reasoning).
R2. Structure and Quality of metadata. In a digital repository, the responsibil-
ity of documenting resources lies among domain experts, which are typically
assigned with this task, rather than arbitrary users. Consequently, well-
defined metadata records are produced, which contain rich and high quality
descriptions. Because of this, it may be difficult to overlay the full semantic
interpretation and complexity of stored metadata, during the querying pro-
cess, as this depends on the quality of the produced transformation into an
ontology. In addition, there is a semantic gap among a flat metadata record
and its corresponding translation into a semantic description. On the other
hand, general semantic search systems start already with a proper ontology.
The most frequently encountered causes of failure (C2-Causes of failure), as
well as a sufficiently refined result set (C7-Number of results) are indicative
of how well the system can perform in terms of knowledge interpretation.
Additionally, the types of semantic queries (C9-Query types), compared
against their complexity and result productivity, may reveal system’s poten-
tial to handle implicit and interconnected knowledge.
R3. Need for reasoning. Implied relations that lie among a repository’s well-
documented metadata descriptions can come to surface only when they are
appropriately processed and reasoned about. To evaluate the actual merits of
reasoning, we first need to figure out how users perceive entailment based
query answering (C4-Perception of semantic search and reasoning) and how
they think such a semantic search facility could be further improved (C5-
Improvements/Suggestions). Moreover, the time needed for a request to be
processed (C6-Response times) and the types of queries users tend to submit
(C9-Query types), could give us a hint about the best possible trade-off
among knowledge discovery and performance, on behalf of the end-users.
R4. Added value. By default, a common metadata-based search facility is pro-
vided to digital repository users, with the latter being accustomed to its func-
tionality and performance. On the other hand, the paradigm-shift to a new,
semantic-enabled search interface, with additional complexity in use, may
raise some hesitation. Therefore, users need to be able to find out the actual
gains of intelligent searching through their own experience. To this end, we
investigate both users’ perception about semantic search and reasoning (C4-
Perception of semantic search and reasoning) and their consideration on
how these constructs can be improved (C5-Improvements/Suggestions).
Nonetheless, added value is also captured in an implicit manner, through the
query analysis process. The average query length (C8-Query length) reflects
how much the underlying interface may facilitate composite and probably
more expressive requests, something which is also indicated by measuring
query success rates and causes of failure (C2-Causes of failure).
Semantic search tools, although rapidly evolving in the last decade, lack a common
evaluation practice which would contribute to their performance improvement and
their broader acceptance by the World Wide Web community. As identified in ,
even though some well-established techniques have been adopted for the evaluation
of IR search systems – like those used in the annual TREC (Text REtrieval Confer-
– semantic technologies are still a long way from defining stand-
ard evaluation benchmarks.
On the other hand, the use of standard IR metrics, like recall and precision, are not
always sufficient for rating semantic search systems’ performance and efficiency. As
discussed in section 3.1, measuring precision and recall makes sense only when syno-
nymity and ambiguity are determining factors in the quality of search results. On the
contrary, an ontology-based semantic search system that employs sound and complete
algorithms will always exhibit perfect precision and recall, as is typically the case in
. What seem as the most widely accepted techniques for these systems’ evaluation
are those based on user-centered studies (, ).
The state of the art in the domain of the semantic search evaluation appears very
limited, either missing comprehensive judgments or producing very system-specific
results that are inappropriate for making comparisons. Reasons for this include the
inherently complex and diverse structure of semantic technologies, the lack of com-
mon evaluation algorithms (, ), or other deficiencies like the inability to iden-
tify suitable metrics and representative test corpora for the Semantic Web .
Among the evaluation frameworks found in the literature, those presented below can
be considered as closest to our work, although they come with significant differences,
mainly in that: i) they often ignore the capabilities for inference-based knowledge
acquisition and reasoning and ii) they do not consider the specific requirements apply-
ing for digital repositories (section 3.1).
Within the SEALS (Semantic Evaluation At Large Scale) project (, ), a
systematic approach for benchmarking semantic search tools is proposed, divided into
two phases. The adopted methodology, unlike to ours, disregards the reasoning per-
spective which is a key aspect for extracting implicit knowledge (requirement R3).
What is more, it is not well-suited for knowledge-based retrieval in the setting of digi-
tal repositories, where more specific requirements apply. For example, although query
expressiveness is taken into account, no particular provision is made for cases where
the required information already lies among rich and well-documented metadata rec-
ords (R2). Also, the generic scope of the SEALS campaign does not take into account
the specific background and/or expectations that a digital repository user may have
Similarly, work presented in  proposes the evaluation of keyword-based search-
ing over RDF using IR techniques. Therefore, this methodology focuses on assessing
what is actually traditional search over structured text corpora (RDF documents) and
the inference mechanisms and capabilities related to semantic search do not find their
way in the evaluation.
The QALD (Question Answering over Linked Data)  series of evaluation cam-
paigns form an attempt to evaluate question answering over linked data. Its actual aim
is to establish a standard evaluation benchmark and not a methodology in itself; there-
fore, it examines hardly any of the requirements posed in the previous section, like for
example users’ background and expectations (R1). Most importantly, QALD has a
focus on systems which are about semantically assisted natural language question
interpretation into graphs (SPARQL query language ). What gets mostly evaluat-
ed by the QALD challenge is the translation of natural language questions into a for-
mal query language, rather than the process and impact of constructing and submitting
queries directly to a knowledge base, be it structured, semantically replicated or oth-
erwise. What is more, the effect of inference-based query answering (R3) is not ac-
Finally, the framework introduced in  evaluates semantic querying systems from
a more user-centric perspective. Its main concern is to guide users among the different
types of ontology querying tools, rather than evaluating the actual capabilities and
performance of the underlying retrieval mechanisms.
Our evaluation methodology consists of an end-user survey aiming to capture user
satisfaction as well as user reaction to system-specific behavior. The latter is also
measured by analyzing actual queries posed through a demo installation. Besides, as
concluded in , when evaluating semantic search systems the ultimate success
factor is end-users’ satisfaction. While reasoning effects play a considerable role in
the evaluation process, our evaluation approach has still many in common with simi-
lar efforts in the literature: for once, survey questions match and refine the usability-
oriented philosophy of the standard SUS questionnaire , while the two-stage ap-
proach of survey- and query- analysis is also shared by the SEALS evaluation cam-
paign (“user-in-the-loop” and “automated” phases ). The logging analysis method
has also been proposed in .
There are different interpretations of semantic searching in the literature (for ex-
ample, see ). The methodology proposed in this paper is clearly applied in entail-
ment-based querying answering through structured queries in digital repositories.
Different querying approaches, like NITELIGHT and PowerAqua (see ) have yet
to appear in repositories. However, the core methodology criteria would remain the
same as they: i) take into account the specific problem space (repositories), ii) empha-
size a pragmatic end-user perception stance, iii) focus on reasoning as the primary
source of added-value of such queries. To our knowledge, this is the first effort to
propose a framework and usage-case for the evaluation of semantic query answering
in digital repositories.
3.3 Evaluation Phases and Criteria
In terms of the above, two distinct evaluation phases have been identified and imple-
mented in this work:
1. A user survey based on anonymous multi-part questionnaires aiming to capture us-
ers’ impression and satisfaction about the system. Beyond the standard SUS ques-
tionnaire, this survey includes further refinements that allow us to correlate be-
tween user demographics and/or particular interface characteristics (Section 4).
2. The calculation of metrics and statistics about actual queries coming from various
logged data. In this way, not only the pragmatic evaluation requirements are ac-
commodated, that would be impossible to capture otherwise, but this phase also
helps validate and double-check the outcomes of the former (Section 5).
Both these phases are designed in order to evaluate semantic search across differ-
ent criteria, taking specifically into account the capabilities for knowledge-based ac-
quisition and entailment. These criteria are often found to intersect both phases, thus
helping verify the findings of both and produce correlations. The evaluation criteria
and their rationale are analyzed below:
C1. Demographics: This information is used as the basis for investigating potential
demographic correlations with user behavior and performance with the system. In
particular, users are classified into groups according to their expertise in Seman-
tic Web and Manchester Syntax. These groups are then taken into account when
considering the actual usage of the provided interface and the successful out-
comes, like the percentage of successful queries.
C2. Causes of failure: A system’s failure to process and answer a query can degrade
its final acceptance by users. Knowing what actually caused this failure, be it a
systemic flaw or user misconception, can help in improving the system. Causes
of failure are captured both by analyzing logged system’s exceptions, as well as
by explicitly asking user opinion from a list of potential reasons.
C3. Friendliness/Intuitiveness: Expressing a semantic query is quite a tricky task
and requires both familiarity with the query language as well as a minimum abil-
ity in comprehending and handling the underlying ontology’s concepts. A user-
friendly and intuitive interface should be capable of smoothly guiding users
through the query construction process. The extent to which this is achieved can
be captured by evaluating users’ perception about provided structures and facili-
ties (auto-complete, help options, overall intuitiveness). It is also important to as-
sess the overall effort it takes for a user to construct a query, to identify the po-
tential difficulties met during this effort and to survey the time it took to assemble
the query using the interface.
C4. Perception of semantic search and reasoning: Entailment based queries man-
age to retrieve non-explicit knowledge and this is a major asset of any semantic
search system. To have an idea of how users comprehend and feel with such a
feature, answers to questions that compare semantic with traditional search are
C5. Improvements/Suggestions: The success of a semantic search service mostly
lies in the potential of its underlying semantic technologies and the way these are
perceived by users. To further enhance such a service, users are asked to identify
possible improvements, either from the semantic or usability perspective.
C6. Response times: Given that accuracy is guaranteed by sound and complete algo-
rithms, the total time needed to answer a query constitutes a core criterion for
evaluating semantic search. Improved response times imply increased satisfaction
for users. However, the extremely fast response times that traditional search en-
gines come with, raise the standards for semantic search also. The system’s actual
response time is derived from logged data and its perception by the users can be
reflected in survey responses concerning interface usability.
C7. Number of results: Skewing the meaning of traditional recall, the productivity of
ontological queries can be captured by the number of individuals obtained rela-
tive to the total number of individuals. This is useful in comparing groups of que-
ries depending on their types, the ontology they were performed on or the reason-
er involved, and gives some indication about how refined results actually are.
C8. Query length: Queries consisting of only a few characters are generally expected
to produce fast responses and simple results. On the other hand, longer queries
are more likely to be complex enough so as to produce refined results. In addi-
tion, how well autocomplete features and suggestion mechanisms assisted in que-
ry formulation can be effectively measured by average query length.
C9. Query types: Queries intended to be evaluated upon semantic concepts and con-
structs can differ significantly in terms of complexity, both from a computational,
as well as a logical and structural point of view. By partitioning queries into
groups according to their structural and semantic characteristics and by examin-
ing times and other metrics related to these groups, we can have a useful indica-
tion of system’s performance and user experience. For example, the frequency of
occurrence for each query type can imply how easy was for users to construct fi-
ne-grained queries through the provided interface. The partitioning of queries in-
to types depends on the underlying querying approach. In any case, a valid
grouping may consider the complexity of the queries, i.e. how “hard” is for them
to run and how productive they may be. In the case of entailment-based struc-
tured queries, this is interpreted as the type of the language constructs utilized,
such as classes, expressions and restrictions, as well as of the axioms associated
with these (see also section 5.4).
The multi-dimensional character of these criteria is better explained in Fig. 3,
where their relative placement within the evaluation phases as well as their relation-
ship to the assessment of reasoning capabilities is depicted.
Fig. 3. Evaluation criteria within evaluation phases and the reasoning aspect.
4 User Survey
In order to investigate users’ behavior and experience in using Semantic Search, we
constructed an online questionnaire
that mostly includes closed format questions. We
are based on multiple-choice as well as on a 3- and 5-point Likert scale assessment.
All questions are optional and organized in four groups: the first group corresponds to
the criterion C1 (“Demographics”). The second group measures the C2 criterion
(“Causes of failure”) and the C3 criterion (“Friendliness/Intuitiveness”). The third
group of questions is related solely to the C3 criterion (“Friendliness/Intuitiveness”).
Finally, the forth group evaluates the criteria of C4 (“Perception of semantic search
and reasoning”) and C5 (“Improvements/Suggestions”). The analysis below follows
the criteria addressed.
4.1 Demographics and Causes of failure
The majority of people that took part in this survey appear as being averagely familiar
with Semantic Web concepts (mean average 3.07 in a 5-point scale with 5 corre-
sponding to “expert”). As expected, their Manchester Syntax level of knowledge (Fig.
4, left) is a bit lower, but still most of them can at least recognize queries written in
this language and possibly can construct some (mean average 2.70/5). To perform
correlations later on, we categorize users into three major groups by collapsing the
two upper and two lower levels into one category each.
The average percentage of successful queries reached 53%. This increases as users
become more Manchester Syntax literate (see Table 1). The percentages have been
Entailment-based query answering
Perception of semantic
search and reasoning
Number of results
computed by taking into account the answers given by participants that have actually
tried to build some semantic queries using our demo installation (81% of all partici-
pants), as these questions where only visible to them. Remaining queries failed mostly
due to syntax-specific errors that resulted in problematic query parsing (56%. Fig. 4,
right part). Another important reason for unsuccessful queries was that, by users’
claim, the results they got didn’t match what they expected (33%).
Fig. 4. Familiarity with Manchester Syntax and reasons for query failure (reported).
Table 1. Manchester Syntax expertise level correlated with query success rates, frequency of
use of the “generated query” field and overall experience with the Semantic Search interface.
Did you have to edit the “generated
query” box manually?
(the two most frequent replies)
Overall, how easy was for
you to understand and/or
use the interface?
1 – 2
Only a few times
Never. I was more comfortable
creating the query step-by-step
Only a few times
Never. I was more comfortable
creating the query step-by-step.
Only a few times
More often than not, it is easier
for me this way
Only a few times
Never. I was more comfortable
creating the query step-by-step.
Table 1 summarizes querying experience and interface intuitiveness as per expertise
group and identifies user’s attitude against manual query construction. It is clear that
the less experienced a user is with the querying syntax, the more does he use the step-
by-step query construction mechanism and refrains from doing so manually. This
observation reveals that our guided query mechanism can appeal to inexperienced
users and help them alleviate the complexity often found in formal language querying
Percentage of Participants
How familiar are you with Manchester Syntax?
I am an expert (5)
I know how to read
and write expressions
in Man. Synt. (4)
I know how to read it,
but I haven’t used it (3)
I have heard of it, but
haven’t seen it in
What is this?! (1)
Percentage of Participants
Why do you think the remaining queries were
The query was OK, but
returned no results
The query was OK, but
the results didn’t match
what I had in mind
The system reported an
The system crashed
(internal system error)
The system reported that
“the input expression
cannot be parsed”
systems. Overall, the interface was considered to be moderately easy to understand
and use, gaining a mean score equal to 2.96/5, being at its peak for the middle exper-
tise group (1: “Very hard to understand and use” – 5: “Very easy and straightfor-
In addition, the fact that the vast majority of replies (80%) indicates that users
didn’t show preference to the “generated query” box (they used it “only a few times”
or “never”) and that a 20% had exclusively used the step by step procedure, shows
that the semantic search interface came in handy for a significant part of them. Be-
sides, everyone said they were able to assemble and submit a query in less than a
minute and more than one third of them (35%) answered that a time interval of 10
seconds was enough (Fig. 5).
Fig. 5. Distribution of query construction times among users.
The most important aspect of the Semantic Search interface proves to be its auto-
complete mechanism. On average, the survey participants rated this particular feature
with the mean score 3.80/5 replying to the question “Did you find the auto-complete
facility helpful?” (1: “No, not at all” – 5: “Very much indeed”). As far as the intui-
tiveness of labels is concerned, the average rates appear a bit worse, related to the
ones regarding the auto-complete mechanism. On average, the names selected for
labels were considered as a moderate choice, being judged with the mean score 3.04
out of 5. The reason is that we have to cope with Manchester Syntax’s specific termi-
nology, which can be overwhelming for non-experts.
4.3 Perception of Semantic Search and Improvements
To evaluate the participants’ perception of the service’s retrieval capability and rea-
soning-centric character, we included some questions in the last part of the survey that
are analyzed below. Compared to the traditional keyword-based search the Semantic
Search service is considered to be better, as shown by the number of positive votes
(68% of the respondents indicated at least one positive reason, see Fig. 6).
The observation that Semantic Search outclasses traditional search in terms of im-
proved precision and recall comes exactly as a consequence of the service’s capability
to perform reasoning. Another reasoning-related question is the one about possible
improvements, with 57% of users requesting support for more advanced constructs of
Manchester Syntax (like facets, automatic use of parentheses and inverse properties)
as well as the addition of more reasoners (43%). Nevertheless, additional and better
How long, on average, did it took to
assemble and submit a query?
More than 5 minutes (0%)
5 minutes (0%)
A minute or so (65%)
Less than 10 seconds (35%)
tooltips for the various fields and boxes seems to be a more important improvement,
indicated by 62% of respondents. This comes as a consequence of the moderate 3.04
mean score assigned previously to the labels intuitiveness. It is worth mentioning that
the majority (38%) of users pointing out this particular improvement gave as main
reason for query failure that the results they obtained didn’t match what they had in
mind. Although not necessarily user-friendlier, the extension of Semantic Search with
a SPARQL endpoint has the highest popularity among user suggestions (76%), possi-
bly due to standardization and added interoperability.
Fig. 6. Comparison between traditional keyword-based search and the Semantic Search service.
The vast majority of users (78%) found that our semantic search interface makes
semantic searching “very much” or at least “somehow” easier and only 17% disagrees
with this viewpoint (answering “not at all”). In a 3-point Likert scale this question
scores an average of 2.13. It is important to mention that by collapsing overall inter-
face intuitiveness (Table 1) into a 3-point scale, the average score of both these ques-
tions is quite close (1.96 and 2.13 out of 3, resp.). This indicates that there exists a
direct relationship between semantic searching and the user friendliness of the under-
lying interface and suggests that semantic search can benefit from interface intuitive-
Participants’ appreciation of semantic searching in digital repositories has an influ-
ence in their perception of our own system (Table 2). Those that find semantic search-
ing a confident idea rank our service higher and are more enthusiastic adopting it,
while those not committed tend to exhibit a negative bias, thus lowering its average
Table 2. Users’ opinion about usefulness of general semantic searching in repositories, corre-
lated with their attitude towards the Semantic Search service.
Do you think
is useful in reposi-
Do you think this
Would you like to
have this service
in your reposito-
Overall, how easy was
for you to understand
and/or use the interface?
3: “Very useful”
Not at all
No, I don’t
believe it is better
I can’t compare them.
They are different things
Compared to traditional keyword-based search,
do you think this is better?
Yes, results are more
Yes, there are more
Yes, it is easier to create
Yes, for some other
(multiple replies are possible)
Not at all
1: “Not at all”
Not at all
Not at all
5 Query Analysis
In this section we analyze real queries posed through a demo installation of the Se-
mantic Search mechanism
. The demo runs on a 64bit Intel Xeon at 2.8GHz and Java
is allowed a maximum heap memory of 1024MB. Data were collected during a period
of 12 months, from March 2012 to March 2013 and correspond to the official de-
ployment of the Semantic Search v2 minor upgrade (v2.1).
5.1 Ontologies used
During this evaluation phase the most part of user queries were posed against the
DSpace ontology, since this is the one selected by default. However, the interface
permits loading and use of any ontology document. We found that users experimented
also with other ontologies. A summary of collective data and metrics regarding user
experience and system response with the four most popular ontologies (Table 3) is
shown below (Fig. 7, Fig. 8). To give a better understanding of overall query efficien-
cy we also report the total and/or average values of these metrics for all ontologies.
Table 3. Most frequently used ontologies and their expressiveness.
The DSpace ontology
SROIF (OWL 2)
Location keywords of the Global Change
Master Directory (GCMD) in SKOS
Part of the Space Physics Archive Search
and Extract (SPASE) data model in SKOS
A test version of SPASE, including a nom-
Fig. 7. Query success rates and results per ontology.
5.2 Success rates
Query percentage (%queries) is the ratio of successful queries to the total number
of queries logged by the system. Results percentage (%results) is the average number
of results returned by a group of queries relative to the maximum number of results
that may be returned by any query, i.e. the total number of individuals in the ontology.
In the case of all ontologies, this represents the weighted mean of the results percent-
age for each ontology. Total quantities are also shown in absolute numbers in Fig. 7.
Average query length is given in characters.
From a technical point of view “successful” queries are those that achieve reasoner
response without errors, i.e., succeed in fetching results unabruptly, even zero ones.
Query analysis indicates a 55.3% of such queries over all ontologies, increasing to
66.5% for the default DSpace ontology case. This is in accordance to the findings of
the user survey (53%, Table 1), though leaning a little on the positive side by also
including queries that didn’t match what users had in mind, but worked nevertheless
(see Fig. 4, right part).
The most part of technically unsuccessful queries (i.e., queries that caused one ex-
ception or another) failed unsurprisingly due to syntax errors (83.9% of logged excep-
tions), just like the users report in the survey. This is followed at a distance by errors
related to reasoner limitations (6.4%), like unsupported datatypes and facets, thus
explaining to some extent the explicit exceptions reported by users (Fig. 4). An im-
portant part of logged exceptions consists of ontology specific errors (9.7%), like
unresolvable URLs, malformed or inconsistent ontologies, which are responsible for
“internal system errors”. This is to be expected though, since the system currently
allows users to operate on any web-accessible ontology.
Fig. 8 summarizes average time measurements for the set of successful queries,
and groups them per reasoner and ontology used. Reasoning time corresponds to in-
ferences precomputation by the reasoner. This is an initialization phase were the rea-
soner classifies the ontology, and precomputes certain reasoning operations like for
example realization, satisfiability and instance checking . Reasoning is a hard
computational task and increases as an ontology grows in complexity (Table 3). How-
ever, precomputation happens only once and results remain available in memory for
future queries to perform on.
Fig. 8. Reasoning and query time per reasoner per ontology.
Query time corresponds to query evaluation, i.e. the time it takes for the reasoner to
evaluate the input class expression and to decide its instance membership (what indi-
viduals belong to the evaluated class). Due to precomputation, query times are gener-
ally submultiple of reasoning times (but see also next section).
As it is evident in Fig. 8, measured reasoner times exhibit great discrepancies,
since they are spreading over totally different orders of magnitude. This is primary
due to the inherent differences in the reasoners themselves, their architecture and
optimization strategies . In such situations, it is therefore preferable to use the ge-
ometric mean, as a means to mitigate the asymmetry of different reasoner scaling,
which does not really account for actual system behavior, and to better convey overall
tendency. In the following, when we refer to average times we use the geometric
These times are essential, since users may be accustomed to rapid response times
from their everyday experience with traditional search engines. On the other hand,
systems that employ semantics to offer enhanced search services can be somewhat
more burdened. A recent study on the evaluation of such systems reports average
response times between a few ms and several seconds . In this setting, the fact that
Semantic Search for DSpace exhibits average query (and reasoning) times of less than
half a second can be considered satisfactory.
Finally, Fig. 9 shows queries distribution along a logarithmic time scale. Most fre-
quent appear to be the queries that take between 1 and 10ms to get evaluated, mean-
ing that users get, for the most part, a quite responsive experience by the system.
Fig. 9. Percentage of queries per time scale.
5.4 Types of Queries
Fig. 10 partitions logged queries into groups based on their types and characteristics.
As will become evident, the number of results returned as well as query evaluation
times are directly dependent on these query types. Both an ANOVA (ANalysis Of
VAriance) as well as a Kruskal-Wallis test yield a p << 0,0001, thus attesting that
this partitioning and the observations that follow are also statistically significant.
Fig. 10. Percentage of queries, average times and average percentage of results per query type.
Query evaluation times: Clearly, since a precomputation has preceded, query evalu-
ation may range from very low (practically insignificant) times to several millisec-
onds, depending on whether query results are already in memory or a re-computation
is required. For example, a named class expression like ‘owl:Thing’ or ‘dspace-
ont:Item’ must be fast in principle, due to the fact that the realization and instance
checking results are already available. On the other hand, queries involving proper-
ties, e.g. existential, value and number restrictions may require reclassification of the
ontology so as to decide the placement of the new expression in the existing class
hierarchy and can be much more costly.
0-1 1-10 10-100 100-1000 1000+
query time (ms)
14.13 66.73 25.28
%queries average query time %results
This is especially true in cases where properties subsume long hierarchies of other
properties or role chains (with transitivity being a special case of chain), since the
corresponding Description Logics algorithms have to investigate exhaustively all
possible paths connecting instances, based on the role hierarchy.
Number of results: As shown in Fig. 10, more than half of user queries (60.3%)
consist of property restrictions. However, these queries resulted on average in a small
percent (12.9%) of total individuals. On the other hand, the simpler single-named
class expressions are somewhat less frequent, but are accountable for a far greater
number of results (30.5%). This confirms that users were able to take advantage of the
new features of the interface and construct also complex and ‘educated’ queries filter-
ing effectively the result space. Besides, the vast majority of users reported that only
rarely did they opt for manual query construction and liked better to utilize the guided
query mechanism, according to the user survey.
It should be noted that the syntactic characteristics of user queries, as they are gen-
erally identified in Fig. 10 do not have any direct effect on their knowledge retrieval
capabilities. Instead, it is the semantic qualities of the vocabulary and expressions
used within the queries that determine their ability to perform knowledge acquisition.
A notable exception is restrictions on subsuming properties, transitive properties
and chains, which actually consider semantically ‘intense’ properties, due to their
manipulation in the ontology. Queries of this type are expected to exhibit improved
retrieval capabilities, by design; however, the opposite does not necessarily hold.
As a final observation, users appear reluctant to build Boolean combinations, pos-
sibly because this requires an extra step in the query construction process (“Add term”
button in Fig.1).
5.5 Length of Queries
Next, Fig. 11 presents queries organized into groups based on their length and corre-
lates them with their average query time and returned results percentage. It can be
Fig. 11. Percentage of queries, results and average query times per query length.
5.6% 7.8% 7.4%
0-10 11-20 21-30 31-40 41-50 51-60 61+
%queries %results average query time
─ Most user queries are between 10 and 20 characters long. In addition, average que-
ry length is 36 characters (Fig. 7), which is larger than usual web search queries.
For example, average query length in Google is around 20 characters  and sim-
ilarly in Bing, new queries appear to be almost 19 characters long . This is due
to the fact that Semantic Search offers several enhancements that reduce users’
workload, such as prefix auto-completion, pull-down menus and autocomplete
suggestions for entity names, all making easier for users to construct longer and
─ Query evaluation time is not directly dependent on query length. This is a reasona-
bly expected result, since query evaluation does not depend on the query size but
on the type and characteristics of the query as explained above (restrictions, prop-
erties, etc.). Still however, average time may increase as queries lengthen, because
there is greater chance for ‘hard’ queries to occur as query length increases.
─ Finally, there is a general tendency for the result-set size to decline as query length
increases. In conjunctive term based search this is a well-expected observation,
since multiple conjunctions tend to prune progressively the result set (they have
higher precision ). Here, longer queries are not necessary conjunctive; Howev-
er, small queries are impossible to contain restrictions or fine-grained class expres-
sions that would limit the number of results and that would only be possible within
larger queries (for example, the Manchester Syntax keyword ‘some’ already con-
tains 4 characters and accounts for 11% of the average query length).
The added value of semantic query answering is sometimes overlooked, possibly
because of the dominance of traditional, text-based searching and its satisfactory re-
sults. This is not to say that semantic search can replace traditional methods; however,
it can evidently provide additional querying dimensions, made possible by unfolding
content semantics and automated reasoning.
In this paper we have introduced a systematic approach towards the evaluation of
semantic searching in knowledge intensive set-ups, like digital repositories. It is easily
seen that traditional IR metrics, such as recall and precision, are often incompatible or
inappropriate in these scenarios. The user-centric character as well as the reasoning-
specific characteristics of such systems necessitates a pragmatic stance and we have
shown specific requirements, criteria and procedures on how this can be achieved.
This methodology has been applied on the DSpace Semantic Search plug-in and
has helped to identify the system’s strengths and weaknesses as well as to confirm the
worthiness of improvements introduced with v2. For example, Manchester Syntax
may not arise as an alternative to natural language, but there are options that can alle-
viate this burden. In any case, no matter what syntax or querying approach is actually
followed, we have indicated specific metrics like the results ratio or query success
rates that can help estimate query quality and user engagement.
In fact, this idea can be carried along all the other aspects a semantic search mech-
anism may have, from system architecture to interface design to reasoning strategy;
no matter how they are actually implemented, our procedure can be equally applied
and confirm (or refute) the validity of design decisions. As a result, this methodology
may serve as a useful paradigm on how one can go on with assessing the effects, mer-
its and challenges of deploying semantic search capabilities over existing repository
1. Bhagdev, R., Chapman, S., Ciravegna, F., Lanfranchi, V. Petrelli, D.: Hybrid Search:
Effectively Combining Keywords and Semantic Searches. In: 5th European Semantic Web
Conference. Tenerife, Spain (2008)
2. Bjork, K., Isaak, D., Vyhnanek, K.: The Changing Roles of Repositories: Where We Are
and Where We Are Headed. Library Faculty Publications and Presentations. (2013)
3. Blanco, R., Halpin, H., Herzig, D., Mika, P., Pound, J., Thompson, H, Duc, T: Entity
Search Evaluation over Structured Wed Data. In: 1st International Workshop on Entity-
Oriented Search at 34th Annual ACM SIGIR Conference. Beijing, China (2011)
4. Bock, J., Haase, P., Ji, Q., Volz, R.: Benchmarking OWL Reasoners. In: Advancing Rea-
soning on the Web: Scalability and Commonsense Workshop at the 5th European Seman-
tic Web Conference. Tenerife, Spain (2008)
5. Brooke, J.: SUS: a "quick and dirty" usability scale. In P. W. Jordan, B. Thomas, B. A.
Weerdmeester, A. L. McClelland (eds) Usability Evaluation in Industry. Taylor and Fran-
cis. London (1996)
6. Cimiano, P., Lopez, V., Unger, C., Cabrio, E., Ngonga Ngomo, A-C., Walter. S.: Multilin-
gual Question Answering over Linked Data (QALD-3): Lab Overview. In: 4th Conference
and Labs of the Evaluation Forum. LNCS 8138, pp 321--332. Springer (2013)
7. Ekaputra, F. J., Serral, E., Winkler, D., Biffl, S.: An analysis framework for ontology
querying tools. In: 9th International Conference on Semantic Systems. Graz, Austria (2013)
8. Elbedweihy, K., Wrigley, S. N., Ciravegna, F., Reinhard, D., Bernstein, A.: Evaluating
semantic search systems to identify future directions of research. In: 2nd International
Workshop on Evaluation of Semantic Technologies at 9th Extended Semantic Web Confer-
ence. Heraklion, Greece (2012)
9. Fernandez, M., Lopez, V., Motta, E., Sabou, M., Uren, V., Vallet, D., Castells, P.: Using
TREC for cross-comparison between classic IR and ontology-based search models at a
Web scale. In: Semantic search workshop at 18th International World Wide Web Confer-
ence. Madrid, Spain (2009)
10. Fernandez, M., Lopez, V., Sabou, M., Uren, V., Vallet, D., Motta, E., Castells, P.: Seman-
tic Search meets the Web. In: IEEE International Conference on Semantic Computing, pp.
253--260. Santa Clara, CA (2008)
11. Harris, S. Seaborne, A.: SPARQL 1.1 Query Language. W3C Recommendation (2012)
12. Hildebrand, M., van Ossenbruggen, J. R., Hardman, L.: An Analysis Of Search-Based
User Interaction On The Semantic Web. Technical Report INS-E0706. (2007)
13. Horridge, M., Bechhofer, S.: The OWL API: A Java API for Working with OWL 2 Ontol-
ogies. In: 6th OWL Experiences and Directions Workshop. Chantilly, Virginia (2009)
14. Horrocks, I.: Ontologies and the Semantic Web.Communications of the ACM, 51 (12), 58-
15. Horridge, M., Patel-Schneider, P.S.: Manchester Syntax for OWL 1.1. In: 4th OWL Expe-
riences and Directions Workshop. Gaithersburg, Maryland (2008)
16. Koutsomitropoulos, D., Borillo, R., Solomou, G.: A Structured Semantic Query Interface
for Reasoning-based Search and Retrieval. In: 8th Extended Semantic Web Conference.
Heraklion, Greece (2011)
17. Kumar, B.T.S., Prakash, J.N.: Precision and Relative Recall of Search Engines: A Com-
parative Study of Google and Yahoo. Singapore Journal of Library & Information Man-
agement, 38 (2009)
18. Lagoze, C., Van de Sompel, H., Nelson, M., Warner, S.: The Open Archive Initiative
Protocol for Metadata Harvesting. (2002)
19. McCool R, Cowell, A.J., Thurman, D.A.: End-User Evaluations of Semantic Web Tech-
nologies. In: Workshop on End User Semantic Web Interaction at 4th International Seman-
tic Web Conference. Galway, Ireland (2005)
20. Motik, B., Parsia, B., Patel-Schneider, P.F.(eds.): OWL 2 Web Ontology Language XML
Serialization (Second Edition). W3C Recommendation (2012)
21. Nardi, D., Brachman, R.J.: An introduction to description logics. In F. Baader (Ed.) et al.,
Description logic handbook, Cambridge University Press. Cambridge (2002)
22. Nixon, L.; Garcia-Castro, R.; Wrigley, S. N.; Yatskevich, M.; Trojahn Dos Santos, C.;
Cabral, L.: The state of semantic technology today - overview of the first SEALS evalua-
tion campaigns. In: 7th International Conference on Semantic Systems. Graz, Austria
23. Pan J.Z., Thomas, E. Sleeman, D.: ONTOSEARCH2: Searching and Querying Web On-
tologies. In: IADIS International Conference WWW/Internet Murcia, Spain (2006)
24. Schandl, B., Todorov, D.: Small-Scale Evaluation of Semantic Web-based Applications.
Technical Report, University of Vienna. (2008)
25. Strasunskas, D., Tomassen, S.L.: On Variety of Semantic Search Systems and their Evalu-
ation Methods. In: International Conference on Information Management and Evaluation,
Academic Conferences International, pp. 380--387. (2010)
26. Tsotsis, A.: Google Instant Will Save 350 Million Hours Of User Time Per Year. Online
article at http://techcrunch.com/2010/09/08/instant-time/ (2010)
27. Tyler S.K., Teevan, J.: Large Scale Query Log Analysis of Re-Finding. In: 3rd ACM Inter-
national Conference on Web Search and Data Mining. New York, NY (2010)
28. Uren, V., Sabou, M., Motta, E., Fernandez, M., Lopez, V., Lei, Y.: Reflections on five
years of evaluating semantic search systems. International Journal of Metadata, Semantics
and Ontologies, 5(2), 87--98 (2011)
29. Wrigley, S.N., Reinhard, D., Elbedweihy, K., Bernstein, A., Ciravegna, F.: Methodology
and campaign design for the evaluation of semantic search tools. In: 3rd International Se-
mantic Search Workshop at 19th International World Wide Web Conference. Raleigh, NC