ArticlePDF Available

The semantics of similarity in geographic information retrieval

Authors:

Abstract and Figures

Similarity measures have a long tradition in fields such as information retrieval, artificial intelligence, and cognitive science. Within the last years, these measures have been extended and reused to measure semantic similarity; i.e., for comparing meanings rather than syntactic differences. Various measures for spatial applications have been developed, but a solid foundation for answering what they measure; how they are best applied in information retrieval; which role contextual information plays; and how similarity values or rankings should be interpreted is still missing. It is therefore difficult to decide which measure should be used for a particular application or to compare results from different similarity theories. Based on a review of existing similarity measures,we introduce a framework to specify the semantics of similarity. We discuss similarity-based information retrieval paradigms as well as their implementation in web-based user interfaces for geographic information retrieval to demonstrate the applicability of the framework. Finally, we formulate open challenges for similarity research.
Content may be subject to copyright.
JOURNAL OF SPATIAL INFORMATION SCIENCE
Number 2 (2011), pp. 29–57 doi:10.5311/JOSIS.2011.2.3
RESEARCH ARTICLE
The semantics of similarity in
geographic information retrieval
Krzysztof Janowicz1, Martin Raubal2, and Werner Kuhn3
1Department of Geography, The Pennsylvania State University, University Park, PA 16802, USA
2Department of Geography, University of California, Santa Barbara, CA 93106, USA
3Institute for Geoinformatics, University of M¨unster, D-48151 M¨unster, Germany
Received: May 24, 2010; returned: July 7, 2010; revised: November 8, 2010; accepted: December 21, 2010.
Abstract: Similarity measures have a long tradition in elds such as information retrieval,
articial intelligence, and cognitive science. Within the last years, these measures have
been extended and reused to measure semantic similarity; i.e., for comparing meanings
rather than syntactic differences. Various measures for spatial applications have been de-
veloped, but a solid foundation for answering what they measure; how they are best ap-
plied in information retrieval; which role contextual information plays; and how similarity
values or rankings should be interpreted is still missing. It is therefore difcult to decide
which measure should be used for a particular application or to compare results from dif-
ferent similarity theories. Based on a review of existing similarity measures, we introduce a
framework to specify the semantics of similarity. We discuss similarity-based information
retrieval paradigms as well as their implementation in web-based user interfaces for geo-
graphic information retrieval to demonstrate the applicability of the framework. Finally,
we formulate open challenges for similarity research.
Keywords: semantic similarity, geographic information retrieval, ontology, similarity mea-
sure, context, relevance, description logic, user interface
1 Introduction and motivation
Similarity measures belong to the classical approaches to information retrieval and have
been successfully applied for many years, increasingly also in the domain of spatial infor-
mation [82]. While they have been working previously in the background of search engines,
similarity measures are nowadays becoming more visible and are integrated into user in-
terfaces of modern search engines. A majority of these measures are purely syntactical, rely
c
by the author(s) Licensed under Creative Commons Attribution 3.0 License CC
30 JANOWICZ,RAUBAL,KUHN
on statistical measures or linguistic models, and are restricted to unstructured data such as
text documents. Lately, the role of similarity measures in searching and browsing multime-
dia content, such as images or videos has been growing [59]. Similarity measures have also
been studied intensively in cognitive science and articial intelligence [80] for more than
40 years. In contrast to information retrieval, these domains investigate similarity to learn
about human cognition, reasoning, and categorization [31] from studying differences and
commonalities in human conceptualizations. Similarity measures have also become popu-
lar in the Semantic (geospatial) Web [20]. They are being applied to compare concepts, to
improve searching and browsing through ontologies, as well as for matching and aligning
ontologies [84]. In GIScience, similarity measures play a core role in understanding and
handling semantic heterogeneity and, hence, in enabling interoperability between services
and data repositories on the Web. In his classic book odel, Escher, Bach—An Eternal Golden
Braid, Hofstadter named among other facts the abilities “to nd similarities between sit-
uations despite differences which may separate them [and] to draw distinctions between
situations despite similarities which may link them” as major characteristics of (human)
intelligence [38, p.26].
Modern similarity measuresare neither restricted to purelystructural approaches nor to
simple network measures within a subsumption hierarchy. They compute the conceptual
overlap between arbitrary concepts and relations, and, hence, narrow the gap between sim-
ilarity and analogy. To emphasize this difference, they are often referred to as semantic sim-
ilarity measures. Similar to syntactic measures, they are increasingly integrated into front-
ends such as semantically enabled gazetteer interfaces [44]. In contrast to subsumption-
based approaches, similarity reasoning is more exible in supporting users during infor-
mation retrieval. Most applications that handle fuzzy or ambiguous input—either from
human beings or from software agents—potentially benet from similarity reasoning.
However, the interpretation of similarity values is not trivial. While the number of mea-
sures and applications is increasing, there is no appropriate theoretical underpinning to ex-
plain what they measure, how they can be compared, and which of them should be chosen
to solve a particular task. In a nutshell, the challenge is to make the semantics of similarity
explicit. Abstracting from various existing theories, we propose a generic framework for
similarity measures, supporting the study of these and related questions. In our work and
review we focus on inter-concept similarity and particularly on comparing classes in on-
tologies. While the methods to measure inter-concept and inter-instance similarity overlap,
the former is more challenging. This is mainly for two reasons. First, in contrast to data
on individuals, ontologies describe multiple potential interpretations. For instance, there
is no single graph describing a concept in an OWL-based ontology [37]. Secondly, an in-
terpretation may have an innite number of elements and, hence, may describe an innite
graph.
The remainder of this article is structured as follows. First we introduce related work on
geographic information retrieval and semantic similarity measurement. Next, we propose
a generic framework and elucidate the introduced steps by examples from popular sim-
ilarity theories. While we focus on inter-concept similarity, the framework has also been
successfully adapted to inter-instance measures [90], and, moreover, can be generalized to
the comparison of spatial scenes [60,72]. We then discuss the role of similarity in semantics-
based information retrieval and show its integration into user interfaces. We conclude by
pointing to open research questions.
www.josis.org
SEMANTICS OF SIMILARITY 31
2 Related work
This section introduces geographic information retrieval, similarity measurement, and
points to related work.
2.1 Geographic information retrieval
Information retrieval (IR) is a broad and interdisciplinary research eld including infor-
mation indexing, relevance rankings, search engines, evaluation measures such as recall
and precision, as well as robust information carriers and efcient storage. In its broadest
denition, information retrieval is concerned with nding relevant information based on
a user’s query [18]. Here, we focus on the relevance relationship and leave other aspects
such as indexing aside. Following Dominich [18], information retrieval can be formalized
as:
IR =m[R(O, (Q, I,→))] (1)
where
Ris the relevance relationship,
Ois a set of objects,
Qis the user’s query,
Iis implicit information,
→ is inferred information, and
mis the degree (or certainty) of relevance.
Accordingly, information retrieval is about computing the degree of relevance between
a set of objects, such as web pages, and the search parameters, e.g., keywords, specied by
the user. Besides dening suitable relevance measures, the main challenge for information
retrieval is that “we are asking the computer to supply the information we want, instead of
the information we asked for. In short, users are asking the computer to reason intuitively”
[10, p.1]. Not all information relevant to a search can be entered into the retrieval system.
For instance, classical search engines offer a single text eld to enter keywords or phrases.
Implicit information, such as the user’s age, cultural background, or the task motivating
the search are not part of the query. Some of this implicit information can be inferred and
used for the relevance rankings. In case of search engines for the web, the language settings
of the browser or the IP-address reveal additional information about the user.
Geographic information retrieval (GIR) adds space and sometimes time as dimensions
to the classical retrieval problem. For instance, a query for “pubs in the historic center
of M¨unster” requires a thematic and a spatial matching between the data and the user’s
query. According to Jones and Purves [50], GIR considers the following steps. First, the
geographic references have to be recognized and extracted from the user’s query or a doc-
ument using methods such as named entity recognition and geo-parsing. Second, place
names are not unique and the GIR system has to decide which interpretation is intended
by the user. Third, geographic references are often vague; typical examples are vernacular
names (“historic center”) and fuzzy geographic footprints. In case of the pub query, the GIR
system has to select the correct boundaries of the historic center [71]. Fourth, and in con-
trasttoclassicalIR,documentsalsohavetobeindexed according to particular geographic
regions. Finally, geographic relevance rankings extend existing relevance measures with
JOSIS, Number 2 (2011), pp. 29–57
32 JANOWICZ,RAUBAL,KUHN
a spatial component. The ranking of instances does not only depend on thematic aspects,
e.g., the pubs, but also on their location, e.g., their distance to the historic center of M ¨unster.
2.2 Semantic similarity measurement
Research on similarity investigates commonalities and differences between individuals or
classes. Most similarity measures originated in psychology and were established to deter-
mine why and how individuals are grouped into categories, and why some categories are
comparable to each other while others are not [31, 69]. The following approaches to seman-
tic similarity measurement can be distinguished: feature-based, alignment-based, network-
based, transformational, geometric, and information theoretic (see [31] for details).
These similarity measures are either syntax- or semantics-based. Classical examples for
syntactic similarity measures are those which compare literals, such as edit-distance; but
there are also more complex theories. The main challenge for semantic similarity measures
is the comparison of meaning as opposed to structure. Lacking direct access to individuals
or categories in the world, any computation of similarity rests on terms expressing con-
cepts. Semantic similarity measures use specications of these concepts taken from ontolo-
gies [34]. These may involve (unstructured) bags of features, regions in a multidimensional
space, algebras, or logical predicates (e.g., in description logics, which are popular among
Semantic Web ontologies). Consequently, similarity measures do not only differ in their ex-
pressivity but also in the degree and kind of formality applied to represent concepts, which
makes them difcult to compare. Besides the question of representation, context and its in-
tegration is another major challenge for similarity measures [40, 52]. Meaningful notions
of similarity cannot be determined without dening (or at least controlling) the context in
which similarity is measured [23,32,69]. While research from many domains including psy-
chology, neurobiology, and GIScience argues for a situated nature of conceptualization and
reasoning [8, 9, 12, 58, 67, 91], the concept representations used by most similarity theories
from information science are static and de-contextualized. An alternative approach was
recently presented by Raubal [79] arguing for a time-indexed representation of concepts.
Similarity has been widely applied within GIScience. Based on Tversky’s feature model
[88], Rodr´ıguez and Egenhofer [81] developed the matching distance similarity measure
(MDSM) which supports a basic context theory, automatically determined weights, and a
symmetric as well as a non-symmetric mode. Ahlqvist, Raubal, and Schwering [2, 77, 83]
used conceptual spaces [26] for models based on geometric distance. Sunna and Cruz
[14,86] applied network-based similarity measures for ontology alignment. Several mea-
sures [4,5,11, 15,16,39, 48] have been developed to close the gap between ontologies spec-
ied in description logics and classical similarity theories which had not been able to han-
dle the expressivity of these logics so far. Other theories [60,73] have been established to
determine the similarity between spatial scenes, handle uncertainty in the denition of ge-
ographic categories [25], or to compute inter-user similarity for geographic recommender
systems [68]. Similarity has also been applied as a quality indicator in geographic ontology
engineering [45]. The ConceptVISTA [24] ontology management and visualization toolkit
uses similarity for knowledge retrieval and organization. Klippel [54, 55] provided rst
insights into measuring similarity between geographic events and the dynamic conceptu-
alization of topological relations.
www.josis.org
SEMANTICS OF SIMILARITY 33
3 Semantics of similarity
Similarity has been applied to various tasks in many domains. One consequence is that
there is no precise and application-independent description of how and what a similarity
theory measures [32,69]. Even for semantics-based information retrieval, several similarity
measures have been proposed. This makes the selection of an appropriate measure for
a particular application a challenging task. It also raises the question of how to compare
existing theories. By examining several of these measures from different domains we found
generic patterns which jointly form a framework for describing how similarity is computed
[44, 48]. The framework consists of the following seven steps:
1. denition of application area and intended audience;
2. selection of context and search (query) and target concepts;
3. transformation of concepts to canonical form;
4. denition of an alignment matrix for concept descriptors;
5. application of constructor specic similarity functions;
6. determination of standardized overall similarity; and
7. interpretation of the resulting similarity values.
The implementation of these steps depends on the similarity measure as well as the
used representation language. Steps which may be of major importance for a particu-
lar theory, may play only a marginal role for others. The key motivation underlying the
framework is to establish a systematic approach to describe how a similarity theory works
by dening in which ways it implements the seven steps. By doing so, the theory xes
the semantics of the computed similarity values as well as important characteristics, such
as whether the measure is symmetric, transitive, reexive, strict, or minimal [6, 13, 31].
Moreover, the framework also supports a separation between the process of computing
similarity (i.e., what is measured) and the applied similarity functions (i.e., how it is mea-
sured). Note that we distinguish between similarity functions and similarity measures (or
theories). A similarity measure is an application of the proposed framework, while simi-
larity functions are specic algorithms used in step 5. For instance, a particular similarity
theory may foresee the use of different similarity functions depending on the tasks or users.
This difference is discussed in more detail below. While the framework has been developed
for inter-concept similarity measures, it can be reused and modied to understand inter-
instance similarity as well. The reason for focusing on inter-concept similarity lies in their
complex nature which makes understanding particular steps and design decisions neces-
sary.
In the following, a description of each step is given; examples from geometric, feature-
based, alignment, network, and transformational similarity measures demonstrate the gen-
eralizability of the framework.
3.1 Application area and intended audience
Which functions should be selected to measure similarity depends on the application area.
Theories established for (geographical) information retrieval and in the cognitive sciences
tend to use non-symmetric similarity functions to mimic human similarity reasoning [31],
which is also inuenced by language, age, and cultural background [40, 63, 69]. The ability
to adjust similarity measures also plays a crucial role in human-computer interaction. In
JOSIS, Number 2 (2011), pp. 29–57
34 JANOWICZ,RAUBAL,KUHN
contrast, similarity theories for ontology matching and alignment tend to utilize symmet-
ric functions as none of the compared ontologies plays a preferred role. In some cases,
the choice of a representation language inuences which parameters have to be taken into
account before measuring similarity. For instance, for logical disjunctions among predi-
cates one needs to choose between computing the maximum, minimum [16], or average
similarity [44]. With respect to the introduced information retrieval denition, this step is
responsible for adjusting the similarity theory using inferable implicit information.
3.2 Context, search, and target concepts
Before similarity is measured, concepts have to be selected for comparison. Depending
on the application scenario and theory, the search concept Cscan be part of the ontology
or built from a shared vocabulary; in the latter case the term query concept Cqmay be
more appropriate [39, 44, 62]. The target concepts Ct1, ..., Ctiform the so-called context of
discourse Cd[40] (called domain of application in case of the MDSM [81]) and are selected
by hand or automatically determined by specifying a context concept Cc. In the latter case,
the target concepts are those concepts subsumed by Cc. Equation 2 shows how to derive
the context of discourse for similarity theories using description logics as representation
language.
Cd={Ct|CtCc}(2)
In case of the matching distance similarity measure, the context (C)isdened as a
set of tuples over operations (opi) associated with their respective nouns (ej,equation3).
These nouns express types, while the operations correspond to verbs associated with the
functions dened for these types (see [81] for details). For instance, a context such as
C=(play, {})restricts the domain of application to those types which share the func-
tional feature play.
C=(opi,{e1, ..., em}), ..., (opn,{e1, ..., el})(3)
Other knowledge representation paradigms such as conceptual spaces require their
own denitions, e.g., by computing relations between regions in a multi-dimensional
space.
The distinction between search and target concept is especially important for non-
symmetric similarity. As will be discussed in the similarity functions step, the selection of
a particular context concept does not only dene which concepts are compared but also di-
rectly affects the measured similarity. The following list shows some exemplary similarity
queries from the domain of hydrology, dened using search, target, and context concept:
How similar is Canal (Cs)to River (Ct)?
Which kind of Waterbody (Cc)is most similar to Canal (Cs)?
What is most similar to Waterbody Articial (Cq)?
What is more similar to Canal (Cs),River (Ct)or Lake (Ct)?
What are the two most similar Waterbodies (Cc)in the examined ontology?
In the rst case, Canal is compared to River, and in the second case to all subconcepts
of Waterbody (e.g., River,Lake,Reservoir). In contrast, the third case shows a query over
the whole ontology. All concepts are compared for similarity to the query concept formed
www.josis.org
SEMANTICS OF SIMILARITY 35
by the conjunction of Waterbody and Articial. Note that the query and context concepts
are not necessarily part of the ontology, but can be dened by the user. The fourth query
is an extended version of the rst, with two target concepts selected by hand. Symmetric
similarity measures can be dened without an explicit search and target concept, though
this is difcult to argue from a cognitive point of view as direction is implicitly contained
in many retrieval tasks.
3.3 Canonical normal form
Semantic similarity measures should only be inuenced by what is said about concepts,
not by how it is said (syntactic differences). If two concept descriptions denote the same
referents using different language elements, they need to be rewritten in a common form
to eliminate unintended syntactic inuences. This step mainly depends on the underlying
representation language and is most important for structural similarity measures. Two
simple examples for description logics are:
1. Condition (nR.C)and n0Rewrite (nR.C)to
2. Condition R.C ∀R.C Rewrite R.C ∀R.Cto R.(CC)
One may also think of canonizations for conceptual spaces. For instance, if the dimen-
sions density,mass,andvolume are part of a knowledge base: the category of all entities with
adensityvalue1ρcan be either expressed as a point on the density axis or as a curve in the
space with dimensions mass and volume. Per denition, the denoted category contains the
same entities, but the similarity value would be 0 using classical geometry-based similarity
measures (see Figure 1). In such a case, a rewriting rule has to map one representation to
the other. Of course, this example requires that the semantics of the involved dimensions
is known. A rst approach to handle these difculties was presented by Raubal, intro-
ducing projection and transformation rules for conceptual spaces [78]. However, from a
perspective of human cognition canonization may not always be possible.
Similar examples can be constructed for so-called transformational measures [35]. They
dene semantic similarity as a function over a set of transformation rules to derive a repre-
sentation from another one. Among others, transformation rules include deletion, mirror-
ing or shifting. Canonization may be required on two levels. First, it has to be ensured that
the same set of transformations is used and that no transformation can be constructed out
of others (as this would increase the transformation distance and, hence, decrease similar-
ity). Second, the same representation has to be used. For instance X2OX3OX3OX may be
a condensed representation of the stimulus XXOXXXOXXXOX [31] and, hence, has to be
unfolded before comparison to ensure that a shift of the rst O towards the second counts
3 instead of 2 steps.
In general, canonization is a complex and expensive task and should be reduced to a
minimum. For instance, SIM-DLAuses the same similarity functions as our previous SIM-
DL theory [44] but reduces the need of canonization and syntactic inuence by breaking
down the problem of inter-concept similarity to the less complex problem of inter-instance
similarity [48]. This is achieved by comparing potential interpretations for overlap instead
of a structural comparison of the formal specications. In doing so, SIM-DLAaddresses
some of the challenges discussed in to introduction, namely how to deal with the multitude
of potential graph representations. This is especially important for concepts specied using
expressive description logics.
JOSIS, Number 2 (2011), pp. 29–57
36 JANOWICZ,RAUBAL,KUHN
(a) 1ρon the density dimension (b) 1ρ;ρ=m/v on the mass(m) and volume(v)
dimensions
Figure 1: The category of all entities with the density of 1ρspecied using one dimension
(a) or two dimensions (b)
3.4 Alignment matrix
While the second step of the framework selects concepts for comparison, the alignment
matrix species which concept descriptors (e.g., dimensions, features) are compared and
how. We use the term “alignment” in a slightly different sense, but based on research in
psychology that investigates how structure and correspondence inuence similarity judg-
ments [22,27,64,66,69]. The term “matrix” points to the fact that the selection of comparable
tuples of descriptors requires a matrix CD
s×CD
t(where CD
sand CD
tare the sets of descrip-
tors forming Csand Ct, respectively).
Alignment-based approaches were developed as a reaction to classical feature-based
and geometric models, which do not establish relations between features and dimensions.
This also affects relations to other concepts or to instances. For example, in feature-based
and geometric models it is not possible to state that two concepts are similar, because their
instances stand in a certain relation to instances of another concept. As depicted in Fig-
ure 2, the topological relation above(circle, triangle) [31] does not describe the same fact as
above(triangle, circle). During a similarity assessment participants may judge above(circle, tri-
angle) more similar to above(circle, rectangle) than to above(triangle, circle) because of the same
role (namely being above something else) that the circle plays within the rst examples (see
also [65]).
The motivation behind alignment-based models is that relations between concepts and
their instances are of fundamental importance to determine similarity [28, 29, 66]. If in-
stances of two compared concepts share the same color, but the colored parts are not re-
lated to each other, then the common feature of having the same color may not inuence
the similarity assessments. This means that subjects tend to focus more on structures and
relations than on disconnected features. Hence, alignment-based models claim that simi-
larity cannot be reduced to matching features, but one must determine how these features
align with others [31].
www.josis.org
SEMANTICS OF SIMILARITY 37
Figure 2: Being above something else as common feature used for similarity reasoning
(see [31] for details)
From a set of available concept descriptors, humans tend to select those for com-
parison which correspond in a meaningful way [22, 27, 64, 66, 69]. The literature distin-
guishes between alignable commonalities, alignable differences, and non-alignable differ-
ences. In the rst case, entities and relations match. For instance, in above(circle,triangle),
above(circle,triangle),above(circle,rectangle),andsmaller(circle,triangle),therst two assertions
are alignable because both specify an above relation, and common because of the related
entities. In contrast, the second and third assertion form an alignable difference. While the
assertions can be compared for similarity, the related entities do not match (but could still
be similar). Non-alignable differences cannot be compared for similarity in a meaningful
way. For instance, no meaningful notion of similarity can be established between above and
smaller. While this example relates individuals within spatial scenes, the same argumen-
tation holds for the concept level. The fact, for instance, that rivers are connected to other
water bodies can be compared to the connectedness of roads. For this reason, both can be
abstracted as being parts of transportation infrastructures. (At the same time, this example
also demonstrates the vague boundaries between similarity and analogy-based reasoning.)
In contrast, this connectedness cannot be compared to a has-depth relation of another water
body as they form a non-alignable difference.
In the proposed similarity framework the alignment matrix tackles the following ques-
tions: in most similarity theories each concept descriptor from (Cs)is compared to exactly
one descriptor from (Ct)—how are these tuples selected? If the compared concepts are
specied by a different number of descriptors, how are surplus descriptors to be treated
[78]? Does it make a difference whether the remaining descriptors belong to the search
or target concept? Are there specic weights for certain tuples or are all tuples of equal
importance? How similar are concepts to their super-concepts and vice versa? Does the
similarity measure depend on the search direction?
While the distinction between search and target concept was introduced in step 1, the
question of how the search direction inuences similarity also depends on the alignment.
In theory, the following four settings can be distinguished:
A user is searching for a concept that exactly matches the search concept (Cs)...
JOSIS, Number 2 (2011), pp. 29–57
38 JANOWICZ,RAUBAL,KUHN
and every divergence reduces similarity.
or is more specic.
or is more general.
or at least overlaps with Cs.
In the rst case, similarity is 1 if CsCtand decreases with every descriptor from Csor
Ctthat is not part of both specications. Similarity reaches 0 if the compared concepts have
no common descriptor. Asymmetry is not mandatory in this setting, but can be introduced
by weighting distinct features differently depending on whether they are descriptors of Cs
or Ct. In the second scenario, similarity is 1 if CsCtor if Ctis a sub-type of Cs;else,
similarity is 0. Such a notion of similarity is not symmetric. If Ctis a sub-concept of Cs,
the similarity sim(Cs,C
t)is 1, while sim(Ct,C
s)=0. The third case works the other way
around, similarity is 1 if CsCtor if Csis a sub-type of Ct. In the last scenario, similarity
is always 1, except for the case when Csand Ctdo not share a single descriptor.
In contrast to the rst setting, the remaining cases can be reducedto subsumption-based
information retrieval, as described by Lutz and Klien [62]. These settings only distinguish
values between 1 and 0. In the second and third case, the search (query) concept is injected
into the examined ontology. After reclassication, all sub- or super-concepts of Csare part
of the result set [49, 62]. The last scenario can be solved accordingly by searching for a
common super-concept of Csand Ct.
Consequently, a similarity theory should be based on the rst case or a combination
of the rst and second, or rst and third case. Such combinations necessarily lead to
non-symmetric similarity measures. For instance, SIM-DL is a combination of setting
one and two. (To be more precise, SIM-DL allows to choose between a symmetric and
non-symmetric mode.) The similarity between two concepts decreases with a decreasing
overlap of descriptors, while the similarity between a type and its sub-types is always
1. The geometric similarity measure dened by Schwering and Raubal [83] applies the
following rules to handle (non-)symmetry: 1. The greater the overlap and the less the non-
overlapping parts, the higher the similarity between compared concepts; 2. Distance values
from subconcepts to their superconcept are zero; 3. Distance values from superconcept to
subconcepts are always greater than zero, but not necessarily 1.
It is important to keep in mind that these design decisions are driven by the application
and not by a generic law of similarity [32,33,75, 85].
3.5 Similarity functions
After selecting the compared concepts and aligning their descriptors, the similarity for each
selected tuple is measured. Depending on the representation language and application,
different similarity functions have to be applied. In most cases, each similarity function
itself takes care of standardization (to values between 0 and 1).
In case of the matching distance similarity measure (MDSM) [81], the features are distin-
guished into different types during the alignment process: parts, attributes, and functions.
Although a contextual weighting is computed for each of these types, the same similarity
function is applied to all of them.
St(c1,c
2)= |C1C2|
|C1C2|+α(c1,c
2)∗|C1\C2|+(1α(c1,c
2)) ∗|C2\C1|(4)
www.josis.org
SEMANTICS OF SIMILARITY 39
Equation 4 describes the non-symmetric similarity function for each of the feature types.
St(c1,c
2)is dened as the similarity for the feature type tbetween the entity classes c1
and c2.C1and C2are the sets of features of type tfor c1and c2, while |C1C2|is the
cardinality of the set intersection and |C1\C2|is the cardinality of the set difference. The
relative importance α(equation 5) of the different features of type tis dened in terms of
the distance dbetween c1and c2within a hierarchy that takes taxonomic and partonomic
relations into account. Lub denotes the least upper bound, i.e., the immediate common
superclass of c1and c2[81]. The distance is dened as d(c1,c
2)=d(c1,lub)+d(c2,lub).
α(c1,c
2)=d(c1,lub)
d(c1,c2),d(c1,lub)d(c2,lub)
1d(c1,lub)
d(c1,c2),d(c1,lub)>d(c2,lub)(5)
MDSM accounts for context by introducing weights for the different types of features.
While the integration of these weights (ωtin equation 13) plays a role for the overall simi-
larity, the two weighting functions are introduced here. The relevance of each feature type
is dened either by the variability Pv
t(equation 6) or commonality Pc
tfunction (equation
7) and then normalized with respect to the remaining feature types so that the sum of
ωp+ωf+ωais always 1.
Pv
t=1
l
i=1
oi
nl(6)
The variability describes how diagnostic [30, 88] or characteristic a feature tis within
a certain application. A certain feature of type thas low relevance if it appears in many
classes and high relevance if it is not common to the classes within the domain. Pv
tis the
sum of the diagnosticity of all features of the type tin the domain and therefore 0 when
all features are shared by all entity classes (Pv
t=1-1=0), and close to 1 if each feature is
unique (oiis the number of occurrences of the feature within the domain) and the number
of features land classes nin the domain is high.
Pc
t=
l
i=1
oi
nl=1Pv
t(7)
Commonality is dened as the opposite of variability (Pc
t=1Pv
t) and assumes that by
dening a domain of application the user implicitly states what features are relevant [81].
In contrast to MDSM, SIM-DL and SIM-DLAdistinguish between several similarity
functions for roles and their llers, e.g., functions for conceptual neighborhoods, role hi-
erarchies, or co-occurrence of primitives. Primitives (also called base symbols) occur only
on the right-hand side of denitions. To measure their similarity (simp, see equation 8),
an adapted version of the Jaccard similarity coefcient is used. It measures the degree of
overlap between two sets S1and S2as the ratio of the cardinality of shared members (e.g.,
features) from S1S2to the cardinality retrieved from S1S2.InSIM-DL,thecoefcient
is applied to compute the context-aware co-occurrence of primitives within the denitions
of other (non-primitive) concepts [44]. Two primitives are the more similar, the more com-
plex concepts are dened by both (and not only one) of them. If simp(A, B)=1,both
primitives always co-occur in complex concepts and cannot be distinguished. As similar-
ity depends on the context of discourse [40], only those concepts Ciare considered which
are subconcepts of Cc(see step two of the similarity framework).
JOSIS, Number 2 (2011), pp. 29–57
40 JANOWICZ,RAUBAL,KUHN
simp(A, B)= |{C|(CCc)(CA)(CB)}|
|{C|(CCc)((CA)(CB))}| (8)
SIM-DL uses a modied network-based approach [76] to compute the similarity be-
tween roles (Rand S) within a hierarchy. Similarity (simr, see equation 9) is dened as
the ratio between the shortest path from Rto Sand the maximum path within the graph
representation of the role hierarchy; where the universal role U(U≡
I×
I)forms
the graph’s root. Compared to simp, similarity between roles is dened without reference
to the context. This would require to take only such roles into account which are used
within quantications or restrictions of concepts within the context. The standardization
in equation 9 is depth-dependent to indicate that the distance from node to node decreases
with increasing depth level of R and S within the hierarchy. In other words, the weights of
the edges used to determine the path between Rand Sdecrease with increasing depth of
the graph. If a path between two roles crosses U, similarity is 0. The lcs(R, S)is the least
common subsumer, in this case the rst common super role of Rand S.
simr(R, S)= depth(lcs(R, S))
depth(lcs(R, S)) + edge distance(R, S )(9)
Similarity between topological or temporal relations (simn, see equation 10) equals their
normalized distance within the graph representation of their conceptual neighborhood. In
contrast to simr, the normalization is not depth-dependent but based on the longest path
within the neighborhood graph.
simn(R, S)= max distancenedge distance(R, S)
max distancen
(10)
The similarity between role ller pairs (simrf , see equation 11) is dened by the simi-
larity of the involved roles Rand Stimes the overall similarity of the llers Cand Dwhich
can again be complex concepts.
simrf (R(C),S(D)) = simr(R, S)simo(C, D)(11)
Some similarity measures dene role-ller similarity as the weighted average of the role
and ller similarities, but the multiplicative approach has proven to be cognitively plausi-
ble [43] and allows for simple approximation and optimization techniques not discussed
here in detail.
In the case of geometric approaches to similarity, the spatial distance in the conceptual
(vector) space is interpreted as the semantic distance d. Consequently, similarity increases
with decreasing spatial distance. A classical function for geometry-based similarity mea-
sures is given by the Minkowski metric (see equation 12). The parameter ris used to switch
between different distances, such as the Manhattan distance (r=1) and the Euclidean dis-
tance (r=2) [31]. A more detailed discussion with regard to a metric conceptual space
algebra including weights is given by Adams and Raubal [1].
d(c, d)=n
i=1
|cidi|r
1
r
(12)
www.josis.org
SEMANTICS OF SIMILARITY 41
Note that, while we focus on inter-concept similarity here, certain similarity functions
can also take knowledge about instances into account to derive information about concept
similarity [15–17].
3.6 Overall similarity
In the sixth step of the framework, the single similarity values derived from applying the
similarity functions to all selected tuples of compared concepts are combined to an overall
similarity value. In most theories this step is a standardized (to values between 0 and 1)
weighted sum.
For MDSM, the overall similarity is the weighted sum of the similarities determined
between functions, parts, and attributes of the compared entity classes c1and c2.The
weights indicate the relative importance of each feature type using either the commonality
or variability model introduced before (equation 13). At the same time, the weights act as
standardization factors (ω=1)[81].
S(c1,c
2)=ωpSp(c1,c
2)+ωfSf(c1,c
2)+ωaSa(c1,c
2)(13)
In case of SIM-DL, each similarity function takes care of its standardization using the
number of compared tuples or the graph depth. Each similarity function returns a stan-
dardized value to the higher-level function by which it was called. Hence, overall similarity
is simply the (standardized) sum of the single similarity values.
For geometric approaches, the overall similarity is given by the z-transformed sum of
compared values [77], in order to account for different dimensional units. Each ziscore is
computed according to equation 14 where xiis the i-th value of the quality dimension X, x
is the mean of all Xiof X, and sxis the standard deviation of these xi.
zi=xix
sx
(14)
The overall similarity is then dened using the Minkowski metric (see equation 12)
where nis the number of quality dimensions and cand dare the z-transformed values for
the compared concepts (per dimension).
3.7 Interpretation of similarity values
All of the introduced measures map two compared concepts to a real number. They do
not explain their results or point to descriptors for which the concepts differ. Such a sin-
gle value (e.g., 0.7) is difcult to interpret. For instance, it does not answer the question
whether there are more or less similar target concepts in the examined ontology. It is not
sufcient to know that possible similarity values range from 0 to 1 as long as their distribu-
tion remains unclear. If the least similar target concept in an ontology has a similarity value
of 0.65 to the source concept and the most similar concept yields 0.9, a similarity value of
0.7 is not necessarily a good match. It is difcult to argue why a single similarity value is
cognitively plausible without reference to other results [51]. Moreover, the threshold value
above which compared concepts are considered similar depends on the specic application
and context.
Therefore, measures such as MDSM or SIM-DL rely on similarity rankings. They com-
pare a search concept to all target concepts from the domain of discourse and return the
JOSIS, Number 2 (2011), pp. 29–57
42 JANOWICZ,RAUBAL,KUHN
results as an ordered list of descending similarity values. Consequently, one would not ar-
gue that a particular similarity value is cognitively plausible, but that a ranking correlates
with human estimations [43]. Such a ranking puts a single similarity value in context by de-
livering additional information about the distribution of similarity values and their range.
We call this context the interpretation context (Ci, see [40] for more details on different kinds
of contexts and their impact on similarity measures).
Ci:(Cs,C
t,simV)Δsim ×C
aΨ(Cs,C
t)ΔΨ(15)
The interpretation context (see equation 15) maps the triple search concept (Cs), target
concept (Ct), similarity value (simV ) from the set of measured similarities between the
search concept and each target concepts ∈C
d(Δsim) and the restrictions specied by the
application context (Ca) to an interpretation value (Ψ(Cs,C
t)) from the domain of inter-
pretations (ΔΨ). The application context [40] describes the settings by which a similarity
measure can be adapted to the user’s needs, e.g., whether the commonality or variability
weightings in MDSM should be selected.
The simplest domain of interpretation can be formed by ΔΨ={t, f}. Depending on
the remaining pairs of compared concepts from Δsim as well as the application area, each
triple is either mapped to true or false. Therefore, the question of whether concepts are
similar is answered by yes or no. For graphical user interfaces, similarity values can also
be mapped to font sizes using a logarithmic tag cloud algorithm (see Figure 3). Note that
as Cidepends on Δsim, it does not simply map an isolated similarity value to yet another
domain. For example, the maximum font size will always be assigned to the targetconcept
with the highest similarity to the search concept, independent of the specicvalue.
Figure 3: Font size scaling for similarity values, based on [47]
3.8 Properties of similarity measures
The proposed framework helps to understand how similarity theories work and what they
measure. This is essential for choosing the optimal measure for a specic application, to
compare similarity measures, and to interpret similarity values and rankings. The frame-
work also unveils basic properties of a particular measure, e.g., whether it is reexive,
symmetric, transitive, strict, minimal, etc. (see [6, 13, 31, 75] for a detailed discussion from
the perspectives of computer science and psychology). As an example, the following para-
graphs discuss strictness and symmetry for the SIM-DL/SIM-DLAtheory, as well as the
www.josis.org
SEMANTICS OF SIMILARITY 43
relation between similarity and dissimilarity. The triangle inequality is discussed as an
important property of geometric approaches.
Strictness is often referred to as an important property of similarity [87]. Formally, strict-
ness states that the maximum similarity value is only assigned to equal stimuli (e.g., con-
cepts): sim(C, D)=1if and only if CD. This is related to the minimality property,
which claims that two different stimuli are less (or equally) similar than the stimulus is to
itself: sim(C, D)sim(C, C)[6, 31]. In the literature, minimality is dened for dissimi-
larity: dis(C, D)dis(C, C ). In SIM-DL, the similarity value 1 is interpreted as equal or
not distinguishable (within a given context). This is for two reasons: co-occurrence between
primitives and non-symmetry. The comparison of two primitives yields 1 if they cannot
be differentiated, i.e., if they always appear jointly within concept denitions (see equation
8). As SIM-DL focuses on information retrieval, a target concept satises the user’s needs
(sim(Cs,C
t)=1) if it is a sub-concept of the search concept (step 4 of the framework).
Consequently, similarity in SIM-DL is not strict.
Symmetry is one of the most controversial properties of similarity. While several theo-
ries from computer science argue that similarity is essentially a symmetric relation [61],
research from cognitive science favors non-symmetric similarity measures [56, 69, 74, 88].
As argued in the previous sections, SIM-DL allows the user to switch between a symmetric
and a non-symmetric mode. From Tversky’s [88] point of view, one may argue that this
is nothing more than indecision. However, the understanding of symmetry underlying
SIM-DL is driven by Nosofsky’s notion of a biased measure [74]. Symmetry is not a char-
acteristic of similarity as such, but of the process of measuring similarity. This process is
driven (biased) by a certain task—namely information retrieval. Whether the comparison
of two concepts is symmetric or not depends on the application area and task (and therefore
on the alignment process), but not on the measure as such. This again reects the need for
a separation between the alignment and the application of concrete similarity functions.
Dissimilarity and similarity are often used interchangeably assuming that dissimilarity
is simply the counterpart of similarity: dis(C, D)=1sim(C, D). While this may be
true for certain cases, it is not a valid assumption in general [31]. As argued by Tver-
sky [88], Nosofsky [74], and Dubois and Prade [19], similarity and dissimilarity are dif-
ferent views on stimuli comparison. SIM-DL, for instance, stresses the alignment of de-
scriptors. If the task is to nd dissimilarities between compared concepts, other tuples
might be selected for comparison and alignment. One can demonstrate that the assump-
tion dis(C, D)=1sim(C, D)is oversimplied and counter-intuitive using SIM-DL’s
maximum similarity function for concepts formed by logical disjunction. For simplica-
tion, consider the concepts CABand DCEwhere A,B,andEare primitives. To
measure the similarity sim(C, D), SIM-DL unfolds their denitions and creates the follow-
ing alignment tuples: (A, A),(A, B),(A, E ),(B,A),(B, B),and(B, E). Out of this set, the
tuples (A, A)and (B, B)are chosen for further computation and nally, sim(C, D)returns
1. Consequently, the resulting dissimilarity dis(C, D)should be 0. This is true, if one still
applies the maximum similarity function. Instead, when searching for dissimilarities be-
tween compared concepts, one would rather use a minimum similarity function and thus
take Einto account for comparison to Aor B.Inbothcases,dis(C, D)can be greater than
0.
JOSIS, Number 2 (2011), pp. 29–57
44 JANOWICZ,RAUBAL,KUHN
Triangle Inequality describes the metric property according to which the distance be-
tween two points cannot be greater than the distance between these points reached via an
additional third point. Surprisingly, it turns out that even such fundamental properties of
geometry cannot be taken for granted. Instead, Tversky and Gati demonstrated that the
triangle inequality does not necessarily hold for cognitive measures of similarity [89].
4 Similarity in semantics-based information retrieval
While the proposed framework denes how similarity is measured, this section demon-
strates its role in semantics-based geographic information retrieval and its integration into
user interfaces.
4.1 Retrieval paradigms
Previously, we dened information retrieval by the degree of relevance
m[R(O, (Q, I,→))] without stating how to measure this relevance. Based on this
denition and without going into any details about query rewriting and expansion, we
explain the role of similarity by restricting the denition such that:
Ois a set of target concepts (Ct) in an ontology,
Qis a particular concept phrased or selected for the search (Cs),
Iand → are additional contextual information at execution time (Cc),
Ris the similarity relationship between pairs of concepts, and
and mis the degree of similarity between pairs of concepts.
In contrast to purely syntactic approaches, semantics-based information retrieval takes
the underlying conceptualizations into account to compute relevance and hence improves
searching and browsing through structured data. In general, one can distinguish between
two approaches for concept retrieval: those based on classical subsumption reasoning and
those that rely on semantic similarity measures [49]. Simplifying, subsumption reasoning
can be applied to vertical search, while similarity works best for horizontal search, i.e.,
similarity values are difcult to interpret when comparing sub- and super-types.
Formally, the result set for a subsumption-based query is dened as RS ={C|C
OCQ}. As each concept in RS is a subsumee of the search/query concept, it meets
the user’s search criteria (see Figure 4a). Consequently, there is no degree of of relevance
m; or, to put it in other words, it is always 1. The missing relevance information and rigid-
ity of subsumption make selecting an appropriate search concept the major challenge for
subsumption-based retrieval. In many cases, the search concept will be an articial con-
struct and not necessarily the searched concept (see [49] for details). If it is too generic (i.e.,
too close to the top of the hierarchy) the user will get a large part of the queried ontology
back as an unsorted result set; if the search concept is too narrow, the result set will only
contain a few or even no concepts.
For similarity-based retrieval as depicted in Figure 4b, the result set is dened as
RS ={C|COsim(Q, C)>t};wheretis a threshold dened by the user or applica-
tion [44, 49]. In contrast to subsumption-based retrieval, the search concept is the concept
the user is really searching for, no matter whether it is part of the queried ontology or not. As
similarity computes the overlap between concept denitions (or their extensions [16, 48])
www.josis.org
SEMANTICS OF SIMILARITY 45
it is more exible than a purely subsumption-based approach. Moreover, the results are
ranked—returned as an ordered list with descending similarity values representing the rel-
evance m. This makes it easier for the user to select an appropriate concept from the results.
However, it is not guaranteed that the returned concepts match all of the user’s search cri-
teria. Consequently, the benets similarity offers during the retrieval phase, namely to
deliver a exible degree of (conceptual) overlap with a searched concept, stands against
shortcomings during the selection phase, because the results do not necessarily match all
of the user’s requirements.
To overcome these shortcomings, similarity theories such as SIM-DL and MDSM com-
bine subsumption and similarity reasoning by introducing contexts to reduce the set of po-
tential target concepts (see equations 2 and 3). As depicted in Figure 4c, only those concepts
are compared for similarity that are subconcepts of the context concept Cc. This way, the
user can specify some minimal characteristics all target concepts need to share. Typically,
user interfaces and search engines will be designed in a way to infer or at least approximate
Ccfrom additional, implicit contextual information (I, →). Consequently, for the combined
retrieval paradigm the result set is dened as RS ={C|COCCcsim(Q, C)>t}.
Figure 4 shows an ontology of geometric guresasasimplied example to illustrate the
differences between the introduced paradigms. Note that some quadrilaterals and relations
between them have been left out to increase readability. We assume that a user is searching
for quadrilaterals with specic characteristics. In the subsumption only case, the result set
contains types such as Rectangle,Rhombus,Square, and so forth without additional infor-
mation about their degree of relevance. In the similarity only case, the result set contains
additional relevance information for these types but also geometric gures such as Circle
which do not satisfy all the requirements specied by the user. Note however that they
would appear at the end of the relevance list due to their low similarity (indicated by the
shift from green over yellow to red in Figure 4b). In case of the combined paradigm a user
could prefer quadrilaterals with right angles by specifying Rectangle as search concept and
Quadrilateral as context concept. In contrast to the similarity only case, the result set does
not contain Circle but still delivers information about the degree of relevance.
Before going into details about the integration of the combined approach into user in-
terfaces, we briey need to discuss two questions which have remained unanswered so far.
First, one could argue that combining subsumption and similarity reasoning by introduc-
ing the context concept as a least upper bound only shifts the query formulation problem
from the search concept to the context concept. If the user chooses a context concept that
is too narrow, then this has the same effects as in the subsumption only case. While this
is true in general, we will demonstrate in the next section that the context concept can be
derived as inferred information from the query, which is not the case for the search concept.
Moreover, the combined approach still delivers ranked results instead of an unstructured
set. Second, so far we have restricted our concept retrieval cases to queries based on the
notion of a search or query concept and therefore to intensional retrieval paradigms. Nev-
ertheless, there are also extensional paradigms for retrieval, e.g., based on non-standard
inference techniques such as computing the least common subsumer (lcs)ormostspecic
concept (msc) [49, 57, 70]. We will discuss these approaches using a query-by-example inter-
face in which reference individuals are selected for searching.
JOSIS, Number 2 (2011), pp. 29–57
46 JANOWICZ,RAUBAL,KUHN
(a) Subsumption-based retrieval
(b) Similarity-based retrieval
(c) Subsumption and similarity-based retrieval
Figure 4: Semantics-based retrieval in a simplied ontology of geometric gures
4.2 Application
This section introduces two web-based user interfaces implementing similarity and
subsumption-based retrieval. The interfaces have been implemented, evaluated [43, 47],
and are available as free and open source software1. Their integration into spatial data
infrastructures was recently discussed by Janowicz et al. [46] and is left aside here.
1http://sim-dl.sourceforge.net/applications/
www.josis.org
SEMANTICS OF SIMILARITY 47
Figure 5: A subsumption and similarity-based user interface for Web gazetteers [47]
4.2.1 Selecting a search concept
Figure 5 shows a semantics-based user interface for the Alexandria Digital Library
Gazetteer. The interface implements the intensional retrieval paradigm based on a combi-
nation of similarity and subsumption reasoning. A user can enter a search concept using a
search-while-you-type AJAX-based text eld. To improve the navigation between geographic
feature types, the interface displays the immediate super-type as well as a list of similar
types [42, 47]. Based on the question of interpretation discussed in Section 3.7, a decreas-
ing font size indicates decreasing similarity between the search concept and the proposed
target concepts. In the example query, the type Stream is selected for comparison and the
interface displays Watercourse as super type to broaden the search. River is the most similar
concept followed by other hydrographic feature types. By clicking on a super- or similar
type it gets selected as search concept for a new query. The map is used to restrict the
search to a specic area. The interface displays features on the right side and on the map.
It does not support the selection of a context concept by the user. This would overload the
interface and the underlying idea of a context concept may be difcult to explain to ordi-
nary users. Nevertheless, the context concept can be inferred from implicit information,
e.g., using the map component. The context concept can be derived by computing the least
common subsumer of all feature types which have features in the map extent. Yet, this
approach only works well for particular zoom levels and will become meaningless if the
user searches a larger area.
4.2.2 Query-by-example
Figure 6 shows a user interface implementing an extensional (example based) paradigm
using similarity and non-standard inference. It overcomes two shortcomings of the pre-
JOSIS, Number 2 (2011), pp. 29–57
48 JANOWICZ,RAUBAL,KUHN
Figure 6: A conceptual design of a query-by-example based Web interface for recommender
services (see [90] for an implementation of such an interface for climbing routes using the
SIM-DL server)
vious interface. First, some users may be unfamiliar with using feature types for search
and navigation; second, the previous interface does not offer a convincing way to infer the
context concept with a minimum of user interaction. The query-by-example interface al-
lows the user to select particular referencefeatures instead of types. The most specic con-
cept [57] is computed for each of these types. Based on these concepts, the least common
subsumer [57] can be determined and used as context concept to deliver an inter-concept
similarity ranking [90]. In the example query, three different water bodies are selected
as reference features and Canal is computed to be the most similar concept to the least
common subsumer of those concepts instantiated by the selected features. While the rst
interface is typical for web gazetteers, the second interface focuses on decision support and
recommender services. For instance, if the user is searching for interesting canoing spots
for her next vacation, the selected water bodies may be picked from previous canoing trips
at different locations [49].
5 Conclusions and further work
In this article we introduced a generic framework for semantic similarity measurement.
The framework consists of seven sequential steps used to explain what and how a particu-
lar theory measures. The framework clearly separates the process of measuring similarity
and nding alignable descriptors from the concrete functions used to compute similarity
values for selected tuples of these descriptors. It also discusses the role of context, addi-
tional application-specic parameters, and the interpretation of similarity values. We do
www.josis.org
SEMANTICS OF SIMILARITY 49
not try to squeeze all existing similarity measures into our framework, but argue that by
applying this framework—in describing the realization of the proposed steps—a measure
denes the semantics of similarity. This, however, is a prerequisite for comparing existing
measures and selecting them for specic applications. A similar argumentation was pro-
posed before by Hayes for the notion of context [36]. Besides offering new insights into
similarity theories used in GIScience and beyond, the article also discusses the role of these
measures in semantics-based geographic information retrieval, introduces paradigms, and
shows their implementations and limitations for real user interfaces.
Further work should focus on the following issues. First, while progress has been made
on developing similarity theories for more expressive description logics [4, 17, 48], the ap-
proximation and explanation of similarity values is still at an early stage. Both topics are
crucial for the adaptation of similarity-based information retrieval paradigms into more
complex applications. Approximation techniques aim at reducing the computational costs
for similarity measurements. While the theories reviewed here can compare dozens of con-
cepts within a reasonable time frame, they do not scale well. In general, two directions
for future work seem reasonable. On the one hand, one could try to improve the selection
and alignment process to reduce the number of comparable concepts and tuples in the rst
place. On the other hand, one could approximate the similarity values and only compute
exact values for candidates that are above a certain threshold. In SIM-DL, for instance, the
role-ller similarity is dened by multiplying role and ller similarities. The computation
of role similarities is realized by a simple network-based distance. Hence, if the resulting
value is below the dened threshold the more complex ller similarity does not need to be
computed.
The downside of using more expressive description logics and approximation tech-
niques is that similarity values become even harder to interpret. In the long term, it will
be necessary to assist the user by providing explanations in addition to plain numerical
values or rankings. Future reasoners could list which descriptors were taken into account
and visualize their impact on overall similarity. While this is important for information re-
trieval, it would be even more relevant for ontology engineering and negotiation [45]. This
way, similarity reasoning could be used to establish bridges between communities across
cultures and ages. So far, there has been no work on explaining similarity values but an
adaptation of recent work on axiom pinpointing [7] may be a promising starting point.
Next, evaluation methods to compare computational similarity measures to human
similarity rankings are still restricted. An interesting research direction towards semantic
precision and recall was recently proposed by Euzenat [21], while Keßler [52] investigates
whether and how one can go beyond simple correlation measures to evaluate the cog-
nitive plausibility of similarity theories. Another approach to adjust similarity values to
the user’s needs would be to compute weights out of partial knowledge gained from user
feedback [41].
Additionally, similarity depends on context in many ways. Most existing measures,
however, reduce context to the selection or similarity functions steps of the framework.
Advanced theories should take contextual information into account to alter these func-
tions, the alignment of descriptors, and the computational representations of the compared
entities and concepts [40, 53]. One promising direction for future research is to inves-
tigate whether and to what degree context can be modeled by changing the alignment
process—this would also lead to interesting insights about the graded structure of ad-hoc
categories [8,30].
JOSIS, Number 2 (2011), pp. 29–57
50 JANOWICZ,RAUBAL,KUHN
Moreover, the application of similarity measures is not restricted to information re-
trieval. Using them for complex data mining, clustering, handling of uncertainty in on-
tology engineering, and so forth requires more work on visualization methods as well as
integration with spatial analysis tools. Semantic variograms [3], parallel coordinate plots,
or radar charts may be interesting starting points in this respect.
Finally, while we provided a framework for understanding the semantics of similarity
and for articulating the differences between existing measures, a formal apparatus to quan-
tify these differences and translate between similarity values obtained by existing theories
is missing. While work on category theory may be a promising direction for further re-
search, the key problem that remains concerns the heterogeneity of the used approaches,
application areas, and the difference between idealized measures and human cognition
(the triangle inequality discussed in Section 3.8 is just one example). For the same reason,
we cannot argue that our framework is necessary and sufcient for all potential similarity
measures.
Acknowledgments
We are thankful to our colleagues from the M ¨unster Semantic Interoperability Lab
(MUSIL), Benjamin Adams, and the three anonymous reviewers for their input to improve
the quality and clarity of this article.
References
[1] ADAMS,B.,AND RAUBAL, M. A metric conceptual space algebra. In Conference on
Spatial Information Theory (COSIT) (2009), K. S. Hornsby, C. Claramunt, M. Denis, and
G. Ligozat, Eds., vol. 5756 of Lecture Notes in Computer Science, Springer, pp. 51–68.
doi:10.1007/978-3-642-03832-7 4.
[2] AHLQVIST, O. Using uncertain conceptual spaces to translate between land cover
categories. International Journal of Geographical Information Science 19, 7 (2005), 831–857.
doi:10.1080/13658810500106729.
[3] AHLQVIST,O.,AND SHORTRIDGE, A. Characterizing land cover structure with se-
mantic variograms. In Progress in Spatial Data Handling, 12th International Symposium on
Spatial Data Handling (2006), A. Riedl, W. Kainz, and G. Elmes, Eds., Springer, pp. 401–
415. doi:10.1007/3-540-35589-8 26.
[4] ARA ´
UJO,R.,AND PINTO, H. S. Semilarity: Towards a model-driven approach to
similarity. In International Workshop on Description Logics (DL) (2007), vol. 20, Bolzano
University Press, pp. 155–162. doi:10.1.1.142.7321.
[5] ARA ´
UJO,R.,AND PINTO, H. S. Towards semantics-based ontology similarity. In Proc.
Workshop on Ontology Matching (OM), International Semantic Web Conference (ISWC)
(2007), P. Shvaiko, J. Euzenat, F. Giunchiglia, and B. He, Eds. doi:10.1.1.143.1541.
[6] ASHBY,F.G.,AND PERRIN,N.A.Towardaunied theory of similarity and recogni-
tion. Psychological Review 95 (1988), 124–150. doi:10.1037/0033-295X.95.1.124.
www.josis.org
SEMANTICS OF SIMILARITY 51
[7] BAADER,F.,AND PENALOZA, R. Axiom pinpointing in general tableaux. In Proc.
16th International Conference on Automated Reasoning with Analytic Tableaux and Related
Methods TABLEAUX (2007), N. Olivetti, Ed., vol. 4548 of Lecture Notes in Computer
Science, Springer-Verlag, pp. 11–27. doi:10.1007/978-3-540-73099-6 4.
[8] BARSALOU, L. Ad hoc categories. Memory and Cognition 11 (1983), 211–227.
[9] BARSALOU, L. Situated simulation in the human conceptual system. Language and
Cognitive Processes 5, 6 (2003), 513–562. doi:10.1080/01690960344000026.
[10] BERRY,M.,AND BROWNE,M. Understanding Search Engines: Mathematical Modeling
and Text Retrieval, 2nd ed. SIAM, 2005.
[11] BORGIDA,A.,WALSH,T.,AND HIRSH, H. Towards measuring similarity in descrip-
tion logics. In International Workshop on Description Logics (DL2005), vol. 147 of CEUR
Workshop Proceedings. CEUR, 2005.
[12] BRODARIC,B.,AND GAHEGAN, M. Experiments to Examine the Situated Na-
ture of Geoscientic Concepts. Spatial Cognition and Computation 7, 1 (2007), 61–95.
doi:10.1080/13875860701337934.
[13] CROSS,V.,AND SUDKAMP,T. Similarity and Computability in Fuzzy Set Theory: As-
sessments and Applications,vol.93ofStudies in Fuzziness and Soft Computing.Physica-
Verlag, 2002.
[14] CRUZ,I.,AND SUNNA, W. Structural alignment methods with applica-
tions to geospatial ontologies. Transactions in GIS 12, 6 (2008), 683–711.
doi:10.1111/j.1467-9671.2008.01126.x.
[15] D’AMATO,C.,FANIZZI,N.,AND ESPOSITO, F. A semantic similarity measure for ex-
pressive description logics. In Convegno Italiano di Logica Computazionale (CILC) (2005).
[16] D’AMATO,C.,FANIZZI,N.,AND ESPOSITO, F. A dissimilarity measure for ALC con-
cept descriptions. In Proc. ACM Symposium on Applied Computing (SAC) (2006), ACM,
pp. 1695–1699. doi:10.1145/1141277.1141677.
[17] D’AMATO,C.,FANIZZI,N.,AND ESPOSITO, F. Query answering and ontology
population: An inductive approach. In Proc. 5th European Semantic Web Confer-
ence (ESWC) (2008), S. Bechhofer, M. Hauswirth, J. Hoffmann, and M. Koubarakis,
Eds., vol. 5021 of Lecture Notes in Computer Science, Springer, pp. 288–302.
doi:10.1007/978-3-540-68234-9 23.
[18] DOMINICH,S. The Modern Algebra of Information Retrieval, 1st ed. Springer, 2008.
doi:10.1007/978-3-540-77659-8.
[19] DUBOIS,D.,AND PRADE, H. A unifying view of comparison indices in a fuzzy set-
theoretic framework. In Recent Development in Fuzzy Set and Possibility Theory,R.Yager,
Ed. Pergamon Press, 1982, pp. 3–13.
[20] EGENHOFER, M. Toward the semantic geospatial web. In Proc. 10th ACM Interna-
tional Symposium on Advances in Geographic Information Systems (2002), ACM, pp. 1–4.
doi:10.1145/585147.585148.
JOSIS, Number 2 (2011), pp. 29–57
52 JANOWICZ,RAUBAL,KUHN
[21] EUZENAT, J. Semantic precision and recall for ontology alignment evaluation. In Proc.
20th International Joint Conference on Articial Intelligence (IJCAI) (2007), pp. 348–353.
[22] FALKENHAINER,B.,FORBUS,K.,AND GENTNER, D. The structure-mapping
engine: Algorithm and examples. Articial Intelligence 41 (1989), 1–63.
doi:10.1016/0004-3702(89)90077-5.
[23] FRANK, A. U. Similarity measures for semantics: What is observed? In COSIT’07
Workshop on Semantic Similarity Measurement and Geospatial Applications (2007).
[24] GAHEGAN,M.,AGRAWAL,R.,JAISWAL,A.R.,LUO,J.,AND SOON,K.-H. A
platform for visualizing and experimenting with measures of semantic similar-
ity in ontologies and concept maps. Transactions in GIS 12, 6 (2008), 713–732.
doi:10.1111/j.1467-9671.2008.01124.x.
[25] GAHEGAN,M.,AND BRODARIC, B. Examining uncertainty in the denition and
meaning of geographical categories. In Proc. 5th International Symposium on Spatial Ac-
curacy Assessment in Natural Resources and Environmental Sciences (2002), G. J. Hunter
and K. Lowell, Eds. doi:10.1.1.61.9168.
[26] G ¨
ARDENFORS,P. Conceptual Spaces—The Geometry of Thought. Bradford Books, MIT
Press, 2000.
[27] GENTNER,D.,AND FORBUS, K. D. MAC/FAC: A model of similarity-based retrieval.
In Proc. 13th Annual Conference of the Cognitive Science Society (1991), Erlbaum, pp. 504–
509. doi:10.1207/s15516709cog1902 1.
[28] GOLDSTONE, R. L. Similarity, interactive activation, and mapping. Jour-
nal of Experimental Psychology: Learning, Memory, and Cognition 20 (1994), 3–28.
doi:10.1037/0278-7393.20.1.3.
[29] GOLDSTONE,R.L.,AND MEDIN, D. Similarity, interactive activation, and mapping:
An overview. In Analogical Connections: Advances in Connectionist and Neural Computa-
tion Theory, K. Holyoak and J. Barnden, Eds., vol. 2. Ablex, 1994, pp. 321–362.
[30] GOLDSTONE,R.L.,MEDIN,D.L.,AND HALBERSTADT, J. Similarity in context. Mem-
ory and Cognition 25 (1997), 237–255.
[31] GOLDSTONE,R.L.,AND SON, J. Similarity. In Cambridge Handbook of Thinking and Rea-
soning, K. Holyoak and R. Morrison, Eds. Cambridge University Press, 2005, pp. 13–36.
doi:10.2277/0521531012.
[32] GOODMAN, N. Seven strictures on similarity. In Problems and projects. Bobbs-Merrill,
1972, pp. 437–447.
[33] GREGSON,R.Psychometrics of similarity. Academic Press, 1975.
[34] GRUBER, T. A translation approach to portable ontology specications. Knowledge
Acquisition 5, 2 (1993), 199–220. doi:10.1006/knac.1993.1008.
[35] HAHN,U.,CHATER,N.,AND RICHARDSON, L. B. Similarity as transformation. Cog-
nition 87 (2003), 1–32. doi:10.1016/S0010-0277(02)00184-1.
www.josis.org
SEMANTICS OF SIMILARITY 53
[36] HAYES, P. Contexts in context. In Context in Knowledge Representation and Natural
Language, AAAI Fall Symposium (1997), AAAI Press.
[37] HITZLER,P.,KR¨
OTZSCH,M.,AND RUDOLPH,S.Foundations of Semantic Web Technolo-
gies. Textbooks in Computing, Chapman and Hall/CRC Press, 2010.
[38] HOFSTADTER,D.odel, Escher, Bach: An Eternal Golden Braid. Basic Books, 1999.
[39] JANOWICZ, K. Sim-DL: Towards a semantic similarity measurement theory for the
description logic ALC N R in geographic information retrieval. In On the Move to Mean-
ingful Internet Systems, Proc. OTM, Part II, R. Meersman, Z. Tari, and P. Herrero,
Eds., vol. 4278 of Lecture Notes in Computer Science. Springer, 2006, pp. 1681–1692.
doi:10.1007/11915072 74.
[40] JANOWICZ, K. Kinds of contexts and their impact on semantic similarity measure-
ment. In Proc. 5th IEEE Workshop on Context Modeling and Reasoning (CoMoRea),
6th IEEE International Conference on Pervasive Computing and Communication (PerCom)
(2008), IEEE Computer Society. doi:10.1109/PERCOM.2008.35.
[41] JANOWICZ,K.,ADAMS,B.,AND RAUBAL, M. Semantic referencing—determining
context weights for similarity measurement. In Proc. 6th International Conference Ge-
ographic Information Science (GIScience) (2010), S. I. Fabrikant, T. Reichenbacher, M. J.
van Kreveld, and C. Schlieder, Eds., vol. 6292 of Lecture Notes in Computer Science,Sp,
pp. 70–84. doi:10.1007/978-3-642-15300-6 6.
[42] JANOWICZ,K.,AND KESSLER, C. The role of ontology in improving gazetteer inter-
action. International Journal of Geographical Information Science 10, 22 (2008), 1129–1157.
doi:10.1080/13658810701851461.
[43] JANOWICZ,K.,KESSLER,C.,PANOV,I.,WILKES,M.,ESPETER,M.,AND SCHWARZ,
M. A study on the cognitive plausibility of SIM-DL similarity rankings for ge-
ographic feature types. In Proc. 11th AGILE International Conference on Geographic
Information Science (AGILE) (2008), L. Bernard, A. Friis-Christensen, and H. Pundt,
Eds., Lecture Notes in Geoinformation and Cartography, Springer, pp. 115–133.
doi:10.1007/978-3-540-78946-8 7.
[44] JANOWICZ,K.,KESSLER,C.,SCHWARZ,M.,WILKES,M.,PANOV,I.,ESPETER,M.,
AND BAEUMER, B. Algorithm, implementation and application of the SIM-DL sim-
ilarity server. In Proc. Second International Conference on GeoSpatial Semantics (GeoS)
(2007), F. T. Fonseca, A. Rodriguez, and S. Levashkin, Eds., no. 4853 in Lecture Notes
in Computer Science, Springer, pp. 128–145. doi:10.1007/978-3-540-76876-0 9.
[45] JANOWICZ,K.,MAU´
E,P.,WILKES,M.,BRAUN,M.,SCHADE,S.,DUPKE,S.,AND
KUHN, W. Similarity as a quality indicator in ontology engineering. In Proc. 5th
International Conference on Formal Ontology in Information Systems (FOIS) (2008), C. Es-
chenbach and M. Gr¨uninger, Eds., vol. 183, IOS Pres, pp. 92–105.
[46] JANOWICZ,K.,SCHADE,S.,BR¨
ORING,A.,KESSLER,C.,MAUE,P.,AND STASCH,C.
Semantic enablement for spatial data infrastructures. Transactions in GIS 14, 2 (2010),
111–129. doi:10.1111/j.1467-9671.2010.01186.x.
JOSIS, Number 2 (2011), pp. 29–57
54 JANOWICZ,RAUBAL,KUHN
[47] JANOWICZ,K.,SCHWARZ,M.,AND WILKES, M. Implementation and evaluation of
a semantics-based user interface for web gazetteers. In Workshop on Visual Interfaces to
the Social and the Semantic Web (VISSW) (2009).
[48] JANOWICZ,K.,AND WILKES, M. SIM-DLA: A Novel Semantic Similarity Measure
for Description Logics Reducing Inter-concept to Inter-instance Similarity. In Proc.
6th Annual European Semantic Web Conference (ESWC) (2009), L. Aroyo, P. Traverso,
F. Ciravegna, P. Cimiano, T. Heath, E. Hyvoenen, R. Mizoguchi, E. Oren, M. Sabou,
and E. P. B. Simperl, Eds., vol. 5554 of Lecture Notes in Computer Science,Springer,
pp. 353–367. doi:10.1007/978-3-642-02121-3 28.
[49] JANOWICZ,K.,WILKES,M.,AND LUTZ, M. Similarity-based information re-
trieval and its role within spatial data infrastructures. In Proc. 5th International
Conference on Geographic Information Science (GIScience) (2008), Springer, pp. 151–167.
doi:10.1007/978-3-540-87473-7 10.
[50] JONES,C.B.,AND PURVES, R. S. Geographical information retrieval. In-
ternational Journal of Geographical Information Science 22, 3 (2008), 219–228.
doi:10.1080/13658810701626343.
[51] JURISICA, I. Dkbs-tr-94-5: Context-based similarity applied to retrieval of relevant
cases. Tech. rep., University of Toronto, Department of Computer Science, Toronto,
1994.
[52] KESSLER, C. What’s the difference? a cognitive dissimilarity measure for information
retrieval result sets. Knowledge and Information Systems (2011; accepted for publication).
[53] KESSLER,C.,RAUBAL,M.,AND JANOWICZ, K. The effect of context on semantic
similarity measurement. In On the Move to Meaningful Internet Systems, Proc. OTM
Part II (2007), R. Meersman, Z. Tari, and P. Herrero, Eds., no. 4806 in Lecture Notes in
Computer Science, Springer, pp. 1274–1284. doi:10.1007/978-3-540-76890-6 55.
[54] KLIPPEL,A.,LI,R.,HARDISTY,F.,AND WEAVER, C. Cognitive invariants of geo-
graphic event conceptualization: What matters and what renes. In Proc. 6th Inter-
national Conference on Geographic Information Science (GIScience) (2010), S. I. Fabrikant,
T. Reichenbacher, M. van Krefeld, and C. Schlieder, Eds., LNCS, Springer, pp. 130–144.
doi:10.1007/978-3-642-15300-6 10.
[55] KLIPPEL,A.,WORBOYS,M.,AND DUCKHAM, M. Identifying factors of geographic
event conceptualisation. International Journal of Geographical Information Science, 22(2)
(2008), 183–204. doi:10.1080/13658810701405607.
[56] KRUMHANSL, C. L. Concerning the applicability of geometric models to similarity
data: the interrelationship between similarity and spatial density. Psychological Review
85 (1978), 445–463. doi:10.1037/0033-295X.85.5.445.
[57] K ¨
USTERS,R.Non-Standard Inferences in Description Logics, vol. 2100 of Lecture Notes in
Articial Intelligence. Springer, 2001. doi:10.1007/3-540-44613-3.
[58] LARKEY,L.,AND MARKMAN, A. Processes of similarity judgment. Cognitive Science
29, 6 (2005), 1061–1076. doi:10.1207/s15516709cog0000 30.
www.josis.org
SEMANTICS OF SIMILARITY 55
[59] LEW,M.,SEBE,N.,DJERABA,C.,AND JAIN, R. Content-based multimedia informa-
tion retrieval: State of the art and challenges. ACM Transactions on Multimedia Comput-
ing, Communications and Applications 2, 1 (2006), 1–19. doi:10.1145/1126004.1126005.
[60] LI,B.,AND FONSECA, F. Tdd—a comprehensive model for qualitative spa-
tial similarity assessment. Spatial Cognition and Computation 6, 1 (2006), 31–62.
doi:10.1207/s15427633scc0601 2.
[61] LIN, D. An information-theoretic denition of similarity. In Proc. 15th International
Conference on Machine Learning (1998), Morgan Kaufmann, pp. 296–304.
[62] LUTZ,M.,AND KLIEN, E. Ontology-based retrieval of geographic informa-
tion. International Journal of Geographical Information Science 20, 3 (2006), 233–260.
doi:10.1080/13658810500287107.
[63] MARK,D.,TURK,A.,AND STEA, D. Does the semantic similarity of geospatial entity
types vary across languages and cultures? In Workshop on Semantic Similarity Measure-
ment and Geospatial Applications, COSIT 2007 (2007).
[64] MARKMAN,A.B. Similarity and Categorization. Oxford University Press., 2001,
ch. Structural alignment, similarity, and the internal structure of category represen-
tations., pp. 109–130.
[65] MARKMAN,A.B.,AND GENTNER, D. Structural alignment during similarity com-
parisons. Cognitive Psychology 25, 4 (1993), 431–467. doi:10.1006/cogp.1993.1011.
[66] MARKMAN,A.B.,AND GENTNER, D. Structure mapping in the comparison process.
American Journal of Psychology 113 (2000), 501–538. doi:10.2307/1423470.
[67] MARKMAN,A.B.,AND STILWELL, C. Role-governed categories. Jour-
nal of Experimental and Theoretical Articial Intelligence 13, 4 (2001), 329–358.
doi:10.1080/09528130110100252.
[68] MATYAS,C.,AND SCHLIEDER, C. A spatial user similarity measure for geographic
recommender systems. In Proc. Third International Conference on GeoSpatial Semantics
(GeoS) (2009; forthcoming), K. Janowicz, M. Raubal, and S. Levashkin, Eds., vol. 5892
of Lecture Notes in Computer Science, Springer. doi:10.1007/978-3-642-10436-7 8.
[69] MEDIN,D.,GOLDSTONE,R.,AND GENTNER, D. Respects for similarity. Psychological
Review 100, 2 (1993), 254–278. doi:10.1037/0033-295X.100.2.254.
[70] M ¨
OLLER,R.,HAARSLEV,V.,AND NEUMANN, B. Semantics-Based Information Re-
trieval. In Proc. International Conference on Information Technology and Knowledge Systems
(IT&KNOWS-98) (1998), pp. 49–56.
[71] MONTELLO,D.,GOODCHILD,M.,GOTTSEGEN,J.,AND FOHL,P. Wheresdown-
town?: Behavioral methods for determining referents of vague spatial queries. Spatial
Cognition and Computation 3, 2 (2003), 185–204. doi:10.1207/S15427633SCC032&3 06.
[72] NEDAS,K.,AND EGENHOFER, M. Spatial similarity queries with logical operators.
In Proc. Eighth International Symposium on Spatial and Temporal Databases, T. Hadzilacos,
Y. Manolopoulos, J. Roddick, and Y. Theodoridis, Eds., vol. 2750 of Lecture Notes in
Computer Science. 2003, pp. 430–448. doi:10.1007/978-3-540-45072-6 25.
JOSIS, Number 2 (2011), pp. 29–57
56 JANOWICZ,RAUBAL,KUHN
[73] NEDAS,K.,AND EGENHOFER, M. Spatial-scene similarity queries. Transactions in GIS
12, 6 (2008), 661–681. doi:10.1111/j.1467-9671.2008.01127.x.
[74] NOSOFSKY, R. M. Stimulus bias, asymmetric similarity, and classication. Cognitive
Psychology 23, 1 (1991), 94–140. doi:10.1016/0010-0285(91)90004-8.
[75] OSGOOD,C.E.,SUCI,G.J.,AND TANNENBAUM,P.H.The Measurement of Meaning.
University of Illnois press, 1967.
[76] RADA,R.,MILI,H.,BICKNELL,E.,AND BLETTNER, M. Development and application
of a metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics 19
(1989), 17–30. doi:10.1109/21.24528.
[77] RAUBAL, M. Formalizing conceptual spaces. In Proc. Third International Conference
Formal Ontology in Information Systems (FOIS), A. Varzi and L. Vieu, Eds., vol. 114 of
Frontiers in Articial Intelligence and Applications. IOS Press, 2004, pp. 153–164.
[78] RAUBAL, M. Mappings for cognitive semantic interoperability. In Proc. 8th AGILE
Conference on Geographic Information Science (AGILE) (2005), F. Toppen and M. Painho,
Eds., pp. 291–296.
[79] RAUBAL, M. Representing concepts in time. In Spatial Cognition (2008), C. Freksa, N. S.
Newcombe, P. G¨ardenfors, and S. W ¨ol, Eds., vol. 5248 of Lecture Notes in Computer
Science, Springer, pp. 328–343. doi:10.1007/978-3-540-87601-4 24.
[80] RISSLAND, E. L. Ai and similarity. IEEE Intelligent Systems 21, 3 (2006), 39–49.
doi:10.1109/MIS.2006.38.
[81] RODR´
IGUEZ,A.,AND EGENHOFER, M. Comparing geospatial entity classes: anasym-
metric and context-dependent similarity measure. International Journal of Geographical
Information Science 18, 3 (2004), 229–256. doi:10.1080/13658810310001629592.
[82] SCHWERING, A. Approaches to semantic similarity measurement for
geo-spatial data—a survey. Transactions in GIS 12, 1 (2008), 5–29.
doi:10.1111/j.1467-9671.2008.01084.x.
[83] SCHWERING,A.,AND RAUBAL, M. Spatial relations for semantic similarity mea-
surement. In Perspectives in Conceptual Modeling: ER 2005 Workshops CAOIS, BP-
UML, CoMoGIS, eCOMO, and QoIS.,J.Akoka,S.Liddle,I.-Y.Song,M.Bertolotto,
I. Comyn-Wattiau, W.-J. vanden Heuvel, M. Kolp, J. Trujillo, C. Kop, and H. Mayr,
Eds., vol. 3770 of Lecture Notes in Computer Science. Springer, 2005, pp. 259–269.
doi:10.1007/11568346 28.
[84] SHVAIKO,P.,AND EUZENAT, J. Ten challenges for ontology matching. In
Proc. On the Move to Meaningful Internet Systems (OTM) (2008), R. Meersman and
Z. Tari, Eds., vol. 5332 of Lecture Notes in Computer Science, Springer, pp. 1164–1182.
doi:10.1007/978-3-540-88873-4 18.
[85] SMITH,L.B.Similarity and analogy. Cambridge University Press, 1989,ch. From global
similarities to kinds of similarities: The construction of dimensions in development,
pp. 146–178.
www.josis.org
SEMANTICS OF SIMILARITY 57
[86] SUNNA,W.,AND CRUZ, I. Using the agreementmaker to align ontologies for the
oaei campaign 2007. In Proc. Second International Workshop on Ontology Matching, 6th
International Semantic Web Conference (ISWC) (2007).
[87] TAN,P.-N.,STEINBACH,M.,AND KUMAR,V. Introduction to Data Mining.Addison
Wesley, 2005.
[88] TVERSKY, A. Features of similarity. Psychological Review 84, 4 (1977), 327–352.
doi:10.1037/0033-295X.84.4.327.
[89] TVERSKY,A.,AND GATI, I. Similarity, separability, and the triangle inequality. Psy-
chological Review 89(2) (1982), 123–154. doi:10.1037/0033-295X.89.2.123.
[90] WILKES,M.,AND JANOWICZ, K. A graph-based alignment approach to similarity
between climbing routes. In Proc. First International Workshop on Information Semantics
and its Implications for Geographic Analysis (ISGA) (2008).
[91] YEH,W.,AND BARSALOU, L. The situated nature of concepts. American Journal of
Psychology 119 (2006), 349–384. doi:10.2307/20445349.
JOSIS, Number 2 (2011), pp. 29–57
... In other words, text matching cannot account for the semantic meaning of user demands, leading to unsatisfactory retrieval results (Göbel & Klein, 2002;Li et al., 2019). To address this problem, geographic ontologies have been used to expand queries (Lutz, 2007;Lutz & Klien, 2006;Wiegand & García, 2007) and convert text matching into conceptual semantic similarity calculation (Gui et al., 2013b;Janowicz et al., 2011;Schwering, 2008) to improve recall and precision. Besides, fuzzy matching mechanisms are employed to handle the problems of uncertain spatial extents or changes in place names during the spatial query by introducing fuzzy knowledge bases (Schockaert, 2011;Yu et al., 2018). ...
Article
Full-text available
Effective retrieval is essential for finding resources in demand handily amidst extensive data records in data warehouse. Mainstream map retrieval methods suffer from intention gap problem and are incapable to describe sophisticated user demands precisely due to the limits of low-and middle-level text or visual feature matching, resulting in unsatisfactory retrieval results. Such limitations are more marked when map retrieval demands were characterized with joint constraints of geographic concepts. To address this issue, we propose a map retrieval intention recognition method to perceive user demands with relevance feedback samples and geographic semantics guidance. Specifically, we construct a hierarchical intention expression model to describe retrieval goals and their multi-dimensional attribute constrains; incorporate geographic ontologies to provide semantic guidance and facilitate recognition; utilize the frequent itemset mining (FIM) algorithm Apriori to generate intention candidates from relevance feedback samples , and search for the optimal intention set by adopting the minimum description length (MDL) principle. The experiments verify the effectiveness of Apriori algorithm and MDL principle on intention recognition. The proposed method outperforms the FIM algorithm Gene Ontology (RuleGO) and the Decision Tree algorithm with Hierarchical Features (DTHF) with higher recognition accuracy and noise tolerance. Furthermore, through our sample augmentation strategy, the method yields promising recognition accuracy even when the feedback sample size is as low as ten, substantially reducing the feedback burden in human-computer interactions. We envision that the application of our method in spatial data infrastructures (SDIs), such as geo-portals and catalogue services, could enhance the quality of service and user experience in geospatial data discovery.
... In the literature, we are assisting a growing interest in the problem of evaluating the semantic similarity between concepts, words, digital resources, etc., not only in computer science but also in the social sciences, medicine, biology, etc. [1]. Semantic similarity, i.e., the identification of different entities that are semantically close, is used in many research areas, such as bioinformatics [2,3], natural language processing [4,5], semantic web search [6,7], geographic information systems [8,9], and business process management [10,11], often by using different notations, overlapping definitions, etc., and, currently, it is still a challenge. ...
Article
Full-text available
The evaluation of the semantic similarity of concepts organized according to taxonomies is a long-standing problem in computer science and has attracted great attention from researchers over the decades. In this regard, the notion of information content plays a key role, and semantic similarity measures based on it are still on the rise. In this review, we address the methods for evaluating the semantic similarity between either concepts or sets of concepts belonging to a taxonomy that, often, in the literature, adopt different notations and formalisms. The results of this systematic literature review provide researchers and academics with insight into the notions that the methods discussed have in common through the use of the same notation, as well as their differences, overlaps, and dependencies, and, in particular, the role of the notion of information content in the evaluation of semantic similarity. Furthermore, in this review, a comparative analysis of the methods for evaluating the semantic similarity between sets of concepts is provided.
... Similarity assessment is important, whether one considers the simple visual assessment of patterns (e.g., interpreting a map) (Slocum, MacMaster, Kessler, & Howard, 2009), or more complex mental or statistical processes where multiple, possibly fuzzy attributes are used to identify locations that are similar. As earlier work has noted, approximation and explanation of similarity values is an area in need of research (Janowicz, Raubal, & Kuhn, 2011). Numerous cognitive approaches to assessing similarity have emerged, with many recognizing that as similarity assessments are a key component in reasoning and induction, the user has a governing role as they select criteria by which to assess similarity (Holt, 1999). ...
Article
Full-text available
What other locations are like my neighborhood? How? Why? The heart of many spatial analyses is in finding similarities or dissimilarities between locations. Discovering patterns and interpreting similarity is a complicated process that is based on both the spatial characteristics and the semantics or meaning that we assign to place. Human conceptualization of similarity in locations is multi-faceted and cannot be captured with a simple assessment of single numeric attributes like population density or median income; however, these quantifiable attributes are the basis for an initial pass of sense-making. MixMap facilitates the incorporation of similarity measures and spatial analytics to provide an information reduction (or semantic generalization) that brings the user closer to actionable insights. Through a preliminary evaluation of MixMap, we found that the tool supports the geospatial inquiry of determining similarity between regions, where participants can manipulate individual weights of the various attributes describing these locations. Based on feedback and observations from the study, we discuss potential implications and considerations for exploring the role of context and additional place-specific parameters for computing similarity, as well as understanding the nuances of semantics for place similarity in geospatial analysis tools.
... Regarding the representations of semantic spatial information, Jones, Alani, and Tudhope (2001) proposed a starting point for deriving the semantic relations measure, which links geographical and nongeographical concepts in a hierarchical form. In subsequent studies, Janowicz and Keßler (2008) and Janowicz, Raubal, and Kuhn (2011) developed an ontology theory of gazetteer representation for maintenance, interoperability and semiautomatic feature annotation. Ontology theory provides an advanced approach to depict the relations between various concepts concerning place, which produce a significant influence on the semantic spatial information. ...
Article
Addresses, one of the most important geographical reference systems in natural languages, are usually used to search spatial objects in daily life. Geocoding concatenates text with georeferenced coordinates and is an essential middleware service in geographic information applications. Despite its importance, geocoding remains challenging with only text as input, hindering text matching in reference databases without the specific text. To optimize the storage and retrieval of addresses in databases, this work proposes a graph-based approach for representing addresses. The approach clarifies the characteristics of relative concepts, designs a graph structure and identifies modelling strategies. Furthermore, a schema is proposed to perform address matching and toponym disambiguation using an address graph. The model is implemented on a graph database, and experimental tasks are employed to demonstrate its effectiveness. The approach provides a new reference for developers when creating address databases.
... Spatial-query-by-sketch (SQBS), that is, searching for spatial scenes similar to a sketch map drawn by users, is a promising way to query a geospatial database (Egenhofer, 1997) and plays an important role in geographic information retrieval (GIR; Larson, 1996;Jones & Purves, 2008;Janowicz, Raubal, & Kuhn, 2011). Due to the intuitiveness of sketch maps and the ubiquity of drawing-friendly technology and devices, such as laptops and smartphones, SQBS was recognized as a promising approach to interacting with geographic information systems. ...
Article
Full-text available
Spatial‐query‐by‐sketch is an intuitive tool to explore human spatial knowledge about geographic environments and to support communication with scene database queries. However, traditional sketch‐based spatial search methods perform inadequately due to their inability to find hidden multiscale map features from mental sketches. In this research, we propose a deep convolutional neural network, namely the Deep Spatial Scene Network (DeepSSN), to better assess the spatial scene similarity. In DeepSSN, a triplet loss function is designed as a comprehensive distance metric to support the similarity assessment. A positive and negative example mining strategy is designed to ensure a consistently increasing distinction of triplets during the training process. Moreover, we develop a prototype spatial scene search system using the proposed DeepSSN, in which the users input spatial queries via sketch maps and the system can automatically augment the sketch training data. The proposed model is validated using multisource conflated map data including 131,300 labeled scene samples after data augmentation. The empirical results demonstrate that the DeepSSN outperforms baseline methods including k‐nearest neighbors, the multilayer perceptron, AlexNet, DenseNet, and ResNet using mean reciprocal rank and precision metrics. This research advances geographic information retrieval studies by introducing a novel deep learning method tailored to spatial scene queries.
... Spatial-query-by-sketch (SQBS), i.e. searching for spatial scenes similar to a sketch map drawn by users, is a promising way to query a geospatial database (Egenhofer 1997) and plays an important role in geographic information retrieval (GIR) (Larson 1996, Jones and Purves 2008, Janowicz et al. 2011. Due to the intuitiveness of sketch maps and the ubiquity of drawn-friendly technology and devices, such as laptops and smartphones, SQBS was recognized as a promising approach to interacting with Geographic Information Systems (GIS). ...
Preprint
Full-text available
Spatial-query-by-sketch is an intuitive tool to explore human spatial knowledge about geographic environments and to support communication with scene database queries. However, traditional sketch-based spatial search methods perform insufficiently due to their inability to find hidden multi-scale map features from mental sketches. In this research, we propose a deep convolutional neural network, namely Deep Spatial Scene Network (DeepSSN), to better assess the spatial scene similarity. In DeepSSN, a triplet loss function is designed as a comprehensive distance metric to support the similarity assessment. A positive and negative example mining strategy using qualitative constraint networks in spatial reasoning is designed to ensure a consistently increasing distinction of triplets during the training process. Moreover, we develop a prototype spatial scene search system using the proposed DeepSSN, in which the users input spatial query via sketch maps and the system can automatically augment the sketch training data. The proposed model is validated using multi-source conflated map data including 131,300 labeled scene samples after data augmentation. The empirical results demonstrate that the DeepSSN outperforms baseline methods including k-nearest-neighbors, multilayer perceptron, AlexNet, DenseNet, and ResNet using mean reciprocal rank and precision metrics. This research advances geographic information retrieval studies by introducing a novel deep learning method tailored to spatial scene queries.
... In the context of maps, this requires a spatial similarity function, which are often solved with graph representations Li and Fonseca (2006). However, graph representations require complex definitions (Janowicz et al., 2011). Solving similarity graphs algorithmically is computationally expensive and has not yet been proven to provide unambiguous results. ...
Article
Full-text available
In recent years, libraries have made great progress in digitising troves of historical maps with high-resolution scanners. Providing user-friendly information access for cultural heritage through spatial search and webGIS requires georeferencing of the hundreds of thousands of digitised maps.Georeferencing is usually done manually by finding “ground control points”, locations in the digital map image, whose identity is unambiguous and can easily be found in modern-day reference geodata/mapping data. To decide whether two symbols from different maps describe the same object, their semantic and spatial relations need to be matched. Automating this process is the only feasible way to georeference the immense quantities of maps in conceivable time. However, automated solutions for spatial matching quickly fail when faced with incomplete data – which is the greatest challenge when comparing maps of different ages or scales.These problems can be overcome by computing map similarity in the image domain. Treating maps as a special case of image processing allows efficient and robust matching and thus identification of geographical regions without the need to explicitly model semantics. We propose a method to encode worldwide reference VGI mapping data as image features, allowing the construction of an efficient lookup index. With this index, content-based image retrieval can be used for both geolocating a given map for georeferencing with high accuracy. We demonstrate our approach on hundreds of map sheets of different historical topographical survey map series, successfully georeferencing most of them within mere seconds.
Article
The occurrence of geological disasters can have a large impact on urban safety. Protecting people’s safety is the most important concern when disasters occur. Safety improvement requires a large amount of comprehensive and representative risk analysis and a large collection of information related to geological hazards, including unstructured knowledge and experience. To address the relevant information and support safety risk analysis, a geological hazard knowledge graph is developed automatically based on computer vision and domain-geoscience ontology to identify geological hazards from input images while obeying safety rules and regulations, even when affected by changes. In the implementation of the knowledge graph, we design an ontology schema of geological disasters based on a top-down approach, and by organizing knowledge as a logical semantic expression, it can be shared using ontology technologies and therefore enable semantic interoperability. Computer vision approaches are then used to automatically detect a set of entities and attributes, using the data from input images, and object types and their attributes are identified so that they can be stored in Neo4j for reasoning and searching. Finally, a reasoning model for geological hazard identification was developed using the Neo4j database to create nodes, relationships, and their properties for modeling, and geological hazards in the images can be automatically identified by searching the Neo4j database. An application on geological hazard is presented. The results show the effectiveness of the proposed approach in terms of identifying possible potential hazards in geological hazards and assisting in formulating targeted preventive measures.
Article
Spatial prepositions have been studied in some detail from multiple disciplinary perspectives. However, neither the semantic similarity of these prepositions, nor the relationships between the multiple senses of different spatial prepositions, are well understood. In an empirical study of 24 spatial prepositions, we identify the degree and nature of semantic similarity and extract senses for three semantically similar groups of prepositions using t-SNE, DBSCAN clustering, and Venn diagrams. We validate the work by manual annotation with another data set. We find nuances in meaning among proximity and adjacency prepositions, such as the use of close to instead of near for pairs of lines, and the importance of proximity over contact for the next to preposition, in contrast to other adjacency prepositions.
Article
Venue categories used in location-based social networks often exhibit a hierarchical structure, together with the category sequences derived from users’ check-ins. The two data modalities provide a wealth of information for us to capture the semantic relationships between those categories. To understand the venue semantics, existing methods usually embed venue categories into low-dimensional spaces by modeling the linear context (i.e., the positional neighbors of the given category) in check-in sequences. However, the hierarchical structure of venue categories, which inherently encodes the relationships between categories, is largely untapped. In this article, we propose a venue C ategory E mbedding M odel named Hier-CEM , which generates a latent representation for each venue category by embedding the Hier archical structure of categories and utilizing multiple types of context. Specifically, we investigate two kinds of hierarchical context based on any given venue category hierarchy and show how to model them together with the linear context collaboratively. We apply Hier-CEM to three tasks on two real check-in datasets collected from Foursquare. Experimental results show that Hier-CEM is better at capturing both semantic and sequential information inherent in venues than state-of-the-art embedding methods.
Article
Four theories of the human conceptual system—semantic memory, exemplar models, feed‐forward connectionist nets, and situated simulation theory—are characterised and contrasted on five dimensions: (1) architecture (modular vs. non‐modular), (2) representation (amodal vs. modal), (3) abstraction (decontextualised vs. situated), (4) stability (stable vs. dynamical), and (5) organisation (taxonomic vs. action–environment interface). Empirical evidence is then reviewed for the situated simulation theory, and the following conclusions are reached. Because the conceptual system shares mechanisms with perception and action, it is non-modular. As a result, conceptual representations are multi-modal simulations distributed across modality‐specific systems. A given simulation for a concept is situated, preparing an agent for situated action with a particular instance, in a particular setting. Because a concept delivers diverse simulations that prepare agents for action in many different situations, it is dynamical. Because the conceptual system’s primary purpose is to support situated action, it becomes organised around the action–environment interface.
Article
Various proposals to use similarity measures to integrate data from different sources exist. We need not only a comparison of the measures in the sense of how similarity is defined, but also a careful analysis what is observed and between what similarity is measured. This contribution proposes to separate mental concepts and their structure from the externally represented descriptions of semantics as formalized ontology or database schema, two ontologically different tiers of semantics. It proposes also some technical topics for further research.
Article
Similarity comparisons are a basic component of cognition, and there are many elegant models of this process. None of these models describe comparisons of structured representations, although mounting evidence suggests that mental representations are well characterized by structured hierarchical systems of relations. We propose that structured representations can be compared via structural alignment, a process derived from models of analogical reasoning. The general prediction of structural alignment is that similarity comparisons lead subjects to attend to the matching relational structure in a pair of items. This prediction is illustrated with a computational simulation that also suggests that the strength of the relational focus is diminished when the relational match is impoverished, or when competing interpretations lead to rich object matches. These claims are tested in four experiments using the one-shot mapping paradigm, which places object similarity and relational similarity in opposition. The results support the hypothesis that similarity involves the alignment of structured representations.
Article
Humans think and talk about regions and spatial relations imprecisely, in terms of vague concepts that are fuzzy or probabilistic (e.g., downtown, near). The functionality of geographic information systems will be increased if they can interpret vague queries. We discuss traditional and newer approaches to defining and modeling spatial queries. Most of the research on vague concepts in information systems has focussed on mathematical and computational implementation. To complement this, we discuss behavioral-science methods for determining the referents of vague spatial terms, particularly vague regions. We present a study of the empirical determination of downtown Santa Barbara. We conclude with a discussion of prospects and problems for integrating vague concepts into geographic information systems.