PreprintPDF Available

HopRank: How Semantic Structure Influences Teleportation in PageRank (A Case Study on BioPortal)

Authors:
  • Complexity Science Hub Vienna (CSH)

Abstract and Figures

This paper introduces HopRank, an algorithm for modeling human navigation on semantic networks. HopRank leverages the assumption that users know or can see the whole structure of the network. Therefore, besides following links, they also follow nodes at certain distances (i.e., k-hop neighborhoods), and not at random as suggested by PageRank, which assumes only links are known or visible. We observe such preference towards k-hop neighborhoods on BioPortal, one of the leading repositories of biomedical ontologies on the Web. In general, users navigate within the vicinity of a concept. But they also "jump" to distant concepts less frequently. We fit our model on 11 ontologies using the transition matrix of clickstreams, and show that semantic structure can influence teleportation in PageRank. This suggests that users--to some extent--utilize knowledge about the underlying structure of ontologies, and leverage it to reach certain pieces of information. Our results help the development and improvement of user interfaces for ontology exploration.
Content may be subject to copyright.
HopRank: How Semantic Structure Influences Teleportation in
PageRank (A Case Study on BioPortal)
Lisette Espín-Noboa
GESIS & Uni. Koblenz-Landau
lisette.espin@gesis.org
Florian Lemmerich
RWTH Aachen University
orian.lemmerich@cssh.rwth-aachen.de
Simon Walk
Detego GmbH
s.walk@detego.com
Markus Strohmaier
RWTH Aachen University & GESIS
markus.strohmaier@cssh.rwth-aachen.de
Mark Musen
BMIR-Stanford
musen@stanford.edu
ABSTRACT
This paper introduces HopRank, an algorithm for modeling human
navigation on semantic networks. HopRank leverages the assump-
tion that users know or can see the whole structure of the network.
Therefore, besides following links, they also follow nodes at certain
distances (i.e., k-hop neighborhoods), and not at random as sug-
gested by PageRank, which assumes only links are known or visible.
We observe such preference towards k-hop neighborhoods on Bio-
Portal, one of the leading repositories of biomedical ontologies on
the Web. In general, users navigate within the vicinity of a concept.
But they also “jump” to distant concepts less frequently. We t our
model on 11 ontologies using the transition matrix of clickstreams,
and show that semantic structure can inuence teleportation in
PageRank. This suggests that users—to some extent—utilize know-
ledge about the underlying structure of ontologies, and leverage it
to reach certain pieces of information. Our results help the develop-
ment and improvement of user interfaces for ontology exploration.
CCS CONCEPTS
Information systems Content ranking
; Browsers;
Math-
ematics of computing Exploratory data analysis.
KEYWORDS
Biased random walker; PageRank; k-hop neighborhood; BioPortal
ACM Reference Format:
Lisette Espín-Noboa, Florian Lemmerich, Simon Walk, Markus Strohmaier,
and Mark Musen. 2019. HopRank: How Semantic Structure Inuences Tele-
portation in PageRank (A Case Study on BioPortal). In Proceedings of the
2019 World Wide Web Conference (WWW ’19), May 13–17, 2019, San Fran-
cisco, CA, USA. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/
3308558.3313487
1 INTRODUCTION
Ontology Engineering and Ontology Learning are two branches
of the Semantic Web whose aim is to accurately build and curate
ontologies. The former studies new techniques to improve collab-
oration among humans while editing ontologies [
26
,
29
], and the
This paper is published under the Creative Commons Attribution 4.0 International
(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their
personal and corporate Web sites with the appropriate attribution.
WWW ’19, May 13–17, 2019, San Francisco, CA, USA
©
2019 IW3C2 (International World Wide Web Conference Committee), published
under Creative Commons CC-BY 4.0 License.
ACM ISBN 978-1-4503-6674-8/19/05.
https://doi.org/10.1145/3308558.3313487
latter introduces new methodologies and algorithms to automati-
cally create ontologies by crawling the Web [
5
,
25
]. These eorts
represent signicant advances in the development of knowledge
bases, which represent facts about the real world (e.g., people, dis-
eases). However, there is little knowledge about how users consume
such ontologies on the Web. To this end, Walk et al. studied how
users browse BioPortal [
28
]. Their ndings suggest that some on-
tologies inuence the way users interact with the website. However,
how users navigate through the ontology structure (i.e., from one
concept to another) remains unclear.
Problem Statement:
In this paper, we study the inuence of se-
mantic structure on teleportation (i.e., jumping to any node chosen
at random) in PageRank. For example, consider the ontology shown
in Figure 1(a), where nodes represent classes (a.k.a. concepts) and
edges isASubClassOf relationships. On BioPortal, ontologies are
shown vertically as hierarchical trees, and concepts can be explored
using the expand-on-demand principle. This means that only top
level concepts are shown rst, and then users are able to expand
and collapse as many concepts as they need at any level of the
ontology. In other words, users can use and therefore are poten-
tially aware of a virtually fully connected network in all stages of
navigation. Previous studies [
21
,
31
] have modeled user navigation
using PageRank. However, these assume that navigation paths are
constrained by links and random teleportation. In our scenario,
where the whole structure of an ontology can be visualized at any
time, we believe that teleportation is not fully random, but rather
biased towards k-hop neighborhoods.
Approach:
Motivated by previous studies on information foraging
[
8
10
,
22
], decentralized search [
16
,
18
], and PageRank [
4
,
14
,
15
,
21
,
31
], we propose HopRank, a method for modeling transitions
across k-hop neighborhoods on semantic networks. The key idea
of this work relies on the HopPortation vector
®
β
, which denes the
probabilities of transitioning to each k-hop neighborhood. From the
PageRank point of view, we can say that teleportation is not fully
random, and the probability of following the structure of a page
is not based only on one parameter (i.e., probability of following
links), but on
k
parameters, representing all k-hop neighborhoods
reached from the current page. Technically, we pass the HopPorta-
tion vector to a random walker to make biased decisions on which
neighborhood to go next. Once this decision is made, the random
walker uniformly chooses a concept within that neighborhood.
arXiv:1903.05704v2 [cs.SI] 15 Mar 2019
1
[1-hop]
100
[2-hop]
15
[4-hop]
a
c
b
d e fg
(a) Transitions
0 1 100 0 15
1 2 101 1 16
0.008 0.017 0.835 0.008 0.132
β0 β1 β2 β3 β4
Transitions
Smoothing
Normalized
(b) HopPortation Vector
0.001 0.011 0.011 0.244 0.244 0.244 0.244
0.008 0.001 0.963 0.008 0.008 0.006 0.006
0.008 0.963 0.001 0.006 0.006 0.008 0.008
0.419 0.018 0.009 0.001 0.419 0.067 0.067
0.419 0.018 0.009 0.419 0.001 0.067 0.067
0.419 0.009 0.018 0.067 0.067 0.001 0.419
0.419 0.009 0.018 0.067 0.067 0.419 0.001
a
b
c
d
e
f
g
a b c d e f g
(c) Transition Probabilities
Figure 1: HopRank on semantic networks. This example illustrates an instance of navigation on an ontology. (a) Shows the
underlying network composed by seven concepts (a-g) and six isASubClassOf relationships (straight-thin grey arrows). Tran-
sitions (curved-thick black arrows) are labeled by the actual number of transitions between concepts, as well as the [k-hop]
distance (i.e., shortest path) between them. (b) Illustrates how the HopPortation vector ®
βis built using transition counts per
k-hop. (c) Shows the transition probabilities inferred by HopRank, see Equation (1).
Contributions: The contributions of this paper are:
(1)
We empirically show how users leverage the structure of
the ontologies on BioPortal by quantifying the proportion
of transitions per k-hop neighborhood.
(2)
We propose HopRank, an algorithm for modeling human
navigation on semantic networks.
(3)
We demonstrate that HopRank outperforms traditional navi-
gation and popularity-based models on BioPortal, especially
when users browse ontologies directly without search.
(4)
We make an implementation of this approach openly avail-
able on the Web [11].
2 RELATED WORK
BioPortal provides users with a tree-like explorer and a local search
engine to navigate ontologies. In addition, concepts can be expanded
on demand to see their children nodes. Although these functionali-
ties are exploited dierently across ontologies [
28
], it is unclear how
users navigate through the ontology structure. Thus, this section
covers previous work on search and navigation on networks.
Search
.Information Foraging [
22
] assumes that people, when pos-
sible, modify their strategies or the structure of the environment
to maximize their gain of valuable information. These patterns are
also found in the way humans recall information from memory
[
17
]. Similarly, berrypicking [
6
], a model of online searching, states
that queries are not static, but rather evolve, and users commonly
gather information in pieces instead of in one large set.
Navigation
.PageRank [
21
] is the most popular method to measure
the importance of web pages based on their incoming and outgoing
links. It relies on an imaginary surfer who is randomly clicking
on links, and eventually jumps to any node in the network. The
probability of following links is given by a damping factor. Multiple
variations have been proposed for improving information retrieval
systems, e.g., a biased PageRank [
15
] to capture the importance of
a page more accurately by taken topics into account or a weighted
PageRank [
31
] to assign larger rank values to more popular pages
(i.e., preferential attachment) instead of distributing the rank value
of a page uniformly to all outgoing pages. Geigl et al. suggest that
the behavior of a random surfer is almost similar to real users, as
long as they do not use search engines [
13
]. They also nd that
classical navigation structures, such as navigation hierarchies or
breadcrumbs, only exercise limited inuence on navigation. Exper-
iments in [
24
] reveal that memory-less Markov chains represent a
quite practical model for human navigation on a page level. How-
ever, this assumption is violated when the analysis is expanded to
a topical level. Helic et al. identify certain congurations of decen-
tralized search that are capable of modeling human navigation in
information networks [
16
]. Their ndings suggest that navigation
on such networks is a two phase process combined with the ex-
ploitation of the known (i.e., goal-seeking) and the exploration of the
unknown (i.e., orientation).
User Interfaces
. Human navigation has also been studied for en-
hancing interfaces. For instance, [
12
] explores sheye views to dis-
play large information structures such as programs and databases.
The intuition behind this paradigm is that users often explore their
neighborhood, and distant major landmarks in more detail. Simi-
larly, Van Ham and Perer studied the search, show context, expand
on demand browsing model in [
27
], and proposed techniques to
design better graph visualization tools.
We propose HopRank—a biased random walker—to model navi-
gation on semantic networks. HopRank builds upon insights from
information foraging [
17
,
22
], decentralized search [
16
,
18
] and
PageRank [
21
]. More precisely, we replace the damping factor by
a HopPortation vector to encode the probabilities of visiting each
k-hop neighborhood. The intuition here is that users browse se-
mantically close terms more often than semantically distant ones.
3 BIOPORTAL
There exist a large number of ontologies in the biomedical domain.
They are highly specialized and therefore expensive to develop. To
enable ontology adoption and reuse, eective support for browsing
and exploring existing ontologies is required. Towards that goal, the
National Center for Biomedical Ontology (NCBO) [
3
,
19
] features
BioPortal [
1
,
20
,
30
]—one of the leading repositories of biomedical
ontologies on the Web— containing currently more than 700 on-
tologies with more than 9million ontology classes. On BioPortal,
practitioners and experts can access ontologies via Web services
and Web browsers. The latter allows users to navigate ontologies
by searching specic classes, or by directly browsing their concept
hierarchies within a tree-like explorer [28].
Ontologies.
We propose to model human navigation on semantic
networks using the structure of the underlying ontology. On Bio-
Portal, ontologies are dened as directed networks, where nodes
represent concepts and edges isASubClassOf relationships. Since
such edges are usually non-cyclic and have a common root, these
ontologies often form trees. Table 1 shows 11 of the most visited
ontologies in 2015
1
. For instance, LOINC the largest ontology with
175Knodes, 153Kedges, and 74Kconnected components.
Transitions
. We analyzed all HTTP requests made in 2015 and
extracted 336
K
valid sessions (i.e., after ltering out sessions with
less than 2requests, and requests to ontologies or concepts which
do not exist). Each session contains transitions (i.e., a sequence of
visited concept pages) triggered by a single user (i.e., IP address)
without breaks (i.e., pauses of at least 60 minutes). For simplicity,
we only consider transitions within the largest connected compo-
nent (LCC) of each ontology, and discard ontologies with less than
1000 transitions
2
. Overall, we found 11 ontologies and 133
K
transi-
tions between their concepts
3
, see Table 1 for some key properties.
Navigation Types
. Based on the HTTP request headers, we in-
ferred 7navigation types: Details (DE), Direct Click (DC), Direct
URL (DU), Expand (EX), External Link (EL), External Search (ES),
and Local Search (LS).
DE
: are all clicks made within the Details tab
of a selected concept.
DC
: are all clicks made on concepts within the
tree-like explorer.
DU
: refers to all concept requests without HTTP
referrer (e.g., direct URL in the browser).
EX
: considers all clicks
on the (+) symbol of a concept, which triggers the expansion of the
concept to show all its children nodes. Notice that this request is
called only once, even if the symbol is clicked multiple times. The
opposite behavior (collapse) is not considered
4
.
EL
: captures all
requests coming from external websites that are not search engines.
ES
: are all requests coming from the top 10 most popular external
1
As ontologies can be edited over time, we work with their latest snapshots from 2015.
2Transitions within the LCCs of these ontologies represent 80% of all transitions.
3We left out the popular SNOMEDCT ontology due to computational limitations.
4Collapse is a client-side functionality, and thus, it is not recorded in the log les.
Table 1: Datasets. This table illustrates network properties
of 11 of the most popular ontologies on BioPortal in 2015.
Ontologies represent networks whose nodes refer to con-
cepts and edges isASubClassOf relationships. Original num-
ber of nodes, edges, and connected components of ontolo-
gies are shown under N, E and cc, respectively. Properties of
the largest connected component (LCC) of each ontology are
shown under N’, E’, d’ and T’, where d’ refers to the diameter
and T’ to the number of transitions.
# Ontology N E cc N’ E’ d’ T’
1 CPT 13219 13235 3 13092 13110 15 44651
2 MEDDRA 66506 31863 43493 22889 31738 8 42746
3 NDFRT 35019 34504 522 32074 32080 24 22452
4 LOINC 174513 152683 73518 100871 152558 13 6349
5 ICD9CM 22534 22531 3 22407 22406 12 4434
6 WHO-ART 1852 2997 3 1725 2872 4 2811
7 MESH 165166 24182 145652 16947 21596 31 2623
8 ICD10 12446 11256 1190 11132 11131 10 2288
9 CHMO 2966 3071 3 2964 3071 22 1423
10 HL7 10319 10600 1049 9146 10475 19 1374
11 OMIM 81821 39359 44110 37587 39234 6 1291
Figure 2: Navigation Types. Each bar shows the fraction of
transitions within the LCC of each ontology. Stacked bars
dierentiate types of navigation: details (DE, blue), direct
click (DC, orange), direct URL (DU, green), expand (EX, red),
external link (EL, purple), external search (ES, brown) and lo-
cal search (LS, pink). Most ontologies are mainly navigated
by expanding nodes within the tree-like explorer.
search engines such as Google and Yahoo.
LS
: are all requests made
via the local search functionality of each ontology. Notice that this
search is a 3-step process. First users type a keyword, then the
system shows auto-suggestions and nally users click on one of
the concepts shown in the auto-suggestion list. We only consider
the nal step a local search transition.
ALL
: includes all the above-
mentioned types. Figure 2 shows the distribution of transitions
across navigation types for each ontology. In general, most trac
comes from expanding a concept (EX, 44%), followed by local search
(LS, 17%), direct URL (DU, 16%) and details (DE, 14%). Surprisingly,
direct clicks on concepts (DC) only represent 7% of all transitions.
This suggests that users spend substantial time expanding concepts
before they nd a concept of interest.
4 HOPRANK: A BIASED RANDOM WALKER
HopRank models human navigation on semantic networks. Imagine
a random walker whose decisions on where to go next are biased
towards specic k-hop neighborhoods. This bias is what we call
HopPortation, which encodes the probabilities of transitioning to
each k-hop neighborhood. In our model, navigation on networks
can be explained as a 2-step process. First, a
k
-hop neighborhood of
the current node
i
is drawn from a categorical distribution. Second,
a node
j
is randomly chosen within that
k
-hop neighborhood. Note
that this process holds only if the walker is fully or partially aware
of the structure of the network (i.e., knows or can see it). Without
this prerequisite, and if links are not preferred, then random jumps
to random pages will be more plausible. In comparison to the classic
random walker with teleportation (e.g., PageRank [
21
]), where its
movements are constrained by the damping factor
α
(i.e., probability
of following links), HopRank is constrained by a vector
®
β
containing
k
dierent factors, which dene the probabilities of going to each
k-hop neighborhood from the current location.
Visited k-hop Neighborhoods on BioPortal
. We aggregate ALL
transitions by the shortest distance between two sequentially vis-
ited nodes. This distance is referred to as k-hop neighborhood. In
Figure 3(a) we see that target nodes at large distances are less likely
to be visited next. This is expected, since—to some extent—larger
distances enclose more branches, therefore more target candidates.
Note that ontologies are sorted by diameter in descendant order
from MESH to WHO-ART. Interestingly, users tend to hop as far
as the ontology’s diameter, for
d
12. For instance, OMIM’s di-
ameter is 6(see Table 1), and 6is the maximum hop done by users.
Otherwise, users (roughly) hop up to two-thirds of the ontology’s
diameter, for
d>
12. For example, MESH’s diameter is 31, and the
largest hop reached is 19.
Transitions per k-hop Neighborhood on BioPortal
. Figure 3(b)
shows the average percentage of transitions across k-hop neigh-
borhoods per navigation type. We see that users on average (ALL,
grey) prefer to navigate through 2-hop (41%) and 1-hop (23%) neigh-
bors. In particular, when navigation is triggered by direct clicks
(DC, orange) and expand (EX, red). Notice their fast decay when
khop >
8. Other types of navigation such as external link (EL,
purple), and direct URL (DU, green)—which do not leverage the
tree-like explorer—tend to reach concepts at larger distances more
frequently. Notice their peaks at
khop ={
5
,
11
}
, respectively. Inter-
estingly, when users opt for external search (ES, brown), they often
click on 2-hop concepts, but also on 12-hop and 15-hop neighbors.
Intuitively, the details tab (DE, blue) helps users to click on nearby
concepts at
khop
2, more often than local search (LS, pink), which
is more likely to reach concepts at khop 2.
5 MODELS OF HUMAN TRANSITIONS
In this section, we formally introduce our HopRank model, and re-
cap popular navigation models for comparison. We denote the tran-
sition probabilities, and # of parameters according to HopRank and
7other models that we will use later on for model selection.
We formally represent an ontology
5
as a graph
G=(V,E)
, with
V=(v1, . . .vn)
being a set of
N
nodes, and
E={(vi,vj)} ∈ V×V
a set of undirected edges
6
. The ontology structure is captured by
the adjacency matrix
AN×N=ai j
, where
ai j
is 1if the link exists,
0otherwise. Transitions are represented by the transition matrix
TN×N=ti j
, where
ti j
represents the number of transitions between
source node iand target node j.
HopRank
. Given the HopPortation vector
®
β
, the probability of
reaching a k-hop neighborhood is denoted by factor
βk®
β
.
Mk
,
the stochastic
k
-hop matrix, describes all nodes
j
with a shortest
distance
k
from
i
. HopRank uniformly distributes
βk
across all
nodes
j
at distance
k
. The limits of k-hop neighborhoods go from
1(direct edges), to
d
, the diameter of the ontology
G
. Noise
β0=
1
Íd
k=1βk
is added to allow for random jumps and self-loops.
Figure 1(b) illustrates how the HopPortation vector is computed
from the transition counts. Number of model parameters: d+1.
PH R =β1M
M
M1+β2M
M
M2+· · · +βkM
M
Mk+β0
N(1)
Preferential Attachment (PA)
. Given the degree matrix
DN×N=
di j =dj
, where
dj
represents the degree of the target node
j
. The
probability of moving from
i
to
j
is proportional to the degree of
j
.
Number of model parameters: 0.
5We focus on its largest connected component (LCC)
6
Directionality of edges is omitted to calculate shortest paths between all pair of nodes.
(a) % of dyads traversed per ontology
(b) Mean % of transitions per navigation type
Figure 3: Popularity of k-hops. (a) Shows the percentage
of dyads that are traversed per k-hop neighborhood. Lines
represent ontologies and are sorted by their LCC diame-
ter: In descendent order from MESH (dark blue) to WHO-
ART (dark red). (b) Shows the distribution of transitions
across k-hop neighborhoods per navigation type. Percent-
ages are averages across ontologies, and error bars the re-
spective standard deviation. While several k-hop distances
are being traversed non-uniformly, most transitions happen
across nearby nodes, especially when browsing (DE, DC, EX,
ES) 2-hop neighbors. In contrast, non-browsing types (EL, LS,
DU) tend to reach more distant nodes more frequently.
PPA =D
D
D(2)
Gravitational (Gr)
. Given the matrix
SN×N=(sp(i,j)+ϵ)2
, where
sp(i,j)
denotes the shortest path between nodes
i
and
j
. The proba-
bility of navigating from
i
to
j
is proportional to the degree of node
j
and inversely proportional to the square distance between
i
and
j
.
We add a smoothing factor
ϵ
to avoid overows when dyads are
disconnected. In such cases, we set
ϵ
to the diameter
d
of
G
plus
1, to consider these jumps with a very low probability. Similarly,
we set the diagonal (i.e., self-loops) to
ϵ=d+
2.Number of model
parameters: 0.
PGr =D
D
D
S(3)
Random Walker (RW)
. Given the damping factor
α
(i.e., prob-
ability of following links), the probability of visiting a node
j
is
proportional to
α
divided by the degree of the source node
i
, plus a
random choice equally distributed among all nodes. Depending on
the
α
value, a random walker can model four dierent behaviors:
(i)
α=
0
.
0: random jumps only,
(ii) α
1
.
0: navigation over links only,
(iii) α=
0
.
85: PageRank using the commonly used damping factor
for navigating the Web [
7
], and
(iv)
the empirical PageRank which
learns the parameter
α
from the transitions data. Number of model
parameters: 1if empirical, 0otherwise.
PP R =αA
A
A+(1α)
N(4)
Markov Chain (MC)
. We assume that moving to the next node
follows a Markov process. Therefore, the probability of moving to
a node
j
only depends on the current node
i
. These probabilities
represent the maximum likelihood, learned from the transition
matrix
T
. Thus, the probability of visiting node
j
from node
i
is
proportional to the number of transitions
ti j
.Number of model
parameters: N× (N2).
PMC =T
T
T(5)
Note that
M
M
M
,
A
A
A
, and all
P
from Equations (1) to (5) are right
stochastic matrices (i.e., each row must sum to 1).
6 EXPERIMENTS
In this section, we compare the performance of HopRank to the
baselines on synthetic and real-world networks.
6.1 Model Selection
For comparing the models, we employ the Bayesian Information
Criterion (BIC) [
23
] to select the best, i.e., lowest BIC score. BIC
evaluates log-likelihoods
LL
(i.e., how likely our transitions are for a
given model) and takes into account the number of model parameters
and observations (i.e., # of transitions) to avoid over-tting.
BIC =2·LL +npar ams ·loд(nobservations ),(6)
LL =
N
Õ
i=1
N
Õ
j=1
ti j ·loд(pi j ),(7)
where
ti j
represents the actual number of transitions from node
i
to node
j
, and
pi j
the probability of transitioning from node
i
to
node jfor a given model.
6.2 Synthetic Network
Setup
. The underlying network (structure) is a binary tree com-
posed by
N=
7nodes and
|E|=
6edges as shown in Figure 1(a).
Transitions (curved-thick edges) are biased towards 2-hop and 4-
hop neighborhoods. These biases are reected in the HopPorta-
tion vector shown in Figure 1(b).
Results
. Probabilities inferred using Equation (1) are depicted in
Figure 1(c). Figure 4 (left) shows the number of parameters inferred
by each model. While the Markov chain model (MC) requires 35
parameters, HopRank only needs 5. The empirical PageRank (RW
E.) learned a damping factor of
α=0.01
. This means that users are
1% likely to follow links. In Figure 4 (right) we see the comparison
of models using BIC scores. In this synthetic network, transitions
are best described by the Markov chain model because model pa-
rameters (i.e., maximum likelihood) are proportional to the actual
transition counts per dyad, and the data structure is very small
7
.
In spite of that, HopRank is the second best model and describes
navigation better than random (RW 0.0).
7Therefore, number of parameters does not play a very important role in BIC.
Figure 4: Results on Synthetic Network from Figure 1. X-axis
maps the models at interest. (a) Number of parameters in-
ferred by each model. (b) BIC: The lower the score, the better
the model explaining the data. In this example, navigation
is best described by Markov chain followed by HopRank.
6.3 Medical Dictionary for Regulatory
Activities Terminology (MEDDRA)
Setup
. MEDDRA[
2
] is one of the the largest ontologies in our
dataset (see Table 1). After pre-processing, its largest connected
component (LCC) consists of 23Knodes and 43Ktransitions.
Results
. Figure 5(a) shows the HopPortation vectors learned for
each type of navigation in MEDDRA. We see that users mainly
navigate through 1, 2, 6, and 8-hop neighbors. For instance, transi-
tions through direct clicks—on a concept (DC), its details (DE) or
expand (EX)—mainly follow 1-hop and 2-hop neighbors. However,
when transitions are triggered by direct URLs (DU), local search
(LS) or external links (EL), users tend to reach distant target nodes
(i.e., 6-hop and 8-hop neighbors). Figure 5(b) shows the ranking
of models according to BIC scores (lower is better). We see that in
MEDDRA all types of navigation are best explained by HopRank.
6.4 Top11 Ontologies in BioPortal
Setup
. We t HopRank and the baseline models to all transitions
by ontology and navigation type. These represent 133
K
transitions
coming from the 11 ontologies described in Table 1.
Results
. In Figure 6 we highlight the model that explains the num-
ber of transitions per ontology and navigation type best (i.e., the
model with lowest BIC score). Ontologies are sorted by their number
of transitions from CPT (largest) to OMIM (smallest). HopRank out-
performs the other models 89% of the time, especially when users
browse directly—regardless of the ontology—the tree-like explorer
via clicks (DC), details (DE) and expand (EX). When there are not
enough observations (i.e., the number of transitions is small), the
other models tend to outperform HopRank due to the fact that the
other models require fewer parameters and/or it is less likely to nd
transitions across dierent k-hop neighborhoods. This is the case
for 6ontologies in certain navigation types. For instance, we found
5external search (ES) transitions in MESH which are best described
by the Gravitational model (Gr). Even though HopRank was a better
candidate (i.e., higher log-likelihood), BIC penalized it for having
more parameters (
nparamsH opR ank =
32
>nparamsG r =
0). No-
tice that we model navigation in ontologies with at least 2transi-
tions. Ontologies that do not full this condition per navigation
type are marked as green cells “-”.
(a) HopPortation vectors (b) Model selection
Figure 5: Results on MEDDRA. (a) This heatmap shows the HopPortation vectors learned from the transitions in MEDDRA.
Cells represent the probabilities of visiting a certain k-hop neighborhood (column) by a given navigation type (row). In general,
2-hop and 1-hop neighborhoods are more likely to be visited next, regardless of navigation type (ALL). However, distant hops
are preferred through direct URLs (DU), external links (EL), and local search (LS). (b) This gure shows the comparison of
models across navigation types using BIC scores. We see that HopRank outperforms all baseline models.
Figure 6: Model Selection on BioPortal. This heatmap high-
lights the model—with lowest BIC score—that best describes
the # of transitions per ontology and navigation type. Hop-
Rank outperforms the other models 89% of the time, espe-
cially when browsing concepts via details (DE), direct click
(DC) and expand (EX). When transitions are scarce (i.e., the
other 11%), BIC penalizes HopRank since it has more param-
eters than the other models (except Markov chain).
7 DISCUSSION AND FUTURE WORK
In this section, we discuss decisions made for data processing, and
future directions that can be pursued to improve our results.
Largest Connected Component (LCC)
. Surprisingly, ontologies
on BioPortal may have multiple connected components. In those
cases, only the branch connected to the root owl:Thing is shown at
rst in the tree-like explorer. Disconnected (and hidden) nodes or
branches need to be accessed from external pages or local search.
For simplicity, we opted to work with the LCC of each ontology
with the cost of removing 20% of all transitions. Future work should
consider the whole network to study the tradeos between number
of transitions and random teleportation.
HopRank Extensions
. More extensions based on network prop-
erties or similarity measures between nodes could improve our
results. For instance, considering ontologies as directed graphs, and
assuming that navigation is not only constrained by distance but
also directionality: top-down or bottom-up.
Other Types of Networks
. Even though this paper targets seman-
tic networks, we believe that HopRank can be utilized to model
human navigation in other networks, such as the Web or cities.
The only assumption required is that users must have background
knowledge of the underlying network they are surng/traveling in.
8 CONCLUSIONS
In this paper, we introduced the concept of HopPortation which
states that users—navigating a known or visible network—are bi-
ased towards certain k-hop neighborhoods. This is a variation of
PageRank, where we assume that teleportation is not fully random
but rather distributed non-uniformly across dierent neighbor-
hoods. We proposed HopRank—a biased random walker—to model
navigation on semantic networks. Our ndings on BioPortal suggest
that semantic structure (i.e., shortest path) inuences navigation
on networks. In particular, users tend to be biased towards certain
k-hop neighborhoods depending on the type of navigation. For in-
stance, when manually browsing the tree-like explorer, users tend
to hop to nearby concepts, whereas far-away concepts are more
likely to be reached by non-browsing types such as external links.
These results advance our understanding of how ontologies are
actually navigated and consumed, and help to develop and improve
user interfaces for ontology exploration.
Acknowledgements.
We would like to thank Tania Tudorache,
John Graybeal, Matthew Horridge, Clement Jonquet, Maulik Kam-
dar, Alex Skrenchuk, Marcos Oliveira, Fabian Flöck, Reinhard Munz,
Dimitar Dimitrov, Indira Sen, Mattia Samory and the three anony-
mous reviewers for their time and suggestions to improve the qual-
ity of the paper. This work was funded by DFG German Science
Fund research projects “KonSKOE” and “PoSTs II”.
REFERENCES
[1] 2011. BioPortal. https://bioportal.bioontology.org/ Accessed: 2019-02-21.
[2]
2011. MEDDRA. https://bioportal.bioontology.org/ontologies/MEDDRA Ac-
cessed: 2019-02-21.
[3]
2011. National Center for Biomedical Ontology (NCBO). https://www.
bioontology.org/ Accessed: 2019-02-21.
[4]
Joshua T Abbott, Joseph L Austerweil, and Thomas L Griths. 2015. Random
walks on semantic networks can resemble optimal foraging. (2015).
[5]
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak,
and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. The semantic
web (2007), 722–735.
[6]
Marcia J Bates. 1989. The design of browsing and berrypicking techniques for
the online search interface. Online review 13, 5 (1989), 407–424.
[7]
Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual
web search engine. Computer networks and ISDN systems 30, 1-7 (1998), 107–117.
[8]
Stuart K Card, Peter Pirolli, Mija Van Der Wege, Julie B Morrison, Robert W
Reeder, Pamela K Schraedley, and Jenea Boshart. 2001. Information scent as a
driver of Web behavior graphs: results of a protocol analysis method for Web
usability. In Proceedings of the SIGCHI conference on Human factors in computing
systems. ACM, 498–505.
[9]
Ed H Chi, Peter Pirolli, Kim Chen, and James Pitkow. 2001. Using information
scent to model user information needs and actions and the Web. In Proceedings
of the SIGCHI conference on Human factors in computing systems. ACM, 490–497.
[10]
Ed H Chi, Peter Pirolli, and James Pitkow. 2000. The scent of a site: A system for
analyzing and predicting information scent, usage, and usability of a web site.
In Proceedings of the SIGCHI conference on Human Factors in Computing Systems.
ACM, 161–168.
[11]
Lisette Espín-Noboa. 2018. HopRank. https://github.com/lisette-espin/HopRank
Accessed: 2019-01-23.
[12] George W Furnas. 1986. Generalized sheye views. Vol. 17. ACM.
[13]
Florian Geigl, Daniel Lamprecht, Rainer Hofmann-Wellenhof, Simon Walk,
Markus Strohmaier, and Denis Helic. 2015. Random surfers on a web encyclope-
dia. In Proceedings of the 15th International Conference on Knowledge Technologies
and Data-driven Business. ACM, 5.
[14] Michael Gorman. 2004. Google and God’s mind. Los Angeles Times 17 (2004).
[15]
Taher H Haveliwala. 2003. Topic-sensitive pagerank: A context-sensitive ranking
algorithm for web search. IEEE transactions on knowledge and data engineering
15, 4 (2003), 784–796.
[16]
Denis Helic, Markus Strohmaier, Michael Granitzer, and Reinhold Scherer. 2013.
Models of human navigation in information networks based on decentralized
search. In Proceedings of the 24th ACM Conference on Hypertext and Social Media.
ACM, 89–98.
[17]
Thomas T Hills, Michael N Jones, and Peter M Todd. 2012. Optimal foraging in
semantic memory. Psychological review 119, 2 (2012), 431.
[18]
Jon M Kleinberg. 2002. Small-world phenomena and the dynamics of information.
In Advances in neural information processing systems. 431–438.
[19]
Mark A Musen, Natalya F Noy, Nigam H Shah, Patricia L Whetzel, Christopher G
Chute, Margaret-Anne Story, Barry Smith, and NCBO team. 2011. The national
center for biomedical ontology. Journal of the American Medical Informatics
Association 19, 2 (2011), 190–195.
[20]
Natalya F Noy, Nigam H Shah, Patricia L Whetzel, Benjamin Dai, Michael Dorf,
Nicholas Grith, Clement Jonquet, Daniel L Rubin, Margaret-Anne Storey,
Christopher G Chute, et al
.
2009. BioPortal: ontologies and integrated data
resources at the click of a mouse. Nucleic acids research (2009), gkp440.
[21]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The
PageRank citation ranking: Bringing order to the web. Technical Report. Stanford
InfoLab.
[22]
Peter Pirolli and Stuart Card. 1999. Information foraging. Psychological review
106, 4 (1999), 643.
[23]
Gideon Schwarz et al
.
1978. Estimating the dimension of a model. The annals of
statistics 6, 2 (1978), 461–464.
[24]
Philipp Singer, Denis Helic, Behnam Taraghi, and Markus Strohmaier. 2014.
Detecting memory and structure in human navigation patterns using markov
chain models of varying order. PloS one 9, 7 (2014), e102070.
[25]
Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: a core of
semantic knowledge. In Proceedings of the 16th international conference on World
Wide Web. ACM, 697–706.
[26]
Tania Tudorache,Jennifer Vendetti, and Natalya Fridman Noy. 2008. Web-Protege:
A Lightweight OWL Ontology Editor for the Web.. In OWLED, Vol. 432.
[27]
Frank Van Ham and AdamPerer. 2009. “Search, show context, expand on demand”:
supporting large graph exploration with degree-of-interest. IEEE Transactions on
Visualization and Computer Graphics 15, 6 (2009).
[28]
Simon Walk, Lisette Espín-Noboa, Denis Helic, Markus Strohmaier, and Mark A
Musen. 2017. How Users Explore Ontologies on the Web: A Study of NCBO’s
BioPortal Usage Logs. In Proceedings of the 26th International Conference on World
Wide Web. International World Wide Web Conferences Steering Committee,
775–784.
[29]
Simon Walk, Philipp Singer, Lisette Espín-Noboa, Tania Tudorache, Mark A
Musen, and Markus Strohmaier. 2015. Understanding how users edit ontologies:
Comparing hypotheses about four real-world projects. In International Semantic
Web Conference. Springer, 551–568.
[30]
Patricia L Whetzel, Natalya F Noy, Nigam H Shah, Paul R Alexander, Csongor
Nyulas, Tania Tudorache, and Mark A Musen. 2011. BioPortal: enhanced func-
tionality via new Web services from the National Center for Biomedical Ontology
to access and use ontologies in software applications. Nucleic acids research 39,
suppl_2 (2011), W541–W545.
[31]
Wenpu Xing and Ali Ghorbani. 2004. Weighted pagerank algorithm. In Proceed-
ings of the Second Annual Conference on Communication Networks and Services
Research, 2004. IEEE, 305–314.
... HopRank Espín-Noboa et al. [73] proposed HopRank to model human navigation on semantic networks. Based on the analysis of user behavior in the BioPortal website 22 , a repository of biomedical ontologies, they found that, instead of teleporting to random ontology nodes, users showed a bias toward jumping to nodes at a particular distance k. ...
... It is based on precomputed random walk fingerprints over a subgraph limited by a set of boundary nodes (blockers and losers). Node importance Espín-Noboa et al. [73]. HopRank models human navigation on semantic networks, by taking into consideration the bias of jumping to nodes at particular distances. ...
Article
Full-text available
Entity-oriented search tasks heavily rely on exploiting unstructured and structured collections. Moreover, it is frequent for text corpora and knowledge bases to provide complementary views on a common topic. While, traditionally, the retrieval unit was the document, modern search engines have evolved to also retrieve entities and to provide direct answers to the information needs of the users. Cross-referencing information from heterogeneous sources has become fundamental, however a mismatch still exists between text-based and knowledge-based retrieval approaches. The former does not account for complex relations, while the latter does not properly support keyword-based queries and ranked retrieval. Graphs are a good solution to this problem, since they can be used to represent text, entities and their relations. In this survey, we examine text-based approaches and how they evolved to leverage entities and their relations in the retrieval process. We also cover multiple aspects of graph-based models for entity-oriented search, providing an overview on link analysis and exploring graph-based text representation and retrieval, leveraging knowledge graphs for document or entity retrieval, building entity graphs from text, using graph matching for querying with subgraphs, exploiting hypergraph-based representations, and ranking based on random walks on graphs. We close with a discussion on the topic and a view of the future to motivate the research of graph-based models for entity-oriented search, particularly as joint representation models for the generalization of retrieval tasks.
... The use of the internet users' behaviour information such as pageviews and clickstream data has been studied for the biomedical ontology repository BioPortal [18]. Pageviews data is used by pantheon group to rank famous individuals from Wikipedia [38]. ...
Preprint
Nowadays, information describing navigation behaviour of internet users are used in several fields, e-commerce, economy, sociology and data science. Such information can be extracted from different knowledge bases, including business-oriented ones. In this paper, we propose a new model for the PageRank, CheiRank and 2DRank algorithm based on the use of clickstream and pageviews data in the google matrix construction. We used data from Wikipedia and analysed links between over 20 million articles from 11 language editions. We extracted over 1.4 billion source-destination pairs of articles from SQL dumps and more than 700 million pairs from XML dumps. Additionally, we unified the pairs based on the analysis of redirect pages and removed all duplicates. Moreover, we also created a bigger network of Wikipedia articles based on all considered language versions and obtained multilingual measures. Based on real data, we discussed the difference between standard PageRank, Cheirank, 2DRank and measures obtained based on our approach in separate languages and multilingual network of Wikipedia.
Chapter
Nowadays, information describing navigation behaviour of internet users are used in several fields, e-commerce, economy, sociology and data science. Such information can be extracted from different knowledge bases, including business-oriented ones. In this paper, we propose a new model for the PageRank, CheiRank and 2DRank algorithm based on the use of clickstream and pageviews data in the google matrix construction. We used data from Wikipedia and analysed links between over 20 million articles from 11 language editions. We extracted over 1.4 billion source-destination pairs of articles from SQL dumps and more than 700 million pairs from XML dumps. Additionally, we unified the pairs based on the analysis of redirect pages and removed all duplicates. Moreover, we also created a bigger network of Wikipedia articles based on all considered language versions and obtained multilingual measures. Based on real data, we discussed the difference between standard PageRank, Cheirank, 2DRank and measures obtained based on our approach in separate languages and multilingual network of Wikipedia.
Conference Paper
Full-text available
Ontologies are complex intellectual artifacts and creating them requires significant expertise and effort. While existing ontology-editing tools and methodologies propose ways of building ontologies in a normative way, empirical investigations of how experts actually construct ontologies " in the wild " are rare. Yet, understanding actual user behavior can play an important role in the design of effective tool support. Although previous empirical investigations have produced a series of interesting insights, they were exploratory in nature and aimed at gauging the problem space only. In this work, we aim to advance the state of knowledge in this domain by systematically defining and comparing a set of hypotheses about how users edit ontologies. Towards that end, we study the user editing trails of four real-world ontology-engineering projects. Using a coherent research framework, called HypTrails, we derive formal definitions of hypotheses from the literature, and systematically compare them with each other. Our findings suggest that the hierarchical structure of an ontology exercises the strongest influence on user editing behavior, followed by the entity similarity, and the semantic distance of classes in the ontology. Moreover, these findings are strikingly consistent across all ontology-engineering projects in our study, with only minor exceptions for one of the smaller datasets. We believe that our results are important for ontology tools builders and for project managers, who can potentially leverage this information to create user interfaces and processes that better support the observed editing patterns of users.
Article
Full-text available
The random surfer model is a frequently used model for simulating user navigation behavior on the Web. Various algorithms, such as PageRank, are based on the assumption that the model represents a good approximation of users browsing a website. However, the way users browse the Web has been drastically altered over the last decade due to the rise of search engines. Hence, new adaptations for the established random surfer model might be required, which better capture and simulate this change in navigation behavior. In this article we compare the classical uniform random surfer to empirical navigation and page access data in a Web Encyclopedia. Our high level contributions are (i) a comparison of stationary distributions of different types of the random surfer to quantify the similarities and differences between those models as well as (ii) new insights into the impact of search engines on traditional user navigation. Our results suggest that the behavior of the random surfer is almost similar to those of users - as long as users do not use search engines. We also find that classical website navigation structures, such as navigation hierarchies or breadcrumbs, only exercise limited influence on user navigation anymore. Rather, a new kind of navigational tools (e.g., recommendation systems) might be needed to better reflect the changes in browsing behavior of existing users.
Article
Full-text available
One of the most frequently used models for understanding human navigation on the Web is the Markov chain model, where Web pages are represented as states and hyperlinks as probabilities of navigating from one page to another. Predominantly, human navigation on the Web has been thought to satisfy the memoryless Markov property stating that the next page a user visits only depends on her current page and not on previously visited ones. This idea has found its way in numerous applications such as Google's PageRank algorithm and others. Recently, new studies suggested that human navigation may better be modeled using higher order Markov chain models, i.e., the next page depends on a longer history of past clicks. Yet, this finding is preliminary and does not account for the higher complexity of higher order Markov chain models which is why the memoryless model is still widely used. In this work we thoroughly present a diverse array of advanced inference methods for determining the appropriate Markov chain order. We highlight strengths and weaknesses of each method and apply them for investigating memory and structure of human navigation on the Web. Our experiments reveal that the complexity of higher order models grows faster than their utility, and thus we confirm that the memoryless model represents a quite practical model for human navigation on a page level. However, when we expand our analysis to a topical level, where we abstract away from specific page transitions to transitions between topics, we find that the memoryless assumption is violated and specific regularities can be observed. We report results from experiments with two types of navigational datasets (goal-oriented vs. free form) and observe interesting structural differences that make a strong argument for more contextual studies of human navigation in future work.
Article
Full-text available
Information foraging theory is an approach to understanding how strategies and technologies for information seeking, gathering, and consumption are adapted to the flux of information in the environment. The theory assumes that people, when possible, will modify their strategies or the structure of the environment to maximize their rate of gaining valuable information. The theory is developed by (a) adaptation (rational) analysis of information foraging problems and (b) a detailed process model (adaptive control of thought in information foraging [ACT-IF]). The adaptation analysis develops (a) information patch models, which deal with time allocation and information filtering and enrichment activities in environments in which information is encountered in clusters; (b) information scent models, which address the identification of information value from proximal cues; and (c) information diet models, which address decisions about the selection and pursuit of information items. ACT-IF is instantiated as a production system model of people interacting with complex information technology. (PsycINFO Database Record (c) 2009 APA, all rights reserved)
Conference Paper
Ontologies in the biomedical domain are numerous, highly specialized and very expensive to develop. Thus, a crucial prerequisite for ontology adoption and reuse is effective support for exploring and finding existing ontologies. Towards that goal, the National Center for Biomedical Ontology (NCBO) has developed BioPortal---an online repository containing more than 500 biomedical ontologies. In 2016, BioPortal represents one of the largest portals for exploration of semantic biomedical vocabularies and terminologies, which is used by many researchers and practitioners. While usage of this portal is high, we know very little about how exactly users search and explore ontologies and what kind of usage patterns or user groups exist in the first place. Deeper insights into user behavior on such portals can provide valuable information to devise strategies for a better support of users in exploring and finding existing ontologies, and thereby enable better ontology reuse. To that end, we study and group users according to their browsing behavior on BioPortal and use data mining techniques to characterize and compare exploration strategies across ontologies. In particular, we were able to identify seven distinct browsing types, all relying on different functionality provided by BioPortal. For example, Search Explorers extensively use the search functionality while Ontology Tree Explorers mainly rely on the class hierarchy for exploring ontologies. Further, we show that specific characteristics of ontologies influence the way users explore and interact with the website. Our results may guide the development of more user-oriented systems for ontology exploration on the Web.
Article
When people are asked to retrieve members of a category from memory, clusters of semantically related items tend to be retrieved together. A recent article by Hills, Jones, and Todd (2012) argued that this pattern reflects a process similar to optimal strategies for foraging for food in patchy spatial environments, with an individual making a strategic decision to switch away from a cluster of related information as it becomes depleted. We demonstrate that similar behavioral phenomena also emerge from a random walk on a semantic network derived from human word-association data. Random walks provide an alternative account of how people search their memories, postulating an undirected rather than a strategic search process. We show that results resembling optimal foraging are produced by random walks when related items are close together in the semantic network. These findings are reminiscent of arguments from the debate on mental imagery, showing how different processes can produce similar results when operating on different representations. (PsycINFO Database Record (c) 2015 APA, all rights reserved).
Conference Paper
Designers and researchers of users' interactions with the World Wide Web need tools that permit the rapid exploration of hypotheses about complex interactions of user goals, user behaviors, and Web site designs. We present an architecture and system for the analysis and prediction of user behavior and Web site usability. The system integrates research on human information foraging theory, a reference model of information visualization and Web data-mining techniques. The system also incorporates new methods of Web site visualization (Dome Tree, Usage Based Layouts), a new predictive modeling technique for Web site use (Web User Flow by Information Scent, WUFIS), and new Web usability metrics.
Conference Paper
Models of human navigation play an important role for understanding and facilitating user behavior in hypertext systems. In this paper, we conduct a series of principled experiments with decentralized search - an established model of human navigation in social networks - and study its applicability to information networks. We apply several variations of decentralized search to model human navigation in information networks and we evaluate the outcome in a series of experiments. In these experiments, we study the validity of decentralized search by comparing it with human navigational paths from an actual information network - Wikipedia. We find that (i) navigation in social networks appears to differ from human navigation in information networks in interesting ways and (ii) in order to apply decentralized search to information networks, stochastic adaptations are required. Our work illuminates a way towards using decentralized search as a valid model for human navigation in information networks in future work. Our results are relevant for scientists who are interested in modeling human behavior in information networks and for engineers who are interested in using models and simulations of human behavior to improve on structural or user interface aspects of hypertextual systems.
Conference Paper
In many contexts, humans often represent their own “neighborhood” in great detail, yet only major landmarks further away. This suggests that such views (“fisheye views”) might be useful for the computer display of large information structures like programs, data bases, online text, etc. This paper explores fisheye views presenting, in turn, naturalistic studies, a general formalism, a specific instantiation, a resulting computer program, example displays and an evaluation