ArticlePDF Available

Are tag clouds useful for navigation? A network-theoretic analysis

Authors:

Abstract

It is a widely held belief among designers of social tagging systems that tag clouds represent a useful tool for navigation. This is evident in, for example, the increasing number of tagging systems offering tag clouds for navigational purposes, which hints towards an implicit assumption that tag clouds support efficient navigation. In this paper, we examine and test this assumption from a network-theoretic perspective, and show that in many cases it does not hold. We first model navigation in tagging systems as a bipartite graph of tags and resources and then simulate the navigation process in such a graph. We use network-theoretic properties to analyse the navigability of three tagging datasets with regard to different user interface restrictions imposed by tag clouds. Our results confirm that tag-resource networks have efficient navigation properties in theory, but they also show that popular user interface decisions (such as "pagination" combined with reverse-chronological listing of resources) significantly impair the potential of tag clouds as a useful tool for navigation. Based on our findings, we identify a number of avenues for further research and the design of novel tag cloud construction algorithms. We also argue that any future algorithm needs to take into account the trade-off between navigational and semantic properties of the generated tag-resource networks. In particular, we introduce a simple method for estimating a so-called semantic penalty induced by a given tag-cloud construction algorithm. Our work is relevant for researchers interested in navigability of emergent hypertext structures, and for engineers seeking to improve the navigability of social tagging systems.
Are Tag Clouds Useful for Navigation?
A Network-Theoretic Analysis
Denis Helic
, Christoph Trattner
, Markus Strohmaier
, Keith Andrews
Knowledge Management Institute
Graz University of Technology
Graz, Austria
Email: {dhelic,markus.strohmaier}@tugraz.at
Institute for Information Systems and Computer Media
Graz University of Technology
Graz, Austria
Email: {ctrattner,kandrews}@iicm.edu
Know-Center, Graz University of Technology, Graz, Austria
Abstract—It is a widely held belief among designers of social
tagging systems that tag clouds represent a useful tool for
navigation. This is evident in, for example, the increasing number
of tagging systems offering tag clouds for navigational purposes,
which hints towards an implicit assumption that tag clouds
support efficient navigation. In this paper, we examine and test
this assumption from a network-theoretic perspective, and show
that in many cases it does not hold. We first model navigation
in tagging systems as a bipartite graph of tags and resources
and then simulate the navigation process in such a graph. We
use network-theoretic properties to analyse the navigability of
three tagging datasets with regard to different user interface
restrictions imposed by tag clouds. Our results confirm that tag-
resource networks have efficient navigation properties in theory,
but they also show that popular user interface decisions (such
as “pagination” combined with reverse-chronological listing of
resources) significantly impair the potential of tag clouds as a
useful tool for navigation. Based on our findings, we identify
a number of avenues for further research and the design of
novel tag cloud construction algorithms. We also argue that
any future algorithm needs to take into account the trade-off
between navigational and semantic properties of the generated
tag-resource networks. In particular, we introduce a simple
method for estimating a so-called semantic penalty induced by
a given tag-cloud construction algorithm. Our work is relevant
for researchers interested in navigability of emergent hypertext
structures, and for engineers seeking to improve the navigability
of social tagging systems.
I. INTRODUCTION
In social tagging systems such as Flickr and Delicious, tag
clouds have emerged as an interesting alternative to traditional
forms of navigation and hypertext browsing. The basic idea
is that tag clouds provide navigational clues by aggregating
tags and corresponding resources from multiple sources, and
by displaying them in a visually appealing fashion. Users are
presented with these tag clouds as a means for exploring and
navigating the resource space in social tagging systems.
While tag clouds can potentially serve different purposes,
there seems to be an implicit assumption among engineers of
social tagging systems that tag clouds are specifically useful to
support navigation. This is evident in the large-scale adoption
of tag clouds for interlinking resources in numerous systems
such as Flickr, Delicious, and BibSonomy. However, this
Navigability Assumption has hardly been critically reflected
(with some notable exceptions, for example [1]), and has
largely remained untested in the past. In this paper, we will
demonstrate that the prevalent approach to tag cloud-based
navigation in social tagging systems is highly problematic
with regard to network-theoretic measures of navigability. In
a series of experiments, we will show that the Navigability
Assumption only holds in very specific settings, and for the
most common scenarios, we can assert that it is wrong.
While recent research has studied navigation in social
tagging systems from user interface [2], [3], [4] and network-
theoretic [5] perspectives, the unique focus of this paper is
the intersection of these issues. With that focus, we want to
answer questions such as: How do user interface constraints of
tag clouds affect the navigability of tagging systems? And how
efficient is navigation via tag clouds from a network-theoretic
perspective?
Particularly, we will first 1) investigate the intrinsic navi-
gability of tagging datasets without considering user interface
effects, and then 2) take pragmatic user interface constraints
into account. Next, 3) we will demonstrate that for many social
tagging systems, the Navigability Assumption does not hold
and then we will 4) use our findings to illuminate a path
towards improving the navigability of tag clouds. Thereafter,
we will 5) argue that any new tag-cloud construction algorithm
will need to balance the trade-off between navigational and
semantic penalties induced by the network generation process,
and finally, we will 6) present a simple method for estimating
the semantic penalty.
To the best of our knowledge, this paper is among the first to
study what we have called the Navigability Assumption of Tag
Clouds, i. e. the widely held belief that tag clouds are useful
for navigating social tagging systems. One of the main results
of this paper is a more critical stance towards the usefulness of
tag clouds as a navigational aid in tagging systems. We argue
that in order to make use of the full potential of tag clouds,
new ways of thinking about tag cloud algorithms are needed.
The paper is structured as follows: In Section 2 we present
our network-theoretic approach to assessing navigability of
tagging systems. Section 3 describes the analyzed datasets.
Section 4 presents and discusses the analysis results. Based
on our findings, we call for and discuss new ideas for tag
cloud algorithms in Section 5. In Section 6, we sketch a new
algorithm for constructing tag clouds and present a method for
estimating the semantic properties of the network generated by
that algorithm. Section 7 provides an overview of related work.
Finally, Section 8 concludes the paper and presents directions
for future work.
II. NETWORK-THEORETIC MODEL OF NAVIGATION IN
TAGGING SYSTEMS
A tagging dataset is typically modeled as a tripartite hyper-
graph with V = R U T , where R is the resource set, U is
the user set, and T is the tag set [6], [7], [8]. An annotation
of a particular resource with a particular tag produced by a
particular user is a hyperedge (r, t, u), connecting three nodes
from these three disjoint sets.
Such a tripartite hypergraph can be mapped onto three
bipartite graphs connecting users and resources, users and tags,
and tags and resources. For different purposes it is often more
practical to analyse one or more of these bipartite graphs.
For example, in the context of ontology learning, the bipartite
graph of users and tags has been shown to be an effective
projection [9].
In this paper, we focus on tag-resource bipartite graphs.
These graphs naturally reflect the way users are supposed to
adopt tag clouds for navigating social tagging systems. For
example, in many tagging systems, tag clouds are intended to
be used in the following way:
1) The system presents a tag cloud to the user.
2) The user selects a tag from the tag cloud.
3) The system presents a list of resources tagged with the
selected tag.
4) The user selects a resource from the list of resources.
5) The system transfers the user to the selected resource,
and the process potentially starts anew.
We will study this general interaction schema and model
it with a simulated user moving along the edges of the
tag-resource bipartite graph and alternately visiting tag and
resource nodes.
To that end, we introduce a network-theoretic approach for
assessing the navigability and the efficiency of navigability in
such a bipartite graph. Ever since Milgram’s small world ex-
periment [10], researchers aimed to understand “navigability”
and in particular “efficient” navigation of networks (for details
see Section VII). Among others, two important results stem
from this line of research: (1) there exist short paths between
people (nodes) in a social network and (2) people are able to
navigate “efficiently” through the network having only local
knowledge of the network, i.e. knowing only their personal
contacts.
Kleinberg [11], [12], [13] and also independently Watts
[14] formalised these properties concluding that a navigable
network has a short path between all or almost all nodes in
the network [13]. Formally, such a network has a low diameter
bounded polylogarithmically, i.e. by a polynomial in logN ,
where N is the number of nodes in the network, and there
exists a giant component, i.e. a strongly connected component
containing almost all nodes [13]. Additionally, an “efficiently”
navigable network possesses certain structural properties so
that it is possible to design efficient decentralised search
algorithms (algorithms that only have local knowledge of the
network) [11], [12], [13]. The delivery time (the expected
number of steps to reach an arbitrary target node) of such
algorithms is polylogarithmic or at most sub-linear in N.
User navigation in hypertext systems is naturally modeled
as a decentralised search, i.e. at each particular node in the
network, users select a new node having only local knowledge
of the network and following the idea that the selected node
would bring them closest to their destination node. We use
this model to investigate the navigability of tag clouds next.
III. EXPERIMENTAL SETUP
In the following, we conduct experiments aiming to shed
light on the navigability of tag-clouds in social tagging sys-
tems. We are particularly interested in studying how design
decisions, such as what tags to include in a tag cloud or how
many tags to display, effect the navigability of tag clouds.
While, today, designers often base such decisions on intuition
or heuristics, it is our goal to study the consequences of these
decisions experimentally, i.e. by exploring their empirical
effects on the network.
In our experiments, we used three datasets covering a range
of different settings.
Dataset Austria-Forum: This dataset consists of anno-
tations from an Austrian encyclopedia called Austria-
Forum
1
. The dataset contains 32,245 annotations and
12,837 unique resources. The system is at an early phase
of adoption, i.e. not many users currently contribute new
tags.
Dataset BibSonomy: This dataset
2
contains nearly all
916,495 annotations and 235,339 unique resources from a
dump of BibSonomy [15] until 2009-01-01. Annotations
from known spammers have been excluded from the
dataset. This dataset is obtained from a more mature
tagging system.
Dataset CiteULike: This dataset contains 6,328,021 an-
notations and 1,697,365 unique resources and is available
online
3
. Again, this is a dataset acquired from a more
mature tagging system.
Dataset Austria-Forum represents a tagging system at an
early stage of adoption. Datasets BibSonomy and CiteULike
are tagging systems which have reached a certain level of
1
http://www.austria-lexikon.at
2
http://www.kde.cs.uni-kassel.de/ws/dc09/
3
http://www.citeulike.org/faq/data.adp
maturity (i.e. attracted a larger set of active users). While
all three systems adopt tag clouds for navigational purposes,
their specific approaches vary. However, because the datasets
contain complete information about the tripartite graph, we can
experimentally manipulate the data in a way that simulates
different approaches to tag cloud construction consistently
across all datasets. We will describe how we manipulate the
data to simulate different user interface constraints next.
A. User Interface Issues
The first user interface restriction which we model is the
size of a tag cloud, i.e. the maximal number of tags displayed
in a tag cloud. While different tagging systems implement
different design choices, we can simulate alternative choices
across all datasets. For example, in some tagging systems the
maximum number of tags in a tag cloud might be 20, while
in others it might be much larger.
Another important issue of tag clouds is the algorithm used
to select the tags to display in a tag cloud. While, in theory,
there are many ways to compute and visualise tag clouds
[16], [17], [3], in practice many tagging systems follow a
simple resource-specific, TopN algorithm. In resource-specific
approaches to tag cloud construction, only tags assigned to the
corresponding resources are considered. In TopN approaches,
the top n tags with the highest resource-specific frequency are
chosen for display in the corresponding tag cloud. In cases
where less than n tags per resource are available, the remaining
slots are left empty.
For the experiments aiming to study the Navigational As-
sumption, we used the TopN algorithm (because it is the
most common) to reconstruct simulated networks of resource-
specific tag clouds for our three datasets.
Popular tags in a mature tagging system can cover hundreds
or even thousands of resources, which exceeds the pragmatic
limits of a system’s user interface. In this situation, tagging
systems usually resort to limiting the set of resources being
displayed for a given tag (for example, by sorting and “pag-
inating” the list of corresponding resources). To model such
limits, we introduce a pragmatic parameter, the length of the
resource list being presented, and denote it henceforth with k.
In the majority of tagging systems, the resource lists
presented after selecting a tag are usually sorted reverse-
chronologically (the resources most recently tagged are listed
first). For simplicity, in our experiments, we select the k
resources for k-limited resource lists randomly.
IV. RESULTS
A. Intrinsic navigability of tagging systems
We start our study by analysing the navigability of tagging
systems in a synthetic network-theoretic case, i.e. without
taking any user interface restrictions into account. The first
row in each of Tables I(a), I(b), and I(c) present the obtained
results. The results show the existence of a giant component
connecting almost all of the nodes (98%), as well as the
existence of a low effective diameter (less than 7, i.e. it is
less than polynomial in logN , see Figure 1).
0
10
20
30
40
50
60
70
80
90
100
0 2 4 6 8 10 12 14 16 18
Percentage of pairs of nodes
Number of hops
Austria-Forum EffDiam: 10.7262, G(24171, 64366)
BibSonomy EffDiam: 6.96109, G(291763, 1727992)
CiteULike EffDiam: 6.84779, G(2045200, 12298510)
Austria-Forum
BibSonomy
CiteULike
Fig. 1. Hop plots for three different tagging datasets. We can observe the
shrinking diameter phenomenon [18]: The two mature datasets (Bibsonomy
and CiteULike, the two lines on the left) exhibit a small diameter, while the
Austria-Forum (a tagging system in an early adoption phase, the line on the
right) exhibits a larger diameter, and a larger ratio of long distances between
nodes.
The only exception here is the Austria-Forum dataset. We
speculate that the reason for that is due to the system being
in an early adoption stage. While the effective diameter of
the Austria-Forum dataset is larger than the one in the two
other datasets (see Figure 1), it is still limited polylogarith-
mically, whereas the giant component contains only 77% of
nodes. This result suggests that the Navigability Assumption
depends on the adoption stage of the tagging system under
investigation, i. e. the assumption may only hold for more
mature tagging systems BibSonomy or CiteULike. We leave
the issue of identifying the point in time where immature
tagging systems transition to tagging systems exhibiting more
useful navigational properties to future research. At this point,
we simply observe that the Navigation Assumption is sensitive
to the stage of adoption of a tagging system.
Result 1: The usefulness of tag clouds for navigation is
sensitive to the phase of adoption of the social tagging system.
Figures 2(a), 2(b), and 2(c) show tag (blue), resource
(green), and degree (red) distributions for the analysed
datasets. The tag and resource distributions were obtained
by analysing a unidirectional bipartite graph, i.e. a graph
with only directed links from tags to resources. The out-
degree distribution and the in-degree distribution in this graph
correspond to tag distribution and to resource distribution
respectively. For certain ranges of degrees, both distributions
are power law distributions. There are deviations in the tail of
the tag distribution – these stem from the system tags assigned
to imported resources (see Figures 2(b) and 2(c)). The vertical
line in the tail of Figure 2(c) comes from the existence of
synonym tags in the dataset. The resource distributions exhibit
an exponential cut-off in the tail (see Figure 2(b)), a deviation
in the tail stemming from a test resource (see Figure 2(a)),
10
0
10
1
10
2
10
3
10
4
10
5
10
0
10
1
10
2
10
3
Count (CCDF)
Degree
Austria-Forum G(24171, 64366).
Combined Degree Dist.
Resource Dist.
Tag Dist.
(a) Austria-Forum
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
0
10
1
10
2
10
3
10
4
10
5
Count (CCDF)
Degree
BibSonomy G(291763, 1727992).
Combined Degree Dist.
Resource Dist.
Tag Dist.
(b) BibSonomy
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
0
10
1
10
2
10
3
10
4
10
5
10
6
Count (CCDF)
Degree
CiteULike G(2045200, 12298510).
Combined Degree Dist.
Resource Dist.
Tag Dist.
(c) CiteULike
Fig. 2. Tag, resource, and degree distributions for the three datasets. We can observe that the tag degrees are two or more orders of magnitude greater than
the resource degrees, i.e. the tag distribution strongly dominates the resource distribution for higher degrees. Therefore, the network hubs (high-degree nodes)
are the “head” tags the top tags for TopN tag cloud construction algorithms. It is therefore to expect that limiting of the tag cloud size will not influence
the navigability of the tag-resource network as the hub nodes are still present in the network.
(a) Austria-Forum
UIR GC ED UIA NADT
none 0.77 10.73 none sub-lin.
n = 5 0.75 10.99 TopN sub-lin.
n = 10 0.76 11.3 TopN sub-lin.
n = 20 0.76 11.97 TopN sub-lin.
n = 30 0.76 11.05 TopN sub-lin.
k = 5 0.36 12.04 Chron. unnav.
k = 10 0.47 11.16 Chron. unnav.
k = 20 0.56 10.31 Chron. unnav.
k = 30 0.6 10.68 Chron. unnav.
(b) BibSonomy
UIR GC ED UIA NADT
none 0.98 6.96 none sub-lin.
n = 5 0.94 6.8 TopN sub-lin.
n = 10 0.97 6.87 TopN sub-lin.
n = 20 0.98 6.84 TopN sub-lin.
n = 30 0.98 6.91 TopN sub-lin.
k = 5 0.31 6.82 Chron. unnav.
k = 10 0.4 6.62 Chron. unnav.
k = 20 0.5 6.61 Chron. unnav.
k = 30 0.54 6.65 Chron. unnav.
(c) CiteULike
UIR GC ED UIA NADT
none 0.98 6.85 none sub-lin.
n = 5 0.93 6.97 TopN sub-lin.
n = 10 0.95 7.07 TopN sub-lin.
n = 20 0.97 7.17 TopN sub-lin.
n = 30 0.97 6.98 TopN sub-lin.
k = 5 0.27 6.89 Chron. unnav.
k = 10 0.36 6.95 Chron. unnav.
k = 20 0.44 6.91 Chron. unnav.
k = 30 0.48 7.05 Chron. unnav.
UIR = UI Restriction, GC = Giant Component, ED = Effective Diameter, UIA = UI Algorithm, NADT = Navigation Algorithm Delivery Time
Chron. = Chronological algorithm, sub-lin. = sub-linear, unnav. = unnavigable network
TABLE I
NAVIGATIONAL PROPERTIES OF THE AUSTRIA-FORUM, BIBSONOMY, AND CITEULIKE TAGGING SYSTEMS.
and a power law distribution as in Figure 2(c).
The degree distribution of the undirected bipartite graph (the
red line in Figures 2(a), 2(b) 2(c)) combines both tag and
resource distributions. For lower degrees, the combined degree
distribution takes the form of the resource distribution, i.e.
the number of resources with low frequencies dominates the
number of tags with low frequencies. For higher degrees, the
combined distribution takes the form of the tag distribution, i.e.
there are more tags with high frequencies than resources with
high frequencies. The tag distribution is two or more orders
of magnitude larger than the resource distribution, i.e. the tag
distribution strongly dominates the resource distribution for
higher degrees. That means that the network hubs (high-degree
nodes) are the “head” tags, i.e. the top tags for TopN tag cloud
construction algorithms.
Due to the existence of a giant component and a low
diameter, tagging systems are intrinsically navigable. In [19],
Adamic shows the existence of efficient decentralised nav-
igation and search algorithms for power law networks. In
principle, a user could first navigate to a hub (which is
typically achieved in a few hops in a power law network)
and since hubs have a large out-degree, one can reach the
destination node easily. The delivery time of the algorithm
is sub-linear, although the number of inspected nodes in the
worst-case is O(N ), since sometimes the user needs to inspect
all outgoing links from a hub.
Result 2: Tagging networks are navigable power-law net-
works. For power law networks, efficient sub-linear decen-
tralised navigation algorithms exist.
B. Tag cloud size
Rows two to ve of Tables I(a), I(b), and I(c) show the
results of applying the TopN algorithm to limit the tag cloud
size on the analysed datasets. From a network-theoretic point
of view, limiting the tag cloud size means limiting the out-
degree of the resource nodes in the bipartite graph. The out-
degree of the resource nodes is two orders of magnitude
smaller then the out-degree of the tag nodes, indicating there
are no resource “hubs” in the network. Therefore, limiting the
tag cloud size does not influence the network to a large extent.
In other words, the structure of the network is still maintained,
i.e. the network remains a navigable network with navigation
efficiency inherent to power law networks.
Result 3: Limiting the tag cloud size to practically feasible
sizes (e.g. 5, 10, or more) does not influence navigability.
C. Pagination
Rows six to nine of Tables I(a), I(b), and I(c) contain
the results of simulating pagination with resource lists sorted
reverse-chronologically. Even without experiments, it is ev-
ident that limiting the number of links going out from a
tag node has destructive effects on the resulting network.
In other words, limiting the out-degree of hub nodes in a
power-law network destroys the connectivity of the network
as a whole. Our experiments show exactly that: the giant
component collapses, and the largest strongly connected com-
ponent now only contains around 50% or less nodes. As such,
pagination destroys network navigability, and the Navigability
Assumption only holds when we assume that users would be
able and willing to inspect long lists (>10.000) of resources
per tag, which is not reasonable. For example, we know from
search query log research that users rarely click on links
beyond the first result page [20]. This yields our final result:
Result 4: Limiting the out-degree of high frequency tags
(e.g. through pagination with resource lists sorted reverse-
chronologically) leaves the network vulnerable to fragmenta-
tion. This destroys navigability of prevalent approaches to tag
clouds.
V. IMPLICATIONS
The previous analysis illustrated the vulnerability of tagging
networks to the pagination effect, where a limit is placed on
the number of links going out from paginated tags, i.e. tags
with frequency higher than the pagination parameter k. This
vulnerability is mainly due to the simplicity of the common
pagination algorithm, i.e. the resource list is simply sorted
reverse-chronologically and only the k most recently tagged
resources are presented to the user. The algorithm does not
take into account the current user context, i.e. the resource
where the user clicks on a paginated tag. Rather the same
reverse-chronologically resource list is presented for a given
paginated tag throughout the system.
Let us now investigate possibilities to recover the nav-
igability of tagging networks by means of alternative tag
construction algorithms. To this end, we introduce an adapted
pagination algorithm. A simple generalisation of the pagina-
tion algorithm is to select k different resources out of all
resources tagged with a given paginated tag, depending on the
current user context, i.e. depending on the resource where the
user activates a paginated tag. Let us denote the resources list
of a given paginated tag t with R
t
. In this case, a particular
selection of resources for t becomes a function of a given
resource and parameter k, i.e. L
t
= f (r, k). In other words,
each paginated tag is replaced by as many resource-specific
tags (t
r
) as there are resources in its resource list. Each
resource-specific tag is then connected to resources computed
by f (r, k). The pseudo-code of the generalised algorithm is
given in Figure 3.
We now discuss some potential functions f (r, k) for select-
ing resources from the available resource pool and analyse
their influence on network navigability.
1: Input: G =< V, E >, r, t, k
2: for all r R
t
do
3: add t
r
to V
4: add (r, t
r
) to E
5: L
t
f (r, k)
6: for all rr L
t
do
7: add (t
r
, rr) to E
8: end for
9: end for
10: remove t from V
Fig. 3. Generalized pagination algorithm
A. Random link selection
A first obvious choice for f (r, k) is to select k resources
uniformly at random. This approach generates a random graph
as introduced by [21] for each given paginated tag. As [22]
and [23] showed, graphs generated uniformly at random are
typically connected and have with a high probability a
diameter bound by logN (already for out-degrees k 3).
However, since there are no structural clues in a randomly
generated network, a decentralized search algorithm will need
to inspect, in the worst case, all nodes of the network in order
to reach a destination node from the given starting node.
Table II shows the results of a random pagination algorithm
on the three test datasets. All three networks become strongly
connected with a giant component even for low values of k.
As expected, all three networks also possess a low diameter.
B. Hierarchical network model
In [13], Kleinberg introduced the hierarchical network
model and elegantly proved that it is possible to design
efficient decentralised search algorithms for such networks
with a delivery time polynomial in logN (for details see
Section VII). Put simply, Kleinberg showed that, if the nodes
of a network can be organised into a hierarchy, then such
a hierarchy provides a probability distribution for connecting
the nodes in the network. The resulting network is efficiently
navigable. A special case of the hierarchical network model is
given when there is a constant number of links leaving a node,
i.e. when the out-degree of a node is limited by a parameter
k as it is the case with pagination. In this case, the tree leaves
contain so-called clusters of nodes, i.e. a collection of a certain
constant number of nodes.
Thus, we developed a hierarchical network generator that 1)
sorts the resource list of a given paginated tag by frequency,
2) creates resource clusters of size 10 by traversing the sorted
resource list sequentially, 3) creates a balanced b-ary (b = 5)
tree where the number of leaves is equal to the number of
the resource clusters, 4) traverses the tree in postorder from
left to right and attaches resource clusters to the tree leaves,
and 5) uses this tree structure to obtain the link probability
distribution for connecting a resource-specific tag node with
resources of a given paginated tag.
It is important to note that the tree creation process follows
the statistical properties of the tagging dataset only, it has no
(a) Austria-Forum
UIR GC ED UIA NADT
k=5 0.86 11.7 Random linear
k=10 0.86 11.02 Random linear
k=20 0.85 10 Random linear
k=30 0.84 10.42 Random linear
(b) BibSonomy
UIR GC ED UIA NADT
k=5 0.99 8.75 Random linear
k=10 0.99 6.97 Random linear
k=20 0.99 6.75 Random linear
k=30 0.99 6.46 Random linear
(c) CiteULike
UIR GC ED UIA NADT
k=5 0.99 7.98 Random linear
k=10 0.99 7.88 Random linear
k=20 0.99 7.13 Random linear
k=30 0.99 6.86 Random linear
UIR = UI Restriction, GC = Giant Component, ED = Effective Diameter, UIA = UI Algorithm, NADT = Navigation Algorithm Delivery Time
TABLE II
NAVIGATIONAL PROPERTIES OF THE AUSTRIA-FORUM, BIBSONOMY, AND CITEULIKE TAGGING SYSTEMS WITH A RANDOM PAGINATION ALGORITHM.
inherent semantic rationale. As such, it serves primarily as a
statistical tool to improve the efficiency of navigability from a
network-theoretic perspective. Table III provides an overview
of the results of the structural network analysis performed with
the three real-life datasets.
Another important observation is that in our model each
paginated tag is a source of a network generated by a hi-
erarchy. These networks are themselves connected through
tag co-occurrence in the dataset, i.e. since tags overlap and
share resources such shared resources link different generated
networks. This makes it more difficult to estimate the delivery
time of a decentralised search algorithm possessing only
the local knowledge. If the algorithm is extended to have
knowledge of all the hierarchies used in the generation of the
networks, then this additional information might be useful in
finding a destination node faster.
However, more theoretical work is needed to offer a proof
of this intuitive assumption. In addition, it would be interesting
to test these ideas empirically, for example, by implementing
the algorithm and applying it to the real-life datasets. An-
other interesting problem is the fitting of parameters for the
hierarchical network model, for example what is the optimal
combination of the cluster size and the maximum number of
children, with respect to the size of the resource list and the
pagination parameter k.
C. Calculation of resource hierarchies
The hierarchy used in our experiments so far does not pos-
sess any semantic grounding. It is a synthetic hierarchy trying
to optimize navigational aspects of the generated network.
However, improvements of our algorithm will need to take
the semantics of the dataset into account by identifying a
set of resource (metadata) attributes. For example, resource
attributes might be the date of creation, authors, other tags,
or even attributes external to the system such as URLs,
full-text, or title. Similar to tag-resource bipartite graphs, a
collection of metadata attributes and resources can be always
represented as yet another bipartite graph. Thus, the discussion
that follows applies for arbitrary resource metadata. However,
for simplicity reasons we refer henceforth only to tag-resource
bipartite graphs.
Let us here shortly discuss possible approaches to obtain
semantically useful resource hierarchies. We can calculate
resource hierarchies by applying e.g. modern hierarchical
clustering algorithms such as K-Means [24] or Affinity Prop-
agation [25] to the tag vectors (see e.g. [26]). Alternatively, if
we deal with text resources it is possible to apply K-Means or
Affinity Propagation on the term vectors. However, in general
case, e.g. in the case when we deal with non-textual resources
such as images or videos we have only tag vectors.
In [27] the authors argue that similarity between tags (the
tag vectors are sparse) are not sufficiently great for purely
similarity based hierarchical clustering methods. Therefore,
the authors designed a new algorithm tailored to the specifics
of the social tagging data. This new algorithm produces so-
called folksonomies
4
– folk-generated taxonomies which are
tag hierarchies. In [28] the authors extend this idea and design
yet another folksonomy creation algorithm.
The input for those folksonomy creation algorithms is the
so-called tag similarity graph – an unweighted graph with tags
as nodes. Two nodes are linked to each other if their similarity
is above a predefined similarity threshold. In the simplest case,
the threshold is defined through tag overlap if the tags do
not overlap in at least one resource than they are not linked
to each other in the tag similarity graph. As the first step, the
algorithm calculates node centralities producing a generality
ranking where the most general tags come in the top positions.
Then, the algorithm starts by a single node tree with the most
general tag as the root node and proceeds by iterating through
the generality ranking and adding each tag to the tree the
algorithm calculates the similarities between the current tag
and each tag currently present in the tree and adds the current
tag as a child to its most similar tag.
The algorithm is extensible as it is possible to apply
different similarity and centrality measures, e.g. the algorithm
described in [27] works with the cosine similarity and close-
ness centrality, whereas the algorithm described in [28] works
with the co-occurrence and degree centrality.
The folksonomy algorithms produce tag hierarchies, how-
ever, we are interested in producing resource hierarchies. A
possible approach is to adapt the folksonomy algorithms to
produce global resource hierarchies or resource hierarchies of
a given paginated tag instead of global tag hierarchies. Thus,
the adapted algorithm 1) maps the bipartite tag-resource graph
onto a resource-resource co-occurrence graph, 2) compiles a
generality ranking by calculating a centrality of nodes in the
resource-resource graph, 3) builds the co-occurrence matrix
between resources for similarity calculation, 4) starts with the
4
http://www.vanderwal.net/folksonomy.html
(a) Austria-Forum
UIR GC ED UIA NADT
k=5 0.85 12.03 Hier. polylog.
k=10 0.86 10.62 Hier. polylog.
k=20 0.85 9.29 Hier. polylog.
k=30 0.84 9.71 Hier. polylog.
(b) BibSonomy
UIR GC ED UIA NADT
k=5 0.99 8.82 Hier. polylog.
k=10 0.99 7.62 Hier. polylog.
k=20 0.99 6.94 Hier. polylog.
k=30 0.99 6.75 Hier. polylog.
(c) CiteULike
UIR GC ED UIA NADT
k=5 0.99 8.76 Hier. polylog.
k=10 0.99 7.6 Hier. polylog.
k=20 0.99 6.36 Hier. polylog.
k=30 0.99 5.89 Hier. polylog.
UIR = UI Restriction, GC = Giant Comp., ED = Eff. Diameter, UIA = UI Algorithm, NADT = Navigation Algorithm Delivery Time
Hier. = Hierarchical Algorithm, polylog. = polylogarithmic
TABLE III
NAVIGATIONAL PROPERTIES OF THE AUSTRIA-FORUM, BIBSONOMY, AND CITEULIKE TAGGING SYSTEMS WITH A HIERARCHICAL PAGINATION
ALGORITHM.
most general tag as the root node, 5) iterates through the
generality ranking and attaches the next resource from the
ranking to its most similar resource from the tree. In addition
(to obtain better navigational properties), we can introduce
hierarchy branching factor b and add new resources to the
tree only to those resources that still have available spots for
child resources.
The future work can concentrate on implementation and
evaluation of such algorithms. One problem that the future
work needs to address is scalability resource-resource map-
pings of tagging datasets tend to produce huge networks with
billions of links.
VI. NAVIGATIONAL AND SEMANTIC PENALTY
The previous section shows that one way of designing
an efficiently navigable network in a tagging system is to
classify the resources of a given paginated tag into a hierarchy.
Thus, to design a navigable network, the pagination algorithm
needs to organise these resource attributes into a hierarchy.
At the same time, it is difficult to expect that an algorithm
taking into account the semantics of resources can produce an
optimal hierarchy that optimizes navigability of the tagging
system as a whole. Rather, the semantic algorithm will tend
to produce an unbalanced tree with a variable cluster size.
As a consequence, the navigational structure generated by
such an algorithm will be sub-optimal, i.e. a decentralised
search algorithm will need to take more steps (investigate
more nodes) to find a destination node. We will call this effect
the navigational penalty. Of course, the pagination algorithm
might be altered to produce a tree closer to the optimal tree
from the navigational point of view. This, however, seems
possible only by breaking semantics to a certain extent. We
will call this contrasting effect the semantic penalty. This
reveals an essential trade-off which tag cloud construction
algorithms will need to address: balancing the navigational
and semantic penalties.
Let us illustrate the navigational and semantic penalties
with an example. Suppose we have 1000 resources about
Austrian cities tagged with Austria”. A particular tagging
system might decide to paginate that tag with a pagination
parameter of k = 20 (listing 20 resources per page). Firstly, the
system would need to semantically classify the resources into
a clustered hierarchy. For example, it could take geography as
the criteria for creating clusters: each cluster corresponding to
an Austrian province. However, the size of the clusters varies
and the province of Vienna (the capital of Austria) might
dominate, since it contains, say, 500 resources. Generating
the network from such an unbalanced hierarchy will result
in a navigational penalty, whereas a new classification of the
resources taking into account the Vienna districts as a further
geographical refinement to balance the cluster size may cause
a semantic penalty, if the Vienna province is represented at a
finer level of detail than other provinces.
A. Measuring Semantic Penalty
In the following we present a simple method for estimating
the semantic penalty of different pagination algorithms.
If we ignore pagination and show the complete resource list
R
t
whenever a tag t is selected, t is connected through this
resource list to the set of its co-occurring tags. We represent the
tag co-occurrence of t by means of the co-occurrence vector
c
t
. The dimensions in this vector correspond to tags, and the
value of a particular dimension is the number of resources that
share both t and the dimension tag.
Taking into account pagination, a particular selection of
resources for t is the set L
t
, which is a function of r, k,
and the resource hierarchy in question. We can now introduce
a resource-specific co-occurrence vector of a given tag t
and denote it as c
r
t
. Again, the vector dimensions are tags,
and the vector values correspond to the number of shared
resources between t and a particular dimension tag. However,
the resources have to belong to L
t
now.
We take the complete co-occurrence vector c
t
of a given
tag t as the ground truth. The resource-specific co-occurrence
vector c
r
t
of t is than compared against c
t
using cosine
similarity cs (cosine of the angle between vectors c
t
and c
r
t
)
to estimate its alignment with the ground truth:
cs(c
t
, c
r
t
) =
c
t
· c
r
t
kc
t
kkc
r
t
k
(1)
In the next step, we calculate the arithmetic mean of cosine
similarities over all resources of a given paginated tag t:
cs
t
=
1
|R
t
|
X
rR
t
cs(c
t
, c
r
t
) (2)
Then we calculate the arithmetic mean of cs
t
over all tags:
cs =
1
|T |
X
tT
cs
t
(3)
Finally, we obtain a single numerical value the semantic
penalty of a given pagination algorithm as:
sp = 1 cs (4)
We subtract from 1 to express the fact that maximum
similarity would be equivalent to the absence of any semantic
penalty. In addition, we can vary the parameter k to see
how the semantic penalty is distributed with the size of the
paginated page presented to users.
Let us illustrate the intuition behind the semantic penalty
with the following example. In a given tagging dataset, seman-
tics emerge through relations between tags, e.g. the tag Austria
might be related via co-occurrence to tags such as Vienna
(sharing a single resource), Europe (sharing two resources),
and Alps (sharing two resources). Through pagination, some of
the links disappear because resources and their corresponding
tags are omitted from the resource list, e.g. after pagination
Austria is related only to Vienna. Let Austria be the first
dimension, Vienna the second, Europe the third, and Alps the
fourth. We have:
c
austria
=
0
1
2
2
c
r
austria
=
0
1
0
0
cs =
1
3
sp =
2
3
Thus, the semantic penalty measures the extent to which the
list of displayed resources is semantically different from the
global semantics of the tag.
Figure 4 compares the semantic penalty of the reverse
chronological, random, and synthetic hierarchical (see Section
V-B) pagination algorithms over all datasets. The preliminary
results show that the semantic penalty does not depend on the
selection of the pagination algorithm but only on the length k
of the paginated list. This result is consistent over all datasets.
Although the results are only preliminary they contain
an interesting observation: While the semantic penalties for
smaller k are still significant, as k grows the semantic penalty
decreases very quickly. Even though the algorithms do not
optimize for semantics, paginated lists of length 20 or more do
not induce significant semantic penalties. Consistently over all
datasets and all algorithms the semantic penalty for k greater
than 20 drops to 1%. The exception here is again the Austria-
Forum dataset (the semantic penalty is marginal even for small
k): there are only few hub tags in the network and that reduces
the pagination effect on the semantics.
Result 5: Limiting the pagination list length to practically
feasible sizes (e.g. 20, 30, or more) does not introduce a
significant semantic penalty.
The further investigation should evaluate semantically op-
timized algorithms to identify potential differences between
the observed and new semantics-aware pagination algorithms.
However, as the semantics is not significantly impaired by
pagination (at least for higher values of k), future research
can concentrate on measuring the navigational penalty and
optimizing pagination algorithms for navigation.
VII. RELATED WORK
We start our review of related work with a brief overview
of network-related research. Research on network navigability
has been inspired by Milgram’s small world experiment [10].
In this experiment, selected persons from Nebraska received
a letter they were then asked to send through their social
networks to a stockbroker in Boston. The striking result of
the study was that, for those letters reaching the destination,
the average number of hops was around 6, i.e. the population
of the USA constituted a “small world”. While the conclusions
have been challenged [29], this experiment has attracted a
great deal of interest in the research community.
Numerous researchers analysed Milgram’s experiment try-
ing to create network models and generators able to produce
such “small world” networks (see for example [30]). The
lattice model by Watts [31] mimics a real-life social network,
where people are primarily connected to their neighbours with
a few “long-range” contacts. The networks generated by this
model have, like the random graph model [22], [23], a giant
component and a diameter bound by logN .
Kleinberg analysed the second result of the Milgram’s
experiment, the ability of people to find a short path when
there is such a path between two nodes [11], [12], [13]. He
concluded that there are structural clues in such networks,
which allow people to find a short path efficiently and argued
that for an “efficiently” navigable network there exists a
decentralised search algorithm with delivery time polynomial
in logN .
Kleinberg also designed a number of network models such
as 2D-grid models [12], hierarchical models [13], and group
models [13], and showed that for certain combinations of
parameters, efficient decentralised search algorithms exist.
Particularly, hierarchical network models [13] are based on
the idea that, in many settings, the nodes in a network might
be classified according to a taxonomy. The taxonomy can be
represented as a b-ary tree and network nodes can be attached
to the leaves of the tree. For each node v, we can create a link
to all other nodes w with the probability that decreases with
h(v, w) where h is the height of the least common ancestor of
v and w in the tree. For a constant out-degree, the nodes are
clustered and then the clusters are attached to the tree. The
link distribution defined by f(h) = (h + 1)
2
b
h
generates a
navigable network with a decentralised search algorithm with
delivery time of O(log
4
b
N).
In related research of tagging systems, tag clouds have
been characterised as a way to translate the emergent vo-
cabulary of a folksonomy into social navigation tools [4],
[32]. Social navigation itself represents a multi-dimensional
concept, covering a range of different issues and ideas. A
distinction between direct and indirect social navigation, for
example, highlights whether navigational clues are provided by
direct communication among users (e.g. via chat), or whether
navigational clues are indirectly inferred from historical traces
left by others [33]. Based on this distinction, our work only
focuses on indirect social navigation in the sense that it studies
0
1
2
3
4
5
5 10 15 20 25 30
Semantic Penalty (Percentage)
Pagination parameter k
Austria-Forum Semantic Penalty
Revers. Chron.
Random
Hierarchical
(a) Austria-Forum
0
1
2
3
4
5
5 10 15 20 25 30
Semantic Penalty (Percentage)
Pagination parameter k
BibSonomy Semantic Penalty
Revers. Chron.
Random
Hierarchical
(b) BibSonomy
0
1
2
3
4
5
5 10 15 20 25 30
Semantic Penalty (Percentage)
Pagination parameter k
CiteULike Semantic Penalty
Revers. Chron.
Random
Hierarchical
(c) CiteULike
Fig. 4. The semantic penalty induced by different pagination algorithms for the three datasets. The two mature datasets (Bibsonomy and CiteULike) exhibit
larger semantic penalties, while the Austria-Forum (a tagging system in an early adoption phase) exhibits significantly smaller penalties there are fewer
paginated hub tags in the Austria-Forum and therefore the pagination effect on the semantics is marginal. The semantic penalty does not depend on the
pagination algorithm but solely on the number of resources shown in the paginated list. While semantic penalty for smaller values of k , e.g. 5 and 10 is still
significant, limiting the paginated list to a practically feasible length, e.g. 20, does not impair semantics (the semantic penalty drops to 1%).
the effectiveness of traces (“tags”) left by users in tagging
systems. Other types of social navigation emphasise the need
to show the presence of others users, to build trust among
groups of users, or to encourage certain behaviour [33].
Researchers have discussed the advantages and drawbacks
of tag clouds, suggesting that tag clouds are a useful mecha-
nism when users’ search tasks are general and explorative (for
example, learn about Web 2.0), while tag clouds provide little
value for specific information-seeking tasks (for example, nav-
igate to www.cnn.com) [4]. While the paper at hand focuses on
network-theoretic aspects, cognitive aspects of navigation have
been studied previously using, for example, SNIF-ACT [34]
and social information foraging theory [35]. Other work has
studied the motivations of users for tagging [36], [37], and how
they influence emergent semantic (as opposed to navigational)
structures. The navigational utility of single tags has been
investigated [38] with somewhat disappointing results. With
time the tags become harder and harder to use as they lose
specificity and reference too many resources. Such tags are
exactly those paginated tags where new pagination algorithms
are needed.
Navigation models for tagging systems have been also dis-
cussed recently. In [8] authors describe a navigation framework
for tagging systems. The authors apply the framework to
analyze possible attacks on tagging systems. In principle, the
framework identifies a navigation channels as any combination
of the basic elements of a tagging system (users, tags, and
resources). Thus, the specific combination which we investi-
gated in this paper can be summarized as the resource-tag or
tag-resource navigation channel.
Recent literature also discusses algorithms for the construc-
tion of tag clouds. The ELSABer algorithm [39] represents
an example of such an effort aimed towards identifying
hierarchical relationships between annotations to facilitate
browsing. The work by [40] is another example, introducing
entropy-based algorithms for the construction of interesting
tag clouds. However, these algorithms have not found wide-
spread adoption in current social tagging systems. In addition,
empirical studies of tagging systems have for example focused
on comparing navigational characteristics of tag distributions
to similar distributions produced by library terms [41].
Our work contributes to an increased theoretical understand-
ing about the navigability of current tag cloud algorithms in
social tagging systems. Our experiments identify empirical
problems related to the navigability of tag clouds in three real-
world tagging systems.
VIII. CONCLUSION
The motivation for this research was to examine and test
the widely held belief that tag clouds support efficient nav-
igation in social tagging systems. We have shown that for
certain specific, but popular, tag cloud scenarios, the so-called
Navigability Assumption does not hold. The results presented
in this paper make a theoretical and an empirical argument
against existing approaches to tag cloud construction. Our
work thereby both confirms and refutes the assumption that
current tag cloud incarnations are a useful tool for navi-
gating social tagging systems. While we confirm that tag-
resource networks have efficient navigational properties in
theory, we show that popular user interface decisions (such as
“pagination” combined with reverse-chronological listing of
resources) significantly impair navigability. Our experimental
results demonstrate that popular approaches to using tag clouds
for navigational purposes suffer from significant problems.
Building on recent research results from network theory, in
particular hierarchical network models, we have illustrated a
path towards constructing more efficiently navigable tag cloud
networks, which are less vulnerable to pagination influences.
Our findings suggest that engineers who want to design
effective tag cloud algorithms have to essentially strike a
balance between semantic and navigation penalties, in order to
make navigation in social tagging systems both efficient and
effective. We also presented a simple method for estimating
the semantic penalty. The method is based on measuring
cosine similarity between the non-paginated (ground truth) and
algorithmically generated paginated tag co-occurrence vectors.
The future work needs to investigate the possibilities for
measuring the navigational penalty.
We conclude that in order to make full use of the potential
of tag clouds for navigating social tagging systems, new and
more sophisticated ways of thinking about designing tag cloud
algorithms are needed.
REFERENCES
[1] M. A. Hearst and D. Rosner, “Tag clouds: Data analysis tool or
social signaller?” in HICSS ’08: Proceedings of the Proceedings of
the 41st Annual Hawaii International Conference on System Sciences.
Washington, DC, USA: IEEE Computer Society, 2008.
[2] C. S. Mesnage and M. J. Carman, “Tag navigation, in SoSEA ’09:
Proceedings of the 2nd international workshop on Social software
engineering and applications. New York, NY, USA: ACM, 2009, pp.
29–32.
[3] A. W. Rivadeneira, D. M. Gruen, M. J. Muller, and D. R. Millen,
“Getting our head in the clouds: toward evaluation studies of tagclouds,
in CHI ’07: Proceedings of the SIGCHI conference on Human factors
in computing systems. New York, NY, USA: ACM, 2007, pp. 995–998.
[4] J. Sinclair and M. Cardew-Hall, “The folksonomy tag cloud: when is it
useful?” Journal of Information Science, vol. 34, p. 15, 2008. [Online].
Available: http://jis.sagepub.com/cgi/content/abstract/34/1/15
[5] N. Neubauer and K. Obermayer, “Hyperincident connected components
of tagging networks, in HT ’09: Proceedings of the 20th ACM confer-
ence on Hypertext and hypermedia. New York, NY, USA: ACM, 2009,
pp. 229–238.
[6] C. Cattuto, C. Schmitz, A. Baldassarri, V. D. P. Servedio, V. Loreto,
A. Hotho, M. Grahl, and G. Stumme, “Network properties of folk-
sonomies, AI Commun., vol. 20, no. 4, pp. 245–262, 2007.
[7] C. Schmitz, A. Hotho, R. J
¨
aschke, and G. Stumme, “Mining association
rules in folksonomies, in Data Science and Classification: Proc. of
the 10th IFCS Conf., Studies in Classification, Data Analysis, and
Knowledge Organization. Springer, 2006, pp. 261–270.
[8] M. Ramezani, J. Sandvig, T. Schimoler, J. Gemmell, B. Mobasher, and
R. Burke, “Evaluating the impact of attacks in collaborative tagging
environments, in Computational Science and Engineering, 2009. CSE
’09. International Conference on, vol. 4, aug. 2009, pp. 136 –143.
[9] P. Mika, “Ontologies are us: A unified model of social networks and
semantics, Web Semantics: Science, Services and Agents on the World
Wide Web, vol. 5, no. 1, pp. 5–15, 2007.
[10] S. Milgram, “The small world problem, Psychology Today, vol. 1, pp.
60–67, 1967.
[11] J. M. Kleinberg, “Navigation in a small world, Nature, vol. 406, no.
6798, August 2000.
[12] J. Kleinberg, “The small-world phenomenon: An algorithmic perspec-
tive, in Proceedings of the 32nd ACM Symposium on Theory of
Computing, 2000.
[13] J. M. Kleinberg, “Small-world phenomena and the dynamics of infor-
mation, in Advances in Neural Information Processing Systems (NIPS)
14. MIT Press, 2001, p. 2001.
[14] D. J. Watts, P. S. Dodds, and M. E. J. Newman, “Identity and search in
social networks, Science, vol. 296, pp. 1302–1305, 2002.
[15] A. Hotho, R. J
¨
aschke, C. Schmitz, and G. Stumme, “Bibsonomy: A
social bookmark and publication sharing system, in Proceedings of
the Conceptual Structures Tool Interoperability Workshop at the 14th
International Conference on Conceptual Structures, 2006, pp. 87–102.
[16] T. Eda, T. Uchiyama, T. Uchiyama, and M. Yoshikawa, “Signaling emo-
tion in tagclouds, in WWW ’09: Proceedings of the 18th international
conference on World wide web. New York, NY, USA: ACM, 2009, pp.
1199–1200.
[17] O. Kaser and D. Lemire, “Tag-Cloud Drawing: Algorithms for Cloud
Visualization, Proceedings of Tagging and Metadata for Social
Information Organization (WWW 2007), 2007. [Online]. Available:
http://arxiv.org/abs/cs/0703109v2
[18] J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graphs over time: densi-
fication laws, shrinking diameters and possible explanations, in KDD
’05: Proceedings of the eleventh ACM SIGKDD international conference
on Knowledge discovery in data mining. New York, NY, USA: ACM,
2005, pp. 177–187.
[19] L. A. Adamic, R. M. Lukose, A. R. Puniyani, and B. A. Huberman,
“Search in power-law networks, Physical Review E, vol. 64, no. 4, pp.
046 135 1–8, Sep 2001.
[20] Y. Zhang, B. Jansen, and A. Spink, “Time series analysis of a Web
search engine transaction log, Information Processing & Management,
vol. 45, no. 2, pp. 230–245, 2009.
[21] P. Erdos and A. Renyi, “On the evolution of random graphs, Publ.
Math. Inst. Hung. Acad. Sci, vol. 5, pp. 17–61, 1960.
[22] B. Bollob
´
as and W. F. de la Vega, “The diameter of random regular
graphs, Combinatorica, vol. 2, no. 2, pp. 125–134, 1982.
[23] B. Bollob
´
as and F. R. K. Chung, “The diameter of a cycle plus a random
matching, SIAM J. Discret. Math., vol. 1, no. 3, pp. 328–333, 1988.
[24] I. Dhillon, J. Fan, and Y. Guan, “Efficient clustering of very large
document collections. in Data Mining for Scientific and Engineering
Applications, R. Grossman, C. Kamath, and R. Naburu, Eds. Kluwer
Academic Publishers, 2001.
[25] B. J. J. Frey and D. Dueck, “Clustering by passing messages
between data points. Science, January 2007. [Online]. Available:
http://dx.doi.org/10.1126/science.1136800
[26] A. Plangprasopchok, K. Lerman, and L. Getoor, “Growing a tree in the
forest: Constructing folksonomies by integrating structured metadata,
in Proc. of the International Conference on Knowledge Discovery and
Data Mining (KDD), July 2010.
[27] P. Heymann and H. Garcia-Molina, “Collaborative creation of
communal hierarchical taxonomies in social tagging systems, Stanford
InfoLab, Technical Report 2006-10, April 2006. [Online]. Available:
http://ilpubs.stanford.edu:8090/775/
[28] D. Benz, A. Hotho, and G. Stumme, “Semantics made by you and me:
Self-emerging ontologies can capture the diversity of shared knowledge,
in Proc. of the 2nd Web Science Conference (WebSci10), Raleigh, NC,
USA, 2010.
[29] J. Kleinfeld, “Could it be a big world after all? The six degrees of
separation myth, Society, April 2002.
[30] M. Kochen, Ed., The Small World. Norwood, NJ: Ablex, 1989.
[31] D. J. Watts and S. H. Strogatz, “Collective dynamics of small-world
networks, Nature, vol. 393, no. 6684, pp. 440–442, June 1998.
[32] A. Dieberger, “Supporting social navigation on the world wide web,
Int. J. Hum.-Comput. Stud., vol. 46, no. 6, pp. 805–825, 1997.
[33] D. Millen and J. Feinberg, “Using social tagging to improve social
navigation, in Workshop on the Social Navigation and Community
Based Adaptation Technologies. Dublin, Ireland. Citeseer, 2006.
[Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=
10.1.1.92.5563&rep=rep1&type=pdf
[34] W.-T. Fu and P. Pirolli, “Snif-act: a cognitive model of user navigation
on the world wide web, Hum.-Comput. Interact., vol. 22, no. 4, pp.
355–412, 2007.
[35] P. Pirolli, “An elementary social information foraging model, in Pro-
ceedings of the 27th international conference on Human factors in
computing systems. ACM, 2009, pp. 605–614.
[36] M. Strohmaier, C. Koerner, and R. Kern, “Why do users tag? Detecting
users’ motivation for tagging in social tagging systems, in International
AAAI Conference on Weblogs and Social Media (ICWSM2010), Wash-
ington, DC, USA, May 23-26, 2010.
[37] C. Koerner, R. Kern, H. P. Grahsl, and M. Strohmaier, “Of categorizers
and describers: An evaluation of quantitative measures for tagging
motivation, in 21st ACM SIGWEB Conference on Hypertext and Hy-
permedia (HT 2010), Toronto, Canada, ACM, Toronto, Canada, June
2010.
[38] E. H. Chi and T. Mytkowicz, “Understanding the efficiency of social
tagging systems using information theory, in HT ’08: Proceedings of
the nineteenth ACM conference on Hypertext and hypermedia. New
York, NY, USA: ACM, 2008, pp. 81–88.
[39] R. Li, S. Bao, Y. Yu, B. Fei, and Z. Su, “Towards effective browsing
of large scale social annotations, Proceedings of the 16th international
conference on World Wide Web, p. 952, 2007. [Online]. Available:
http://portal.acm.org/citation.cfm?id=1242700
[40] K. Aouiche, D. Lemire, and R. Godin, “Web 2.0 OLAP: From Data
Cubes to Tag Clouds, 4th International Conference, WEBIST 2008,
vol. 18, 2008. [Online]. Available: http://www.springerlink.com/index/
10.1007/978-3-642-01344-7
[41] P. Heymann, A. Paepcke, and H. Garcia-Molina, “Tagging human
knowledge, in Proceedings of the Third ACM International Conference
on Web Search and Data Mining. New York, NY, USA: ACM, 2010,
pp. 51–61.
... There are numerous websites that used tagging even before they were aware of its existence, but it was not until 2001 that the tag clouds began to be used as 19 Helic et al. 37 Other terminologies Usages Rivadeneira 24 ...
... Therefore, the navigation is sensitive to the adoption of the language of tagging-based systems. 37 According to Khusro et al. 4 a tag cloud only is effective when the following conditions are met: ...
Article
Full-text available
Tag clouds are tools that have been widely used on the Internet since their conception. The main applications of these textual visualizations are information retrieval, content representation and browsing of the original text from which the tags are generated. Despite the extensive use of tag clouds, their enormous popularity and the amount of research related to different aspects of them, few studies have summarized their most important features when they work as tools for information retrieval and content representation. In this paper we present a summary of the main characteristics of tag clouds found in the literature, such as their different functions, designs and negative aspects. We also present a summary of the most popular metrics used to capture the structural properties of a tag cloud generated from the query results, as well as other measures for evaluating the goodness of the tag cloud when it works as a tool for content representation. The different methods for tagging and the semantic association processes in tag clouds are also considered. Finally we give a list of alternative for visual interfaces, which makes this study a useful first help for researchers who want to study the content representation and information retrieval interfaces in greater depth.
... termine, maggiore sarà il numero di volte che questo è stato utilizzato (Helic et al., 2011). Uno degli elementi di maggiore originalità di questa dimensione contemporanea dell'omosessualità è la definizione di sé come "queer". ...
... Founding examples include Flickr, Del-icio-us, and Technorati. While the benefit for navigation has been contested on network-theoretic and user-interface grounds (Helic et al., 2011), the word cloud continues to represent a useful device to distill the most relevant single words (tags) of a body of text. In social software applications, the frequency, significance, or categorization of words are aggregated over text and used to define the font size of words in the cloud. ...
Article
Full-text available
Communication is an undisputed central activity of life that requires an evolving molecular language. It conveys meaning through messages and vocabularies. Here, I explore the existence of a growing vocabulary in the molecules and molecular functions of the microbial world. There are clear correspondences between the lexicon, syntax, semantics, and pragmatics of language organization and the module, structure, function, and fitness paradigms of molecular biology. These correspondences are constrained by universal laws and engineering principles. Macromolecular structure, for example, follows quantitative linguistic patterns arising from statistical laws that are likely universal, including the Zipf’s law, a special case of the scale-free distribution, the Heaps’ law describing sublinear growth typical of economies of scales, and the Menzerath–Altmann’s law, which imposes size-dependent patterns of decreasing returns. Trade-off solutions between principles of economy, flexibility, and robustness define a “triangle of persistence” describing the impact of the environment on a biological system. The pragmatic landscape of the triangle interfaces with the syntax and semantics of molecular languages, which together with comparative and evolutionary genomic data can explain global patterns of diversification of cellular life. The vocabularies of proteins (proteomes) and functions (functionomes) revealed a significant universal lexical core supporting a universal common ancestor, an ancestral evolutionary link between Bacteria and Eukarya, and distinct reductive evolutionary strategies of language compression in Archaea and Bacteria. A “causal” word cloud strategy inspired by the dependency grammar paradigm used in catenae unfolded the evolution of lexical units associated with Gene Ontology terms at different levels of ontological abstraction. While Archaea holds the smallest, oldest, and most homogeneous vocabulary of all superkingdoms, Bacteria heterogeneously apportions a more complex vocabulary, and Eukarya pushes functional innovation through mechanisms of flexibility and robustness.
... This was also confirmed by the findings in [11]; the authors found out that particularly useful tags are those with high popularity (i.e., occurring frequently in the dataset), but with low clustering, which means that is important to consider not only the number of co-occurring tags, but also their diversity. Even though tag clouds can be an efficient way of navigation in document corpora, this ability is seriously limited if there are too many resources and pagination is used as showed in [14]. The users rarely investigate more than a few first pages of results; if the results, i.e., the tagged resources are sorted from the newest to the oldest, it makes many of the older resources practically unreachable. ...
Article
Full-text available
Exploratory search (in contrary to the traditional lookup search) is characterized by the search tasks that have exploration, learning, and investigation as their goals. An example of this task in the domain of digital libraries is exploration of a new domain, a task that is typically performed by a researcher novice, such as a master’s or a doctoral student. To support the researcher novices in this task, we proposed an approach of exploratory search and navigation using navigation leads, with which we augment the search results, and which serve as navigation starting points allowing users to follow a specific path by filtering only documents pertinent to the selected lead. In this paper, we present a method of selection of navigation leads considering their navigational value in the form of a corpus relevance. We examined this method by the means of an offline evaluation on the dataset from a bookmarking service Annota. We showed that considering the corpus relevance helps to cover significantly more (relevant) documents when conducting the exploratory search. In addition, our relevance metric combining document and corpus relevance of a lead outperformed the popularity metric based on the frequency of the term in the document corpus.
... Published works on tagclouds have been about algorithms, navigation, information processing, and information retrieval interfaces Helic et al. 2011;Knautz et al. 2010;Trattner et al. 2014;Walhout et al. 2015). Recently, tagclouds have been used and applied in decision-making and discovery. ...
Article
Full-text available
Student-generated tagclouds provided an intuitive overview of a group of learners’ collective knowledge. Although such tagclouds may have the potential to be used as effective learning tools, it has not been clear how students use this tool for knowledge construction. In this paper, we report a two-stage study that investigated college students’ experiences of using tagclouds for developing their domain knowledge, culminating in individual concept maps and research papers. Based on the results of the qualitative analyses of students’ reflections from the first stage, an intervention was introduced: group discussions on tagclouds generated from different groups. The result of Study Stage II showed that group discussions highlighted the utility of the tagclouds. Treatment group participants were more likely to use tagclouds as metacognitive strategies for planning, searching, retrieving, and organizing their learning. The two-stage study also underscored the importance of collecting students’ reflections earlier in the learning process when introducing a new technology tool to promote learning.
... Word clouds graphically display the frequency of words used by participants of qualitative methods and have become "an innovative approach to quickly summarize and present information from thematic analyses" [47]. Although they do not provide context for the words in the cloud, the visualization reduces the burden of information overload [48] [49] [50]. Word clouds, then, are useful exploratory, analytical tools that are increasingly being used to yield quick visual information about a subject [51] [52] [53] [54] [55]. ...
Article
Il contributo intende indagare le difficoltà causate dalla pandemia e le opinioni sulle prime decisioni circa il passaggio dalla Fase 1 alla Fase 2 di alcune donne italiane impegnate in relazioni considerate significative che si collocano al di là dei legami giuridici e di sangue. Nel dettaglio, lo studio esplorativo combina un’analisi del contenuto effettuata sui commenti pubblicati su Twitter nella notte in cui è stato reso noto il provvedimento e le dichiarazioni di 12 donne raccolte durante 3 focus group online.
Article
Full-text available
Tüm dünyayla birlikte Türkiyeyi de etkisi altına alan Covid-19 salgınından korunmak için T.C. Sağlık Bakanlığı’nın başlattığı “Hayat Eve Sığar,“Evde Kal” sloganlarıyla yurttaşların evde kalması çağrısı yapılmaktadır. Bu süreç içerisinde zaman zaman sokağa çıkma yasakları da getirilmektedir. Bu uygulamalar kişilerin evde kaldığı süre içinde oyalanmalarını sağlayacak alternatif arayışlarını arttırmıştır. Bu durum, Netflix’i önemli bir aktör olarak karşımıza çıkartmaktadır. Ancak Netflix vakit geçirmek için kullanılan bir araç olmanın ötesinde kullandığı yayın teknolojisi, sunduğu televizyon izleme deneyimi vb. neden olduğu değişiklikler yüzünden televizyon tarihi içerisinde yeni bir dönem başlattığı konusunda birçok tartışmaya konu olmaktadır. Bu çalışmada da Netflix’in televizyon tarihinde yeni bir dönem açıp açmadığı kavramsal olarak tartışılmıştır. İncelenen kavramsal yapı içindeki unsurlar dikkate alınarak, Netflix kullanıcılarının Twitter paylaşımları üzerinden Netflix’i nasıl değerlendirdikleri ve bu değerlendirmenin tartışılan kavramsal yapıya katkı sağlayıp sağlamayacağı incelenmiştir. Python programlama dili kullanılarak Twitter APİ bağlantısı ile belirlenen etiketler doğrultusunda veri seti elde edilmiştir. Elde edilen veri seti doğrultusunda paylaşılan tweetleri analiz edebilmek için bir kodlama cetveli oluşturulmuştur. Yapılan veri analizi sonucunda literatürde Netflix’in yeni bir dönem başlattığı iddia edilen teknojisi, izleme deneyimi vb. hakkında herhangi bir paylaşımda bulunulmadığı gözlemlenmiştir. Yapılan paylaşımlar içerisinde ekseriyetle canlı yayında yapılacak programlara ilişkin duyuruların bu çalışma kapsamında belirlenen etiketlerle paylaşıldığı tespit edilmiştir. Anahtar Kelimeler: TVIV, Netflix, Twitter, Evde Kal, Pandemi
Chapter
Allowing users to organize content by tagging resources in webbased systems has led to the emergence of the so-called SocialWeb. Tags turned out to be helpful not only for giving recommendations and improving search in social tagging systems but also for enhancing information access by navigating. In this chapter, we will cover much of the pioneer research work that has studied tag-based navigation and visualization. After giving a short overview of the social tagging process and its specifics, we provide an extensive description of the typical user interfaces and visualization techniques characteristic for social tagging systems. As the efficiency of tag-based navigation depends on structuring tagging data, we also provide a review of the state of the art algorithms for tag clustering. Before we conclude, we demonstrate how tag-based navigation can be modeled and discuss the intrinsic navigability of social tagging systems from various theoretic perspectives.
Book
Full-text available
Modeling Activation Processes in Human Memory for Tag Recommendations: Using Models from Human Memory Theory to Implement Recommender Systems for Social Tagging and Microblogging Environments
Article
Full-text available
We describe the development of a computational cognitive model that explains navigation behavior on the World Wide Web. The model, called SNIF-ACT (Scent-based Navigation and Information Foraging in the ACT cognitive architecture), is motivated by Information Foraging Theory (IFT), which quantifies the perceived relevance of a Web link to a user's goal by a spreading activation mechanism. The model assumes that users evaluate links on a Web page sequentially and decide to click on a link or to go back to the previous page by a Bayesian satisficing model (BSM) that adaptively evaluates and selects actions based on a combination of previous and current assessments of the relevance of link texts to information goals. SNIF-ACT 1.0 utilizes the measure of utility, called information scent, derived from IFT to predict rankings of links on different Web pages. The model was tested against a detailed set of protocol data collected from 8 participants as they engaged in two information-seeking tasks using the World Wide Web. The model provided a good match to participants' link selections. In SNIF-ACT 2.0, we included the adaptive link selection mechanism from the BSM that sequentially evaluates links on a Web page. The mechanism allowed the model to dynamically build up the aspiration levels of actions in a satisficing process (e.g., to follow a link or leave a Web site) as it sequential assessed link texts on a Web page. The dynamic mechanism provides an integrated account of how and when users decide to click on a link or leave a page based on the sequential, ongoing experiences with the link context on current and previous Web pages. SNIF-ACT 2.0 was validated on a data set obtained from 74 subjects. Monte Carlo simulations of the model showed that SNIF-ACT 2.0 provided better fits to human data than SNIF-ACT 1.0 and a Position model that used position of links on a Web page to decide which link to select. We conclude that the combination of the IFT and the BSM provides a good description of user-Web interaction. Practical implications of the model are discussed.
Article
According to a fundamental result of Erdös and Rényi, the structure of a random graph $G_M$ changes suddenly when $M \sim n/2:$ if $M = \lfloor cn \rfloor$ and $c \frac{1}{2} a.e. $G_M$ has a giant component: a component of order $(1 - \alpha_c + o(1))n$ where $\alpha_c
Article
The amount of information available on the world wide web keeps growing at an exponential pace. Social tagging is a feature of various online social networks to organize information elements by letting people label these with free-form text, called tags. The graph created by this process is often called a folksonomy and comprises the association between people, tags and documents. Tagging is now used to organize web pages, pictures, videos, music, books, academic publications, etc. The current ways of navigating folksonomies are limited. In most web portals, "search" is the main feature which uses tags. When browsing tags, most systems give a few related tags to the clicked tag, none enables the user to get related tags to multiple clicked tags at the same time. A popular tag cloud displays links to the most popular tags in the folksonomy with a font size that depends on their popularity. Popular tag clouds and related tags can enable tag-based navigation. Enabling navigation through related tag clouds to multiple clicked tags in an efficient and scalable manner is a hard problem. We propose a bayesian approach to the problem of generating related tag clouds for navigation by using social network information and probabilistic models of people's tagging behaviors. We propose two new models to generate tag clouds based on popularity, tag co-occurrence and social relationships. The models are implemented in a prototype application to navigate empirical data from "last.fm", an online social network for music. We give an evaluation plan to compare the models regarding searchability through user evaluations.
Article
The participatory nature of many Web 2.0 platforms makes a large portion of users' interactions with each other and with information resources digitally observable. The assumption that the evolving structure of these digital records contains implicit evidences for the underlying semantics has been proven by successful approaches of making the emergent semantics explicit, e.g. in the form of light-weight ontologies. In this paper, we provide further evidence for the great poten-tial of self-emerging ontologies from Web 2.0 data, exemplified by collaborative tagging systems. We hereby combine and extend prior research, where we identified crucial aspects for successful methods to infer tag semantics. The additional contribution of this paper is to propose an extended methodology to induce a hierar-chical organization scheme from the initially flat tag space which captures the semantics and the diversity of the shared knowledge. It comprises the introduction of a synsetized folksonomy (which tack-les the problem of synonymous tags) and a clustering approach for tag sense disambiguation. In order to assess the quality of the learned semantics, we com-pare the inferred organization scheme with manually built catego-rization schemes from WordNet and Wikipedia. Our results exhibit clear similarities; so in summary, our work demonstrates a success-ful example of self-emergent ontologies from Web 2.0 data.
Article
Collaborative tagging systems—systems where many casual users annotate objects with free-form strings (tags) of their choosing—have recently emerged as a powerful way to label and organize large collections of data. During our recent investigation into these types of systems, we discovered a simple but remarkably effective algorithm for converting a large corpus of tags annotating objects in a tagging system into a navigable hierarchical taxonomy of tags. We first discuss the algorithm and then present a preliminary model to explain why it is so effective in these types of systems.
Article
In this paper, we use time series analysis to evaluate predictive scenarios using search engine transactional logs. Our goal is to develop models for the analysis of searchers’ behaviors over time and investigate if time series analysis is a valid method for predicting relationships between searcher actions. Time series analysis is a method often used to understand the underlying characteristics of temporal data in order to make forecasts. In this study, we used a Web search engine transactional log and time series analysis to investigate users’ actions. We conducted our analysis in two phases. In the initial phase, we employed a basic analysis and found that 10% of searchers clicked on sponsored links. However, from 22:00 to 24:00, searchers almost exclusively clicked on the organic links, with almost no clicks on sponsored links. In the second and more extensive phase, we used a one-step prediction time series analysis method along with a transfer function method. The period rarely affects navigational and transactional queries, while rates for transactional queries vary during different periods. Our results show that the average length of a searcher session is approximately 2.9 interactions and that this average is consistent across time periods. Most importantly, our findings shows that searchers who submit the shortest queries (i.e., in number of terms) click on highest ranked results. We discuss implications, including predictive value, and future research.
Conference Paper
A fundamental premise of tagging systems is that regular users can organize large collections for browsing and other tasks using uncontrolled vocabularies. Until now, that premise has remained relatively unexamined. Using library data, we test the tagging approach to organizing a collection. We find that tagging systems have three major large scale organizational features: consistency, quality, and completeness. In addition to testing these features, we present results suggesting that users produce tags similar to the topics designed by experts, that paid tagging can effectively supplement tags in a tagging system, and that information integration may be possible across tagging systems.