Conference Paper

The Role of Structural Information for Designing Navigational User Interfaces

Authors:
  • GESIS - Leibniz Institute of the Social Sciences
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Today, a variety of user interfaces exists for navigating information spaces, including, for example, tag clouds, breadcrumbs, subcategories and others. However, such navigational user interfaces are only useful to the extent that they expose the underlying topology---or network structure---of the information space. Yet, little is known about which topological clues should be integrated in navigational user interfaces. In detail, the aim of this paper is to identify what kind of and how much topological information needs to be included in user interfaces to facilitate efficient navigation. We model navigation as a variation of a decentralized search process with partial information and study its sensitivity to the quality and amount of the structural information used for navigation. We experiment with two strategies for node selection (quality of structural information provided to the user) and different amount of information (amount of structural information provided to the user). Our experiments on four datasets from different domains show that efficient navigation depends on the kind of structural information utilized. Additionally, node properties differ in their quality for augmenting navigation and intelligent pre-selection of which nodes to present in the interface to the user can improve navigational efficiency. This suggests that only a limited amount of high quality structural information needs to be exposed through the navigational user interface.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... They found novelty-and diversity-based strategies to be the best (in this order). This was also confirmed by the findings in [11]; the authors found out that particularly useful tags are those with high popularity (i.e., occurring frequently in the dataset), but with low clustering, which means that is important to consider not only the number of co-occurring tags, but also their diversity. Even though tag clouds can be an efficient way of navigation in document corpora, this ability is seriously limited if there are too many resources and pagination is used as showed in [14]. ...
... We compared it with using popularity as a measure of fitness, which we computed as an out-degree of a node, i.e., the number of edges going out of a node. This measure resembles tag frequency that is often used in the tag navigation approach; it was used also in [11]. Thus, we had two independent variables in the experiment: ...
... It is worth noting that there are some differences between our experimental procedure and the search task used in [11,15,53]. For once, their task was searching a path from a given node A to a given node B. In our case, we have more starting points (seeds) and there is no one goal to reach, since the task is exploratory in its nature. ...
Article
Full-text available
Exploratory search (in contrary to the traditional lookup search) is characterized by the search tasks that have exploration, learning, and investigation as their goals. An example of this task in the domain of digital libraries is exploration of a new domain, a task that is typically performed by a researcher novice, such as a master’s or a doctoral student. To support the researcher novices in this task, we proposed an approach of exploratory search and navigation using navigation leads, with which we augment the search results, and which serve as navigation starting points allowing users to follow a specific path by filtering only documents pertinent to the selected lead. In this paper, we present a method of selection of navigation leads considering their navigational value in the form of a corpus relevance. We examined this method by the means of an offline evaluation on the dataset from a bookmarking service Annota. We showed that considering the corpus relevance helps to cover significantly more (relevant) documents when conducting the exploratory search. In addition, our relevance metric combining document and corpus relevance of a lead outperformed the popularity metric based on the frequency of the term in the document corpus.
... In this dataset, each article is a node and wikilinks between articles are edges. When navigating a Web network, users typically browse and search links, and we therefore compute corresponding network metrics [10]. These include empirical complementary cumulative distribution functions (CCDF) for in-degree, out-degree, PageRank, reciprocity, clustering coefficient, and k -core of the 4,800 articles in our examined subset. ...
Chapter
Full-text available
When editing articles on Wikipedia, arguments between editors frequently occur. These conflicts occasionally lead to destructive behavior and diminish article quality. Currently, the relation between editing behavior, link structure, and article quality is not well-understood in our community, notwithstanding that this relation may facilitate editing processes and article quality on Wikipedia. To shed light on this complex relation, we classify edits for 13,045 articles and perform an in-depth analysis of a 4,800 article subsample. Additionally, we build a network of wikilinks (internal Wikipedia hyperlinks) between articles. Using this data, we compute parsimonious metrics to quantify editing and linking behavior. Our analysis unveils that controversial articles differ considerably from others for almost all metrics, while slight trends are also detectable for higher-quality articles. With our work, we assist online collaboration communities, especially Wikipedia, in long-term improvement of article quality by identifying deviant behavior via simple sequence-based edit and network-based article metrics.
... Initially, content has been accessed by traversing hyperlinks on the Web [14]. This navigational user behavior on the Web and on Wikipedia is often modeled using well-established methods such as Markov chains [3,23,25,29,30] and decentralized search models [5,12]. Numerous navigational hypotheses on Wikipedia have also been presented based on, e.g., click traces stemming from navigational games and on click data from server logs. ...
Preprint
As one of the richest sources of encyclopedic information on the Web, Wikipedia generates an enormous amount of traffic. In this paper, we study large-scale article access data of the English Wikipedia in order to compare articles with respect to the two main paradigms of information seeking, i.e., search by formulating a query, and navigation by following hyperlinks. To this end, we propose and employ two main metrics, namely (i) searchshare -- the relative amount of views an article received by search --, and (ii) resistance -- the ability of an article to relay traffic to other Wikipedia articles -- to characterize articles. We demonstrate how articles in distinct topical categories differ substantially in terms of these properties. For example, architecture-related articles are often accessed through search and are simultaneously a "dead end" for traffic, whereas historical articles about military events are mainly navigated. We further link traffic differences to varying network, content, and editing activity features. Lastly, we measure the impact of the article properties by modeling access behavior on articles with a gradient boosting approach. The results of this paper constitute a step towards understanding human information seeking behavior on the Web.
... My research focuses on modeling navigation in an information space represented as an information network. Regarding the first question, I introduce and apply partially informed decentralized search to model the extent to which a user is exposed to the network structure of the information space and can make informed decisions about her next step towards exploring the content [1]. I test different hypotheses regarding the type and the amount of network structural information used to model navigation. ...
Conference Paper
Navigation in an information space is a natural way to explore and discover its content. Information systems on the Web like digital encyclopedias (e.g., Wikipedia) are interested in providing good navigational support to their users. To that end, navigation models can be useful for estimating the general navigability of an information space and for understanding how users interact with it. Such models can also be applied to identify problems faced by the users during navigation and to improve user interfaces. Studying navigation on the Web is a challenging task that has a long tradition in our scientific community. Based on large studies, researchers have made significant steps towards understanding navigational user behavior on the Web identifying general usage patterns, regularities, and strategies users apply during navigation. The seminal information foraging theory has been developed suggesting that people follow links by constantly estimating their quality in terms of information value and cost associated with obtaining that value by interacting with the environment. Furthermore, models describing the network structure of the Web like the bow tie model, and the small world models have been introduced. These models contributed valuable insights towards characterizing the underlying network topology on which the users operate and the extent to which it allows efficient navigation. In the context of information networks, researchers have successfully modeled user navigation resorting to Markov chains and to decentralized search. With respect to the users' navigational behavior and their click activities to traverse a link, researchers have found a valuable source of information in the log files of Web servers. Click data has also been collected by letting humans play navigational games on Wikipedia. With this data, researchers tested different navigational hypotheses; for example, (i) if humans tend to navigate between semantically similar articles, (ii) if they experience a trade-off between following links leading towards semantically similar articles and following links leading towards possibly well-connected articles. For navigation with a particular target in mind, users are found to be greedy with respect to the next click if they are confident to be on the right path, whereas they tend to explore the information network at random if they feel insecure or lost and have no intuition about the next click. Although these research lines have advanced our understanding of navigational user behavior in information networks, for the goal of the proposed thesis-modeling navigation-related work does not address and cover the following questions: (i) What is the relationship between the user's awareness regarding the structure and the topology of the information network and the efficiency of navigation, i.e., modeled as decentralized search and (ii) How do users interact with the content to explore and discover it, i.e., are there some specific links that are especially appealing and what are their characteristics? My research focuses on modeling navigation in an information space represented as an information network. Regarding the first question, I introduce and apply partially informed decentralized search to model the extent to which a user is exposed to the network structure of the information space and can make informed decisions about her next step towards exploring the content [1]. I test different hypotheses regarding the type and the amount of network structural information used to model navigation. My results show that only a small amount of knowledge about the network structure is sufficient for efficient navigation. For the second question, I study large-scale click data from the English version of Wikipedia. I observe a focus of the users' attention towards specific links. With this part of the proposal, I want to shed light on a different aspect of navigation and concentrate on the question why some links are more successful than others. In particular, I study the relationship between the link properties and the link popularity as measured by transitional click data. To that end, I formulate navigational hypotheses based on different link features, i.e., network features, semantic features and visual features [2, 3]. The plausibility of these hypotheses is then tested using a Markov chain-based Bayesian hypothesis testing framework. Results suggest that Wikipedia users tend to select links located at the top of the page. Furthermore, users are tempted to select links leading towards the periphery of the Wikipedia network. To conclude, I believe that the won insights may have impact on system design decisions, i.e, existing guidelines for Wikipedia contributors can be adapted to better reflect the usage of the system.
... We see that most, but not all hypotheses achieve an improvement (marked bold); the best correlation with an improvement of ∼ 0.1 across the three dumping factors is achieved by the hypothesis kcore_visual. proposed such as the well-known Markov chain model also utilized in this work [30,34,36,37], or models such as decentralized search [12,16] motivated by small-world navigation [24]. Insights have been utilized to infer missing links [43], predict the break-up of navigation process [35], for recommendations [41], or to improve the link structure of a website [31]. ...
Article
While a plethora of hypertext links exist on the Web, only a small amount of them are regularly clicked. Starting from this observation, we set out to study large-scale click data from Wikipedia in order to understand what makes a link successful. We systematically analyze effects of link properties on the popularity of links. By utilizing mixed-effects hurdle models supplemented with descriptive insights, we find evidence of user preference towards links leading to the periphery of the network, towards links leading to semantically similar articles, and towards links in the top and left-side of the screen. We integrate these findings as Bayesian priors into a navigational Markov chain model and by doing so successfully improve the model fits. We further adapt and improve the well-known classic PageRank algorithm that assumes random navigation by accounting for observed navigational preferences of users in a weighted variation. This work facilitates understanding navigational click behavior and thus can contribute to improving link structures and algorithms utilizing these structures.
Article
Full-text available
Currently, the relation between edit behavior, link structure, and article quality is not well-understood in our community, notwithstanding that this relationship may facilitate editing processes and content quality on Wikipedia. To shed light on this complex relation, we classify article edits and perform an in-depth analysis of editing sequences for 4941 articles. Additionally, we build a network of internal Wikipedia hyperlinks between articles. Using this data, we compute parsimonious metrics to quantify editing and linking behavior. Our analysis unveils that conflicted articles differ substantially from others in almost all metrics, while we also detect slight trends for high-quality articles. With our network analysis we find evidence indicating that controversial and edit war articles frequently span structural holes in the Wikipedia network. Finally, in a prediction experiment we demonstrate the usefulness of edit behavior patterns and network properties in predicting conflict and article quality. With our work, we assist online collaboration communities, especially Wikipedia, in long-term improvement of content quality by offering valuable insights about the interplay of article quality, controversies and edit wars, editing behavior, and network properties via sequence-based edit and network-based article metrics.
Conference Paper
We present the Stolperwege app, a web-based framework for ubiquitous modeling of historical processes. Starting from the art project Stolpersteine of Gunter Demnig, it allows for virtually connecting these stumbling blocks with information about the biographies of victims of Nazism. According to the practice of public history, the aim of Stolperwege is to deepen public knowledge of the Holocaust in the context of our everyday environment. Stolperwege uses an information model that allows for modeling social networks of agents starting from information about portions of their life. The paper exemplifies how Stolperwege is informationally enriched by means of historical maps and 3D animations of (historical) buildings.
Conference Paper
While a plethora of hypertext links exist on the Web, only a small amount of them are regularly clicked. Starting from this observation, we set out to study large-scale click data from Wikipedia in order to understand what makes a link successful. We systematically analyze effects of link properties on the popularity of links. By utilizing mixed-effects hurdle models supplemented with descriptive insights, we find evidence of user preference towards links leading to the periphery of the network, towards links leading to semantically similar articles, and towards links in the top and left-side of the screen. We integrate these findings as Bayesian priors into a navigational Markov chain model and by doing so successfully improve the model fits. We further adapt and improve the well-known classic PageRank algorithm that assumes random navigation by accounting for observed navigational preferences of users in a weighted variation. This work facilitates understanding navigational click behavior and thus can contribute to improving link structures and algorithms utilizing these structures.
Conference Paper
In this paper, we explore how user-generated content, such as links (web page URLs) shared on social media can be used to recommend relevant and popular web pages to the website visitors. Based on our preliminary findings [8, 9], we have developed a set of guidelines and a prototype called the Social Media Panel (SMP). The SMP displays popular web pages as page thumbnails based on the aggregate information trending on social media sites. We evaluate the SMP via a focus group and a controlled user study, and compare it against conventional website navigation tools (i.e., menus, search, links, and browser tools), for effectiveness, efficiency and user engagement. We found SMP to be effective, efficient and engaging for browsing tasks. Subsequently, ANOVA tests were employed which proved that it took fewer clicks to complete the task using SMP. However, the use of SMP did not prove to make a significant difference in expediting the completion of the task.
Article
Full-text available
We analyse the complex network architectures based on the k-core notion, where the k-core is the maximal subgraph in a network, whose vertices all have internal degree at least k. We explain the nature of the unusual “hybrid” phase transition of the emergence of a giant k-core in a network, which combines a jump in the order parameter and a critical singularity, and relate this transition to critical phenomena in other systems. Finally, we indicate generic features and differences between the k-core problem and bootstrap percolation.
Article
Full-text available
Power laws are theoretically interesting probability distributions that are also frequently used to describe empirical data. In recent years, effective statistical methods for fitting power laws have been developed, but appropriate use of these techniques requires significant programming and statistical insight. In order to greatly decrease the barriers to using good statistical methods for fitting power law distributions, we developed the powerlaw Python package. This software package provides easy commands for basic fitting and statistical analysis of distributions. Notably, it also seeks to support a variety of user needs by being exhaustive in the options available to the user. The source code is publicly available and easily extensible.
Conference Paper
Full-text available
Although many social tagging systems share a common tri-partite graph structure, the collaborative processes that are generating these structures can differ significantly. For ex-ample, while resources on Delicious are usually tagged by all users who bookmark the web page cnn.com, photos on Flickr are usually tagged just by a single user who uploads the photo. In the literature, this distinction has been described as a distinction between broad vs. narrow folksonomies. This paper sets out to explore navigational differences be-tween broad and narrow folksonomies in social hypertextual systems. We study both kinds of folksonomies on a dataset provided by Mendeley -a collaborative platform where users can annotate and organize scientific articles with tags. Our experiments suggest that broad folksonomies are more use-ful for navigation, and that the collaborative processes that are generating folksonomies matter qualitatively. Our find-ings are relevant for system designers and engineers aiming to improve the navigability of social tagging systems.
Conference Paper
Full-text available
Decentralized search in networks is an activity that is often performed in online tasks. It refers to situations where a user has no global knowledge of a network’s topology, but only local knowledge. On Wikipedia for instance, humans typically have local knowledge of the links emanating from a given Wikipedia article, but no global knowledge of the entire Wikipedia graph. This makes the task of navigation to a target Wikipedia article from a given starting article an interesting problem for both humans and algorithms. As we know from previous studies, people can have very efficient decentralized search procedures that find shortest paths in many cases, using intuitions about a given network. These intuitions can be modeled as hierarchical background knowledge that people access to approximate a networks’ topology. In this paper, we explore the differences and similarities between decentralized search that utilizes hierarchical background knowledge and actual human navigation in information networks. For that purpose we perform a large scale study on the Wikipedia information network with over 500,000 users and 1,500,000 click trails. As our results reveal, a decentralized search procedure based on hierarchies created directly from the link structure of the information network simulates human navigational behavior better than simulations based on hierarchies that are created from external knowledge.
Article
Full-text available
Information foraging theory is an approach to understanding how strategies and technologies for information seeking, gathering, and consumption are adapted to the flux of information in the environment. The theory assumes that people, when possible, will modify their strategies or the structure of the environment to maximize their rate of gaining valuable information. The theory is developed by (a) adaptation (rational) analysis of information foraging problems and (b) a detailed process model (adaptive control of thought in information foraging [ACT-IF]). The adaptation analysis develops (a) information patch models, which deal with time allocation and information filtering and enrichment activities in environments in which information is encountered in clusters; (b) information scent models, which address the identification of information value from proximal cues; and (c) information diet models, which address decisions about the selection and pursuit of information items. ACT-IF is instantiated as a production system model of people interacting with complex information technology. (PsycINFO Database Record (c) 2009 APA, all rights reserved)
Conference Paper
Full-text available
It is crucial to study basic principles that support adaptive and scalable retrieval functions in large networked environments such as the Web, where information is distributed among dynamic systems. We conducted experiments on decentralized IR operations on various scales of information networks and analyzed effectiveness, efficiency, and scalability of various search methods. Results showed network structure, i.e., how distributed systems connect to one another, is crucial for retrieval performance. Relying on partial indexes of distributed systems, some level of network clustering enabled very efficient and effective discovery of relevant information in large scale networks. For a given network clustering level, search time was well explained by a poly-logarithmic relation to network size (i.e., the number of distributed systems), indicating a high scalability potential for searching in a growing information space. In addition, network clustering only involved local self-organization and required no global control - clustering time remained roughly constant across the various scales of networks.
Conference Paper
Full-text available
It is a widely held belief among designers of social tagging systems that tag clouds represent a useful tool for navigation. This is evident in, for example, the increasing number of tagging systems offering tag clouds for navigational purposes, which hints towards an implicit assumption that tag clouds support efficient navigation. In this paper, we examine and test this assumption from a network-theoretic perspective, and show that in many cases it does not hold. We first model navigation in tagging systems as a bipartite graph of tags and resources and then simulate the navigation process in such a graph. We use network-theoretic properties to analyse the navigability of three tagging datasets with regard to different user interface restrictions imposed by tag clouds. Our results confirm that tag resource networks have efficient navigation properties in theory, but they also show that popular user interface decisions (such as “pagination” combined with reverse-chronological listing of resources) significantly impair the potential of tag clouds as a useful tool for navigation. Based on our findings, we identify a number of avenues for further research and the design of novel tag cloud construction algorithms. Our work is relevant for researchers interested in navigability of emergent hypertext structures, and for engineers seeking to improve the navigability of social tagging systems.
Article
Full-text available
Social networks have the surprising property of being “searchable”: Ordinary people are capable of directing messages through their network of acquaintances to reach a specific but distant target person in only a few steps. We present a model that offers an explanation of social network searchability in terms of recognizable personal identities: sets of characteristics measured along a number of social dimensions. Our model defines a class of searchable networks and a method for searching them that may be applicable to many network search problems, including the location of data files in peer-to-peer networks, pages on the World Wide Web, and information in distributed databases.
Article
Full-text available
Many social Web sites allow users to annotate the content with descriptive metadata, such as tags, and more recently to organize content hierarchically. These types of structured metadata provide valuable evidence for learning how a community organizes knowledge. For instance, we can aggregate many personal hierarchies into a common taxonomy, also known as a folksonomy, that will aid users in visualizing and browsing social content, and also to help them in organizing their own content. However, learning from social metadata presents several challenges, since it is sparse, shallow, ambiguous, noisy, and inconsistent. We describe an approach to folksonomy learning based on relational clustering, which exploits structured metadata contained in personal hierarchies. Our approach clusters similar hierarchies using their structure and tag statistics, then incrementally weaves them into a deeper, bushier tree. We study folksonomy learning using social metadata extracted from the photo-sharing site Flickr, and demonstrate that the proposed approach addresses the challenges. Moreover, comparing to previous work, the approach produces larger, more accurate folksonomies, and in addition, scales better. Comment: 10 pages, To appear in the Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining(KDD) 2010
Article
Full-text available
Many communication and social networks have power-law link distributions, containing a few nodes that have a very high degree and many with low degree. The high connectivity nodes play the important role of hubs in communication and networking, a fact that can be exploited when designing efficient search algorithms. We introduce a number of local search strategies that utilize high degree nodes in power-law graphs and that have costs scaling sublinearly with the size of the graph. We also demonstrate the utility of these strategies on the GNUTELLA peer-to-peer network.
Article
Full-text available
The clustering coefficient quantifies how well connected are the neighbors of a vertex in a graph. In real networks it decreases with the vertex degree, which has been taken as a signature of the network hierarchical structure. Here we show that this signature of hierarchical structure is a consequence of degree-correlation biases in the clustering coefficient definition. We introduce a definition in which the degree-correlation biases are filtered out, and provide evidence that in real networks the clustering coefficient is constant or decays logarithmically with vertex degree.
Article
Full-text available
We analytically describe the architecture of randomly damaged uncorrelated networks as a set of successively enclosed substructures--k-cores. The k-core is the largest subgraph where vertices have at least k interconnections. We find the structure of k-cores, their sizes, and their birthpoints--the bootstrap percolation thresholds. We show that in networks with a finite mean number zeta2 of the second-nearest neighbors, the emergence of a k-core is a hybrid phase transition. In contrast, if zeta2 diverges, the networks contain an infinite sequence of k-cores which are ultrarobust against random damage.
Conference Paper
Full-text available
The spherical k-means algorithm, i.e., the k-means algorithm with cosine similarity, is a popular method for clustering high-dimensional text data. In this algorithm, each document as well as each cluster mean is represented as a high-dimensional unit-length vector. However, it has been mainly used in hatch mode. Thus is, each cluster mean vector is updated, only after all document vectors being assigned, as the (normalized) average of all the document vectors assigned to that cluster. This paper investigates an online version of the spherical k-means algorithm based on the well-known winner-take-all competitive learning. In this online algorithm, each cluster centroid is incrementally updated given a document. We demonstrate that the online spherical k-means algorithm can achieve significantly better clustering results than the batch version, especially when an annealing-type learning rate schedule is used. We also present heuristics to improve the speed, yet almost without loss of clustering quality.
Article
Full-text available
An ecological-cognitive framework of analysis and a model-tracing architecture are presented and used in the analysis of data recorded from users browsing a large document collection. The users interacted with the Scatter/Gather browser, which clusters documents into groups of similar content and presents users with summaries of cluster content. Predictions made by a computational model of navigation and information foraging are matched against the observed activity.
Article
The scientific study of networks, including computer networks, social networks, and biological networks, has received an enormous amount of interest in the last few years. The rise of the Internet and the wide availability of inexpensive computers have made it possible to gather and analyze network data on a large scale, and the development of a variety of new theoretical tools has allowed us to extract new knowledge from many different kinds of networks. The study of networks is broadly interdisciplinary and important developments have occurred in many fields, including mathematics, physics, computer and information sciences, biology, and the social sciences. This book brings together the most important breakthroughs in each of these fields and presents them in a coherent fashion, highlighting the strong interconnections between work in different areas. Subjects covered include the measurement and structure of networks in many branches of science, methods for analyzing network data, including methods developed in physics, statistics, and sociology, the fundamentals of graph theory, computer algorithms, and spectral methods, mathematical models of networks, including random graph models and generative models, and theories of dynamical processes taking place on networks.
Conference Paper
Models of human navigation play an important role for understanding and facilitating user behavior in hypertext systems. In this paper, we conduct a series of principled experiments with decentralized search - an established model of human navigation in social networks - and study its applicability to information networks. We apply several variations of decentralized search to model human navigation in information networks and we evaluate the outcome in a series of experiments. In these experiments, we study the validity of decentralized search by comparing it with human navigational paths from an actual information network - Wikipedia. We find that (i) navigation in social networks appears to differ from human navigation in information networks in interesting ways and (ii) in order to apply decentralized search to information networks, stochastic adaptations are required. Our work illuminates a way towards using decentralized search as a valid model for human navigation in information networks in future work. Our results are relevant for scientists who are interested in modeling human behavior in information networks and for engineers who are interested in using models and simulations of human behavior to improve on structural or user interface aspects of hypertextual systems.
Chapter
An ecological-cognitive framework of analysis and a model-tracing architecture are presented and used in the analysis of data recorded from users browsing a large document collection. The users interacted with the Scatter/Gather browser, which clusters documents into groups of similar content and presents users with summaries of cluster content. Predictions made by a computational model of navigation and information foraging are matched against the observed activity.
Article
We studied decentralized search in information networks and focused on the impact of network clustering on the findabil- ity of relevant information sources. We developed a multi- agent system to simulate peer-to-peer networks, in which peers worked with one another to forward queries to tar- gets containing relevant information, and evaluated the ef- fectiveness, efficiency, and scalability of the decentralized search. Experiments on a network of 181 peers showed that the RefNet method based on topical similarity cues outper- formed random walks and was able to reach relevant peers through short search paths. When the network was extended to a larger community of 5890 peers, however, the advantage of the RefNet model was constrained due to noise of many topically irrelevant connections or weak ties. By applying topical clustering and a clustering exponent � to guide network rewiring, we studied the role of strong ties vs. weak ties, particularly their influence on distributed search. Interestingly, an inflection point was discovered for �, below which performance suffered from many remote con- nections that disoriented searches and above which perfor- mance degraded due to lack of weak ties that could move queries quickly from one segment to another. The inflec- tion threshold for the 5890-peer network was � ≈ 3.5. Fur- ther experiments on larger networks of up to 4 million peers demonstrated that clustering optimization is crucial for de- centralized search. Although overclustering only moderately degraded search performance on small networks, it led to dramatic loss in search efficiency for large networks. We ex- plain the implication on scalability of distributed systems that rely on clustering for search.
Article
Tag clouds are text-based visual representations of a set of tags usually depicting tag importance by font size. Recent trends in social and collaborative software have greatly increased the popularity of this type of visualization. This paper proposes a family of novel algorithms for tag cloud layout and presents evaluation results obtained from an extensive user study and a technical evaluation. The algorithms address issues found in many common approaches, for example large whitespaces, overlapping tags and restriction to specific boundaries. The layouts computed by these algorithms are compact and clear, have small whitespaces and may feature arbitrary convex polygons as boundaries. The results of the user study and the technical evaluation enable designers to devise a combination of algorithm and parameters which produces satisfying tag cloud layouts for many application scenarios.
Article
Collaborative tagging systems—systems where many casual users annotate objects with free-form strings (tags) of their choosing—have recently emerged as a powerful way to label and organize large collections of data. During our recent investigation into these types of systems, we discovered a simple but remarkably effective algorithm for converting a large corpus of tags annotating objects in a tagging system into a navigable hierarchical taxonomy of tags. We first discuss the algorithm and then present a preliminary model to explain why it is so effective in these types of systems.
Article
We address the question of how participants in a small world experiment are able to find short paths in a social network using only local information about their immediate contacts. We simulate such experiments on a network of actual email contacts within an organization as well as on a student social networking website. On the email network we find that small world search strategies using a contact’s position in physical space or in an organizational hierarchy relative to the target can effectively be used to locate most individuals. However, we find that in the online student network, where the data is incomplete and hierarchical structures are not well defined, local search strategies are less effective. We compare our findings to recent theoretical hypotheses about underlying social structure that would enable these simple search strategies to succeed and discuss the implications to social software design.
Conference Paper
Today, a number of algorithms exist for constructing tag hierarchies from social tagging data. While these algorithms were designed with ontological goals in mind, we know very little about their properties from an information retrieval perspective, such as whether these tag hierarchies support efficient navigation in social tagging systems. The aim of this paper is to investigate the usefulness of such tag hierarchies (sometimes also called folksonomies - from folk-generated taxonomy) as directories that aid navigation in social tagging systems. To this end, we simulate navigation of directories as decentralized search on a network of tags using Kleinberg's model. In this model, a tag hierarchy can be applied as background knowledge for decentralized search. By constraining the visibility of nodes in the directories we aim to mimic typical constraints imposed by a practical user interface (UI), such as limiting the number of displayed subcategories or related categories. Our experiments on five different social tagging datasets show that existing tag hierarchy algorithms can support navigation in theory, but our results also demonstrate that they face tremendous challenges when user interface (UI) restrictions are taken into account. Based on this observation, we introduce a new algorithm that constructs efficiently navigable directories on our datasets. The results are relevant for engineers and scientists aiming to improve navigability of social tagging systems.
Article
This book focuses on the human users of search engines and the tool they use to interact with them: the search user interface. The truly worldwide reach of the Web has brought with it a new realization among computer scientists and laypeople of the enormous importance of usability and user interface design. In the last ten years, much has become understood about what works in search interfaces from a usability perspective, and what does not. Researchers and practitioners have developed a wide range of innovative interface ideas, but only the most broadly acceptable make their way into major web search engines. This book summarizes these developments, presenting the state of the art of search interface design, both in academic research and in deployment in commercial systems. Many books describe the algorithms behind search engines and information retrieval systems, but the unique focus of this book is specifically on the user interface. It will be welcomed by industry professionals who design systems that use search interfaces as well as graduate students and academic researchers who investigate information systems.
Article
Targeted or quasi-targeted propagation of information is a fundamental process running in complex networked systems. Optimal communication in a network is easy to achieve if all its nodes have a full view of the global topological structure of the network. However many complex networks manifest communication efficiency without nodes having a full view of the network, and yet there is no generally applicable explanation of what mechanisms may render efficient such communication in the dark. In this work we model this communication as an oblivious routing process greedily operating on top of a network and making its decisions based only on distances within a hidden metric space lying underneath. Abstracting intrinsic similarities among networked elements, this hidden metric space is not easily reconstructible from the visible network topology. Yet we find that the structure of complex networks observed in reality, characterized by strong clustering and specific values of exponents of power-law degree distributions, maximizes their navigability, i.e., the efficiency of the greedy path-finding strategy in this hidden framework. We explain this observation by showing that more navigable networks have more prominent hierarchical structures which are congruent with the optimal layout of routing paths through a network. This finding has potentially profound implications for constructing efficient routing and searching strategies in communication and social networks, such as the Internet, Web, etc., and merits further research that would explain whether navigability of complex networks does indeed follow naturally from specifics of their evolution.
Article
Despite degree distributions give some insights about how heterogeneous a network is, they fail in giving a unique quantitative characterization of network heterogeneity. This is particularly the case when several different distributions fit for the same network, when the number of data points is very scarce due to network size, or when we have to compare two networks with completely different degree distributions. Here we propose a unique characterization of network heterogeneity based on the difference of functions of node degrees for all pairs of linked nodes. We show that this heterogeneity index can be expressed as a quadratic form of the Laplacian matrix of the network, which allows a spectral representation of network heterogeneity. We give bounds for this index, which is equal to zero for any regular network and equal to one only for star graphs. Using it we study random networks showing that those generated by the Erdös-Rényi algorithm have zero heterogeneity, and those generated by the preferential attachment method of Barabási and Albert display only 11% of the heterogeneity of a star graph. We finally study 52 real-world networks and we found that they display a large variety of heterogeneities. We also show that a classification system based on degree distributions does not reflect the heterogeneity properties of real-world networks.
Article
The study of complex networks has emerged over the past several years as a theme spanning many disciplines, ranging from mathematics and computer science to the social and biological sciences. A significant amount of recent work in this area has focused on the development of random graph models that capture some of the qualitative properties observed in large-scale network data; such models have the potential to help us reason, at a general level, about the ways in which real-world networks are organized. We survey one particular line of network research, concerned with small-world phenomena and decentralized search algorithms, that illustrates this style of analysis. We begin by describing awell-known experiment that provided the first empirical basis for the �six degrees of separation� phenomenon in social networks; wethen discuss some probabilistic network models motivated by this work, illustrating how these models lead to novel algorithmic and graph-theoretic questions, and how they are supported by recent empirical studies of large social networks.
Article
Networks of coupled dynamical systems have been used to model biological oscillators, Josephson junction arrays, excitable media, neural networks, spatial games, genetic control networks and many other self-organizing systems. Ordinarily, the connection topology is assumed to be either completely regular or completely random. But many biological, technological and social networks lie somewhere between these two extremes. Here we explore simple models of networks that can be tuned through this middle ground: regular networks 'rewired' to introduce increasing amounts of disorder. We find that these systems can be highly clustered, like regular lattices, yet have small characteristic path lengths, like random graphs. We call them 'small-world' networks, by analogy with the small-world phenomenon (popularly known as six degrees of separation. The neural network of the worm Caenorhabditis elegans, the power grid of the western United States, and the collaboration graph of film actors are shown to be small-world networks. Models of dynamical systems with small-world coupling display enhanced signal-propagation speed, computational power, and synchronizability. In particular, infectious diseases spread more easily in small-world networks than in regular lattices.
Introduction The problem of searching for information in networks like the World Wide Web can be approached in a variety of ways, ranging from centralized indexing schemes to decentralized mechanisms that navigate the underlying network without knowledge of its global structure. The decentralized approach appears in a variety of settings: in the behavior of users browsing the Web by following hyperlinks; in the design of focused crawlers [4, 5, 8] and other agents that explore the Web's links to gather information; and in the search protocols underlying decentralized peer-to-peer systems such as Gnutella [10], Freenet [7], and recent research prototypes [21, 22, 23], through which users can share resources without a central server. In recent work, we have been investigating the problem of decentralized search in large information networks [14, 15]. Our initial motivation was an experiment that dealt directly with the search problem in a decidedly pre-Internet context: Stanley Milgram
Article
Long a matter of folklore, the "small-world phenomenon" --- the principle that we are all linked by short chains of acquaintances --- was inaugurated as an area of experimental study in the social sciences through the pioneering work of Stanley Milgram in the 1960's. This work was among the first to make the phenomenon quantitative, allowing people to speak of the "six degrees of separation" between any two people in the United States. Since then, a number of network models have been proposed as frameworks in which to study the problem analytically. One of the most refined of these models was formulated in recent work of Watts and Strogatz; their framework provided compelling evidence that the small-world phenomenon is pervasive in a range of networks arising in nature and technology, and a fundamental ingredient in the evolution of the World Wide Web. But existing models are insu#cient to explain the striking algorithmic component of Milgram's original findings: that individuals using local information are collectively very e#ective at actually constructing short paths between two points in a social network. Although recently proposed network models are rich in short paths, we prove that no decentralized algorithm, operating with local information only, can construct short paths in these networks with non-negligible probability. We then define an infinite family of network models that naturally generalizes the Watts-Strogatz model, and show that for one of these models, there is a decentralized algorithm capable of finding short paths with high probability. More generally, we provide a strong characterization of this family of network models, showing that there is in fact a unique model within the family for which decentralized algorithms are e#ect...
Article
An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory ecient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memory-ecient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented - a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.
Article
Power-law distributions occur in many situations of scientific interest and have significant consequences for our understanding of natural and man-made phenomena. Unfortunately, the detection and characterization of power laws is complicated by the large fluctuations that occur in the tail of the distribution -- the part of the distribution representing large but rare events -- and by the difficulty of identifying the range over which power-law behavior holds. Commonly used methods for analyzing power-law data, such as least-squares fitting, can produce substantially inaccurate estimates of parameters for power-law distributions, and even in cases where such methods return accurate answers they are still unsatisfactory because they give no indication of whether the data obey a power law at all. Here we present a principled statistical framework for discerning and quantifying power-law behavior in empirical data. Our approach combines maximum-likelihood fitting methods with goodness-of-fit tests based on the Kolmogorov-Smirnov statistic and likelihood ratios. We evaluate the effectiveness of the approach with tests on synthetic data and give critical comparisons to previous approaches. We also apply the proposed methods to twenty-four real-world data sets from a range of different disciplines, each of which has been conjectured to follow a power-law distribution. In some cases we find these conjectures to be consistent with the data while in others the power law is ruled out. Comment: 43 pages, 11 figures, 7 tables, 4 appendices; code available at http://www.santafe.edu/~aaronc/powerlaws/