Preprint

Query for Architecture, Click through Military: Comparing the Roles of Search and Navigation on Wikipedia

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

As one of the richest sources of encyclopedic information on the Web, Wikipedia generates an enormous amount of traffic. In this paper, we study large-scale article access data of the English Wikipedia in order to compare articles with respect to the two main paradigms of information seeking, i.e., search by formulating a query, and navigation by following hyperlinks. To this end, we propose and employ two main metrics, namely (i) searchshare -- the relative amount of views an article received by search --, and (ii) resistance -- the ability of an article to relay traffic to other Wikipedia articles -- to characterize articles. We demonstrate how articles in distinct topical categories differ substantially in terms of these properties. For example, architecture-related articles are often accessed through search and are simultaneously a "dead end" for traffic, whereas historical articles about military events are mainly navigated. We further link traffic differences to varying network, content, and editing activity features. Lastly, we measure the impact of the article properties by modeling access behavior on articles with a gradient boosting approach. The results of this paper constitute a step towards understanding human information seeking behavior on the Web.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

... In ISRE-Framework instantiation, we adopted the XML-based representation of image objects provided by the I-Search Project. 12 The XML-based representation encapsulates the URI's, tags, textual descriptions, image thumbnails, etc., associated with the real image objects accessible from the web sources. The Web Layer holds XMLs of image objects, which can be accessed from online Flickr 13 image source. ...
... In multimodal retrieval, the efficiency in terms of precision is commonly employed to evaluate the effectiveness of image results [21,40]. The image browsing activities are usually assessed via click-through rates (CRTs) [12,40]. Taking inspiration from existing literature, in this research, we adopted precision and CRTs discussed in [40] to verify the correctness of clustering and reachability of results presented/visualized in clusters, respectively. ...
... The QUIS elicits the evaluation dimensions based on interface, software, terminology, system information, learning, and system capabilities [11]. It comprises 27 questions; questions (1-6) measures the overall user reaction to the software, questions (7-10) measures the screening effectiveness, questions (11)(12)(13)(14)(15)(16) measures the efficiency of terminology and system information provided by the tool, questions (17)(18)(19)(20)(21)(22) measures the system capabilities, and questions (23)(24)(25)(26)(27) measures the effectiveness of terminology and system information provided by the tool. Each question has a related Likert scale from 1-9, where 1 represents strongly disagree, and 9 represents strongly agree. ...
Article
Full-text available
The extensive information delivery power and an immense volume of image objects make them frequently use multimedia content over the web. However, access to desired image objects to satisfy visual information needs by employing primitive exploration paradigms is difficult. Traditionally, the linear presentation of web image results often leads to reachability and navigation issues. Alternatively, nonlinear approaches provide navigation in web image results. The in-depth browsing to access particular web image results is challenging. In this research, we proposed an exploration framework to browse and explore web image results. We addressed the associated exploration issues, i.e., reachability and navigation in browsing and visualization. The framework enables the nonlinear and multimodal exploration of web image results by representing them in a graph-cluster data model and enabling an interactive exploration mechanism. The graph-cluster data model mainly employs and modifies Zahn’s method and particular algorithms to transform the web image results into specific nonlinear and multimodal search results spaces. The exploration mechanism enables reachability, navigation, browsing, and visualization of web image results in an integrated way. We instantiated the proposed framework over a real dataset of image objects and employed empirical, usability, and comparison tests to evaluate the proposed exploration framework.
... Longuet-Higgins et al. [28] describe a method that helps reduce the number of keystrokes required to complete a word, in this case, identifiers and commands are entered by developers. Recently, query frequency [29], term co-occurrence, query chains, query click through [30] and hitting time have become the most commonly-used types of information [31] in the query log. Term co-occurrences often appear in the search and can be predicted by using the probability of co-occurrences in different feature spaces according to the clustering or dispersion trend between terms. ...
Article
Full-text available
Distracted driving due to smartphone use is one of the key reasons for road accidents. However, the 6G super-heterogeneous network systems and highly differentiated application scenarios require highly elastic and endogenous information services involving the use of smart apps, and related information retrieval by drivers in modern Vehicle-to-People (V2P) Networks. The tension raised due to the conflicting attention requirements of driving and information retrieval can be resolved by designing information retrieval solutions that demand minimal user interaction. In this paper, we construct a Personalized Search Query Generator (PSQG) to reduce driver-mobile interaction during information retrieval in the 6G era. This system has a query generator and a query recommendation component that update two sets of relationships dynamically: one is the query and the title, another is search and recommendation. The proposed system learns a user's intent based on historical query records and recommends personalized queries, thus reducing the driver-mobile interaction time. We deploy the system into a real search engine and conduct several online experiments. These experiments are conducted using a custom constructed dataset comprising ten million samples. We use the BLEU-score metric and perform A/B testing. The results demonstrate that our system can assist users in making precise queries efficiently. The proposed system can improve drivers' safety if used in smartphones and other information retrieval systems in vehicles.
... Information-search behavior on Wikipedia is strongly influenced by search functionalities and navigation paths (Dimitrov, Lemmerich, Flöck, & Strohmaier, 2018;Medelyan, Milne, Legg, & Witten, 2009), and its article and hyperlink structures, in particular, explain users' navigation paths on Wikipedia quite well (Lamprecht, Lerman, Helic, & Strohmaier, 2017). Because of its structure, Wikipedia allows for both general and specific searches: articles are hyperlinked to other articles, enabling users to learn about general topics, browse from one Wikipedia site to another and explore new topics there (Medelyan et al., 2009). ...
Article
CRISPR/Cas-based genome editing is a monumental leap in genetic engineering with considerable societal implications – but it is a complex procedure that is difficult to understand for non-scientists. Wikipedia has been shown to be an important source of information about scientific topics. But research on search, selection, and reception processes on Wikipedia is scarce. By means of eye tracking and survey data, this study investigates how users find information about genetic engineering and CRISPR on Wikipedia and what influences search behaviors and outcomes. An observational study was conducted in which 67 participants searched for general information about genetic engineering and specific information about CRISPR. Results indicate that participants looking for specific information about CRISPR searched shorter, visited fewer Wikipedia pages, and followed shorter and more straightforward search paths than participants looking for general information about genetic engineering. Moreover, prior knowledge and involvement affected users’ browsing behavior. Prior knowledge and search behavior influenced search outcomes.
... One aspect of web navigation is finding the shortest path towards a specific page, be it an item on Amazon or a specific article on Wikipedia [2]. While search engines can often retrieve the desired information, they may not always do so [3]. Hence, the user needs to locate the desired information in the depth of the web by navigating along hyperlinked pages [4,5,6]. ...
Conference Paper
Understanding and modeling user navigation behaviour in the web is of interest for different applications. For example, e-commerce portals can be adjusted to strengthen customer engagement or information sites can be optimized to improve the availability of relevant content to the user. In web navigation, the users goal and whether she reached it, is typically unknown. This makes navigation games particularly interesting to researchers, since they capture human navigation towards a known goal and allowbuilding labelled datasets suitable for supervised machine learning models. In this work, we show that a recurrent neural network model can predict game success from a partial click trail without knowledge of the users navigation goal. We evaluate our approach on data from WikiSpeedia and WikiGame, two well known navigation games and achieve an AUC of 86% and 90%, respectively. Furthermore, we show that our model outperforms a baseline that leverages the navigation goal on the WikiSpeedia dataset. A detailed analysis of both datasets with regards to structural and content related properties reveals significant differences in navigation behaviour, which confirms the applicability of our approach to different settings.
Article
Despite the importance and pervasiveness of Wikipedia as one of the largest platforms for open knowledge, surprisingly little is known about how people navigate its content when seeking information. To bridge this gap, we present the first systematic large-scale analysis of how readers browse Wikipedia. Using billions of page requests from Wikipedia’s server logs, we measure how readers reach articles, how they transition between articles, and how these patterns combine into more complex navigation paths. We find that navigation behavior is characterized by highly diverse structures. Although most navigation paths are shallow, comprising a single pageload, there is much variety, and the depth and shape of paths vary systematically with topic, device type, and time of day. We show that Wikipedia navigation paths commonly mesh with external pages as part of a larger online ecosystem, and we describe how naturally occurring navigation paths are distinct from targeted navigation in lab-based settings. Our results further suggest that navigation is abandoned when readers reach low-quality pages. Taken together, these insights contribute to a more systematic understanding of readers’ information needs and allow for improving their experience on Wikipedia and the Web in general.
Poster
Full-text available
CRISPR/Cas-based genome editing is a monumental leap in genetic engineering with considerable societal implications – but it is a complex procedure that is difficult to understand for non-scientists. Wikipedia has been shown to be an important source of information about scientific topics. But research on search, selection, and reception processes on Wikipedia is scarce. By means of eye tracking and survey data, this study investigates how users find information about genetic engineering and CRISPR on Wikipedia and what influences search behaviors and outcomes. An observational study was conducted in which 67 participants searched for general information about genetic engineering and specific information about CRISPR. Results indicate that participants looking for specific information about CRISPR searched shorter, visited fewer Wikipedia pages, and followed more relatively shorter and more straightforward search paths than participants looking for general information about genetic engineering. Moreover, prior knowledge and involvement affected users’ browsing behavior. Prior knowledge and search behavior influenced search outcomes.
Chapter
Full-text available
The World Wide Web (WWW) has fundamentally changed the ways billions of people are able to access information. Thus, understanding how people seek information online is an important issue of study. Wikipedia is a hugely important part of information provision on the Web, with hundreds of millions of users browsing and contributing to its network of knowledge. The study of navigational behavior on Wikipedia, due to the site’s popularity and breadth of content, can reveal more general information seeking patterns that may be applied beyond Wikipedia and the Web. Our work addresses the relative shortcomings of existing literature in relating how information structure influences patterns of navigation online. We study aggregated clickstream data for articles on the English Wikipedia in the form of a weighted, directed navigational network. We introduce two parameters that describe how articles act to source and spread traffic through the network, based on their in/out strength and entropy. From these, we construct a navigational phase space where different article types occupy different, distinct regions, indicating how the structure of information online has differential effects on patterns of navigation. Finally, we go on to suggest applications for this analysis in identifying and correcting deficiencies in the Wikipedia page network that may also be adapted to more general information networks.
Article
Full-text available
The World Wide Web (WWW) has fundamentally changed the ways billions of people are able to access information. Thus, understanding how people seek information online is an important issue of study. Wikipedia is a hugely important part of information provision on the web, with hundreds of millions of users browsing and contributing to its network of knowledge. The study of navigational behaviour on Wikipedia, due to the site's popularity and breadth of content, can reveal more general information seeking patterns that may be applied beyond Wikipedia and the Web. Our work addresses the relative shortcomings of existing literature in relating how information structure influences patterns of navigation online. We study aggregated clickstream data for articles on the English Wikipedia in the form of a weighted, directed navigational network. We introduce two parameters that describe how articles act to source and spread traffic through the network, based on their in/out strength and entropy. From these, we construct a navigational phase space where different article types occupy different, distinct regions, indicating how the structure of information online has differential effects on patterns of navigation. Finally, we go on to suggest applications for this analysis in identifying and correcting deficiencies in the Wikipedia page network that may also be adapted to more general information networks.
Article
Full-text available
We present a dataset that contains every instance of all tokens (~ words) ever written in undeleted, non-redirect English Wikipedia articles until October 2016, in total 13,545,349,787 instances. Each token is annotated with (i) the article revision it was originally created in, and (ii) lists with all the revisions in which the token was ever deleted and (potentially) re-added and re-deleted from its article, enabling a complete and straightforward tracking of its history. This data would be exceedingly hard to create by an average potential user as it is (i) very expensive to compute and as (ii) accurately tracking the history of each token in revisioned documents is a non-trivial task. Adapting a state-of-the-art algorithm, we have produced a dataset that allows for a range of analyses and metrics, already popular in research and going beyond, to be generated on complete-Wikipedia scale; ensuring quality and allowing researchers to forego expensive text-comparison computation, which so far has hindered scalable usage. We show how this data enables, on token-level, computation of provenance, measuring survival of content over time, very detailed conflict metrics, and fine-grained interactions of editors like partial reverts, re-additions and other metrics, in the process gaining several novel insights.
Article
Full-text available
In this work we study how people navigate the information network of Wikipedia and investigate (i) free-form navigation by studying all clicks within the English Wikipedia over an entire month and (ii) goal-directed Wikipedia navigation by analyzing wikigames, where users are challenged to retrieve articles by following links. To study how the organization of Wikipedia articles in terms of layout and links affects navigation behavior, we first investigate the characteristics of the structural organization and of hyperlinks in Wikipedia and then evaluate link selection models based on article structure and other potential influences in navigation, such as the generality of an article's topic. In free-form Wikipedia navigation, covering all Wikipedia usage scenarios, we find that click choices can be best modeled by a bias towards article structure, such as a tendency to click links located in the lead section. For the goal-directed navigation of wikigames, our findings confirm the zoom-out and the homing-in phases identified by previous work, where users are guided by generality at first and textual similarity to the target later. However, our interpretation of the link selection models accentuates that article structure is the best explanation for the navigation paths in all except these initial and final stages. Overall, we find evidence that users more frequently click on links that are located close to the top of an article. The structure of Wikipedia articles, which places links to more general concepts near the top, supports navigation by allowing users to quickly find the better-connected articles that facilitate navigation. Our results highlight the importance of article structure and link position in Wikipedia navigation and suggest that better organization of information can help make information networks more navigable.
Chapter
Full-text available
Literacy has become deictic (Leu, 2000); the meaning of literacy is rapidly changing as new technologies for literacy continually appear and new social practices of literacy quickly emerge.Keywords:Internet;language in the classroom;literacy;online reading comprehension;reading
Article
Full-text available
Wikipedia is a collaboratively-edited online encyclopaedia that relies on thousands of editors to both contribute articles and maintain their quality. Over the last years, research has extensively investigated this group of users while another group of Wikipedia users, the readers, their preferences and their behavior have not been much studied. This paper makes this group and its %their activities visible and valuable to Wikipedia's editor community. We carried out a study on two datasets covering a 13-months period to obtain insights on users preferences and reading behavior in Wikipedia. We show that the most read articles do not necessarily correspond to those frequently edited, suggesting some degree of non-alignment between user reading preferences and author editing preferences. We also identified that popular and often edited articles are read according to four main patterns, and that how an article is read may change over time. We illustrate how this information can provide valuable insights to Wikipedia's editor community.
Article
Full-text available
One of the most frequently used models for understanding human navigation on the Web is the Markov chain model, where Web pages are represented as states and hyperlinks as probabilities of navigating from one page to another. Predominantly, human navigation on the Web has been thought to satisfy the memoryless Markov property stating that the next page a user visits only depends on her current page and not on previously visited ones. This idea has found its way in numerous applications such as Google's PageRank algorithm and others. Recently, new studies suggested that human navigation may better be modeled using higher order Markov chain models, i.e., the next page depends on a longer history of past clicks. Yet, this finding is preliminary and does not account for the higher complexity of higher order Markov chain models which is why the memoryless model is still widely used. In this work we thoroughly present a diverse array of advanced inference methods for determining the appropriate Markov chain order. We highlight strengths and weaknesses of each method and apply them for investigating memory and structure of human navigation on the Web. Our experiments reveal that the complexity of higher order models grows faster than their utility, and thus we confirm that the memoryless model represents a quite practical model for human navigation on a page level. However, when we expand our analysis to a topical level, where we abstract away from specific page transitions to transitions between topics, we find that the memoryless assumption is violated and specific regularities can be observed. We report results from experiments with two types of navigational datasets (goal-oriented vs. free form) and observe interesting structural differences that make a strong argument for more contextual studies of human navigation in future work.
Conference Paper
Full-text available
Large corpora are ubiquitous in today’s world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). In this paper, we identify a gap in existing implementations of many of the popular algorithms, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. Within this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design, so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ.
Article
Full-text available
THE PURPOSE of this qualitative study was to explore the nature of reading comprehension processes while reading on the Internet. Eleven sixth-grade students with the highest combination of standardized reading scores, reading report card grades, and Internet reading experiences were selected from a population of 150 sixth graders in three different middle schools in the central and northeastern United States. These 11 skilled readers met individually with a researcher and completed two separate tasks that involved reading within multilayered websites or using the Yahooligans! search engine. Students answered specific questions about their strategy use in a follow-up interview after each reading session. Qualitative analysis evolved through four distinct phases, each of which involved reviewing data from think-aloud protocols, field observations, and semistructured interviews to provide insights on the nature of online reading comprehension. Findings suggested that successful Internet reading experiences appeared to simultaneously require both similar and more complex applications of (1) prior knowledge sources, (2) inferential reasoning strategies, and (3) self-regulated reading processes. The authors suggest that reading Internet text prompts a process of self-directed text construction that may explain the additional complexities of online reading comprehension. Implications for literacy theory and future research are discussed.
Working Paper
Full-text available
We introduce a model for predicting page-view dynamics of promoted content. The regularity of the content promotion process on Wikipedia provides excellent experimental conditions which favour detailed modelling. We show that the popularity of an article featured on Wikipedia's main page decays exponentially in time if the circadian cycles of the users are taken into account. Our model can be explained as the result of individual Poisson processes and is validated through empirical measurements. It provides a simpler explanation for the evolution of content popularity than previous studies.
Article
Full-text available
The Internet is the defining technology for literacy and learning in the 21st century. Approximately two billion individuals use the Internet (Internet World Stats, 2010). At the current rate of growth, more than one-half of the world's population will be online in five to seven years and most of the world will be online in 10 to 15 years. Never in the history of civilization have we seen a new technology adopted by so many, in so many different places, in such a short period of time. While there are many explanations for the rapid growth in Internet us-age, a primary impetus has been the economy and the workplace (Rouet, 2006; Smith, Mikulecky, Kibby, Dreher, & Dole, 2000; Organisation for Economic Co-operation and Development & the Centre for Educational Research and Innovation, 2010). Workplace settings are increasingly charac-terized by the effective use of information to solve important problems with-in a global economy (Friedman, 2006; Matteucci, O'Mahony, Robinson, & Zwick, 2005). Moreover, the efficient use of information skills in workplace contexts has become even more important as networked, digital technologies have provided greater access to larger amounts of information (Kirsch, Braun, Yamamoto, & Sum, 2007). This analysis suggests that skill with the new literacies of the Internet and other Information and Communication Technologies (ICTs) will become an important determinant of an engaged life in an online age (International Reading Association, 2009; National Council of Teachers of English, 2008). This is true because the Internet and other ICTs are increasingly an important source of information and require new literacies to effectively exploit their information potential (Coiro, Knobel, Lankshear, & Leu, 2008). Individuals, groups, and societies who can identify the most important problems, locate useful information the fastest, critically evaluate information most effectively, synthesize information most appropriately to develop the best solutions, and then communicate these solutions to others most clearly will succeed in the challenging times that await us. 6 J o u r n a l o f A d o l e s c e n t & A d u l t L i t e r a c y 5 5 (1) S e p t e m b e r 2 0 1 1 and new social practices of literacy quickly emerge. Historically, literacy has always changed (Manguel, 1996), but over substantial periods of time. Today, however, the emergence of the Internet has brought about a period of rapid, continuous technological change and, as a result, rapid, continuous change in the nature of literacy. The Internet is the most efficient system in the history of civilization for delivering new technologies that require new skills to read, write, and communi-cate effectively. It is also an amazingly efficient sys-tem for rapidly disseminating new social practices for the use of these technologies (Lankshear & Knobel, 2006). As a result, new technologies and new social practices rapidly and repeatedly redefine what it once meant, in a simpler world, to be able to read, write, and communicate effectively. To be literate today often means being able to use some combination of blogs, wikis, texting, search engines, Facebook, foursquare, Google Docs, Skype, Chrome, iMovie, Contribute, Basecamp, or many other relatively new technologies, including thou-sands of mobile applications, or "apps." To be literate tomorrow will be defined by even newer technolo-gies that have yet to appear and even newer social practices that we will create to meet unanticipated needs. Thus, the very nature of literacy continuously changes; literacy is deictic. It is becoming increasingly clear that the deictic nature of literacy will require us to continuously rethink traditional notions of literacy.
Article
Full-text available
Reading is a multi-sensory activity, entailing perceptual, cognitive and motor interactions with whatever is being read. With digital technology, reading manifests itself as being extensively multi-sensory – both in more explicit and more complex ways than ever before. In different ways from traditional reading technologies such as the codex, digital technology illustrates how the act of reading is intimately connected with and intricately dependent on the fact that we are both body and mind – a fact carrying important implications for even such an apparently intellectual activity as reading, whether recreational, educational or occupational. This article addresses some important and hitherto neglected issues concerning digital reading, with special emphasis on the vital role of our bodies, and in particular our fingers and hands, for the immersive fiction reading experience.
Article
Full-text available
Surfing the World Wide Web (WWW) involves traversing hyperlink connections among documents. The ability to predict surfing patterns could solve many problems facing producers and consumers of WWW content. We analyzed WWW server logs for a WWW site, collected over ten days, to compare different path reconstruction methods and to investigate how past surfing behavior predicts future surfing choices. Since log files do not explicitly contain user paths, various methods have evolved to reconstruct user paths. Session times, number of clicks per visit, and Levenshtein Distance analyses were performed to show the impact of various reconstruction methods. Different methods for measuring surfing patterns were also compared. Markov model approximations were used to model the probability of users choosing links conditional on past surfing paths. Information‐theoretic (entropy) measurements suggest that information is gained by using longer paths to estimate the conditional probability of link choice given surf path. The improvements diminish, however, as one increases the length of path beyond one. Information‐theoretic (total divergence to the average entropy) measurements suggest that the conditional probabilities of link choice given surf path are more stable over time for shorter paths than longer paths. Direct examination of the accuracy of the conditional probability models in predicting test data also suggests that shorter paths yield more stable models and can be estimated reliably with less data than longer paths.
Conference Paper
Full-text available
We analyze a large query log of 2.3 million anonymous registered users from a web-scale U.S. search engine in order to jointly analyze their on-line behavior in terms of who they might be (demographics), what they search for (query topics), and how they search (session analysis). We examine basic demographics from registration information provided by the users, augmented with U.S. census data, analyze basic session statistics, classify queries into types (navigational, informational, transactional) based on click entropy, classify queries into topic categories, and cluster users based on the queries they issued. We then examine the resulting clusters in terms of demographics and search behavior. Our analysis of the data suggests that there are important differences in search behavior across different demographic groups in terms of the topics they search for, and how they search (e.g., white conservatives are those likely to have voted republican, mostly white males, who search for business, home, and gardening related topics; Baby Boomers tend to be primarily interested in Finance and a large fraction of their sessions consist of simple navigational queries related to online banking, etc.). Finally, we examine regional search differences, which seem to correlate with differences in local industries (e.g., gambling related queries are highest in Las Vegas and lowest in Salt Lake City; searches related to actors are about three times higher in L.A. than in any other region).
Article
Full-text available
This paper analyzes which pages and topics are the most popular on Wikipedia and why. For the period of September 2006 to January 2007, the 100 most visited Wikipedia pages in a month are identified and categorized in terms of the major topics of interest. The observed topics are compared with search behavior on the Web. Search queries, which are identical to the titles of the most popular Wikipedia pages, are submitted to major search engines and the positions of popular Wikipedia pages in the top 10 search results are determined. The presented data helps to explain how search engines, and Google in particular, fuel the growth and shape what is popular on Wikipedia.
Article
Full-text available
Online popularity has enormous impact on opinions, culture, policy, and profits. We provide a quantitative, large scale, temporal analysis of the dynamics of online content popularity in two massive model systems, the Wikipedia and an entire country's Web space. We find that the dynamics of popularity are characterized by bursts, displaying characteristic features of critical systems such as fat-tailed distributions of magnitude and inter-event time. We propose a minimal model combining the classic preferential popularity increase mechanism with the occurrence of random popularity shifts due to exogenous factors. The model recovers the critical features observed in the empirical analysis of the systems analyzed here, highlighting the key factors needed in the description of popularity dynamics.
Article
While Wikipedia is a subject of great interest in the computing literature, very little work has considered Wikipedia’s important relationships with other information technologies like search engines. In this paper, we report the results of two deception studies whose goal was to better understand the critical relationship between Wikipedia and Google. These studies silently removed Wikipedia content from Google search results and examined the effect of doing so on participants’ interactions with both websites. Our findings demonstrate and characterize an extensive interdependence between Wikipedia and Google. Google becomes a worse search engine for many queries when it cannot surface Wikipedia content (for example, click-through rates on results pages drop significantly) and the importance of Wikipedia content is likely greater than many improvements to search algorithms. Our results also highlight Google’s critical role in providing readership to Wikipedia. However, we also found evidence that this mutually beneficial relationship is in jeopardy: changes Google has made to its search results that involve directly surfacing Wikipedia content are significantly reducing traffic to Wikipedia. Overall, our findings argue that researchers and practitioners should give deeper consideration to the interdependence between peer production communities and the information technologies that use and surface their content.
Article
Portrayals of history are never complete, and each description inherently exhibits a specific viewpoint and emphasis. In this paper, we aim to automatically identify such differences by computing timelines and detecting temporal focal points of written history across languages on Wikipedia. In particular, we study articles related to the history of all UN member states and compare them in 30 language editions. We develop a computational approach that allows to identify focal points quantitatively, and find that Wikipedia narratives about national histories (i) are skewed towards more recent events (recency bias) and (ii) are distributed unevenly across the continents with significant focus on the history of European countries (Eurocentric bias). We also establish that national historical timelines vary across language editions, although average interlingual consensus is rather high. We hope that this paper provides a starting point for a broader computational analysis of written history on Wikipedia and elsewhere.
Conference Paper
While a plethora of hypertext links exist on the Web, only a small amount of them are regularly clicked. Starting from this observation, we set out to study large-scale click data from Wikipedia in order to understand what makes a link successful. We systematically analyze effects of link properties on the popularity of links. By utilizing mixed-effects hurdle models supplemented with descriptive insights, we find evidence of user preference towards links leading to the periphery of the network, towards links leading to semantically similar articles, and towards links in the top and left-side of the screen. We integrate these findings as Bayesian priors into a navigational Markov chain model and by doing so successfully improve the model fits. We further adapt and improve the well-known classic PageRank algorithm that assumes random navigation by accounting for observed navigational preferences of users in a weighted variation. This work facilitates understanding navigational click behavior and thus can contribute to improving link structures and algorithms utilizing these structures.
Article
Wikipedia is one of the most popular sites on the Web, with millions of users relying on it to satisfy a broad range of information needs every day. Although it is crucial to understand what exactly these needs are in order to be able to meet them, little is currently known about why users visit Wikipedia. The goal of this paper is to fill this gap by combining a survey of Wikipedia readers with a log-based analysis of user activity. Based on an initial series of user surveys, we build a taxonomy of Wikipedia use cases along several dimensions, capturing users' motivations to visit Wikipedia, the depth of knowledge they are seeking, and their knowledge of the topic of interest prior to visiting Wikipedia. Then, we quantify the prevalence of these use cases via a large-scale user survey conducted on live Wikipedia with almost 30,000 responses. Our analyses highlight the variety of factors driving users to Wikipedia, such as current events, media coverage of a topic, personal curiosity, work or school assignments, or boredom. Finally, we match survey responses to the respondents' digital traces in Wikipedia's server logs, enabling the discovery of behavioral patterns associated with specific use cases. For instance, we observe long and fast-paced page sequences across topics for users who are bored or exploring randomly, whereas those using Wikipedia for work or school spend more time on individual articles focused on topics such as science. Our findings advance our understanding of reader motivations and behavior on Wikipedia and can have implications for developers aiming to improve Wikipedia's user experience, editors striving to cater to their readers' needs, third-party services (such as search engines) providing access to Wikipedia content, and researchers aiming to build tools such as recommendation engines.
Conference Paper
In this work, we study the visual position of links and their clicks on Wikipedia, particularly where links are visually located, at which screen positions users click on links, and which areas on the screen exhibit more or less clicks per links. For that purpose, we introduce a novel dataset containing the on-screen coordinate position for all links between pages in the English Wikipedia and additionally resort to navigation logs of Wikipedia users. Using this data, we can observe a preference of certain link and click locations on Wikipedia including first evidence of positional click bias. For example, our results suggest that users have a tendency to prefer to click on the left side of the screen which exceeds what one would expect from the presence of links on pages. We believe that presented data and research can be useful for optimizing the process of link creation and link consumption on Wikipedia and other Web platforms.
Conference Paper
Wikipedia supports its users to reach a wide variety of goals: looking up facts, researching a topic, making an edit or simply browsing to pass time. Some of these goals, such as the lookup of facts, can be effectively supported by search functions. However, for other use cases such as researching an unfamiliar topic, users need to rely on the links to connect articles. In this paper, we investigate the state of navigability in the article networks of eight language versions of Wikipedia. We find that, when taking all links of articles into account, all language versions enable mutual reachability for almost all articles. However, previous research has shown that visitors of Wikipedia focus most of their attention on the areas located close to the top. We therefore investigate different restricted navigational views that users could have when looking at articles. We find that restricting the view of articles strongly limits the navigability of the resulting networks and impedes navigation. Based on this analysis we then propose a link recommendation method to augment the link network to improve navigability in the network. Our approach selects links from a less restricted view of the article and proposes to move these links into more visible sections. The recommended links are therefore relevant for the article. Our results are relevant for researchers interested in the navigability of Wikipedia and open up new avenues for link recommendations in Wikipedia editing.
Conference Paper
Today, a variety of user interfaces exists for navigating information spaces, including, for example, tag clouds, breadcrumbs, subcategories and others. However, such navigational user interfaces are only useful to the extent that they expose the underlying topology---or network structure---of the information space. Yet, little is known about which topological clues should be integrated in navigational user interfaces. In detail, the aim of this paper is to identify what kind of and how much topological information needs to be included in user interfaces to facilitate efficient navigation. We model navigation as a variation of a decentralized search process with partial information and study its sensitivity to the quality and amount of the structural information used for navigation. We experiment with two strategies for node selection (quality of structural information provided to the user) and different amount of information (amount of structural information provided to the user). Our experiments on four datasets from different domains show that efficient navigation depends on the kind of structural information utilized. Additionally, node properties differ in their quality for augmenting navigation and intelligent pre-selection of which nodes to present in the interface to the user can improve navigational efficiency. This suggests that only a limited amount of high quality structural information needs to be exposed through the navigational user interface.
Article
Introduction. This exploratory study analyses the content of the search queries that led Australian Internet users from a search engine to a Wikipedia entry. Method. The study used transaction logs from Hitwise that matched search queries with data on the lifestyle of the searcher. A total sample of 1760 search terms, stratified by search term frequency and lifestyle, was drawn. Analysis. Each search term was coded to indicate the subject of the query and weighted according to its position in the long tail distribution. Quantitative analysis was carried out using the statistical package SPSS. Results. The results of the study suggest that Wikipedia is used more for lighter topics than for those of a more academic or serious nature. Significant differences among the various lifestyle segments were observed in the use of Wikipedia for queries on popular culture, cultural practice and science. Conclusions. The analysis provides some analytical purchase on the complex nature of information search and the difficulties inherent in assuming a valid distinction between information search and entertainment. It is suggested that the term leisure search be used to identify information search that is in itself a leisure activity and not a search for particular information.
Article
Good websites should be easy to navigate via hyperlinks, yet maintaining a high-quality link structure is difficult. Identifying pairs of pages that should be linked may be hard for human editors, especially if the site is large and changes frequently. Further, given a set of useful link candidates, the task of incorporating them into the site can be expensive, since it typically involves humans editing pages. In the light of these challenges, it is desirable to develop data-driven methods for automating the link placement task. Here we develop an approach for automatically finding useful hyperlinks to add to a website. We show that passively collected server logs, beyond telling us which existing links are useful, also contain implicit signals indicating which nonexistent links would be useful if they were to be introduced. We leverage these signals to model the future usefulness of yet nonexistent links. Based on our model, we define the problem of link placement under budget constraints and propose an efficient algorithm for solving it. We demonstrate the effectiveness of our approach by evaluating it on Wikipedia, a large website for which we have access to both server logs (used for finding useful new links) and the complete revision history (containing a ground truth of new links). As our method is based exclusively on standard server logs, it may also be applied to any other website, as we show with the example of the biomedical research site Simtk.
Article
When users interact with the Web today, they leave sequential digital trails on a massive scale. Examples of such human trails include Web navigation, sequences of online restaurant reviews, or online music play lists. Understanding the factors that drive the production of these trails can be useful for e.g., improving underlying network structures, predicting user clicks or enhancing recommendations. In this work, we present a general approach called HypTrails for comparing a set of hypotheses about human trails on the Web, where hypotheses represent beliefs about transitions between states. Our approach utilizes Markov chain models with Bayesian inference. The main idea is to incorporate hypotheses as informative Dirichlet priors and to leverage the sensitivity of Bayes factors on the prior for comparing hypotheses with each other. For eliciting Dirichlet priors from hypotheses, we present an adaption of the so-called (trial) roulette method. We demonstrate the general mechanics and applicability of HypTrails by performing experiments with (i) synthetic trails for which we control the mechanisms that have produced them and (ii) empirical trails stemming from different domains including website navigation, business reviews and online music played. Our work expands the repertoire of methods available for studying human trails on the Web.
Conference Paper
Models of human navigation play an important role for understanding and facilitating user behavior in hypertext systems. In this paper, we conduct a series of principled experiments with decentralized search - an established model of human navigation in social networks - and study its applicability to information networks. We apply several variations of decentralized search to model human navigation in information networks and we evaluate the outcome in a series of experiments. In these experiments, we study the validity of decentralized search by comparing it with human navigational paths from an actual information network - Wikipedia. We find that (i) navigation in social networks appears to differ from human navigation in information networks in interesting ways and (ii) in order to apply decentralized search to information networks, stochastic adaptations are required. Our work illuminates a way towards using decentralized search as a valid model for human navigation in information networks in future work. Our results are relevant for scientists who are interested in modeling human behavior in information networks and for engineers who are interested in using models and simulations of human behavior to improve on structural or user interface aspects of hypertextual systems.
Article
Navigating information spaces is an essential part of our everyday lives, and in order to design efficient and user-friendly information systems, it is important to understand how humans navigate and find the information they are looking for. We perform a large-scale study of human wayfinding, in which, given a network of links between the concepts of Wikipedia, people play a game of finding a short path from a given start to a given target concept by following hyperlinks. What distinguishes our setup from other studies of human Web-browsing behavior is that in our case people navigate a graph of connections between concepts, and that the exact goal of the navigation is known ahead of time. We study more than 30,000 goal-directed human search paths and identify strategies people use when navigating information spaces. We find that human wayfinding, while mostly very efficient, differs from shortest paths in characteristic ways. Most subjects navigate through high-degree hubs in the early phase, while their search is guided by content features thereafter. We also observe a trade-off between simplicity and efficiency: conceptually simple solutions are more common but tend to be less efficient than more complex ones. Finally, we consider the task of predicting the target a user is trying to reach. We design a model and an efficient learning algorithm. Such predictive models of human wayfinding can be applied in intelligent browsing interfaces.
Article
User modeling on the Web has rested on the fundamental assumption of Markovian behavior --- a user's next action depends only on her current state, and not the history leading up to the current state. This forms the underpinning of PageRank web ranking, as well as a number of techniques for targeting advertising to users. In this work we examine the validity of this assumption, using data from a number of Web settings. Our main result invokes statistical order estimation tests for Markov chains to establish that Web users are not, in fact, Markovian. We study the extent to which the Markovian assumption is invalid, and derive a number of avenues for further research.
Article
In almost all computer applications, users must enter correct words for the desired objects or actions. For success without extensive training, or in first-tries for new targets, the system must recognize terms that will be chosen spontaneously. We studied spontaneous word choice for objects in five application-related domains, and found the variability to be surprisingly large. In every case two people favored the same term with probability less than 0. 20. Simulations show how this fundamental property of language limits the success of various design methodologies for vocabulary-driven interaction. For example, the popular approach in which access is via one designer's favorite single word will result in 80-90 percent failure rates in many common situations. An optimal strategy, unlimited aliasing, is derived and shown to be capable of several-fold improvements. (Author abstracat)
Article
The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.
Conference Paper
In view navigation a user moves about an information struc- ture by selecting something in the current view of the struc- ture. This paper explores the implications of rudimentary requirements for effective view navigation, namely that, despite the vastness of an information structure, the views must be small, moving around must not take too many steps and the route to any target be must be discoverable. The analyses help rationalize existing practice, give insight into the difficulties, and suggest strategies for design.
Conference Paper
In this paper, we undertake a large-scale study of online user behavior based on search and toolbar logs. We propose a new CCS taxonomy of pageviews consisting of Content (news, portals, games, verticals, multimedia), Communication (email, social networking, forums, blogs, chat), and Search (Web search, item search, multimedia search). We show that roughly half of all pageviews online are content, one-third are communications, and the remaining one-sixth are search. We then give further breakdowns to characterize the pageviews within each high-level category. We then study the extent to which pages of certain types are revisited by the same user over time, and the mechanisms by which users move from page to page, within and across hosts, and within and across page types. We consider robust schemes for assigning responsibility for a pageview to ancestors along the chain of referrals. We show that mail, news, and social networking pageviews are insular in nature, appearing primarily in homogeneous sessions of one type. Search pageviews, on the other hand, appear on the path to a disproportionate number of pageviews, but cannot be viewed as the principal mechanism by which those pageviews were reached. Finally, we study the burstiness of pageviews associated with a URL, and show that by and large, online browsing behavior is not significantly affected by "breaking" material with non-uniform visit frequency.
Article
Ranked lists are encountered in research and daily life and it is often of interest to compare these lists even when they are incomplete or have only some members in common. An example is document rankings returned for the same query by different search engines. A measure of the similarity between incomplete rankings should handle nonconjointness, weight high ranks more heavily than low, and be monotonic with increasing depth of evaluation; but no measure satisfying all these criteria currently exists. In this article, we propose a new measure having these qualities, namely rank-biased overlap (RBO). The RBO measure is based on a simple probabilistic user model. It provides monotonicity by calculating, at a given depth of evaluation, a base score that is non-decreasing with additional evaluation, and a maximum score that is nonincreasing. An extrapolated score can be calculated between these bounds if a point estimate is required. RBO has a parameter which determines the strength of the weighting to top ranks. We extend RBO to handle tied ranks and rankings of different lengths. Finally, we give examples of the use of the measure in comparing the results produced by public search engines and in assessing retrieval systems in the laboratory.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
In this paper we undertake a large-scale study of online user search behavior based on search and toolbar logs. We identify three types of search: web, multimedia, and item. Together, we show that these different flavors represent almost 10% of all online pageviews, and indirectly result in over 21% of all pageviews. We study search queries themselves, and show that more than half of them contain direct references to some type of structured object; we characterize the types of objects that occur in these queries. We then revisit the relationship search and navigation specifically in the context of e-commerce, and consider how search aids users in online shopping tasks.
Conference Paper
THE KINDS OF FILE structures required if we are to use the computer for personal files and as an adjunct to creativity are wholly different in character from those customary in business and scientific data processing. They need to provide the capacity for intricate and idiosyncratic arrangements, total modifiability, undecided alternatives, and thorough internal documentation. I want to explain how some ideas developed and what they are. The original problem was to specify a computer system for personal information retrieval and documentation, able to do some rather complicated things in clear and simple ways. In this paper I will explain the original problem. Then I will explain why the problem is not simple, and why the solution (a file structure) must yet be very simple. The file structure suggested here is the Evolutionary List File, to be built of zippered lists. A number of uses will be suggested for such a file, to show the breadth of its potential usefulness. Finally, I want to explain the philosophical implications of this approach for information retrieval and data structure in a changing world.
Evaluating the development of scientific knowledge and new forms of reading comprehension during online learning. Final report presented to the North Central Regional Educational Laboratory/Learning Point Associates
  • Donald J Leu Jill Castek D Hartman
  • Julie Coiro
  • L Henry
  • J Kulikowich
  • Stacy Lyver
Donald J Leu, Jill Castek, D Hartman, Julie Coiro, L Henry, J Kulikowich, and Stacy Lyver. 2005. Evaluating the development of scientific knowledge and new forms of reading comprehension during online learning. Final report presented to the North Central Regional Educational Laboratory/Learning Point Associates. Retrieved May 15 (2005), 2006.
  • Ellery Wulczyn
  • Dario Taraborelli
Ellery Wulczyn and Dario Taraborelli. 2016. Wikipedia Clickstream. figshare. doi:10.6084/m9.figshare.1305770. Accessed: 2017-5-3.
Andreas Hotho, and Markus Strohmaier
  • Philipp Singer
  • Denis Helic
Jacob Ratkiewicz Santo Fortunato Alessandro Flammini Filippo Menczer and Alessandro Vespignani . 2010. Characterizing and modeling the dynamics of online popularity
  • Jacob Ratkiewicz Santo Fortunato Alessandro Flammini Filippo Menczer
  • Alessandro Vespignani