Conference Paper

Analysis of Cluster Structure in Large-Scale English Wikipedia Category Networks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this paper we propose a framework for analysing the structure of a large-scale social media network, a topic of significant recent interest. Our study is focused on the Wikipedia category network, where nodes correspond to Wikipedia categories and edges connect two nodes if the nodes share at least one common page within the Wikipedia network. Moreover, each edge is given a weight that corresponds to the number of pages shared between the two categories that it connects. We study the structure of category clusters within the three complete English Wikipedia category networks from 2010 to 2012. We observe that category clusters appear in the form of well-connected components that are naturally clustered together. For each dataset we obtain a graph, which we call the t-filtered category graph, by retaining just a single edge linking each pair of categories for which the weight of the edge exceeds some specified threshold t. Our framework exploits this graph structure and identifies connected components within the t-filtered category graph. We studied the large-scale structural properties of the three Wikipedia category networks using the proposed approach. We found that the number of categories, the number of clusters of size two, and the size of the largest cluster within the graph all appear to follow power laws in the threshold t. Furthermore, for each network we found the value of the threshold t for which increasing the threshold to t + 1 caused the “giant” largest cluster to diffuse into two or more smaller clusters of significant size and studied the semantics behind this diffusion.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Conference Paper
Some provinces in Thailand cannot promote their festival events directly to tourists. From a tourist perspective, it is the inconvenience of searching for tourism information from different sources. Therefore, we developed an online promotion system for Thai festivals with three objectives: 1) to represent the events in a geospatial information for the event's location 2) to display the event's details and 3) to show the text content describing the places of the events aggregated from Wikipdia's web services using 'MediaWiki Action API'. The online system's functionalities for the users are: 1) the tourists to filter and view the events 2) the festival promoter to register the events, 3) the administrator to manage the users' account and 4) the approver to validate and manage the events' details. The online system was evaluated from users' satisfaction, and the result was good, the arithmetic mean at 4.31 from 5.00 and the standard deviation at 0.70.
Article
Full-text available
Twitter and other social media platforms are increas-ingly used as the primary way in which people speak with each other. As opposed to other platforms, Twit-ter is interesting in that many of these dialogues are public and so we can get a view into the dynamics of dialogues and how they differ from other other tweet behaviors. We here analyze tweets gathered from 2400 twitter streams over a one month period. We study so-cial interactions in three important dimensions: what are the salient user behaviors in terms of how often they have social interactions and how these interactions are spread among different people; what are the character-istics of the dialogues, or sets of tweets, that we can extract from these interactions, and what are the charac-teristics of the social network which emerges from con-sidering these interactions? We find that roughly half of the users spend a fair amount of time interacting whereas 40% of users do not seem to have active inter-actions. We also find that the vast majority of active dia-logues only involve two people despite the public nature of these tweets. We finally find that while the emerg-ing social network does contain a giant component, the component clearly is a set of well-defined tight clusters which are loosely connected.
Article
Full-text available
Today, YouTube is the largest user-driven video con-tent provider in the world; it has become a major plat-form for disseminating multimedia information. A ma-jor contribution to its success comes from the user-to-user social experience that differentiates it from tradi-tional content broadcasters. This work examines the so-cial network aspect of YouTube by measuring the full-scale YouTube subscription graph, comment graph, and video content corpus. We find YouTube to deviate sig-nificantly from network characteristics that mark tradi-tional online social networks, such as homophily, re-ciprocative linking, and assortativity. However, compar-ing to reported characteristics of another content-driven online social network, Twitter, YouTube is remarkably similar. Examining the social and content facets of user popularity, we find a stronger correlation between a user's social popularity and his/her most popular con-tent as opposed to typical content popularity. Finally, we demonstrate an application of our measurements for classifying YouTube Partners, who are selected users that share YouTube's advertisement revenue. Results are motivating despite the highly imbalanced nature of the classification problem.
Article
Full-text available
The popularity of the Web has allowed individuals to commu-nicate and interact with each other on a global scale: people connect both to close friends and acquaintances, creating ties that can bridge otherwise separated groups of people. Recent evidence suggests that spatial distance is still affecting social links established on online platforms, with online ties prefer-entially connecting closer people. In this work we study the relationships between interaction strength, spatial distance and structural position of ties be-tween members of a large-scale online social networking plat-form, Tuenti. We discover that ties in highly connected social groups tend to span shorter distances than connections bridg-ing together otherwise separated portions of the network. We also find that such bridging connections have lower social in-teraction levels than ties within the inner core of the network and ties connecting to its periphery. Our results suggest that spatial constraints on online social networks are intimately connected to structural network properties, with important consequences for information diffusion.
Article
Full-text available
As the Web has become integrated into daily life, understand-ing how individuals spend their time online impacts domains ranging from public policy to marketing. It is difficult, how-ever, to measure even simple aspects of browsing behavior via conventional methods—including surveys and site-level analytics—due to limitations of scale and scope. In part ad-dressing these limitations, large-scale Web panel data are a relatively novel means for investigating patterns of Internet usage. In one of the largest studies of browsing behavior to date, we pair Web histories for 250,000 anonymized individ-uals with user-level demographics—including age, sex, race, education, and income—to investigate three topics. First, we examine how behavior changes as individuals spend more time online, showing that the heaviest users devote nearly twice as much of their time to social media relative to typ-ical individuals. Second, we revisit the digital divide, find-ing that the frequency with which individuals turn to the Web for research, news, and healthcare is strongly related to ed-ucational background, but not as closely tied to gender and ethnicity. Finally, we demonstrate that browsing histories are a strong signal for inferring user attributes, including ethnic-ity and household income, a result that may be leveraged to improve ad targeting.
Article
Full-text available
Research in computational epidemiology to date has concentrated on coarse-grained statistical analysis of populations, often synthetic ones. By contrast, this paper focuses on fine-grained modeling of the spread of infectious diseases throughout a large real-world social network. Specifically, we study the roles that social ties and interactions between specific individuals play in the progress of a contagion.We focus on public Twitter data, where we find that for every health-related message there are more than 1,000 unrelated ones. This class imbalance makes classification particularly challenging. Nonetheless, we present a framework that accurately identifies sick individuals from the content of online communication. Evaluation on a sample of 2.5 million geo-tagged Twitter messages shows that social ties to infected, symptomatic people, as well as the intensity of recent co-location, sharply increase one's likelihood of contracting the illness in the near future. To our knowledge, this work is the first to model the interplay of social activity, human mobility, and the spread of infectious disease in a large real-world population. Furthermore, we provide the first quantifiable estimates of the characteristics of disease transmission on a large scale without active user participation-a step towards our ability to model and predict the emergence of global epidemics from day-to-day interpersonal interactions. Copyright © 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Data
Full-text available
In the last few years the size and coverage of Wikipedia, a community edited, freely available on-line encyclopedia has reached the point where it can be effectively used to identify topics discussed in a document, similarly to an ontology or taxonomy. In this paper we will show that even a fairly simple algorithm that exploits only the titles and categories of Wikipedia articles can characterize documents by Wikipedia categories surprisingly well. We test the reliability of our method by predicting categories of Wikipedia articles themselves based on their bodies, and also by performing classification and clustering on 20 Newsgroups and RCV1, representing documents by their Wikipedia categories instead of (or in addition to) their texts. Support from NKFP projects MOLINGV and Language Miner.
Article
Full-text available
Wikipedia, as a social phenomenon of collaborative knowledge creating, has been studied extensively from various points of views. The category system of Wikipedia, introduced in 2004, has attracted relatively little attention. In this study, we focus on the documentation of knowledge, and the transformation of this documentation with time. We take Wikipedia as a proxy for knowledge in general and its category system as an aspect of the structure of this knowledge. We investigate the evolution of the category structure of the English Wikipedia from its birth in 2004 to 2008. We treat the category system as if it is a hierarchical Knowledge Organization System, capturing the changes in the distributions of the top categories. We investigate how the clustering of articles, defined by the category system, matches the direct link network between the articles and show how it changes over time. We find the Wikipedia category network mostly stable, but with occasional reorganization. We show that the clustering matches the link structure quite well, except short periods preceding the reorganizations.
Conference Paper
Full-text available
In this paper, we investigate the dierence between Wikipe- dia and Web link structure with respect to their value as in- dicators of the relevance of a page for a given topic of request. Our experimental evidence is from two IR test-collections: the .GOV collection used at the TREC Web tracks and the Wikipedia XML Corpus used at INEX. We first perform a comparative analysis of Wikipedia and .GOV link structure and then investigate the value of link evidence for improv- ing search on Wikipedia and on the .GOV domain. Our main findings are: First, Wikipedia link structure is similar to the Web, but more densely linked. Second, Wikipedia's outlinks behave similar to inlinks and both are good indica- tors of relevance, whereas on the Web the inlinks are more important. Third, when incorporating link evidence in the retrieval model, for Wikipedia the global link evidence fails and we have to take the local context into account.
Conference Paper
Full-text available
While previous studies have used the Wikipedia dataset to provide an understanding of its growth, there have been few attempts to quantitatively analyze the establishment and evo- lution of the rich social practices that support this editing community. One such social practice is the enactment and creation of Wikipedian policies. We focus on the enact- ment of policies in discussions on the talk pages that ac- company each article. These policy citations are a valuable micro-to-macro connection between everyday action, com- munal norms and the governance structure of Wikipedia. We find that policies are widely used by registered users and ad- ministrators, that their use is converging and stabilizing in and across these groups, and that their use illustrates the growing importance of certain classes of work, in particular source at- tribution. We also find that participation in Wikipedia's gov- ernance structure is inclusionary in practice.
Conference Paper
Full-text available
Wikipedia, the free online encyclopedia anyone can edit, is a live social experiment: millions of individuals volunteer their knowledge and time to collective create it. It is hence interesting trying to understand how they do it. While most of the scholar attention focused on article pages, a less investigated share of activities happen on user talk pages, Wikipedia pages where a message can be left for the specific user. This public conversations can be studied from a Social Network Analysis perspective in order to highlight the structure of the "talk" network. In this paper we focus on this preliminary extraction step by proposing different algorithms. We then empirically validate the differences in the networks they generate on the Venetian Wikipedia with the real network of conversations extracted manually by coding every message left on all user talk pages. The comparisons show that both the algorithms and the manual process contain inaccuracies that are intrinsic in the freedom and unpredictability of Wikipedia syntax and practices. Nevertheless, a precise description of the involved issues allows to make informed decisions and to base empirical findings on reproducible evidence. Our goal is to lay the foundation for a solid computational sociology of wikis. For this reason we release the scripts encoding our algorithms as open source and also some datasets extracted out of Wikipedia conversations, in order to let other researchers replicate and improve our initial effort.
Conference Paper
Full-text available
Wikipedia is an online encyclopedia which has undergone tremendous growth. However, this same growth has made it difficult to characterize its content and coverage. In this paper we develop measures to map Wikipedia using its socially annotated, hierarchical category structure. We introduce a mapping technique that takes advantage of socially-annotated hierarchical categories while dealing with the inconsistencies and noise inherent in the distributed way that they are generated. The technique is demonstrated through two applications: mapping the distribution of topics in Wikipedia and how they have changed over time; and mapping the degree of conflict found in each topic area. We also discuss the utility of the approach for other applications and datasets involving collaboratively annotated category hierarchies.
Conference Paper
Full-text available
In this paper, we discuss two graphs in Wikipedia (i) the article graph, and (ii) the category graph. We perform a graph- theoretic analysis of the category graph, and show that it is a scale-free, small world graph like other well-known lexi- cal semantic networks. We substantiate our findings by transferring semantic re- latedness algorithms defined on WordNet to the Wikipedia category graph. To as- sess the usefulness of the category graph as an NLP resource, we analyze its cover- age and the performance of the transferred semantic relatedness algorithms.
Conference Paper
Full-text available
Wikipedia is an online encyclopedia, available in more than 100 languages and comprising over 1 million articles in its English version. If we consider each Wikipedia article as a node and each hyperlink between articles as an arc we have a "Wikigraph", a graph that represents the link structure of Wikipedia. The Wikigraph differs from other Web graphs studied in the literature by the fact that there are explicit timestamps associated with each node's events. This allows us to do a detailed analysis of the Wikipedia evolution over time. In the first part of this study we characterize this evolution in terms of users, editions and articles; in the second part, we depict the temporal evolution of several topological properties of the Wikigraph. The insights obtained from the Wikigraphs can be applied to large Web graphs from which the temporal data is usually not available
Article
In Wikipedia, good articles are wanted. While Wikipedia relies on collaborative effort from online volunteers for quality checking, the process of selecting top quality articles is time consuming. At present, the duty of decision making is shouldered by only a couple of administrators. Aiming to assist in the quality checking cycles so as to cope with the exponential growth of online contributions to Wikipedia, this work studies the task of predicting the outcome of featured article (FA) nominations. We analyze FA candidate (FAC) sessions collected over a period of 3.5 years, and examine the extent to which consensus has been practised in this process. We explore the use of interaction features between FAC reviewers to learn SVM classifiers to predict the nomination outcome. We find that, calibrating the individual user's polarity of opinions as features improves the prediction accuracy significantly.
Article
In this survey we overview the definitions and methods for graph clustering, that is, finding sets of ''related'' vertices in graphs. We review the many definitions for what is a cluster in a graph and measures of cluster quality. Then we present global algorithms for producing a clustering for the entire vertex set of an input graph, after which we discuss the task of identifying a cluster for a specific seed vertex by local computation. Some ideas on the application areas of graph clustering algorithms are given. We also address the problematics of evaluating clusterings and benchmarking cluster algorithms.
Article
Wikipedia is a collaborative setting with both combative and cooperative editing. We propose a new method for investi-gating the types of editor interactions using a novel repre-sentation of Wikipedia's revision history as a temporal, bi-partite network with multiple node and edge types for users and revisions. From this representation we identify signifi-cant author interactions as network motifs and show how the motif types capture important, diverse editing behaviors. Two experiments demonstrate the further benefit of motifs. First, we demonstrate significant performance improvement over a purely revision-based analysis in classifying pages as com-bative or cooperative page by using motifs; and second we use motifs as a basis for analyzing trends in the dynamics of editor behavior to explain Wikipedia's content growth.
Conference Paper
In the size and coverage of Wikipedia, a freely available online encyclopedia has reached the point where it can be utilized similar to an ontology or taxonomy to identify the topics discussed in a document. In this paper we show that even a simple algorithm that exploits only the titles and categories of Wikipedia articles can characterize documents by Wikipedia categories surprisingly well. We test the reliability of our method by predicting categories of Wikipedia articles themselves based on their bodies, and by performing classification and clustering on 20 newsgroups and RCV1, representing documents by their Wikipedia categories instead of their texts
Conference Paper
Document Topic Extraction aims at using several key phrases to describe the topics of documents. It can be applied in web document categorization and tagging, document clusters topic description and information retrieval tasks. In this paper, we propose a Wikipedia category-based document topic extraction method. Document is mapped to a set of Wikipedia categories and is represented as graph structure in order to conserve the relationship between Wikipedia categories. Then, document topic can be extracted by clustering the related Wikipedia categories in the document collection. Experiment in real data shows Wikipedia category-based document topic extraction method achieves the better result than latent topic modeling method, such as LDA.
Conference Paper
In Wikipedia, good articles are wanted. While Wikipedia re- lies on collaborative effort from online volunteers for quality checking, the process of selecting top quality articles is time consuming. At present, the duty of decision making is shoul- dered by only a couple of administrators. Aiming to assist in the quality checking cycles so as to cope with the exponential growth of online contributions to Wikipedia, this work stud- ies the task of predicting the outcome of featured article (FA) nominations. We analyze FA candidate (FAC) sessions col- lected over a period of 3.5 years, and examine the extent to which consensus has been practised in this process. We ex- plore the use of interaction features between FAC reviewers to learn SVM classifiers to predict the nomination outcome. We find that, calibrating the individual user's polarity of opinions as features improves the prediction accuracy significantly.
Article
Social media sites are often guided by a core group of committed users engaged in various forms of governance. A crucial aspect of this type of governance is deliberation, in which such a group reaches decisions on issues of importance to the site. Despite its crucial --- though subtle --- role in how a number of prominent social media sites function, there has been relatively little investigation of the deliberative aspects of social media governance. Here we explore this issue, investigating a particular deliberative process that is extensive, public, and recorded: the promotion of Wikipedia admins, which is determined by elections that engage committed members of the Wikipedia community. We find that the group decision-making at the heart of this process exhibits several fundamental forms of relative assessment. First we observe that the chance that a voter will support a candidate is strongly dependent on the relationship between characteristics of the voter and the candidate. Second we investigate how both individual voter decisions and overall election outcomes can be based on models that take into account the sequential, public nature of the voting.
Social Network Analysis: a handbook
  • J Scott
  • J. Scott