Conference Paper

Detecting Wikipedia Vandalism using WikiTrust - Lab Report for PAN at CLEF 2010.

Conference: CLEF 2010 LABs and Workshops, Notebook Papers, 22-23 September 2010, Padua, Italy
Source: DBLP

ABSTRACT WikiTrust is a reputation system for Wikipedia authors and content. WikiTrust computes three main quantities: edit quality, author reputation, and content reputation. The edit quality measures how well each edit, that is, each change introduced in a revision, is preserved in subsequent revisions. Authors who perform good quality edits gain reputation, and text which is revised by sev- eral high-reputation authors gains reputation. Since vandalism on the Wikipedia is usually performed by anonymous or new users (not least because long-time vandals end up banned), and is usually reverted in a reasonably short span of time, edit quality, author reputation, and content reputation are obvious candi- dates as features to identify vandalism on the Wikipedia. Indeed, using the full set of features computed by WikiTrust, we have been able to construct classifiers that identify vandalism with a recall of 83.5%, a precision of 48.5%, and a false positive rate of 8%, for an area under the ROC curve of 93.4%. If we limit our- selves to the set of features available at the time an edit is made (when the edit quality is still unknown), the classifier achieves a recall of 77.1%, a precision of 36.9%, and a false positive rate of 12.2%, for an area under the ROC curve of 90.4%. Using these classifiers, we have implemented a simple Web API that provides the vandalism estimate for every revision of the English Wikipedia. The API can be used both to identify vandalism that needs to be reverted, and to select high- quality, non-vandalized recent revisions of any given Wikipedia article. These recent high-quality revisions can be included in static snapshots of the Wikipedia, or they can be used whenever tolerance to vandalism is low (as in a school setting, or whenever the material is widely disseminated).

  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes the generation of temporally anchored infobox attribute data from the Wikipedia history of revisions. By mining (attribute, value) pairs from the revision history of the English Wikipedia we are able to collect a comprehensive knowledge base that contains data on how attributes change over time. When dealing with the Wikipedia edit history, vandalic and erroneous edits are a concern for data quality. We present a study of vandalism identification in Wikipedia edits that uses only features from the infoboxes, and show that we can obtain, on this dataset, an accuracy comparable to a state-of-the-art vandalism identification method that is based on the whole article. Finally, we discuss different characteristics of the extracted dataset, which we make available for further study.
    Language Resources and Evaluation 12/2013; 47(4). DOI:10.1007/s10579-013-9232-5 · 0.52 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Vandalism, the malicious modification of articles, is a serious problem for open access encyclopedias such as Wikipedia. The use of counter-vandalism bots is changing the way Wikipedia identifies and bans vandals, but their contributions are often not considered nor discussed. In this paper, we propose novel text features capturing the invariants of vandalism across five languages to learn and compare the contributions of bots and users in the task of identifying vandalism. We construct computationally efficient features that highlight the contributions of bots and users, and generalize across languages. We evaluate our proposed features through classification performance on revisions of five Wikipedia languages, totaling over 500 million revisions of over 9 million articles. As a comparison, we evaluate these features on the small PAN Wikipedia vandalism data sets, used by previous research, which contain approximately 62,000 revisions. We show differences in the performance of our features on the PAN and the full Wikipedia data set. With the appropriate text features, vandalism bots can be effective across different languages while learning from only one language. Our ultimate aim is to build the next generation of vandalism detection bots based on machine learning approaches that can work effectively across many languages.
    IEEE Transactions on Knowledge and Data Engineering 07/2014; 27(3). DOI:10.1109/TKDE.2014.2339844 · 1.82 Impact Factor
  • Source
    CLEF 2010 LABs and Workshops, Notebook Papers, 22-23 September 2010, Padua, Italy; 01/2010


Available from