Preprint

Talking to oneself in CMC: a study of self replies in Wikipedia talk pages

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

This study proposes a qualitative analysis of self replies in Wikipedia talk pages, more precisely when the first two messages of a discussion are written by the same user. This specific pattern occurs in more than 10% of threads with two messages or more and can be explained by a number of reasons. After a first examination of the lexical specificities of second messages, we propose a seven categories typology and use it to annotate two reference samples (English and French) of 100 threads each. Finally, we analyse and compare the performance of human annotators (who reach a reasonable global efficiency) and instruction-tuned LLMs (which encounter important difficulties with several categories).

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
En etudiant les echanges sur un forum public et leurs interconnexions avec d'autres supports sur Internet, cet article s'attache a montrer comment se constituent pratiquement des interactions asynchrones, ecrites et mediatisees par les reseaux electroniques. Le forum ne constitue pas a lui seul un cadre de participation. En effet, d'autres ressources du reseau sont mobilisees au cours ou autour des interactions dans le forum : les pages personnelles et les sites collectifs peuvent etre utilises pour s'orienter par rapport a l'interlocuteur ; la messagerie ou les outils de conversation en direct peuvent etre utilises en parallele ou successivement pour des echanges prives. Pour rendre compte de la particularite des interactions mediatisees, l'analyse explore les types d'activite sur le forum, la constitution des identites et la maniere dont se construisent les relations sur le reseau.
Thesis
Full-text available
Over the past decade, the paradigm of the World Wide Web has shifted from static web pages towards participatory and collaborative content production. The main properties of this user generated content are a low publication threshold and little or no editorial control. While this has improved the variety and timeliness of the available information, it causes an even higher variance in quality than the already heterogeneous quality of traditional web content. Wikipedia is the prime example for a successful, large-scale, collaboratively created resource that reflects the spirit of the open collaborative content creation paradigm. Even though recent studies have confirmed that the overall quality of Wikipedia is high, there is still a wide gap that must be bridged before Wikipedia reaches the state of a reliable, citable source. A key prerequisite to reaching this goal is a quality management strategy that can cope both with the massive scale of Wikipedia and its open and almost anarchic nature. This includes an efficient communication platform for work coordination among the collaborators as well as techniques for monitoring quality problems across the encyclopedia. This dissertation shows how natural language processing approaches can be used to assist information quality management on a massive scale. In the first part of this thesis, we establish the theoretical foundations for our work. We first introduce the relatively new concept of open online collaboration with a particular focus on collaborative writing and proceed with a detailed discussion of Wikipedia and its role as an encyclopedia, a community, an online collaboration platform, and a knowledge resource for language technology applications. We then proceed with the three main contributions of this thesis. Even though there have been previous attempts to adapt existing information quality frameworks to Wikipedia, no quality model has yet incorporated writing quality as a central factor. Since Wikipedia is not only a repository of mere facts but rather consists of full text articles, the writing quality of these articles has to be taken into consideration when judging article quality. As the first main contribution of this thesis, we therefore define a comprehensive article quality model that aims to consolidate both the quality of writing and the quality criteria defined in multiple Wikipedia guidelines and policies into a single model. The model comprises 23 dimensions segmented into the four layers of intrinsic quality, contextual quality, writing quality and organizational quality. As a second main contribution, we present an approach for automatically identifying quality flaws in Wikipedia articles. Even though the general idea of quality detection has been introduced in previous work, we dissect the approach to find that the task is inherently prone to a topic bias which results in unrealistically high cross-validated evaluation results that do not reflect the classifier’s real performance on real world data. We solve this problem with a novel data sampling approach based on the full article revision history that is able to avoid this bias. It furthermore allows us not only to identify flawed articles but also to find reliable counterexamples that do not exhibit the respective quality flaws. For automatically detecting quality flaws in unseen articles, we present FlawFinder, a modular system for supervised text classification. We evaluate the system on a novel corpus of Wikipedia articles with neutrality and style flaws. The results confirm the initial hypothesis that the reliable classifiers tend to exhibit a lower cross-validated performance than the biased ones but the scores more closely resemble their actual performance in the wild. As a third main contribution, we present an approach for automatically segmenting and tagging the user contributions on article Talk pages to improve the work coordination among Wikipedians. These unstructured discussion pages are not easy to navigate and information is likely to get lost over time in the discussion archives. By automatically identifying the quality problems that have been discussed in the past and the solutions that have been proposed, we can help users to make informed decisions in the future. Our contribution in this area is threefold: (i) We describe a novel algorithm for segmenting the unstructured dialog on Wikipedia Talk pages using their revision history. In contrast to related work, which mainly relies on the rudimentary markup, this new algorithm can reliably extract meta data, such as the identity of a user, and is moreover able to handle discontinuous turns. (ii) We introduce a novel scheme for annotating the turns in article discussions with dialog act labels for capturing the coordination efforts of article improvement. The labels reflect the types of criticism discussed in a turn, for example missing information or inappropriate language, as well as any actions proposed for solving the quality problems. (iii) Based on this scheme, we created two automatically segmented and manually annotated discussion corpora extracted from the Simple English Wikipedia (SEWD) and the English Wikipedia (EWD). We evaluate how well text classification approaches can learn to assign the dialog act labels from our scheme to unseen discussion pages and achieve a cross-validated performance of F1 = 0.82 on the SEWD corpus while we obtain an average performance of F1 = 0.78 on the larger and more complex EWD corpus.
Conference Paper
Full-text available
We analyze the structure and evolution of discussion cascades in four popular websites: Slashdot, Barrapunto, Meneame and Wikipedia. Despite the big heterogeneities between these sites, a preferential attachment (PA) model with bias to the root can capture the temporal evolution of the observed trees and many of their statistical properties, namely, probability distributions of the branching factors (degrees), subtree sizes and certain correlations. The parameters of the model are learned efficiently using a novel maximum likelihood estimation scheme for PA and provide a figurative interpretation about the communication habits and the resulting discussion cascades on the four different websites.
Chapter
The present volume is intended as a reference book on Wikipedia corpus studies, from corpus construction to exploration and analysis. Wikipedia is a complex object, difficult to manipulate for linguists and corpus researchers. In addition to the encyclopedic articles consulted by millions of users, it contains vast spaces of written discussions, aka talk pages, where Wikipedia authors negotiate the collaborative editing of articles, make evaluations, or discuss related topics. The proposed volume covers Wikipedia articles, their revision histories, and discussions, with a focus on discussions, which have not been studied extensively so far and have also been neglected in previous corpus building efforts. Wikipedia discussions are instances of computer-mediated communication (CMC), thus constituting a completely different, interaction-oriented linguistic genre. Sophisticated tools and methods of linguistic annotation and corpus exploration are needed to exploit the huge and valuable corpus resources that can be constructed from the Wikipedia discussions. The present volume aims at encouraging and facilitating Wikipedia corpus studies, providing standards, recommendations, and innovative methods to build and explore Wikipedia corpora, and presenting corpus studies that make the most of the peculiarities of Wikipedia.
Article
Labelling data is one of the most fundamental activities in science, and has underpinned practice particularly in medicine for decades, but also research in corpus linguistics at least since the development of the Brown corpus. With the shift in Artificial Intelligence (AI) towards Machine Learning, the creation of datasets to be used for training and evaluating AI systems, also known in AI as corpora, has become a central activity in the field as well. Early AI datasets were created on an ad-hoc basis to tackle specific problems, but as larger and more reusable datasets were created, requiring greater investment, the need for a more systematic approach to dataset creation arose, that would ensure a better quality. A range of statistical methods were adopted, often although not exclusively from the medical sciences, to ensure that the labels used were not subjective, or to choose among different labels provided by the coders. A wide variety of such methods is now in regular use. This monograph is meant to provide a survey of the most widely used among these statistical methods supporting annotation practice. As far as we know, this is the first monograph attempting to cover the two families of methods in wider use. The first family of methods is concerned with the development of labelling schemes, and in particular, with ensuring that such schemes are such that sufficient agreement can be observed among the coders. The second family includes methods developed to analyze the output of coders once the scheme has been agreed upon, in particular although not exclusively to identify the most likely label for an item among those provided by the coders. The focus will be primarily on Natural Language Processing, the area of Artificial Intelligence devoted to the development of models of language interpretation and production, but many if not most of the methods discussed here are applicable to other areas of AI as well, or indeed to other areas of Data Science.
Article
This paper presents types and annotation layers of reply relations in computer- mediated communication (CMC). Reply relations hold between post units in CMC interactions and describe references from one given post to a previous post. We classify three types of reply relations in CMC interactions: first, technical replies, i.e. the possibility to reply directly to a previous post by clicking a ‘reply’ button; second, indentations, e.g. in wiki talk pages in which users insert their contributions in the existing talk page by indenting them and third, interpretative reply relations, i.e. the reply action is not realised formally but signalled by other structural or linguistics means such as address markers ‘@’, greetings, citations and/or Q-A structures. We take a look at existing practices in the description and representation of such relations in corpora and examples of chat, Wikipedia talk pages, Twitter and blogs. We then provide an annotation proposal that combines the different levels of description and representation of reply relations and which adheres to the schemas and practices for encoding CMC corpus documents within the TEI framework as defined by the TEI CMC SIG. It constitutes a prerequisite for correctly identifying higher levels of interactional relations such as dialogue acts or discussion trees.
Article
This paper investigates audience design in monologues. The study uses video blogs, a spoken, asynchronous form of computer-mediated communication, to illustrate how talk reflects the lack of an immediately present audience. It is based on a corpus of English language vlogs collected from the video hosting site YouTube. The study demonstrates how speakers adapt to a mediated speech situation where there is not even minimal feedback and the speaker has to address absent viewers. Clark and Carlson's audience design (1992) and Bell's audience design (1984), introducing the notion of participant roles, are central constructs in the present study. It is argued that when vloggers (re)assign participant roles, the audience is actively involved, as they have to recognize their new status. The phenomena examined in this paper include multimodal, syntactical, and lexical features. The particular context of the medium and the monologic nature of the data will be given special consideration in the analysis of genre specific features. These include terms of address (e.g. YouTubers, YouTube, vlog fans), directives/directed language (e.g. comment, rate, and subscribe, leave me a comment). Other features under discussion include questions (how are you guys doing), voicing the audience, constructed dialogue, whispering, gestures, categorization etc.
Article
This study investigates the beginning sequences of video blogs, a relatively new form of computer-mediated communication. It analyzes spoken language with regards to the three salient factors that shape the situation the passages under analysis occur in: (1) it is monologic language, (2) the passages are opening sequences, (3) the passages are taken from a CMC context. As a result, the paper provides a taxonomy of practices commonly used in this setting. Furthermore, I demonstrate that speakers develop and borrow strategies to compensate for the missing interlocutor. Openings in video blogs do not necessarily have the same functions as conversational openings in other settings. They represent an interactional element to encourage viewers to respond via the interactive feature embedded in the website, and they work toward identity construction.
Open-source large language models outperform crowd workers and approach ChatGPT in text-annotation tasks
  • M Alizadeh
  • M Kubli
  • Z Samei
  • S Dehghani
  • J D Bermeo
  • M Korobeynikova
  • F Gilardi
Alizadeh, M., Kubli, M., Samei, Z., Dehghani, S., Bermeo, J. D., Korobeynikova, M., & Gilardi, F. (2023). Open-source large language models outperform crowd workers and approach ChatGPT in text-annotation tasks. arXiv preprint arXiv:2307.02179.
HuggingFace repository
  • Dataset
Dataset. HuggingFace repository.