Conference PaperPDF Available

Collecting Facebook posts and WhatsApp chats: Corpus compilation of private social media messages

Authors:

Abstract and Figures

This paper describes the compilation of a social media corpus with Facebook posts and WhatsApp chats. Authentic messages were voluntarily donated by Dutch youths between 12 and 23 years old. Social media nowadays constitute a fundamental part of youths’ private lives, constantly connecting them to friends and family via computer-mediated communication (CMC). The social networking site Facebook and mobile phone chat application WhatsApp are currently quite popular in the Netherlands. Several relevant issues concerning corpus compilation are discussed, including website creation, promotion, metadata collection, and intellectual property rights/ethical approval. The application that was created for scraping Facebook posts from users’ timelines, of course with their consent, can serve as an example for future data collection. The Facebook and WhatsApp messages are collected for a sociolinguistic study into Dutch youths’ written CMC, of which a preliminary analysis is presented, but also present a valuable data source for further research.
Content may be subject to copyright.
Collecting Facebook Posts and WhatsApp Chats
Corpus Compilation of Private Social Media Messages
Lieke Verheijen(B
)and Wessel Stoop
Radboud University, Nijmegen, The Netherlands
{lieke.verheijen,w.stoop}@let.ru.nl
Abstract. This paper describes the compilation of a social media corpus
with Facebook posts and WhatsApp chats. Authentic messages were
voluntarily donated by Dutch youths between 12 and 23 years old. Social
media nowadays constitute a fundamental part of youths’ private lives,
constantly connecting them to friends and family via computer-mediated
communication (CMC). The social networking site Facebook and mobile
phone chat application WhatsApp are currently quite popular in the
Netherlands. Several relevant issues concerning corpus compilation are
discussed, including website creation, promotion, metadata collection,
and intellectual property rights/ethical approval. The application that
was created for scraping Facebook posts from users’ timelines, of course
with their consent, can serve as an example for future data collection.
The Facebook and WhatsApp messages are collected for a sociolinguistic
study into Dutch youths’ written CMC, of which a preliminary analysis
is presented, but also present a valuable data source for further research.
Keywords: Computer-mediated communication ·Social media ·
New media ·Face b o o k ·WhatsApp ·Corpus compilation ·
Data collection
1 Introduction
Increasingly more youths around the world, including the Netherlands, are in
the habit of using social media such as SMS text messaging, chat, instant mes-
saging, microblogging, and networking sites in their private lives on a regular
and frequent basis. This has raised worries among parents and teachers alike
that the informal, non-standard lingo used by youngsters while communicating
via social media may have a (negative) impact upon their traditional literacy
skills, i.e. writing and reading [1,2]. Before studying the possible effect of uncon-
ventional language use in social media on literacy, it is paramount to know what
that language actually looks like. Yet little is known so far about the exact lin-
guistic manifestation of Dutch social media texts, in terms of key features of
writing such as orthography (spelling), syntax (grammar and sentence struc-
ture), and lexis (vocabulary). As such, a linguistic analysis into Dutch youths’
written computer-mediated communication is an urgent matter for research. To
conduct such a study, an up-to-date corpus of social media texts is of the utmost
c
Springer International Publishing Switzerland 2016
P. Sojka et al. (Eds.): TSD 2016, LNAI 9924, pp. 249–258, 2016.
DOI: 10.1007/978-3-319-45510-5 29
250 L. Verheijen and W. Stoop
importance. This paper describes the compilation of such a social media corpus,
specifically of WhatsApp chats and Facebook posts. Ultimately, the corpus can
help to answer the following questions: how does Dutch youths’ language use on
WhatsApp and Facebook differ from Standard Dutch? And how do WhatsApp
and Facebook messages differ, linguistically speaking, from other new media
genres, such as SMS text messages and tweets?
First, we collected WhatsApp chats. These are private online chats, which
involve typed spontaneous communication in real time between two or more users
of the mobile phone application WhatsApp Messenger. This instant messaging
client, whose name is a contraction of ‘what’s up’ and ‘application’, was released
in 2010 and has since then enormously gained in popularity among Dutch smart-
phone users. It was acquired by the Facebook company in 2014. Secondly, we
have started collecting status updates, both public and non-public, posted on
Facebook timelines. This social networking service was created in 2004. Its name
comes from the ‘face book’ directories that are often given to university students
in the United States, who were the initial members of this social network. The
personal Facebook timeline was introduced in 2011, when the format of users’
individual profile pages was changed. In this paper, we describe the collection of
these two datasets. To the best of our knowledge, this is the first social media
corpus with Dutch WhatsApp and Facebook messages.
2 Related Work
The corpus compiled for this project is an addition to existing corpora of
computer-mediated communication, in particular SoNaR (‘STEVIN Nederland-
stalig Referentiecorpus’), a freely available reference corpus of written Dutch
containing some 500 million words of text that was built by Dutch and Belgian
computational linguists [36]. SoNaR contains a variety of text sources, including
some social media genres, namely online chats, tweets, internet forums, blogs,
and text messages. However, two media that are currently very popular in the
Netherlands are lacking, that is, Facebook and WhatsApp. As such, there is a
great need for the texts collected in the present project.
The creation and analysis of CMC corpora is currently an active research
area. Yet, most projects explore language data that are publicly available, which
are relatively easy to obtain, such as from Twitter, Wikipedia, discussion boards,
or public social networking profiles. CMC corpora with non-public language
data are still sparse: they are more time-consuming and difficult to obtain,
because they require active participation of contributors. The following pioneer-
ing projects are in the vanguard of private social media message collection.
A notable project similar to our WhatsApp data collection is the
‘What’s up, Switzerland?’ project [7,8], a follow-up of the ‘Sms4science’ project
[9]. Researchers from four universities study the language used in Swiss
WhatsApp chats. For this non-commercial large-scale project, over 838,000
WhatsApp messages (about 5 million tokens) by 419 users were collected in 2014.
Contact with the project’s coordinators provided us with information about the
Collecting Facebook Posts and WhatsApp Chats 251
set-up of their data collection; this served as an inspiration for our own collection.
A related project is ‘What’s up, Deutschland?’ [10], conducted by researchers
from seven German universities. They collected over 376,000 WhatsApp mes-
sages by 238 users in 2014 and 2015. Similar to our current project, the ‘What’s
up’ projects compare WhatsApp chats to SMS text messages, and several fea-
tures are investigated, e.g. linguistic structures, spelling, and emoticons/emoji.
Our Facebook data collection is comparable to that of the DiDi project
[11,12]. The DiDi corpus comprises German Facebook writings by 136 voluntary
participants from the Italian province of South Tyrol (around 650,000 tokens).
The corpus was collected in 2013. It contains not just status updates, but also
comments on timeline posts, private messages, and chat conversations. The data
and corresponding metadata were acquired by means of a Facebook web applica-
tion. Their linguistic analysis focuses, among other things, on the use of dialects
and age-related differences in language on social network sites.
3 Creation of Websites
We created two websites to gather WhatsApp chats and Facebook posts (see
http://cls.ru.nl/whatsapptaal and http://cls.ru.nl/facebooktaal), where youths
could donate their own WhatsApp and Facebook messages to science. The
data thus represent authentic, original, unmodified messages that were com-
posed in completely natural, non-experimental conditions. Besides the home
page, the websites contain the tabs ‘Prizes’, ‘Instructions’, ‘Consent’, ‘FAQ’,
‘About us’, and ‘Contact’. These pages present, respectively, information on the
prizes youths can win by contributing their social media messages to the research
project, instructions on how they can submit their messages, consent forms that
they should sign for us to be allowed to use the data, frequently asked questions,
brief info about ourselves (the researchers), and a contact form.
The main difference between the two websites for gathering social media data
is that the WhatsApp collection website includes an ‘Instructions’ page with
extensive explanations on how to submit chats depending on one’s mobile phone
type (Android, iPhone, or Windows Phone), whereas the Facebook collection
website prominently features a button for donating messages. This difference
stems from the technical possibilities of submitting messages: while WhatsApp
chats can be sent via email from a mobile phone (to an email address created
specifically for the purposes of this data collection, whatsapp-taal@let.ru.nl),
Facebook posts cannot easily be submitted by users themselves, so we retrieved
them by means of a self-built application.
4 Creation of Application
To automatically retrieve posts from volunteering youths’ Facebook timelines,
we created a Facebook app - a piece of software that has access to data stored
by Facebook via the Facebook Graph API (application programming interface,
https://developers.facebook.com/docs/graph-api). In practice, this means that
252 L. Verheijen and W. Stoop
users only have to click on a button on our website, telling the app to make a
connection to Facebook, to collect their posts, and to save them in our database.
To protect the privacy of its users, Facebook has installed two layers of secu-
rity with which the app needed to deal. The first layer entails that the volunteer-
ing user needs to allow the app access to every piece of information it collects.
Facebook calls this allowance of access to personal data a ‘permission’. Two
permissions were required for our purpose, user birthday (to make sure that we
collected posts of youths of the intended ages) and user posts. Users grant these
permissions directly after they click the button on our website: a pop-up window
appears which first asks them to log in to Facebook and then explains to what
the app will have access if they proceed.
The second security layer entails that Facebook itself needs to allow the app
to ask for permissions. During development, the app only worked for a predefined
set of Facebook users for testing purposes; users that were not part of this set
could not grant any permissions and thus donate their data with the app. To
make the app available to all Facebook users, it had to be manually reviewed by
a Facebook employee. Our app was accepted only after making clear that it is
of value to Facebook users because it enabled volunteering users to effortlessly
donate their posts without having to manually copy and paste these one by one.
The source code of the app can be found at https://github.com/Woseseltops/
FB-data-donator. It can easily be adjusted to make another app that collects
other user data in a similar way.
5 Promotion of Websites
The websites for collecting social media messages were promoted through free
publicity in Dutch media. It attracted quite some media attention, which
resulted in newspaper publications, both regional (de Gelderlander ) and national
(AD.nl), radio interviews on regional (RTV Noord-Hol land,Studio 040 )and
national (De Taalstaat,NPO Radio 1,3FM,Radio FunX ) stations, and tele-
vision interviews on regional (NimmaTV ) and national (Rtl4 ) TV. University
and student magazines reported on the data collection too (Vox,ANS). In addi-
tion, it was advertised in the digital newsletters of Onze Taal (the Dutch society
for language buffs) - Taalpost for adults and TLPST for adolescents. The data
collection was also promoted via the Radboud University’s web pages and by
researchers via social media channels, in particular Twitter and Facebook. We
further promoted it during lectures and master classes for young audiences, i.e.
students in secondary and tertiary education. Our aim was to promote the web-
sites nationwide, in order to gather a representative sample of messages from
youngsters throughout the country.
In order to stimulate youths to contribute their social media messages to our
project, we decided to raffle off prizes - gift certificates at the value of 100, 50, and
20 euros. With respect to WhatsApp, individual contributors’ odds of winning
a prize increased as they sent in more chat conversations. We felt that this raffle
was necessary to stimulate youths to donate their private messages to the corpus.
Collecting Facebook Posts and WhatsApp Chats 253
Importantly, it was emphasized on the websites that only those contributors who
completely filled in the consent form stood a chance of winning the prizes. This
was made explicit to motivate youths to give their informed consent.
6 Metadata
All WhatsApp chats and Facebook posts in our social media corpus are accom-
panied by a substantial amount of sociolinguistic information. Via the websites,
the following metadata were obtained: name, place of residence, place and date of
birth, age, gender, and educational level, as well as date and place of submission.
These parameters are useful for sociolinguistic research, since they enable one to
study the language use of different social groups in WhatsApp and Facebook.
7 IPR Issues and Ethical Approval
Intellectual property rights (IPR) were obtained by consent of both the Face-
book company and individual contributors of Facebook and WhatsApp mes-
sages, since it is key to safeguard the authors’ rights and interests [5] (p. 2270).
For underage contributors, between 12 and 17 years old, written consent was also
gained of one of their parents or guardians. By signing the consent web form,
contributors declared the following:
to have been informed about the purpose of the study;
to have been able to ask questions about the study;
to understand how the data from the study will be stored and to what ends
they will be used;
to have considered if they want to partake in the study;
to voluntarily participate in the study.
Additionally, parents or guardians also declared:
to be aware of the contents of their child’s messages;
to agree with their child’s participation in the study.
Participants and their parents/guardians gave full permission for their
(child’s) submitted messages (i) to be used for scientific research and educa-
tional purposes; (ii) to be stored in a database, according to Radboud Univer-
sity’s rules, and to be kept available for scientific research, provided they are
anonymised and in no way traceable to the original authors; and (iii) to be used
in scientific publications and meetings. If messages appear in publications or
presentations, no parts that may harm the participants’ interests will be made
public.
Furthermore, ethical approval was obtained from our institution’s Ethical
Testing Committee (ETC). For the WhatsApp chats, it was crucial for the ETC
that messages of conversation partners were deleted, since they have not given
consent for the use of their messages. Accordingly, interlocutors’ WhatsApp
messages were immediately discarded. This procedure was explained on the FAQ
page of the websites. In accordance with the ETC’s further guidelines, we added
downloadable information documents on the home pages.
254 L. Verheijen and W. Stoop
8 Current Corpus Composition
The collection period of WhatsApp messages lasted from April until December
2015; the collection of Facebook messages started in December 2015. Up to
the time of writing, over 332,000 word tokens of WhatsApp chats have been
collected from youths between the ages of 12 and 23, which compares to the
SoNaR subcorpora with texts by youths up to 20 years old from the Netherlands
as follows - 44,012 word tokens in the SMS corpus (6.08 % of the total number of
words of that corpus); 219,043 in the chat corpus (29.7% of total); and 2,458,904
in the Twitter corpus (10.6 % of total). The scale of this corpus makes it suitable
for fine-grained (manual) linguistic studies; it is not intended as a training data
set for large-scale computational research.
We excluded chain messages from our corpus. Also not included were any
visual or audio materials: since the study that prompted the data collection is
completely linguistic in nature, images, videos, and sound files were not gath-
ered, so the corpus is wholly textual rather than multimodal. Another deciding
factor in asking contributors not to add media files when sending WhatsApp
conversations from their smartphones is that adding them may prevent mails
from arriving due to an exceeded data limit. More importantly, issues of copy-
right and privacy protection would make any inclusion of pictures, videos, or
sounds highly problematic. The messages are stored as one WhatsApp chat con-
versation per file. Table 1shows demographic details on the data collected so
far, focusing on the age and gender distribution.
Table 1. Composition of WhatsApp dataset.
Contributors Conversations Wo r ds
# % # % # %
Adolescents 11 32.4 83 38.6 63,217 19.0
Young adults 23 67.6 132 61.4 269,440 81.0
Male 12 37.5 71 33.0 98,201 29.5
Fema l e 22 68.8 144 67.0 234,456 70.5
Tota l 34 100 215 100 332,657 100
For the WhatsApp dataset, a relatively small number of youths (34) have
contributed large quantities of data. At the time of writing, the number of con-
tributors of Facebook posts was already considerably greater - 94, who together
contributed 171,693 words. This difference may stem from the submission pro-
cedure: while users were asked to submit WhatsApp chats via separate emails,
which required taking several steps on their mobile phones, they could easily
submit all their Facebook posts with the click of a button. Young adults (18–23
years old, avg. age 20.1) submitted many more WhatsApp messages than ado-
lescents (12–17, avg. age 14.4), not only in terms of number of contributors, but
Collecting Facebook Posts and WhatsApp Chats 255
also in terms of number of conversations as well as words. The average age of
all contributors was 18.3. In terms of gender, a higher percentage of WhatsApp
chat contributors are female, with about two thirds girls versus one third boys
(a distribution similar to that for donated text messages as reported in [4]). This
corresponds to the percentages of words and conversations that were submitted
by male versus female contributors.
9 Preliminary Data Analysis
This section presents the first findings of a linguistic corpus study of Dutch
youths’ WhatsApp chats. Their language use in social media often differs from
Standard Dutch, in various dimensions of writing. A striking orthographic fea-
ture of written CMC are textisms: unconventional spellings of various kinds. We
conducted a quantitative register analysis into the frequency of textisms, and
investigated how the independent variable age group affects this linguistic feature
by distinguishing between WhatsApp messages of adolescents and young adults.
The following textism types were found (presented here with Dutch examples):
textisms with letters:
initialism: first letters of each word/element in a compound word, phrase,
sentence, or exclamation, e.g. hvj (hou van je), omg (oh mijn God)
contraction: omission of letters (mostly vowels) from middle of word, e.g.
vnv (vanavond), idd (inderdaad)
clipping: omission of final letter of word, e.g. lache (lachen), nie (niet)
shortening: dropping of ending or occasionally beginning of word, e.g.
miss (misschien), wan (wanneer)
phonetic respelling: substitution of letter(s) of word by (an)other letter(s),
while applying accurate grapheme-phoneme patterns of the standard lan-
guage, e.g. ensow (enzo), boeiuh (boeien), okeej (ok´e), egt (echt)
single letter/number homophone: substitution of word by phonologically
resembling or identical letter/number, e.g. n (een), t (het), 4 (for)
alphanumeric homophone: substitution of part of word by phonologi-
cally resembling or identical letter(s)/number(s), e.g. suc6 (succes), w88
(wachten)
reduplication: repetition of letter(s), e.g. neeee (nee), superrr (super)
visual respelling: substitution of letter(s) by graphically resembling non-
alphabetic symbol(s), e.g. Juli@n (Julian), c00l (cool)
accent stylisation: words from casual, colloquial, or accented speech
spelled as they sound, e.g. hoezut (hoe is het), lama (laat maar)
inanity: other, e.g. laterz (later)
standard language abbreviations, e.g. aug (augustus), bios (bioscoop)
textisms with diacritics:
missing, e.g. carriere (carri`ere), ideeen (idee¨en), enquete (enquˆete)
textisms with punctuation:
missing, e.g. mn (m’n), maw (m.a.w.), ovkaart (ov-kaart)
extra, e.g. stilte-coup´e (stiltecoup´e)
256 L. Verheijen and W. Stoop
reduplication, e.g. !!!!!, ??, ..........
textisms with spacing:
missing (in between words), e.g. hahaokeeedan (haha ok´e dan)
extra (in between elements of compound words), e.g. fel groen (felgroen)
textisms with capitalisation:
missing (of proper names, abbreviations), e.g. tim (Tim), ok (OK)
extra, e.g. WOW (wow)
Figure 1shows the results for the textisms, separating adolescents from young
adults. The frequencies shown here have been standardised per 10,000 words,
because the total number of words differs per age group in the WhatsApp
dataset. The figure makes clear that textisms with letters were by far the most
frequent in the WhatsApp chats. It also shows an age-based distinction: while
textisms with diacritics, capitalisation, punctuation, and spacing occurred with
more or less similar frequencies in the WhatsApp messages of the two age groups,
those with letters were used much more by adolescents. Their greater use of
orthographic deviations may be attributed to a desire to rebel against societal
norms, including the standard language norms, and to play with language: the
most non-conformist linguistic behaviour is said to occur around the ages of
15/16, when the ‘adolescent peak’ occurs. Young adults, on the other hand, may
feel more social pressure to conform to norms set by society, also those about
language.
Fig. 1. Five types of textisms in WhatsApp dataset.
This preliminary analysis is part of a larger in-depth linguistic study
of a broad range of linguistic features in WhatsApp chats. These focus on
orthography (misspellings, typos, emoticons, symbols), syntax (omissions; com-
plexity), and lexis (borrowings, interjections; diversity, density). Other lexical
features that may be interesting for online youth communication are, for exam-
ple, swearwords, intensifiers, and hyperbolic expressions. The WhatsApp data
will be compared to the Facebook data, as well as to instant messages, text
messages, and microblogs of the SoNaR corpus. This can reveal to what extent
deviations from the standard language norms in CMC depend not just on indi-
vidual user characteristics such as age, but also on genre characteristics.
Collecting Facebook Posts and WhatsApp Chats 257
10 Conclusions
The central role currently played by CMC in (especially) youths’ lives makes
social media corpora quite valuable for state-of-the-art sociolinguistic research.
This paper discussed the compilation of such a corpus in the Netherlands.
WhatsApp chats and Facebook posts were contributed by Dutch youths from
12 to 23 years old. This paper has made clear that a data collection method of
voluntary donations, with the added incentive of a prize raffle, can yield a fair
amount of data if sufficient public attention is obtained through e.g. media cov-
erage. We have presented websites created for this purpose, and have explained
how such websites can be promoted. The importance of collecting metadata and
obtaining written consent and ethical approval have been stressed. Crucially, the
application we created to gather Facebook posts, beside the process of gaining
consent from the Facebook company, can serve as a model for future corpus
builders.
11 Future Work
Eventually, if the WhatsApp and Facebook data are processed in a similar fash-
ion as the rest of SoNaR, they can be incorporated into the corpus together
with their metadata. This would require format conversion, tokenization, and
anonymisation: the data should be (a) converted into the FoLiA xml-format,
which was developed for linguistic resources, (b) tokenised by UCTO, a tokeniser
adapted for social media, and (c) anonymised, if possible automatically, so that
they contain no personal/place names, (email) addresses, telephone numbers, or
bank accounts. Such additional processing was beyond the scope of the present
project, but particularly data anonymisation is essential if the WhatsApp chats
and Facebook messages are shared with the wider scientific community and
become available for further research into social media texts. It would also be
useful to apply part-of-speech tagging to this corpus. Moreover, we recognize the
need for multimodal social media corpora: the next step in sociolinguistic social
media research may be to focus on multimodality, given the increased options
for incorporating visual materials (photographs, emoji, videos, etc.) and the use
thereof in computer-mediated communication. The number of contributors so
far suggest that youths remain hesitant to donate their private, often intimate,
social media messages to science, despite significant gift certificates; perhaps a
larger corpus could be obtained by even more publicity or even greater prizes.
Nonetheless, albeit monomodal and of modest scale, the present corpus with its
metadata can be a vital resource and an example of how social media texts can
be collected for linguistic, sociological, or other research.
Acknowledgments. This research was funded by a grant of the Dutch Organisation
for Scientific Research (NWO), under project number 322-70-006. Special thanks are
due to Iris Monster, who constructed the WhatsApp website. Thanks also go to Wilbert
Spooren and Ans van Kemenade, the supervisors of Lieke’s PhD project. Finally, we
thank all contributors of WhatsApp and Facebook messages to our corpus.
258 L. Verheijen and W. Stoop
References
1. Thurlow, C.: From statistical panic to moral panic: the metadiscursive construction
and popular exaggeration of new media language in the print media. J. Comput.-
Mediated Commun. 11(3), 667–701 (2006)
2. Postma, K.: Geen paniek! Een analyse van de beeldvorming van sms-taal in
Nederland. Master thesis, VU University Amsterdam (2011)
3. Sanders, E.: Collecting and analysing chats and tweets in SoNaR. In: Proceedings
LREC (Language Resources and Evaluation) 2012, pp. 2253–2256 (2012)
4. Treurniet, M., Sanders, E.: Chats, tweets and SMS in the SoNaR corpus: social
media collection. In: Newman, D. (ed.) Proceedings of the 1st Annual International
Conference Language, Literature & Linguistics, pp. 268–271. Global Science and
Technology Forum, Singapore (2012)
5. Treurniet, M., De Clercq, O., van den Heuvel, H., Oostdijk, N.: Collecting a corpus
of Dutch SMS. In: Proceedings LREC 2012, pp. 2268–2273 (2012)
6. Oostdijk, N., Reynaert, M., Hoste, V., Schuurman, I.: The construction of a
500-million-word reference corpus of contemporary written Dutch. In: Spyns, P.,
Odijk, J. (eds.) Essential Speech and Language Technology for Dutch: Results by
the STEVIN Programme, pp. 219–247. Springer, Heidelberg (2013)
7. urscheid, C., Frick, K.: Keyboard-to-Screen-Kommunikation gestern und heute:
SMS und WhatsApp. im Vergleich. In: Mathias, A., Runkehl, J., Siever, T. (eds.)
Sprachen? Vielfalt! Sprache und Kommunikation in der Gesellschaft und den
Medien. Eine Online-Festschrift zum Jubil¨aum von Peter Schlobinski, pp. 149–181.
Networx 64, Hannover (2014)
8. Stark, E., Ueberwasser, S., Di´emoz, F., urscheid, C., Natale, S., Thurlow, C.,
Siebenhaar, B.: What’s up, Switzerland? Language, individuals and ideologies in
mobile messaging. Universit¨at urich, Universit¨at Bern, Universit´edeNeuchˆatel,
Universit¨at Leipzig (2015). http://www.whatsup-switzerland.ch
9. Stark, E., Ueberwasser, S., urscheid, C., B´eguelin, M.J., Moretti, B., Gr¨unert, M.,
Gazin, A.-D., Pekarek Doehler, S., Siebenhaar, B.: Sms4science. Universit¨at urich,
Universit´edeNeuchˆatel, Universit¨at Bern, Universit¨at Leipzig (2015). http://www.
sms4science.uzh.ch
10. Siebenhaar, B., et al.: What’s up, Deutschland? WhatsApp-Nachrichten
erforschen. Universit¨at Leipzig, Technische Universit¨at Dortmund, Technische
Universit¨at Dresden, Leibniz Universit¨at Hannover, Universit¨at Mannheim,
Universit¨at Koblenz-Landau, Universit¨at Duisburg-Essen (2016). http://www.
whatsup-deutschland.de
11. Frey, J.-C., Stemle, E.W., Glazniek, A.: Collecting language data of non-public
social media profiles. In: Faaß, G., Ruppenhofer, J. (eds.) Workshop Proceedings
of the 12th Edition of the KONVENS Conference, pp. 11–15. Universit¨atsverlag,
Hildesheim (2014)
12. Frey, J.-C., Glaznieks, A., Stemle, E.W.: The DiDi corpus of South Tyrolean CMC
data. In: Beißwenger, M., Zesch, T. (eds.) Proceedings of the 2nd Workshop of
the Natural Language Processing for Computer-Mediated Communication/Social
Media, pp. 1–6. University of Duisburg-Essen (2015)
... This aspect confronts us with one of the main problems of exporting conversations directly to institutional mailboxes (as in the case of Sampietro, 2016 andVerheijen &Stoop, 2016 ): the participants' lack of control over their data. In this regard, Sampietro (2016) notes that two participants in her data collection opted to send the conversation history to their personal email, delete a fragment they did not want to share, and then send part of the chat to the researcher. ...
... This aspect confronts us with one of the main problems of exporting conversations directly to institutional mailboxes (as in the case of Sampietro, 2016 andVerheijen &Stoop, 2016 ): the participants' lack of control over their data. In this regard, Sampietro (2016) notes that two participants in her data collection opted to send the conversation history to their personal email, delete a fragment they did not want to share, and then send part of the chat to the researcher. ...
... 3. According toVerheijen and Stoop (2016), this corpus was hosted at http://www. whatsupdeutschland.de/ . ...
Article
Full-text available
The collection of datasets from real interactions is an unavoidable step in many research works aiming to understand language use. In the field of digital discourse analysis, data collection is complex due to the fast-paced changes in the applications and the ethical decisions involved. This work has two goals. First, we seek to show an overview of the literature on datasets of digital exchanges by WhatsApp. Then, we aim to systematize different sampling techniques used in previous research. We thus proceeded by applying content analysis to 100 research articles and theses retrieved from open access portals. We conducted a descriptive analysis that included the amount of data collected, the technique employed in the collection of the data, the method used to contact participants, and the online access to the linguistic corpora, among other variables. The results show the existence of some corpora annotated and available in languages other than Spanish. In addition, most of the literature shows a combination of different techniques to collect a wide set of linguistic and multimodal data. Then, we systematize the main methodological alternatives for data collection from digital interactions by WhatsApp, with the participant observation method standing out. RESUMEN | La recolección de conjuntos de datos de interacciones reales es un paso ineludible en muchas investigaciones que buscan comprender los usos lingüísticos. En el campo del análisis del discurso digital, esto resulta complejo tanto por las características cambiantes de las aplicaciones como por las decisiones éticas que suponen. Este artículo tiene un doble objetivo. En primer lugar, ofrecer un estado de la cuestión sobre los conjuntos de datos de intercambios digitales por WhatsApp y, en segundo lugar, sistematizar diferentes técnicas de recolección de estas muestras, utilizadas en investigaciones previas. La metodología empleada es el análisis de contenido de cien tesis y artículos de investigación recuperados de portales científicos. Se realizó un análisis descriptivo que consideró, entre otras variables, la cantidad de datos recogidos, la técnica de recolección de datos utilizada, la forma de contacto con los participantes y el acceso en línea a los corpus lingüísticos. Los resultados muestran la existencia de algunos corpus anotados y disponibles en lenguas diferentes a la española. Asimismo, se observa, en la mayoría de los antecedentes, la combinación de diferentes técnicas para recoger un conjunto amplio de datos lingüísticos y multimodales. En tal sentido, se sistematizan las principales alternativas metodológicas con las que es posible recolectar datos de interacciones digitales por WhatsApp. PalabRaS clavE: discurso digital; corpus lingüístico; mensajería instantánea; interacción digital. ReSUMo | A coleta de conjuntos de dados de interações reais é um passo inevitável em muitas investigações que buscam compreender os usos linguísticos. No campo da análise do discurso digital, a coleta de dados é complexa tanto pelas características mutáveis das aplicações quanto pelas decisões éticas envolvidas. O artigo tem um duplo objetivo. Em primeiro lugar, oferecer um estado da arte sobre os conjuntos de dados de trocas digitais por WhatsApp e, em segundo lugar, sistematizar diferentes técnicas de coleta de amostras utilizadas em pesquisas anteriores. A metodologia utilizada é a análise de conteúdo de 100 artigos de pesquisa e teses recuperados de portais científicos. Foi realizada uma análise descritiva que levou em consideração, entre outras variáveis, a quantidade de dados coletados, a técnica de coleta de dados utilizada, forma de contato com os participantes e acesso online ao material linguístico. Os resultados mostram a existência de alguns corpus anotados e disponíveis em outros idiomas além do espanhol. Além disso, observa-se, na maioria dos pesquisas, a combinação de diferentes técnicas para coletar um amplo conjunto de dados linguísticos e multimodais. Nesse sentido, são sistematizadas as principais alternativas metodológicas com as quais é possível coletar dados de interações digitais pelo WhatsApp, dentre as quais se destaca a observação participante. PAlAvRAS-cHAve: discurso digital; linguística de corpus; mensagens instantâneas; interação digital.
... Data from WhatsApp provide high resolution interactional data, require little effort on the side of study participants, can be collected retrospectively from a natural setting, and objectively quantify behavior in interpersonal interactions. Consequently, WhatsApp data have been used to investigate diverse research questions in linguistics (Verheijen and Stoop 2016;Ueberwasser and Stark 2017;Dürscheid and Frick 2014), political science (Resende et al. 2019a;Narayanan et al. 2019;Garimella and Tyson 2018), education (Costa-Sánchez and Guerrero-Pico 2020; Rosenberg and Asterhan 2018), and research about social relationships (Aharony 2015;Garcıa-Gómez 2018). However, despite their potential, data from WhatsApp are a methodical and technical challenge for most researchers because they require special considerations for data collection, participant incentivization, data processing, informed consent, anonymization, and reproducibility of research. ...
... However, in the context of WhatsApp data donation, it has not seemed to be effective so far (cf. Verheijen and Stoop 2016). Another approach to motivate research participants to donate data is to provide insight about their own behavior through tailored feedback. ...
... Second, new members who join the group after researchers posted their announcement cannot see it and are unaware of their data being used for research purposes. Other studies rely on explicit opt-in procedures to guarantee informed consent, often by combining survey studies with voluntary data donation (Seufert et al. 2015;2016;Verheijen and Stoop 2016;Ueberwasser and Stark 2017;Schwind and Seufert 2018). Typically, participants in these studies are asked to export the chat logs on their own phones and send them to the researchers via email. ...
Chapter
In this chapter, we will first give a brief overview of the mobile instant messaging landscape. Subsequently, we focus on the instant messaging application “WhatsApp” and describe its current features and which kinds of data can be extracted from it. Based on the existing literature, we provide practical advice for researchers seeking to work with WhatsApp data with respect to data collection, participant incentivization, data processing, informed consent, anonymization, and reproducibility of research. These insights might also prove useful to researchers seeking to work with other kinds of chat log data. We conclude that WhatsApp is an intriguing data source for social science research questions but that the data have to be treated with great caution to ensure ethical conduct. To facilitate this, we present several issues to contemplate for designing studies and briefly introduce the “WhatsR” package for R - our own package for parsing and visualizing data from exported WhatsApp chat logs with convenience features for tailoring, anonymizing, and extracting metadata from them.
... Data from WhatsApp provide high resolution interactional data, require little effort on the side of study participants, can be collected retrospectively from a natural setting, and objectively quantify behavior in interpersonal interactions. Consequently, WhatsApp data have been used to investigate diverse research questions in linguistics (Verheijen and Stoop 2016;Ueberwasser and Stark 2017;Dürscheid and Frick 2014), political science (Resende et al. 2019a;Narayanan et al. 2019;Garimella and Tyson 2018), education (Costa-Sánchez and Guerrero-Pico 2020;Rosenberg and Asterhan 2018), and research about social relationships (Aharony 2015;Garcıa-Gómez 2018). However, despite their potential, data from WhatsApp are a methodical and technical challenge for most researchers because they require special considerations for data collection, participant incentivization, data processing, informed consent, anonymization, and reproducibility of research. ...
... However, in the context of WhatsApp data donation, it has not seemed to be effective so far (cf. Verheijen and Stoop 2016). Another approach to motivate research participants to donate data is to provide insight about their own behavior through tailored feedback. ...
... Not revoking consent might thus be a consequence of not having read the message instead of consenting to contributing data to the research project. Other studies rely on explicit opt-in procedures to guarantee informed consent, often by combining survey studies with voluntary data donation (Seufert et al. 2015;Verheijen and Stoop 2016;Ueberwasser and Stark 2017;Schwind and Seufert 2018). Typically, participants in these studies are asked to export the chat logs on their own phones and send them to the researchers via email. ...
... Data from WhatsApp provide high resolution interactional data, require little effort on the side of study participants, can be collected retrospectively from a natural setting, and objectively quantify behavior in interpersonal interactions. Consequently, WhatsApp data have been used to investigate diverse research questions in linguistics (Verheijen and Stoop 2016;Ueberwasser and Stark 2017;Dürscheid and Frick 2014), political science (Resende et al. 2019a;Narayanan et al. 2019;Garimella and Tyson 2018), education (Costa-Sánchez and Guerrero-Pico 2020;Rosenberg and Asterhan 2018), and research about social relationships (Aharony 2015;Garcıa-Gómez 2018). However, despite their potential, data from WhatsApp are a methodical and technical challenge for most researchers because they require special considerations for data collection, participant incentivization, data processing, informed consent, anonymization, and reproducibility of research. ...
... However, in the context of WhatsApp data donation, it has not seemed to be effective so far (cf. Verheijen and Stoop 2016). Another approach to motivate research participants to donate data is to provide insight about their own behavior through tailored feedback. ...
... Not revoking consent might thus be a consequence of not having read the message instead of consenting to contributing data to the research project. Other studies rely on explicit opt-in procedures to guarantee informed consent, often by combining survey studies with voluntary data donation (Seufert et al. 2015;Verheijen and Stoop 2016;Ueberwasser and Stark 2017;Schwind and Seufert 2018). Typically, participants in these studies are asked to export the chat logs on their own phones and send them to the researchers via email. ...
Chapter
The aim of this chapter is to introduce and describe how digital technologies, in particular smartphones, can be used in research in two areas, namely (i) to conduct personality assessment and (ii) to assess and promote physical activity. This area of research is very timely, because it demonstrates how the ubiquitously available smartphone technology—next to its known advantages in day-to-day life—can provide insights into many variables, relevant for psycho-social research, beyond what is possible within the classic spectrum of self-report inventories and laboratory experiments. The present chapter gives a brief overview on first empirical studies and discusses both opportunities and challenges in this rapidly developing research area. Please note that the personality part of this chapter in the second edition has been slightly updated.
... Finally, and most important for this paper, collecting and processing WhatsApp chat log data is a massive logistical challenge for scientists, because they require a custom infrastructure to be collected, processed, and stored in a secure, anonymous, ethically reflected, and General Data Protection Regulation (GDPR)-compliant manner. Previous studies that have used donated WhatsApp chat logs (Ueberwasser & Stark, 2017;Verheijen & Stoop, 2016) typically do not publicize the tools and infrastructure they used for data collection, so that the practical hurdles for collecting donated WhatsApp chat logs are usually prohibitively high for most research projects. ...
... While other R packages (e.g., Gruber, 2023) exist for parsing extracted chat logs, this wide range of features for enabling scientific data collection is so far unique to the best of our knowledge. As such, the WhatsR package is agnostic to how WhatsApp chat logs were collected and can be used in different data collection paradigms, from donated chat logs (Seufert et al., 2015;Ueberwasser & Stark, 2017;Verheijen & Stoop, 2016) to researchers joining public chat groups (Garimella & Tyson, 2018;Machado et al., 2019;Melo et al., 2019;Narayanan et al., 2019;Resende et al., 2019), or to artificially created chats in experiments (Sprugnoli et al., 2018). In total, the WhatsR package contains 20 functions that can be grouped into functions for processing chat logs, visualizing results, computing summary statistics, keeping the package up to date, and simulating artificial chat logs for testing purposes (see Table 2). ...
Article
Full-text available
In this paper, we present ChatDashboard, a framework for collecting, linking, and processing donated WhatsApp chat log data. The framework consists of the WhatsR R package for parsing, anonymizing, and preprocessing donated WhatsApp chat logs, the ChatDashboard R Shiny web app for uploading, reviewing, and securely donating WhatsApp chat logs, and DashboardTester, an automated script for testing the correct setup of the framework by simulating participants. With ChatDashboard, researchers can set up their own data collections to gather transparently donated WhatsApp chat log data from consenting participants and link them to survey responses. It enables researchers to retrospectively collect highly granular data on interpersonal interactions and communication without building their own tools from scratch. We briefly discuss the advantages of donated WhatsApp chat log data for investigating social relationships and provide a detailed explanation of the ChatDashboard framework. Additionally, we provide a step-by-step guideline in the supplementary materials for researchers to set up their own data donation pipelines.
... regular participants of these migrant Facebook groups. Hence,Verheijen and Stoop's (2016) solution would not suit our research aim. ...
Article
Full-text available
Based on the heuristics proposed by Helen Nissenbaum to assess ethical issues surrounding research using new technologies, this paper discusses the ethics of the collection and analysis of migrants' digital traces for academic research purposes. Concretely, this paper is grounded on an empirical research that applies a topic modeling approach to a large dataset of migrants' posts written on Facebook groups. After discussing the nine aspects proposed by Nissenbaum, the paper contends that while researchers strive to comply with ethical measures by, for instance, asking adequate questions and protecting the collected data, the lack of transparency of social networking sites is harmful to critical social sciences and can hamper findings that contribute to understanding migratory patterns and decisions.
Article
Full-text available
Social media is a term with which most of the people around the world are well acquainted. The advancement of technology has provided a new medium through which we can propose, deliver, swap, and share our ideas without moving a single inch. It is a new avenue for conveying information and a trend which is now-a-days in vogue. From infants to adults, everyone is somehow in contact with the social media. Similarly, education system too has a profound influence of social media. From placement institutes, school authority, teachers, learners, to parents in fact every stakeholder of education system is somehow tied to social media. Jeff Bezos, CEO at Amazon.com once described the power of social media by asserting that “If you make customers unhappy in the physical world, they might each tell 6 friends. If you make customers unhappy on the Internet, they can each tell 6,000 friends” (Pencak 2019). Thus, we can assume the potency and status of social media in our life. Though social media is affecting many significant areas of human life, but the area which itself is considered as a ‘systematic means of communication’ (that is ‘Language’) is too being swayed by this virtual medium. Social media has exceedingly affected English language skills. The paper explores how the social media has influenced linguistics habits of millennial, whether it has affected upcoming academicians in a positive or negative way, and what should be done in order to protect their linguistic habits from the negative influence of social media.
Article
Full-text available
La recolección de conjuntos de datos de interacciones reales es un paso ineludible en muchas investigaciones que buscan comprender los usos lingüísticos. En el campo del análisis del discurso digital, esto resulta complejo tanto por las características cambiantes de las aplicaciones como por las decisiones éticas que suponen. Este artículo tiene un doble objetivo. En primer lugar, ofrecer un estado de la cuestión sobre los conjuntos de datos de intercambios digitales por WhatsApp y, en segundo lugar, sistematizar diferentes técnicas de recolección de estas muestras, utilizadas en investigaciones previas. La metodología empleada es el análisis de contenido de cien tesis y artículos de investigación recuperados de portales científicos. Se realizó un análisis descriptivo que consideró, entre otras variables, la cantidad de datos recogidos, la técnica de recolección de datos utilizada, la forma de contacto con los participantes y el acceso en línea a los corpus lingüísticos. Los resultados muestran la existencia de algunos corpus anotados y disponibles en lenguas diferentes a la española. Asimismo, se observa, en la mayoría de los antecedentes, la combinación de diferentes técnicas para recoger un conjunto amplio de datos lingüísticos y multimodales. En tal sentido, se sistematizan las principales alternativas metodológicas con las que es posible recolectar datos de interacciones digitales por WhatsApp.
Article
Hierdie artikel handel oor enkele van die onderskeidende stylkenmerke van Generasie X se SMS-Afrikaans en hoe hierdie kenmerke gebruik word om sosiale betekenis te skep. Die doel is om ’n ooglopende leemte in die kennisbasis oor die SMS-taal van ’n afgeskeepte teikengroep te vul. Die teoretiese vertrekpunt is Coupland (2007) se aanspraak dat taal “opgevoer” word om sosiale betekenis te skep. Die bron van die taaldata is ’n WhatsApp-kletsgroep wat aanvanklik geskep is om die reëlings vir ’n hoërskoolreünie te tref, maar waarvan die inhoud weldra uitgebrei het om ’n verskeidenheid onderwerpe in natuurlike taal te dek. Die stylkenmerke is ontgin op grond van onder meer morfologiese, fonologiese, ortografiese en tipografiese verskynsels wat die spreektaalgrondslag van SMS weerspieël. ’n Samespel van Thurlow (2003) se sosiolinguistiese grondstellings van SMS-taal – bondigheid en spoed, paralinguistiese regstelling en fonologiese benadering – dien op die oog af om die vorm van Generasie X se SMS-Afrikaans te verklaar, maar ’n volle verrekening vereis aandag aan die sosiopragmatiese funksie(s) van die deelnemers se stylkeuses. Dit lei tot die insig dat die onderskeidende stylkenmerke in diens staan van sosialebetekenisskepping, wat neerslag vind in die manier waarop die deelnemers die taalmiddele tot hulle beskikking op WhatsApp aanwend om sosiale identiteit te skep en hulle aanlyn verhoudings te bedryf. Trefwoorde: emoji; Generasie X; kortboodskapdienste; SMS; SMS-Afrikaans; sosialebetekenisskepping; sosiolinguistiek; sosiopragmatiek; stylkenmerke; taal as opvoering; WhatsApp
Article
Full-text available
The pervasiveness of the English language in society and education in the Netherlands, as well as its status as online lingua franca, has caused concerns. English is a manifest aspect of oral youth language, as reflected in online written messages, but has in no way replaced the Dutch language. This paper presents a large-scale corpus analysis of computer-mediated communication by Dutch youths. These social media messages (392,169 words in total) were studied for the presence of code-mixing with English, in terms of amount and manner. They contained 7528 English elements: (parts of) words, interjections, textisms (typical of ‘digi-talk’), phrases, and sentences. We argue that the concept of ‘manifold code-mixing’, consisting of four pathways – discourse framing, insertion, alternation, and integration – is necessary to truly comprehend the complexity and social meaning of code-mixing. These pathways relate to the SUPER-functions of textisms (speechlike, understandable, playful, expressive, reduced) and reveal Dutch youths' high proficiency in English.
Chapter
Full-text available
The construction of a large and richly annotated corpus of written Dutch was identified as one of the priorities of the STEVIN programme. Such a corpus, sampling texts from conventional and new media, is invaluable for scientific research and application development. The present chapter describes how in two consecutive STEVIN-funded projects, viz. D-Coi and SoNaR, the Dutch reference corpus was developed. The construction of the corpus has been guided by (inter)national standards and best practices. At the same time through the achievements and the experiences gained in the D-Coi and SoNaR projects, a contribution was made to their further advancement and dissemination.
Conference Paper
Full-text available
In this paper, we propose an integrated web strategy for mixed sociolinguistic research methodologies in the context of social media corpora. After stating the particular challenges for building corpora of private, non-public computer-mediated communication, we will present our solution to these problems: a Facebook web application for the acquisition of such data and the corresponding meta data. Finally, we will discuss positive and negative implications for this method.
Article
Full-text available
In this paper we present the first freely available corpus of Dutch text messages containing data originating from the Netherlands and Flanders. This corpus has been collected in the framework of the SoNaR project and constitutes a viable part of this 500-million-word corpus. About 53,000 text messages were collected on a large scale, based on voluntary donations. These messages will be distributed as such. In this paper we focus on the data collection processes involved and after studying the effect of media coverage we show that especially free publicity in newspapers and on social media networks results in more contributions. All SMS are provided with metadata information. Looking at the composition of the corpus, it becomes visible that a small number of people have contributed a large amount of data, in total 272 people have contributed to the corpus during three months. The number of women contributing to the corpus is larger than the number of men, but male contributors submitted larger amounts of data. This corpus will be of paramount importance for sociolinguistic research and normalisation studies.
Article
In this paper a collection of chats and tweets from the Netherlands and Flanders is described. The chats and tweets are part of the freely available SoNaR corpus, a 500 million word text corpus of the Dutch language. Recruitment, metadata, anonymisation and IPR issues are discussed. To illustrate the difference of language use between the various text types and other parameters (like gender and age) simple text analysis in the form of unigram frequency lists is carried out. Furthermore a website is presented with which users can retrieve their own frequency lists.
Article
As a way of tracking popular framing of CMC, this article critically reviews an inter- national corpus of 101 print-media accounts (from 2001 to 2005) of language-use in technologies such as instant messaging and text messaging. From the combined perspec- tive of folk linguistics and critical discourse analysis, this type of metadiscourse (i.e., discourse about discourse) reveals the conceptual and ideological assumptions by which particular communication practices come to be institutionalized and understood. The article is illustrated with multiple examples from across the corpus in order to demon- strate the most recurrent metadiscursive themes in mediatized depictions of technologi- cally or computer-mediated discourse (CMD). Rooted in extravagant characterizations of the prevalence and impact of CMD, together with highly caricatured exemplifications of actual practice, these popular but influential (mis)representations typically exagger- ate the difference between CMD and nonmediated discourse, misconstrue the ''evolu- tionary'' trajectory of language change, and belie the cultural embeddedness of CMD.
Geen paniek! Een analyse van de beeldvorming van sms-taal in Nederland
  • K Postma
Postma, K.: Geen paniek! Een analyse van de beeldvorming van sms-taal in Nederland. Master thesis, VU University Amsterdam (2011)