Content uploaded by Lieke Verheijen
Author content
All content in this area was uploaded by Lieke Verheijen on Nov 21, 2017
Content may be subject to copyright.
Collecting Facebook Posts and WhatsApp Chats
Corpus Compilation of Private Social Media Messages
Lieke Verheijen(B
)and Wessel Stoop
Radboud University, Nijmegen, The Netherlands
{lieke.verheijen,w.stoop}@let.ru.nl
Abstract. This paper describes the compilation of a social media corpus
with Facebook posts and WhatsApp chats. Authentic messages were
voluntarily donated by Dutch youths between 12 and 23 years old. Social
media nowadays constitute a fundamental part of youths’ private lives,
constantly connecting them to friends and family via computer-mediated
communication (CMC). The social networking site Facebook and mobile
phone chat application WhatsApp are currently quite popular in the
Netherlands. Several relevant issues concerning corpus compilation are
discussed, including website creation, promotion, metadata collection,
and intellectual property rights/ethical approval. The application that
was created for scraping Facebook posts from users’ timelines, of course
with their consent, can serve as an example for future data collection.
The Facebook and WhatsApp messages are collected for a sociolinguistic
study into Dutch youths’ written CMC, of which a preliminary analysis
is presented, but also present a valuable data source for further research.
Keywords: Computer-mediated communication ·Social media ·
New media ·Face b o o k ·WhatsApp ·Corpus compilation ·
Data collection
1 Introduction
Increasingly more youths around the world, including the Netherlands, are in
the habit of using social media such as SMS text messaging, chat, instant mes-
saging, microblogging, and networking sites in their private lives on a regular
and frequent basis. This has raised worries among parents and teachers alike
that the informal, non-standard lingo used by youngsters while communicating
via social media may have a (negative) impact upon their traditional literacy
skills, i.e. writing and reading [1,2]. Before studying the possible effect of uncon-
ventional language use in social media on literacy, it is paramount to know what
that language actually looks like. Yet little is known so far about the exact lin-
guistic manifestation of Dutch social media texts, in terms of key features of
writing such as orthography (spelling), syntax (grammar and sentence struc-
ture), and lexis (vocabulary). As such, a linguistic analysis into Dutch youths’
written computer-mediated communication is an urgent matter for research. To
conduct such a study, an up-to-date corpus of social media texts is of the utmost
c
Springer International Publishing Switzerland 2016
P. Sojka et al. (Eds.): TSD 2016, LNAI 9924, pp. 249–258, 2016.
DOI: 10.1007/978-3-319-45510-5 29
250 L. Verheijen and W. Stoop
importance. This paper describes the compilation of such a social media corpus,
specifically of WhatsApp chats and Facebook posts. Ultimately, the corpus can
help to answer the following questions: how does Dutch youths’ language use on
WhatsApp and Facebook differ from Standard Dutch? And how do WhatsApp
and Facebook messages differ, linguistically speaking, from other new media
genres, such as SMS text messages and tweets?
First, we collected WhatsApp chats. These are private online chats, which
involve typed spontaneous communication in real time between two or more users
of the mobile phone application WhatsApp Messenger. This instant messaging
client, whose name is a contraction of ‘what’s up’ and ‘application’, was released
in 2010 and has since then enormously gained in popularity among Dutch smart-
phone users. It was acquired by the Facebook company in 2014. Secondly, we
have started collecting status updates, both public and non-public, posted on
Facebook timelines. This social networking service was created in 2004. Its name
comes from the ‘face book’ directories that are often given to university students
in the United States, who were the initial members of this social network. The
personal Facebook timeline was introduced in 2011, when the format of users’
individual profile pages was changed. In this paper, we describe the collection of
these two datasets. To the best of our knowledge, this is the first social media
corpus with Dutch WhatsApp and Facebook messages.
2 Related Work
The corpus compiled for this project is an addition to existing corpora of
computer-mediated communication, in particular SoNaR (‘STEVIN Nederland-
stalig Referentiecorpus’), a freely available reference corpus of written Dutch
containing some 500 million words of text that was built by Dutch and Belgian
computational linguists [3–6]. SoNaR contains a variety of text sources, including
some social media genres, namely online chats, tweets, internet forums, blogs,
and text messages. However, two media that are currently very popular in the
Netherlands are lacking, that is, Facebook and WhatsApp. As such, there is a
great need for the texts collected in the present project.
The creation and analysis of CMC corpora is currently an active research
area. Yet, most projects explore language data that are publicly available, which
are relatively easy to obtain, such as from Twitter, Wikipedia, discussion boards,
or public social networking profiles. CMC corpora with non-public language
data are still sparse: they are more time-consuming and difficult to obtain,
because they require active participation of contributors. The following pioneer-
ing projects are in the vanguard of private social media message collection.
A notable project similar to our WhatsApp data collection is the
‘What’s up, Switzerland?’ project [7,8], a follow-up of the ‘Sms4science’ project
[9]. Researchers from four universities study the language used in Swiss
WhatsApp chats. For this non-commercial large-scale project, over 838,000
WhatsApp messages (about 5 million tokens) by 419 users were collected in 2014.
Contact with the project’s coordinators provided us with information about the
Collecting Facebook Posts and WhatsApp Chats 251
set-up of their data collection; this served as an inspiration for our own collection.
A related project is ‘What’s up, Deutschland?’ [10], conducted by researchers
from seven German universities. They collected over 376,000 WhatsApp mes-
sages by 238 users in 2014 and 2015. Similar to our current project, the ‘What’s
up’ projects compare WhatsApp chats to SMS text messages, and several fea-
tures are investigated, e.g. linguistic structures, spelling, and emoticons/emoji.
Our Facebook data collection is comparable to that of the DiDi project
[11,12]. The DiDi corpus comprises German Facebook writings by 136 voluntary
participants from the Italian province of South Tyrol (around 650,000 tokens).
The corpus was collected in 2013. It contains not just status updates, but also
comments on timeline posts, private messages, and chat conversations. The data
and corresponding metadata were acquired by means of a Facebook web applica-
tion. Their linguistic analysis focuses, among other things, on the use of dialects
and age-related differences in language on social network sites.
3 Creation of Websites
We created two websites to gather WhatsApp chats and Facebook posts (see
http://cls.ru.nl/whatsapptaal and http://cls.ru.nl/facebooktaal), where youths
could donate their own WhatsApp and Facebook messages to science. The
data thus represent authentic, original, unmodified messages that were com-
posed in completely natural, non-experimental conditions. Besides the home
page, the websites contain the tabs ‘Prizes’, ‘Instructions’, ‘Consent’, ‘FAQ’,
‘About us’, and ‘Contact’. These pages present, respectively, information on the
prizes youths can win by contributing their social media messages to the research
project, instructions on how they can submit their messages, consent forms that
they should sign for us to be allowed to use the data, frequently asked questions,
brief info about ourselves (the researchers), and a contact form.
The main difference between the two websites for gathering social media data
is that the WhatsApp collection website includes an ‘Instructions’ page with
extensive explanations on how to submit chats depending on one’s mobile phone
type (Android, iPhone, or Windows Phone), whereas the Facebook collection
website prominently features a button for donating messages. This difference
stems from the technical possibilities of submitting messages: while WhatsApp
chats can be sent via email from a mobile phone (to an email address created
specifically for the purposes of this data collection, whatsapp-taal@let.ru.nl),
Facebook posts cannot easily be submitted by users themselves, so we retrieved
them by means of a self-built application.
4 Creation of Application
To automatically retrieve posts from volunteering youths’ Facebook timelines,
we created a Facebook app - a piece of software that has access to data stored
by Facebook via the Facebook Graph API (application programming interface,
https://developers.facebook.com/docs/graph-api). In practice, this means that
252 L. Verheijen and W. Stoop
users only have to click on a button on our website, telling the app to make a
connection to Facebook, to collect their posts, and to save them in our database.
To protect the privacy of its users, Facebook has installed two layers of secu-
rity with which the app needed to deal. The first layer entails that the volunteer-
ing user needs to allow the app access to every piece of information it collects.
Facebook calls this allowance of access to personal data a ‘permission’. Two
permissions were required for our purpose, user birthday (to make sure that we
collected posts of youths of the intended ages) and user posts. Users grant these
permissions directly after they click the button on our website: a pop-up window
appears which first asks them to log in to Facebook and then explains to what
the app will have access if they proceed.
The second security layer entails that Facebook itself needs to allow the app
to ask for permissions. During development, the app only worked for a predefined
set of Facebook users for testing purposes; users that were not part of this set
could not grant any permissions and thus donate their data with the app. To
make the app available to all Facebook users, it had to be manually reviewed by
a Facebook employee. Our app was accepted only after making clear that it is
of value to Facebook users because it enabled volunteering users to effortlessly
donate their posts without having to manually copy and paste these one by one.
The source code of the app can be found at https://github.com/Woseseltops/
FB-data-donator. It can easily be adjusted to make another app that collects
other user data in a similar way.
5 Promotion of Websites
The websites for collecting social media messages were promoted through free
publicity in Dutch media. It attracted quite some media attention, which
resulted in newspaper publications, both regional (de Gelderlander ) and national
(AD.nl), radio interviews on regional (RTV Noord-Hol land,Studio 040 )and
national (De Taalstaat,NPO Radio 1,3FM,Radio FunX ) stations, and tele-
vision interviews on regional (NimmaTV ) and national (Rtl4 ) TV. University
and student magazines reported on the data collection too (Vox,ANS). In addi-
tion, it was advertised in the digital newsletters of Onze Taal (the Dutch society
for language buffs) - Taalpost for adults and TLPST for adolescents. The data
collection was also promoted via the Radboud University’s web pages and by
researchers via social media channels, in particular Twitter and Facebook. We
further promoted it during lectures and master classes for young audiences, i.e.
students in secondary and tertiary education. Our aim was to promote the web-
sites nationwide, in order to gather a representative sample of messages from
youngsters throughout the country.
In order to stimulate youths to contribute their social media messages to our
project, we decided to raffle off prizes - gift certificates at the value of 100, 50, and
20 euros. With respect to WhatsApp, individual contributors’ odds of winning
a prize increased as they sent in more chat conversations. We felt that this raffle
was necessary to stimulate youths to donate their private messages to the corpus.
Collecting Facebook Posts and WhatsApp Chats 253
Importantly, it was emphasized on the websites that only those contributors who
completely filled in the consent form stood a chance of winning the prizes. This
was made explicit to motivate youths to give their informed consent.
6 Metadata
All WhatsApp chats and Facebook posts in our social media corpus are accom-
panied by a substantial amount of sociolinguistic information. Via the websites,
the following metadata were obtained: name, place of residence, place and date of
birth, age, gender, and educational level, as well as date and place of submission.
These parameters are useful for sociolinguistic research, since they enable one to
study the language use of different social groups in WhatsApp and Facebook.
7 IPR Issues and Ethical Approval
Intellectual property rights (IPR) were obtained by consent of both the Face-
book company and individual contributors of Facebook and WhatsApp mes-
sages, since it is key to safeguard the authors’ rights and interests [5] (p. 2270).
For underage contributors, between 12 and 17 years old, written consent was also
gained of one of their parents or guardians. By signing the consent web form,
contributors declared the following:
– to have been informed about the purpose of the study;
– to have been able to ask questions about the study;
– to understand how the data from the study will be stored and to what ends
they will be used;
– to have considered if they want to partake in the study;
– to voluntarily participate in the study.
Additionally, parents or guardians also declared:
– to be aware of the contents of their child’s messages;
– to agree with their child’s participation in the study.
Participants and their parents/guardians gave full permission for their
(child’s) submitted messages (i) to be used for scientific research and educa-
tional purposes; (ii) to be stored in a database, according to Radboud Univer-
sity’s rules, and to be kept available for scientific research, provided they are
anonymised and in no way traceable to the original authors; and (iii) to be used
in scientific publications and meetings. If messages appear in publications or
presentations, no parts that may harm the participants’ interests will be made
public.
Furthermore, ethical approval was obtained from our institution’s Ethical
Testing Committee (ETC). For the WhatsApp chats, it was crucial for the ETC
that messages of conversation partners were deleted, since they have not given
consent for the use of their messages. Accordingly, interlocutors’ WhatsApp
messages were immediately discarded. This procedure was explained on the FAQ
page of the websites. In accordance with the ETC’s further guidelines, we added
downloadable information documents on the home pages.
254 L. Verheijen and W. Stoop
8 Current Corpus Composition
The collection period of WhatsApp messages lasted from April until December
2015; the collection of Facebook messages started in December 2015. Up to
the time of writing, over 332,000 word tokens of WhatsApp chats have been
collected from youths between the ages of 12 and 23, which compares to the
SoNaR subcorpora with texts by youths up to 20 years old from the Netherlands
as follows - 44,012 word tokens in the SMS corpus (6.08 % of the total number of
words of that corpus); 219,043 in the chat corpus (29.7% of total); and 2,458,904
in the Twitter corpus (10.6 % of total). The scale of this corpus makes it suitable
for fine-grained (manual) linguistic studies; it is not intended as a training data
set for large-scale computational research.
We excluded chain messages from our corpus. Also not included were any
visual or audio materials: since the study that prompted the data collection is
completely linguistic in nature, images, videos, and sound files were not gath-
ered, so the corpus is wholly textual rather than multimodal. Another deciding
factor in asking contributors not to add media files when sending WhatsApp
conversations from their smartphones is that adding them may prevent mails
from arriving due to an exceeded data limit. More importantly, issues of copy-
right and privacy protection would make any inclusion of pictures, videos, or
sounds highly problematic. The messages are stored as one WhatsApp chat con-
versation per file. Table 1shows demographic details on the data collected so
far, focusing on the age and gender distribution.
Table 1. Composition of WhatsApp dataset.
Contributors Conversations Wo r ds
# % # % # %
Adolescents 11 32.4 83 38.6 63,217 19.0
Young adults 23 67.6 132 61.4 269,440 81.0
Male 12 37.5 71 33.0 98,201 29.5
Fema l e 22 68.8 144 67.0 234,456 70.5
Tota l 34 100 215 100 332,657 100
For the WhatsApp dataset, a relatively small number of youths (34) have
contributed large quantities of data. At the time of writing, the number of con-
tributors of Facebook posts was already considerably greater - 94, who together
contributed 171,693 words. This difference may stem from the submission pro-
cedure: while users were asked to submit WhatsApp chats via separate emails,
which required taking several steps on their mobile phones, they could easily
submit all their Facebook posts with the click of a button. Young adults (18–23
years old, avg. age 20.1) submitted many more WhatsApp messages than ado-
lescents (12–17, avg. age 14.4), not only in terms of number of contributors, but
Collecting Facebook Posts and WhatsApp Chats 255
also in terms of number of conversations as well as words. The average age of
all contributors was 18.3. In terms of gender, a higher percentage of WhatsApp
chat contributors are female, with about two thirds girls versus one third boys
(a distribution similar to that for donated text messages as reported in [4]). This
corresponds to the percentages of words and conversations that were submitted
by male versus female contributors.
9 Preliminary Data Analysis
This section presents the first findings of a linguistic corpus study of Dutch
youths’ WhatsApp chats. Their language use in social media often differs from
Standard Dutch, in various dimensions of writing. A striking orthographic fea-
ture of written CMC are textisms: unconventional spellings of various kinds. We
conducted a quantitative register analysis into the frequency of textisms, and
investigated how the independent variable age group affects this linguistic feature
by distinguishing between WhatsApp messages of adolescents and young adults.
The following textism types were found (presented here with Dutch examples):
– textisms with letters:
•initialism: first letters of each word/element in a compound word, phrase,
sentence, or exclamation, e.g. hvj (hou van je), omg (oh mijn God)
•contraction: omission of letters (mostly vowels) from middle of word, e.g.
vnv (vanavond), idd (inderdaad)
•clipping: omission of final letter of word, e.g. lache (lachen), nie (niet)
•shortening: dropping of ending or occasionally beginning of word, e.g.
miss (misschien), wan (wanneer)
•phonetic respelling: substitution of letter(s) of word by (an)other letter(s),
while applying accurate grapheme-phoneme patterns of the standard lan-
guage, e.g. ensow (enzo), boeiuh (boeien), okeej (ok´e), egt (echt)
•single letter/number homophone: substitution of word by phonologically
resembling or identical letter/number, e.g. n (een), t (het), 4 (for)
•alphanumeric homophone: substitution of part of word by phonologi-
cally resembling or identical letter(s)/number(s), e.g. suc6 (succes), w88
(wachten)
•reduplication: repetition of letter(s), e.g. neeee (nee), superrr (super)
•visual respelling: substitution of letter(s) by graphically resembling non-
alphabetic symbol(s), e.g. Juli@n (Julian), c00l (cool)
•accent stylisation: words from casual, colloquial, or accented speech
spelled as they sound, e.g. hoezut (hoe is het), lama (laat maar)
•inanity: other, e.g. laterz (later)
•standard language abbreviations, e.g. aug (augustus), bios (bioscoop)
– textisms with diacritics:
•missing, e.g. carriere (carri`ere), ideeen (idee¨en), enquete (enquˆete)
– textisms with punctuation:
•missing, e.g. mn (m’n), maw (m.a.w.), ovkaart (ov-kaart)
•extra, e.g. stilte-coup´e (stiltecoup´e)
256 L. Verheijen and W. Stoop
•reduplication, e.g. !!!!!, ??, ..........
– textisms with spacing:
•missing (in between words), e.g. hahaokeeedan (haha ok´e dan)
•extra (in between elements of compound words), e.g. fel groen (felgroen)
– textisms with capitalisation:
•missing (of proper names, abbreviations), e.g. tim (Tim), ok (OK)
•extra, e.g. WOW (wow)
Figure 1shows the results for the textisms, separating adolescents from young
adults. The frequencies shown here have been standardised per 10,000 words,
because the total number of words differs per age group in the WhatsApp
dataset. The figure makes clear that textisms with letters were by far the most
frequent in the WhatsApp chats. It also shows an age-based distinction: while
textisms with diacritics, capitalisation, punctuation, and spacing occurred with
more or less similar frequencies in the WhatsApp messages of the two age groups,
those with letters were used much more by adolescents. Their greater use of
orthographic deviations may be attributed to a desire to rebel against societal
norms, including the standard language norms, and to play with language: the
most non-conformist linguistic behaviour is said to occur around the ages of
15/16, when the ‘adolescent peak’ occurs. Young adults, on the other hand, may
feel more social pressure to conform to norms set by society, also those about
language.
Fig. 1. Five types of textisms in WhatsApp dataset.
This preliminary analysis is part of a larger in-depth linguistic study
of a broad range of linguistic features in WhatsApp chats. These focus on
orthography (misspellings, typos, emoticons, symbols), syntax (omissions; com-
plexity), and lexis (borrowings, interjections; diversity, density). Other lexical
features that may be interesting for online youth communication are, for exam-
ple, swearwords, intensifiers, and hyperbolic expressions. The WhatsApp data
will be compared to the Facebook data, as well as to instant messages, text
messages, and microblogs of the SoNaR corpus. This can reveal to what extent
deviations from the standard language norms in CMC depend not just on indi-
vidual user characteristics such as age, but also on genre characteristics.
Collecting Facebook Posts and WhatsApp Chats 257
10 Conclusions
The central role currently played by CMC in (especially) youths’ lives makes
social media corpora quite valuable for state-of-the-art sociolinguistic research.
This paper discussed the compilation of such a corpus in the Netherlands.
WhatsApp chats and Facebook posts were contributed by Dutch youths from
12 to 23 years old. This paper has made clear that a data collection method of
voluntary donations, with the added incentive of a prize raffle, can yield a fair
amount of data if sufficient public attention is obtained through e.g. media cov-
erage. We have presented websites created for this purpose, and have explained
how such websites can be promoted. The importance of collecting metadata and
obtaining written consent and ethical approval have been stressed. Crucially, the
application we created to gather Facebook posts, beside the process of gaining
consent from the Facebook company, can serve as a model for future corpus
builders.
11 Future Work
Eventually, if the WhatsApp and Facebook data are processed in a similar fash-
ion as the rest of SoNaR, they can be incorporated into the corpus together
with their metadata. This would require format conversion, tokenization, and
anonymisation: the data should be (a) converted into the FoLiA xml-format,
which was developed for linguistic resources, (b) tokenised by UCTO, a tokeniser
adapted for social media, and (c) anonymised, if possible automatically, so that
they contain no personal/place names, (email) addresses, telephone numbers, or
bank accounts. Such additional processing was beyond the scope of the present
project, but particularly data anonymisation is essential if the WhatsApp chats
and Facebook messages are shared with the wider scientific community and
become available for further research into social media texts. It would also be
useful to apply part-of-speech tagging to this corpus. Moreover, we recognize the
need for multimodal social media corpora: the next step in sociolinguistic social
media research may be to focus on multimodality, given the increased options
for incorporating visual materials (photographs, emoji, videos, etc.) and the use
thereof in computer-mediated communication. The number of contributors so
far suggest that youths remain hesitant to donate their private, often intimate,
social media messages to science, despite significant gift certificates; perhaps a
larger corpus could be obtained by even more publicity or even greater prizes.
Nonetheless, albeit monomodal and of modest scale, the present corpus with its
metadata can be a vital resource and an example of how social media texts can
be collected for linguistic, sociological, or other research.
Acknowledgments. This research was funded by a grant of the Dutch Organisation
for Scientific Research (NWO), under project number 322-70-006. Special thanks are
due to Iris Monster, who constructed the WhatsApp website. Thanks also go to Wilbert
Spooren and Ans van Kemenade, the supervisors of Lieke’s PhD project. Finally, we
thank all contributors of WhatsApp and Facebook messages to our corpus.
258 L. Verheijen and W. Stoop
References
1. Thurlow, C.: From statistical panic to moral panic: the metadiscursive construction
and popular exaggeration of new media language in the print media. J. Comput.-
Mediated Commun. 11(3), 667–701 (2006)
2. Postma, K.: Geen paniek! Een analyse van de beeldvorming van sms-taal in
Nederland. Master thesis, VU University Amsterdam (2011)
3. Sanders, E.: Collecting and analysing chats and tweets in SoNaR. In: Proceedings
LREC (Language Resources and Evaluation) 2012, pp. 2253–2256 (2012)
4. Treurniet, M., Sanders, E.: Chats, tweets and SMS in the SoNaR corpus: social
media collection. In: Newman, D. (ed.) Proceedings of the 1st Annual International
Conference Language, Literature & Linguistics, pp. 268–271. Global Science and
Technology Forum, Singapore (2012)
5. Treurniet, M., De Clercq, O., van den Heuvel, H., Oostdijk, N.: Collecting a corpus
of Dutch SMS. In: Proceedings LREC 2012, pp. 2268–2273 (2012)
6. Oostdijk, N., Reynaert, M., Hoste, V., Schuurman, I.: The construction of a
500-million-word reference corpus of contemporary written Dutch. In: Spyns, P.,
Odijk, J. (eds.) Essential Speech and Language Technology for Dutch: Results by
the STEVIN Programme, pp. 219–247. Springer, Heidelberg (2013)
7. D¨urscheid, C., Frick, K.: Keyboard-to-Screen-Kommunikation gestern und heute:
SMS und WhatsApp. im Vergleich. In: Mathias, A., Runkehl, J., Siever, T. (eds.)
Sprachen? Vielfalt! Sprache und Kommunikation in der Gesellschaft und den
Medien. Eine Online-Festschrift zum Jubil¨aum von Peter Schlobinski, pp. 149–181.
Networx 64, Hannover (2014)
8. Stark, E., Ueberwasser, S., Di´emoz, F., D¨urscheid, C., Natale, S., Thurlow, C.,
Siebenhaar, B.: What’s up, Switzerland? Language, individuals and ideologies in
mobile messaging. Universit¨at Z¨urich, Universit¨at Bern, Universit´edeNeuchˆatel,
Universit¨at Leipzig (2015). http://www.whatsup-switzerland.ch
9. Stark, E., Ueberwasser, S., D¨urscheid, C., B´eguelin, M.J., Moretti, B., Gr¨unert, M.,
Gazin, A.-D., Pekarek Doehler, S., Siebenhaar, B.: Sms4science. Universit¨at Z¨urich,
Universit´edeNeuchˆatel, Universit¨at Bern, Universit¨at Leipzig (2015). http://www.
sms4science.uzh.ch
10. Siebenhaar, B., et al.: What’s up, Deutschland? WhatsApp-Nachrichten
erforschen. Universit¨at Leipzig, Technische Universit¨at Dortmund, Technische
Universit¨at Dresden, Leibniz Universit¨at Hannover, Universit¨at Mannheim,
Universit¨at Koblenz-Landau, Universit¨at Duisburg-Essen (2016). http://www.
whatsup-deutschland.de
11. Frey, J.-C., Stemle, E.W., Glazniek, A.: Collecting language data of non-public
social media profiles. In: Faaß, G., Ruppenhofer, J. (eds.) Workshop Proceedings
of the 12th Edition of the KONVENS Conference, pp. 11–15. Universit¨atsverlag,
Hildesheim (2014)
12. Frey, J.-C., Glaznieks, A., Stemle, E.W.: The DiDi corpus of South Tyrolean CMC
data. In: Beißwenger, M., Zesch, T. (eds.) Proceedings of the 2nd Workshop of
the Natural Language Processing for Computer-Mediated Communication/Social
Media, pp. 1–6. University of Duisburg-Essen (2015)