Conference PaperPDF Available

A Crowd-sourcing Approach for Translations of Minority Language User-Generated Content (UGC)


Abstract and Figures

Data sparsity is a common problem for machine translation of minority and less-resourced languages. While data collection for standard, grammatical text can be challenging enough, efforts for collection of parallel user-generated content can be even more challenging. In this paper we describe an approach to collecting English<->Irish translations of user-generated content (tweets) that overcomes some of these hurdles. We show how a crowd-sourced data collection campaign, which was tailored to our target audience (the Irish language community), proved successful in gathering data for a niche domain. We also discuss the reliablity of crowd-sourcing English<->Irish tweet translations in terms of quality by reporting on a self-rating approach along with qualified reviewer ratings.
Content may be subject to copyright.
The Prague Bulletin of Mathematical Linguistics
NUMBER ??? JUNE 2017 1–12
A Crowd-sourcing Approach
for Translations of Minority Language
User-Generated Content (UGC)
Meghan Dowling, Teresa Lynn, Andy Way
ADAPT Centre, Dublin City University
Data sparsity is a common problem for machine translation of minority and less-resourced
languages. While data collection for standard, grammatical text can be challenging enough,
eorts for collection of parallel user-generated content can be even more challenging. In this
paper we describe an approach to collecting EnglishIrish translations of user-generated con-
tent (tweets) that overcomes some of these hurdles. We show how a crowd-sourced data col-
lection campaign, which was tailored to our target audience (the Irish language community),
proved successful in gathering data for a niche domain. We also discuss the reliablity of crowd-
sourcing EnglishIrish tweet translations in terms of quality by reporting on a self-rating ap-
proach along with qualied reviewer ratings.
1. Introduction
Irish is the rst ocial language of Ireland, an ocial language of the European
Union, and a recognised minority language in both Northern Ireland and the Euro-
pean Union. However, despite its status, the 2012 META-NET White Paper Series
report classies the Irish language as having “weak/no support” with regards to ma-
chine translation resources (Judge et al., 2012). Recently, in response to this, there has
been notable progress in terms of gathering parallel data for EnglishIrish (ENGA)
machine translation (Arcan et al. (2016), Dowling et al. (2015)).
Of course, a robust statistical machine translation (SMT) system, which is data-
driven, relies on the availability of a signicant amount of parallel data suitable for
© 2017 PBML. Distributed under CC BY-NC-ND. Corresponding author:
Cite as: Meghan Dowling, Teresa Lynn, Andy Way. A Crowd-sourcing Approach for Translations of Minority Lan-
guage User-Generated Content (UGC). The Prague Bulletin of Mathematical Linguistics No. ???, 2017, pp. 1–12.
PBML ??? JUNE 2017
the translation domain. However, there is still only a relatively small amount of par-
allel data available for the EnglishIrish language pair,1the vast majority of which
contains curated, grammatical and carefully translated content. As expected, this type
of text diers greatly from the characteristics of user-generated content (UGC). Lynn
et al. (2015) report on some interesting nuances of Irish language UGC such as code-
switching, verb drop and phonetic spelling, all of which can cause challenges for text
processing. Their work is part of a recent growth of interest in automated processing
of Irish UGC, which also saw the creation of the only known user-generated corpus
of EnglishIrish (ENGA) text, gathered as part of the Brazilator sentiment analy-
sis and machine translation project (CNGL-DCU Team, 2014). This project provided
Twitter with a number of SMT systems (including Irish) that allowed for real-time
translation of tweets during the 2014 World Cup.2
There are a number of diculties associated with collecting parallel data for ma-
chine translation of ENGA UGC content. Firstly, UGC content is relatively new and
has really only become prevalent in the past 10 years following the growth in popu-
larity of social media platforms such as Facebook, Twitter, Instagram, and so on. This
means that translated data is not as readily available as it may be with other content
types (e.g. public documents, educational materials, etc.). Secondly, the domains of
UGC content vary so much that a system would need to be tuned to the specic termi-
nology use of that topic or domain. This indeed was the case with the MT system used
in the Brazilator project, due to soccer-related tweets carrying a particular register and
terminology usage. Finally, compounding the challenges for Irish UGC translation is
the lack of high-quality human translators in general who are available to translate
English content into Irish, and vice-versa. Moorkens (2016) reports that there is a rel-
atively small number of accredited Irish language translators available, and that the
demand for translation exceeds the availability of quality translators to such an extent
that there is a derogation on Irish translation in the European Commission until 2022.
With all these obstacles facing the development of domain-specic ENGA parallel
data, we are faced with formulating an alternative approach to data collection.
A well-attested solution to gathering translation content is through crowd-sourcing
platforms (e.g. Ambati et al. (2010), Zaidan and Callison-Burch (2011)). However, as
is the case for many minority languages, EnglishIrish translation requires a niche set
of skills, which contributors to well-known global crowd-sourcing platforms such as
Crowd Flower3or Mechanical Turk4are unlikely to hold. In this paper, we describe a
1Approx. 348,964 sentences of parallel text publicly available, according to Arcan et al. (2016). It should
also be noted that a large portion of these ‘sentences’ in fact contain word to word translations, similar to a
terminology database.
2Sentiment analysis was carried out on tweets so that the change in polarity of tweets could be viewed
in real time over the course of each game depending on fans’ views of the match.
M. Dowling, T. Lynn, A. Way Crowd-sourcing Minority Language UGC (1–12)
crowd-sourcing campaign which allowed us to develop a user-generated IrishEnglish
tweet dataset (for the purposes of a study in Sentiment Analysis within MT (Ai et al.,
2017)) by directly attracting altruistic contributions from the Irish-speaking commu-
The remainder of this paper is divided as follows: in Section 2, we describe the
motivation behind this project. In Section 3, we provide details of the design of the
crowd-sourcing interface. In Section 4, the public and media responses relating to
this project are discussed. Finally, in Section 6, we provide some conclusions on this
2. Motivation
2.1. Resource collection motivation
In recent years, there has been an increased awareness of the usefulness of NLP
analysis of social media content when reporting on signicant societal events or top-
ical discussion (e.g. the analysis on Twitter of rioting (Lukasik et al., 2015), fake news
(Gupta and Kumaraguru (2012), Mitra et al. (2017)), rumours (Jin et al., 2013) and elec-
tions (O’Connor et al. (2010), Bakliwal et al. (2013)). Particularly relevant to this work,
sentiment analysis helps to provide both governments and the general public with
an overview of the online community’s opinions or feelings towards events or people
(for example election candidates (e.g. Ceron et al. (2014)). In Ireland, the national
broadcaster (RTÉ - Raidió Teilifís Éireann), through sentiment analysis of tweets with
the hashtag #GE16, reported on opinion trends in the lead-up to the 2016 General
Election.5One shortcoming of this report, however, is that it only reported on the
English language tweets. In other words, the sentiment of the Irish-speaking online
community was not represented.
Subsequent to this work, a study was carried out to investigate the sentiment of
Irish language tweets from this period of time, and containing the same #GE16 hash-
tag (Ai et al., 2017).6The study focused on analysing sentiment analysis of Irish
language tweets and assessing whether sentiment holds across languages through
translation. In order to carry out the study, a parallel corpus of ENGA tweets was
required, on which sentiment polarity are annotated. Here, we describe the crowd-
sourcing method used in the collection of data for the creation of this parallel corpus
of ENGA tweets.
6Irish tweets around the General Election (olltoghchán) tended to also incude the English language hash-
tag #GE16, along with #togh16, #olltoghchán or #OT16
PBML ??? JUNE 2017
2.2. Motivation for crowd-sourcing
As discussed, Irish language research is low in resources, both in terms of fund-
ing and in terms of skilled translators. For these reasons, professional translation of
the dataset would be beyond the budget of this project. Considering the positive dis-
position towards Irish language promotion (Darmody et al., 2015), an approach that
benets from the altruistic nature of Irish speakers seemed more realistic and more
feasible. Members of the Irish-speaking community, both on and o-line, are passion-
ate and proactive about Irish language promotion. In recent times, it has been noted
that the community have a strong presence on social media (Lacka and Moner, 2016).
Social network platforms such as Facebook and Twitter have proven benecial for on-
line campaigns related to the Irish language (e.g. The Twitter campaign known as
#AchtAnois - ‘(Legislative) Act Now’).7These platforms are also positively exploited
in ‘spreading the word’ about Irish language related activites (e.g. #PopUpGaeltacht
tweets have helped the promotion of informal Irish-speaker social meetups both in
Ireland and in cities around the world).8
Given the positive disposition online towards the cultivation and growth of the
Irish language, it is unsuprising to note that previous crowd-sourcing campaigns have
proved successful. For example, through crowd-sourcing, 1000 English tweets related
to the 2014 World Cup were translated into Irish for the Brazilator project (CNGL-
DCU Team, 2014).9In addition, Meitheal Dúchas is a larger, more recent and on-
going campaign that has shown how this approach engages the community at large
to contribute to language conservation. The Meitheal Dúchas transcription project al-
lows the general public to transcribe The School’s Collection section of the digitization
of the National Folklore Collection. 10 The project’s website site provides up-to-date
statistics on the contributions to the collection so far. Given these previous successes,
we created an online translation interface open to the public and re-enforced its pro-
motion with a social media campaign to elicit participant involvement from the online
Irish-speaking community.
3. Design considerations
Our aim was to provide a practical crowd-sourcing translation platform with a
suitable user interface tailored to the contributors. Two important factors we needed
to bear in mind were that the translators would be (1) unpaid and (2) un-acknowledged
7#AchtAnois is used by the Irish language community to show their annoyance at the standard of Irish
language legislation in Northern Ireland
9These 1000 tweets were translated in sets of 100 tweets by 10 volunteers, whose help was enlisted
through an online campaign.
M. Dowling, T. Lynn, A. Way Crowd-sourcing Minority Language UGC (1–12)
(anonymous). In order to optimise contributions, the design, therefore, needed to en-
sure that the translation request did not feel like a project or tedious task. This con-
sideration in particular arose from lessons learned from the Brazilator data collection,
where the provision of a shared spreadsheet with a large list of tweets for translation
resulted in procrastination by some of the volunteers.
3.1. Design criteria
The criteria identied for this website is as follows:
The website needed to be user-friendly and casual, with clear instructions.
Users should feel as though they had a trivial task, and could complete as many
translations as they felt comfortable with.
Given that the translators may not be qualied or accredited, it was vital that
they could provide some feedback on their measure of quality of the translations
they provided.
While low-quality translations would not be included in the dataset, all contri-
butions were to be deemed valuable.
Both native and non-native speakers should be able to contribute to the transla-
tion eort.
An eort should be made to maintain consistency across the translations (i.e.
approaches to dealing with Twitter-style language).
Administrators should be able to easily view and access the translations and
Figure 1: A portion of the translation interface. Image has been slightly altered for
printing purposes
PBML ??? JUNE 2017
3.2. Implementation
Given the specic criteria identied as necessary for this crowd-sourcing plat-
forms, the following features were implemented:
The landing page of the website has a ‘no-fuss’ appearance, with just four op-
tions to choose from (two translation direction options, guidelines and an Ad-
min login option).
Users were presented with just one tweet to translate at a time, creating a casual
opt-in/ opt-out environment.
Users were required to assign a condence level from 0–10. The purpose of
the scoring was to allow for retranslation on lower-scored tweets in an eort to
achieve a high quality translation corpus.
It was possible for users to skip any translations that they did not feel condent
translating, allowing for another user to undertake instead.
The language direction (EnglishIrish or IrishEnglish) could be chosen and
switched between at any time.
Non-native Irish speakers could still contribute by choosing to translate into En-
glish, their (presumably) native language with more ease.
A set of translation guidelines, outlined in Section 3.2.1, were provided to aid
users and ensure consistency.
The Admin user-interface provided a spreadsheet view of all tweets, their trans-
lations, and their condence scoring.
Figure 1 shows a screenshot of the translation inferface for the translation of EnglishIrish
3.2.1. Translation guidelines
The following are the translation guidelines provided to users to aid them in their
Placeholders: #hashtags and @twitterhandles are to be left untranslated. Emoti-
cons have been replaced by the placeholder [emoticon]. Please retain these
placeholders in your Irish translation (or English translations) also.
e.g. My Dad [emoticon] soaked but smiling #ge16 M’athair [emoticon] iuch
báite ach fós gealgháireach #ge16
Case: Please keep translations case sensitive where possible.
e.g.: FULL HOUSE Great night tonight @SorchaNicC #GE16 launch. TEACH
LÁN Oíche iontach anocht ag seoladh #GE16 @SorchaNicC.
Text speak: Where possible, please translate English text speak to Irish text
speak (and vice versa), where there are equivalents.
e.g. tnx (thanks) grma (go raibh maith agat). If there is no shortened Irish/English
equivalent that you are aware of, translate the word into its full form.
M. Dowling, T. Lynn, A. Way Crowd-sourcing Minority Language UGC (1–12)
Tweet length: Although the original tweets have been limited to 140 characters,
your translations do not have to adhere to this.
Pre-translate options: It is acceptable to use Google Translate to pre-translate
the tweets and correct the output – if you nd it helpful. If it is too much of a
hindrance, translation from scratch might work better. Note that the translations
do not have to be 100% sound. Remember that the quality of Twitter language
is questionable at the best of times, so your best shot is enough. Where there is
ambiguity, go with your intuitive translation.
Condence level: After having translated the tweet, you are asked to indicate
how condent you are that your translation is accurate. Please rate your trans-
lation on a scale of 1–10 from the drop-down menu provided.
Skip translation: If you want to skip a tweet leave the translation eld blank
and submit a condence level of 0
4. Dissemination and Public Response
4.1. Dissemination
Given the previous positive reactions to Irish language social media campaigns,
a call for participation on social media sites was a natural starting point for gath-
ering prospective translators. This approach also takes into account that this is a
non-conventional11 crowdfunding platform, and therefore participants must be ac-
tively sought out. A web-based approach is most suitable in order to spread the word
rapidly and reach a wide audience. In addition, it is worth noting that due to the fact
that the translation platform was new and entirely web-based, it was more eective to
direct users to the website through digital means (i.e. through sharing a hyperlink).
As mentioned earlier, the Irish language community is highly active on social me-
dia, particularly Twitter12 and Facebook.13 Participants with knowledge and regular
use of the Irish language on social media were especially valuable to this project, as
it related to translation of a specic genre of language. The language used on Twitter
by Irish language users often takes a dierent shape to language from other domains
(Lynn et al., 2015). For instance, in Example (1), taken from our collected Twitter cor-
pus, the term fér plé is used, which is an Irish phoneticisation of the phrase ‘fair play’
(‘well done to...’) as well as which uses non-standard orthography based on the
dialectal pronunciation of the word faoi ‘about/on’.
11As opposed to Mechanical Turk or Crowdower where frequent users visit the site to seek work, we
needed to invite people to visit our site.
121,681,291 Irish language tweets to date according to Indigenous Tweets, a website which provides statis-
tics on minority language tweeting:
13For example, the public group ‘Gaeilge Amhain’ Available at
PBML ??? JUNE 2017
(1) Fér plé do @RTERnaG as leanúint leis an gcraoltóireacht fé #GE16!
‘Fair play to @RTERnaG for following reports on #GE16!’
4.2. Public Response
The press and broadcasting media play a central role in the Irish language com-
munity, both in Ireland and among the diaspora overseas. It was fortunate, therefore,
that this crowd-sourcing campaign was picked up, endorsed and distributed by a va-
riety of Irish-language digital media outlets, e.g. Raidió na Gaeltachta, Raidió na Life,
and Tuairisc.14 This happened mainly through promotion on Twitter, through the
tagging of such media bodies in tweets or retweets. Endorsements from such public-
facing outlets undoubtedly helped to shape the positive public response we received
towards the campaign and thus broadened the reach for soliciting contribution.
Feedback from users, however, suggested that it would have been helpful for this
platform to be available as a mobile application. One possible assumption that could
be made from this is that users did indeed feel as though single tweet translation was
a trivial task that could be carried out on the move and during a moment of downtime.
5. Results and Evaluation
Through our crowd-sourcing platform, over 1000 tweet translations were collected
from 4th July, 2016 until 18th August, 2016 (see Table 2). A larger number of GAEN
tweets were collected (720) than ENGA (324).
Language direction Translations collected Average condence value
EnglishIrish 324 8.04
IrishEnglish 720 8.70
Table 1: Crowd-sourced translations, including average self-score rating
A natural question that arises in a study like this is the question of reliability of
crowd-sourcing as a method for translation, and ultimately for data set creation. As
the translation contributions are anonymous and the link through which the tweets
are translated is available to the general public, how can we assess that the translations
we solicit are reliable? We took two approaches to answering this question:
(1) We asked the translators to score themselves, and as such rate their own trans-
lation quality. The purpose of this was two-fold. Firstly, it allowed for lower-scored
translations to be re-presented to another user for translation as part of a quality con-
trol measure. Secondly, to assess the reliability of self-scoring as a method for evalu-
14Tuairisc is an online Irish language periodical of a news/journal/magazine nature.
M. Dowling, T. Lynn, A. Way Crowd-sourcing Minority Language UGC (1–12)
ating (or roughly evaluating) the quality of the crowd-sourced translations. It can be
seen from the results in Table 1 that the average condence value is above 8 (out of a
range of 1-10) for crowd-sourced translations in both language directions.
Language direction Translations reviewed Average reviewer score
EnglishIrish 180 8.68
IrishEnglish 180 9.22
Table 2: Reviewer quality rating for subset of crowd-sourced data: average score for
both language directions
(2) A native Irish speaker reviewed a portion of the tweet translations (n=180) and
assigned them a quality rating (1–10).15 This scoring gave us a true indication of trans-
lation quality.
In order to assess the reliablity of the crowd-sourced self-scoring method, we com-
pare the reviewer’s rating to the translators’ self ratings. The reviewer’s average qual-
ity rating is higher (by more than 0.5) than the average rating of the translators in both
language directions (see Table 2). Furthermore, in 71% of EnglishIrish translations
and 82% of IrishEnglish translations, the reviewer deemed the translations either
the same or of a higher quality than the original self-rated score (see Figure 2).
6. Conclusions and Future Work
6.1. Conclusions
We have shown that for a minority language such as Irish, while traditional crowd-
sourcing platforms may not be an option for the collection of data, it is possible instead
to benet from the altruistic nature of the community towards language cultivation
– in a way that would not be possible for a majority language. We have presented a
web interface that is tailored to the needs of this project – user-friendly, casual, and
accessible.16 The success of this platform is evident in the 1000+ tweet parallel corpus
of user–generated content that has been collected and quality-assessed, as well as the
positive public and media response that the project received. It is clear that when
presented with a project that has clear benets for the Irish language, speakers will
donate their time and eorts to participate.
15The same scoring system as the original translator: 1 being incomprehensible, and 10 being fully ac-
ceptable in terms of uency and adequacy.
16The code for this platform is open-source and it available from
Deep-Senti- Analytics/tree/master/Translator
PBML ??? JUNE 2017
Quality rating
Number of tweets
original translator reviewer
Quality rating
Number of tweets
original translator reviewer
Figure 2: Quality ratings for IrishEnglish and EnglishIrish translations provided
by the original translator and the reviewer
We have also shown that a high quality of translation can be acquired through a
crowd-sourcing campaign amongst the Irish-speaking online community. Our pre-
liminary study has shown that a self-rating approach to evaluation can be a reliable
indicator of the general quality of a crowd-sourced data set. Further extensive studies
of course are required before more denite conclusions can be drawn on this.
As this was a exploratory work, we have also been able to identify some learnings
that should be considered in future crowd-sourcing eorts. While generating aware-
ness online is invaluable for the initial promotion of such a project, it became clear
that the “hype” can die down relatively quickly if there is not a concerted eort to
continue with the promotion drive. This is understandable, as the public will assume
that (without reminders) all required translations have been collected. One option to
mitigate against this is to provide a progress bar on the site to indicate the percentage
that has already been translated, and how much is outstanding.
6.2. Future Work
In the future, we aim to extend this study in a number of ways. Firstly, we would
like to further investigate Irish speakers’ self-perceptions of their translation abilities
in comparison to the actual professionally-rated quality. To this end, we would ask
two professional translators to provide a quality rating of all tweet translations, and
compare their scores to the original self-rated scores.
It would also be interesting to analyse more closely the tweets where the self-rating
score diered signicantly to that of the reviewer. By analysing the disagreements,
M. Dowling, T. Lynn, A. Way Crowd-sourcing Minority Language UGC (1–12)
we would have an insight into whether the reasons were due to major grammatical
errors, problems with adequacy, uency or merely typos or misuse of elements such
as hashtags. This would give us a better insight into how reliable self-rating is as a
metric for evaluation.
It is also our aim to perform preliminary MT experiments using the crowd-sourced
data with the view to creating a UGC-specic MT system for the translation of
EnglishIrish text.
This work is supported by the ADAPT Centre for Digital Content Technology,
which is funded under the SFI Research Centres Programme (Grant 13/RC/2016)
and is co-funded by the European Regional Development Fund. We would like to ac-
knowledge the work Saurabh Gupta put into the development of this crowd-sourcing
platform, and to thank him for his support in our data analysis. We would also like to
thank Yvette Graham, Áine Monk, Aoife Mitchell, and Abigail Walsh for their valu-
able comments and advice. We would also like to thank the two anonymous reviewers
for their useful comments.
Ai, Haithem, Sorcha McGuire, and Andy Way. Sentiment Translation for low-resourced lan-
guages: Experiments on Irish General Election Tweets. In Proceedings of the 8th International
Conference on Intelligent Text Processing and Computational Linguistics, Budapest, Hungary,
Ambati, Vamshi, Stephan Vogel, and Jaime G. Carbonell. Active Learning and Crowd-Sourcing
for Machine Translation. In Proceedings of the Seventh International Conference on Language
Resources and Evaluation, pages 2169–2174. European Language Resources Association, 2010.
Arcan, Mihael, Caoilfhionn Lane, Eoin Ó Droighneáin, and Paul Buitelaar. IRIS: English-Irish
Machine Translation System. In Language Resources and Evaluation Conference. Special Inter-
est Group on the Design of Communication (SIGDOC), 2016.
Bakliwal, Akshat, Jennifer Foster, Jennifer van der Puil, Ron O’Brien, Lamia Tounsi, and Mark
Hughes. Sentiment Analysis of Political Tweets: Towards an Accurate Classier. In Pro-
ceedings of the Workshop on Language Analysis in Social Media, pages 49–58, Atlanta, Georgia,
June 2013. Association for Computational Linguistics.
Ceron, Andrea, Luigi Curini, and Stefano M. Iacus. Using Sentiment Analysis to Monitor Elec-
toral Campaigns. Social Science Computer Review, 33(1):3–20, 2017/03/31 2014.
CNGL-DCU Team. Brazilator. In The 11th Conference of the Association for Machine Translation in
the Americas, 2014. URL showcase.pdf.
Darmody, Merike, Tania Daly, et al. Attitudes towards the Irish Language on the Island of
Ireland. The Economic and Social Research Institute, 2015.
PBML ??? JUNE 2017
Dowling, Meghan, Lauren Cassidy, Eimear Maguire, Teresa Lynn, Ankit Srivastava, and John
Judge. Tapadóir: Developing a Statistical Machine Translation Engine and Associated Re-
sources for Irish. 2015.
Gupta, Aditi and Ponnurangam Kumaraguru. Credibility Ranking of Tweets During High Im-
pact Events. In Proceedings of the 1st Workshop on Privacy and Security in Online Social Media,
PSOSM ’12, pages 2:2–2:8, New York, NY, USA, 2012. ACM. doi: 10.1145/2185354.2185356.
Jin, Fang, Edward Dougherty, Parang Saraf, Yang Cao, and Naren Ramakrishnan. Epidemio-
logical Modeling of News and Rumors on Twitter. In Proceedings of the 7th Workshop on Social
Network Mining and Analysis, SNAKDD ’13, pages 8:1–8:9, New York, NY, USA, 2013. ACM.
ISBN 978-1-4503-2330-7. doi: 10.1145/2501025.2501027.
Judge, John, Ailbhe Ní Chasaide, Rose Ní Dhubhda, Kevin P. Scannell, and Elaine Uí Dhon-
nchadha. The Irish Language in the Digital Age. META-NET White Paper Series: Europe’s
Languages in the Digital Age. Springer, 2012.
Lacka, Derek and William J. Moner. Local languages, global networks: Mobile design for
minority language users. In Proceedings of the 34th ACM International Conference on the Design
of Communication, page 14. ACM, 2016.
Lukasik, Michal, Trevor Cohn, and Kalina Bontcheva. Classifying Tweet Level Judgements of
Rumours in Social Media. In Proceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 2590–2595,
Lynn, Teresa, Kevin Scannell, and Eimear Maguire. Minority language Twitter: Part-of-speech
tagging and analysis of Irish tweets. The 53rd Annual Meeting of the Association for Computa-
tional Linguistics and the 7th International Joint Conference on Natural Language Processing of the
Asian Federation of Natural Language Processing, 2015.
Mitra, Tanushree, Graham P. Wright, and Eric Gilbert. A Parsimonious Language Model of
Social Media Credibility Across Disparate Events. In Proceedings of the 2017 ACM Conference
on Computer Supported Cooperative Work and Social Computing, CSCW ’17, pages 126–145, New
York, NY, USA, 2017. ACM. doi: 10.1145/2998181.2998351.
Moorkens, Joss. Irish Translator Survey Report. Dublin City University, 2016.
O’Connor, Brendan, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith.
From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. In Cohen,
William W. and Samuel Gosling, editors, ICWSM. The AAAI Press, 2010.
Zaidan, Omar F. and Chris Callison-Burch. Crowdsourcing Translation: Professional Quality
from Non-professionals. In Proceedings of the 49th Annual Meeting of the Association for Com-
putational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 1220–1229,
Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.
Address for correspondence:
Meghan Dowling
ADAPT Centre, Dublin City University, Glasnevin, Dublin 9, Ireland
... In fact, according to the Indigenous Tweets website, which curates tweets from indigenous and minority languages worldwide, there have been over 3 million tweets sent in Irish to date. 4 With the increased availability of user-generated Irish language content, it is unsurprising that there has been an increased interest in the application of technology to analyse Irish language use online, in order to gain insights into how the language is used (e.g. POS-tagging (Lynn et al., 2015), machine translation (Dowling et al., 2017) and sentiment analysis (Afli et al., 2017)). ...
Conference Paper
Full-text available
As is the case with many languages, research into code-switching in Modern Irish has, until recently, mainly been focused on the spoken language. Online user-generated content (UGC) is less restrictive than traditional written text, allowing for code-switching, and as such, provides a new platform for text-based research in this field of study. This paper reports on the annotation of (English) code-switching in a corpus of 1496 Irish tweets and provides a computational analysis of the nature of code-switching amongst Irish-speaking Twitter users, with a view to providing a basis for future linguistic and socio-linguistic studies.
Conference Paper
Full-text available
Minority and indigenous languages have a complex relationship with contemporary communication media. Social media, in particular, provide new venues for language use and revitalization, but also subject minority languages to inhibiting technological and social pressures. The present study contributes to a better understanding of social media and language use dynamics via an analysis of a survey of Irish language users (n=617) and their sociotechnical contexts. We develop a typology of social, linguistic, and technical factors that provide a theoretical and analytical foundation for future work. A complex interplay of social and technical factors impact minority language use in social media, and we suggest potential interaction design strategies for language activists and technologists to promote more effective engagement.
Conference Paper
Full-text available
We describe IRIS, a statistical machine translation (SMT) system for translating from English into Irish and vice versa. Since Irish is considered an under-resourced language with a limited amount of machine-readable text, building a machine translation system that produces reasonable translations is rather challenging. As translation is a difficult task, current research in SMT focuses on obtaining statistics either from a large amount of parallel, monolingual or other multilingual resources. Nevertheless, we collected available English-Irish data and developed an SMT system aimed at supporting human translators and enabling cross-lingual language technology tasks.
Conference Paper
Full-text available
Tapadóir (from the Irish tapa 'fast' and the nominal suffix-óir) is a statistical machine translation (SMT) project, funded by the Irish government. This work was commissioned to help government translators meet the translation demands which have arisen from the Irish language's status as an official EU and national language. The development of this system, which translates English into Irish (a morphologically rich, low-resourced minority language), has produced an interesting set of challenges. These challenges have inspired a creative response to the lack of data and NLP tools available for the Irish language and have also resulted in the development of new resources for the Irish linguistic and NLP community. We show that our SMT system out-performs Google Translate TM (a widely used general-domain SMT system) as a result of steps we have taken to tailor translation output to the user's specific needs.
Conference Paper
Full-text available
Noisy user-generated text poses problems for natural language processing. In this paper, we show that this statement also holds true for the Irish language. Irish is regarded as a low-resourced language, with limited annotated corpora available to NLP researchers and linguists to fully analyse the linguistic patterns in language use in social media. We contribute to recent advances in this area of research by reporting on the development of part-of-speech annotation scheme and annotated corpus for Irish language tweets. We also report on state-of-the-art tagging results of training and testing three existing POS-taggers on our new dataset.
Conference Paper
Full-text available
Characterizing information diffusion on social platforms like Twitter enables us to understand the properties of underlying media and model communication patterns. As Twitter gains in popularity, it has also become a venue to broadcast rumors and misinformation. We use epidemiological models to characterize information cascades in twitter resulting from both news and rumors. Specifically, we use the SEIZ enhanced epidemic model that explicitly recognizes skeptics to characterize eight events across the world and spanning a range of event types. We demonstrate that our approach is accurate at capturing diffusion in these events. Our approach can be fruitfully combined with other strategies that use content modeling and graph theoretic features to detect (and possibly disrupt) rumors.
Full-text available
Twitter has evolved from being a conversation or opinion sharing medium among friends into a platform to share and disseminate information about current events. Events in the real world create a corresponding spur of posts (tweets) on Twitter. Not all content posted on Twitter is trustworthy or useful in providing information about the event. In this paper, we analyzed the credibility of information in tweets corresponding to fourteen high impact news events of 2011 around the globe. From the data we analyzed, on average 30% of total tweets posted about an event contained situational information about the event while 14% was spam. Only 17% of the total tweets posted about the event contained situational awareness information that was credible. Using regression analysis, we identified the important content and sourced based features, which can predict the credibility of information in a tweet. Prominent content based features were number of unique characters, swear words, pronouns, and emoticons in a tweet, and user based features like the number of followers and length of username. We adopted a supervised machine learning and relevance feedback approach using the above features, to rank tweets according to their credibility score. The performance of our ranking algorithm significantly enhanced when we applied re-ranking strategy. Results show that extraction of credible information from Twitter can be automated with high confidence.
Conference Paper
Full-text available
We connect measures of public opinion measured from polls with sentiment measured from text. We analyze several surveys on consumer confidence and political opinion over the 2008 to 2009 period, and find they correlate to sentiment word frequencies in contempora- neous Twitter messages. While our results vary across datasets, in several cases the correlations are as high as 80%, and capture important large-scale trends. The re- sults highlight the potential of text streams as a substi- tute and supplement for traditional polling.
Conference Paper
Social media has increasingly become central to the way billions of people experience news and events, often bypassing journalists---the traditional gatekeepers of breaking news. Naturally, this casts doubt on the credibility of information found on social media. Here we ask: Can the language captured in unfolding Twitter events provide information about the event's credibility? By examining the first large-scale, systematically-tracked credibility corpus of public Twitter messages (66M messages corresponding to 1,377 real-world events over a span of three months), and identifying 15 theoretically grounded linguistic dimensions, we present a parsimonious model that maps language cues to perceived levels of credibility. While not deployable as a standalone model for credibility assessment at present, our results show that certain linguistic categories and their associated phrases are strong predictors surrounding disparate social media events. In other words, the language used by millions of people on Twitter has considerable information about an event's credibility. For example, hedge words and positive emotion words are associated with lower credibility.
Conference Paper
Social media is a rich source of rumours and corresponding community reactions. Rumours reflect different characteristics, some shared and some individual. We formulate the problem of classifying tweet level judgements of rumours as a supervised learning task. Both supervised and unsupervised domain adaptation are considered, in which tweets from a rumour are classified on the basis of other annotated rumours. We demonstrate how multi-task learning helps achieve good results on rumours from the 2011 England riots.
In recent years, there has been an increasing attention in the literature on the possibility of analyzing social media as a useful complement to traditional off-line polls to monitor an electoral campaign. Some scholars claim that by doing so, we can also produce a forecast of the result. Relying on a proper methodology for sentiment analysis remains a crucial issue in this respect. In this work, we apply the supervised method proposed by Hopkins and King to analyze the voting intention of Twitter users in the United States (for the 2012 Presidential election) and Italy (for the two rounds of the centre-left 2012 primaries). This methodology presents two crucial advantages compared to traditionally employed alternatives: a better interpretation of the texts and more reliable aggregate results. Our analysis shows a remarkable ability of Twitter to nowcast as well as to forecast electoral results.