Conference PaperPDF Available

Minority Language Twitter: Part-of-Speech Tagging and Analysis of Irish Tweets

Authors:
  • Cadhan Aonair

Abstract

Noisy user-generated text poses problems for natural language processing. In this paper, we show that this statement also holds true for the Irish language. Irish is regarded as a low-resourced language, with limited annotated corpora available to NLP researchers and linguists to fully analyse the linguistic patterns in language use in social media. We contribute to recent advances in this area of research by reporting on the development of part-of-speech annotation scheme and annotated corpus for Irish language tweets. We also report on state-of-the-art tagging results of training and testing three existing POS-taggers on our new dataset.
Minority Language Twitter: Part-of-Speech Tagging and Analysis of Irish
Tweets
Teresa Lynn
1,3
, Kevin Scannell
2
, and Eimear Maguire
1
1
ADAPT Centre, School of Computing, Dublin City University, Ireland
2
Department of Mathematics and Computer Science, St. Louis University, USA
3
Department of Computing, Macquarie University, Sydney, Australia
1
{tlynn,emaguire}@computing.dcu.ie
2
{kscanne}@gmail.com
3
{teresa.lynn}@mq.edu.au,
Abstract
Noisy user-generated text poses problems
for natural language processing. In this
paper, we show that this statement also
holds true for the Irish language. Irish
is regarded as a low-resourced language,
with limited annotated corpora available
to NLP researchers and linguists to fully
analyse the linguistic patterns in language
use in social media. We contribute to re-
cent advances in this area of research by
reporting on the development of part-of-
speech annotation scheme and annotated
corpus for Irish language tweets. We also
report on state-of-the-art tagging results of
training and testing three existing POS-
taggers on our new dataset.
1 Introduction
The language style variation used on social media
platforms, such as Twitter for example, is often
referred to as noisy user-generated text. Tweets
can contain typographical errors and ungrammat-
ical structures that pose challenges for process-
ing tools that have been designed for and tai-
lored to high quality, well-edited text such as that
found in newswire, literature and official docu-
ments. Previous studies, Foster et al. (2011) and
Petrov and McDonald (2012) for example, have
explored the effect that the style of language used
in user-generated content has on the performance
of standard NLP tools. Other studies by Gimpel
et al. (2011), Owoputi et al. (2013), Avontuur et
al. (2012), Rehbein (2013) and Derczynski et al.
(2013) (POS-tagging), Ritter et al. (2011) (named
entity recognition), Kong et al. (2014) and Seddah
et al. (2012) (parsing) have shown that NLP tools
and resources need to be adapted to cater for the
linguistic differences present in such text.
When considering data-driven NLP tasks, a lack
of resources can also produce additional chal-
lenges. We therefore examine the impact of noisy
user-generated text on the existing resources for
Irish, a low-resourced language. We also explore
options for leveraging from existing resources to
produce a new domain-adapted POS-tagger for
processing Irish Twitter data. We achieve this by:
defining a new POS tagset for Irish tweets
providing a mapping from the PAROLE Irish
POS-tagset to this new one
manually annotating a corpus of 1537 Irish
tweets
training three statistical taggers on our data
and reporting results
This paper is divided as follows: Section 2
gives a summary of Twitter and issues specific
to the Irish Twitter data. Section 3 discusses the
new part-of-speech tagged corpus of Irish tweets.
Section 4 discusses our inter-annotator agreement
study and the observations we note from annota-
tor disagreements. Section 5 reports our tagging
accuracy results on three state-of-the-art statistical
taggers.
2 Irish Tweets
Irish, the official and national language of Ireland,
is a minority language. While it is a second lan-
guage for most speakers, everyday use outside of
academic environments has seen a recent resur-
gence in social media platforms such as Facebook
and Twitter. Twitter is a micro-blogging platform
which allows users (tweeters) to create a social
network through sharing or commenting on items
of social interest such as ideas, opinions, events
and news. Tweeters can post short messages called
tweets, of up to 140 characters in length, that can
typically be seen by the general public, includ-
ing the user’s followers. Tweets can be classified
by topic by using hashtags (e.g. #categoryname)
and linked to other tweeters through the use of at-
mentions (e.g. @username).
The first tweets in Irish appeared not long after
the launch of Twitter in 2006, and there have been
more than a million tweets in Irish since then, by
over 8000 tweeters worldwide
1
.
The social nature of tweets can result in the use
of informal text, unstructured or ungrammatical
phrases, and a variety of typographical errors. The
140 character limit can also lead to truncated un-
grammatical sentences, innovative spellings, and
word play, such as those discussed by Eisenstein
(2013) for English. From our analysis, this phe-
nomenon appears to extend also to Irish tweets.
In Figure 1, we provide an example of an Irish
tweet that contains some of these NLP challenges:
Freezing i dTra Li,Ciarrai chun cinn le cuilin.
Freezing i dTr
´
a L
´
ı, t
´
a Ciarra
´
ı chun cinn le c
´
uil
´
ın.
‘Freezing in Tralee, Kerry (is) ahead by a point.
Figure 1: Example of noisy Irish tweet
Diacritics Irish, in its standard orthography,
marks long vowels with diacritics (
´
a,
´
e,
´
ı,
´
o,
´
u). Our
analysis of Irish tweets revealed that these diacrit-
ics are often replaced with non-accented vowels
(c
´
uil
´
ın => cuilin). There are a number of word
pairs that are differentiated only by the presence
or absence of these diacritics (for example, cead
‘permission’ : c
´
ead ‘hundred’). There are many
possible reasons for omitting diacritics, including
shortening the time required to tweet (this tweet
is from a spectator at a Gaelic Football match),
a lack of knowledge on how to find diacritics on
a device’s keyboard, carelessness, or uncertainty
about the correct spelling.
Code-switching Alternating between English
and Irish is common in our dataset. This is un-
surprising as virtually all Irish speakers are fluent
English speakers, and many use English as their
first language in their daily lives. In the exam-
ple given, there is no obvious reason why “Freez-
ing” was used in place of various suitable Irish
words (e.g. Pr
´
eachta), other than perhaps seek-
ing a more dramatic effect. Sometimes, how-
ever, English is understandably used when there
is no suitable Irish term in wide use, for example
‘hoodie’ or ‘rodeo-clown’. Aside from occurring
1
http://indigenoustweets.com/ga/
at an intra-sentential level, code-switching at an
inter-sentential level is also common in Irish: an t-
am seo an t7ain seo chugainn bei 2 ag party
´
ail le
muintir R
´
ath Daingin! Hope youre not too scared
#upthevillage. In total, of the 1537 tweets in our
gold-standard corpus, 326 (21.2%) contain at least
one English word with the tag G
2
.
Verb drop We can see in this example that the
verb t
´
a ‘is’ has been dropped. This is a common
phenomenon in user-generated content for many
languages. The verb is usually understood and can
be interpreted through the context of the tweet.
Spacing Spacing after punctuation is often over-
looked (i) in an attempt to shorten messages or
(ii) through carelessness. In certain instances,
this can cause problems when tokenizing tweets;
Li,Ciarrai => Li, Ciarrai.
Phonetic spelling Linguistic innovations often
result from tweeters trying to fit their message into
the 140 character limit. Our dataset contains some
interesting examples of this phenomenon occur-
ring in Irish. For example t7ain is a shortened ver-
sion of tseachtain ‘week’. Here the word seacht
‘seven’ is shortened to its numeral form and the
initial mutation t remains attached. Other exam-
ples are gowil (go bhfuil), beidir (b’fh
´
eidir), v
(bh
´
ı).
Abbreviations Irish user-generated text has its
own set of frequently used phrase abbreviations –
referred to sometimes as text-speak. Forms such
as mgl:maith go leor, ‘fair enough’ and grma:go
raibh maith agat ‘thank you’ have been widely
adopted by the Irish language community.
The linguistic variation of Irish that is used in
social media is relatively unexplored, at least not
in any scientific manner. We expect therefore that
the part-of-speech tagged corpus and taggers that
we have developed for Irish language tweets will
contribute to further research in this area.
3 Building a corpus of annotated Irish
tweets
Unlike rule-based systems, statistical data-driven
POS-taggers require annotated data on which they
can be trained. Therefore, we build a gold-
standard corpus of 1537 Irish tweets annotated
2
The tag G is used for foreign words, abbreviations, items
and unknowns, as shown in Table 1.
with a newly defined Twitter POS tagset. The fol-
lowing describes this development process.
3.1 New Irish Twitter POS tagset
The rule-based Irish POS-tagger (U
´
ı Dhonn-
chadha and van Genabith, 2006) for standard Irish
text is based on the PAROLE Morphosyntactic
Tagset (IT
´
E, 2002). We used this as the basis
for our Irish Twitter POS tagset. We were also
inspired by the English-tweet POS tagset defined
by Gimpel et al. (2011), and have aimed to stay
closely aligned to it in order to facilitate any fu-
ture work on cross-lingual studies.
We started by selecting a random sample of 500
Irish tweets to carry out an initial analysis. From
our analysis of these tweets we concluded that our
new Twitter-specific POS tagset would not require
the granularity of the original standard Irish POS
set. For example we do not need to differentiate
between a locative adverb and a temporal adverb,
or between a vocative particle and an infinitive par-
ticle. While our tagset is also closely aligned with
the English-tweet POS tagset, we introduce the
following tags that the English set does not use:
VN: Verbal Noun Progressive aspectual
phrases in Irish are denoted by the preposi-
tion ag followed by a verbal noun (e.g. ag
rith ‘running’). We choose to differentiate
between N and VN to avoid losing this ver-
bal information in what would otherwise be a
regular prepositional phrase.
#MWE: Multiword hashtag These are hash-
tags containing strings of words used to cate-
gorise a text (e.g. #godhelpus). We retain in-
formation on the multi-word nature of these
hashtags in order to facilitate future syntactic
analysis efforts.
We also adapt the T particle to suit Irish linguis-
tic features.
T: Particle We extend the T tag to not only
cover verb particles, but all other Irish parti-
cles: relative particles, surname particles, in-
finitive particles, numeric particles, compar-
ative particles, the vocative particle, and ad-
verbial particles.
We do not use the following tags from the En-
glish set: S, Z, L, M, X, Y, as the linguis-
tic cases they apply to do not occur in either stan-
dard or non-standard Irish. The final set of 21
POS-tags is presented in Table 1.
Most of the tags in the tagset are intuitive
to an Irish language speaker. However, some
tags require specific explanation in the guide-
lines. Hashtags and at-mentions can be a syntac-
tic part of a sentence or phrase within a tweet.
When this is the case, we apply the relevant
syntactic POS tag. For example, Beidh m
´
e ar
chl
´
ar @SplancNewstalk
anocht ag labhairt leis
@AnRonanEile
faoi #neknomination
N
‘I will be
on @SplancNewstalk tonight speaking to @An-
RonanEile about #neknomination’. Otherwise if
they are not part of the syntactic structure of the
tweet (typically appended or prepended to the
main tweet text), they are tagged as @ and # (or
#MWE). In our gold standard corpus, 554 out of
693 hashtags (79.9%), and 1604 out of 1946 at-
mentions (82.4%) are of this non-syntactic type.
With some Twitter clients, if a tweet exceeds
the 140 character limit, the tweet is truncated
and an ellipsis is used to indicate that some text
is missing. We leave this appended to the fi-
nal (usually partial) token, which was often a
URL. We marked these cases as G. For example
http://t.co/2nvQsxaIa7. . . .
Some strings of proper nouns contain other POS
elements, such as determiners and common nouns.
Despite being a proper noun phrase syntactically,
we tag each token as per its POS. For exam-
ple, C
´
u
na
D
mBaskerville
‘The Hound of the
Baskervilles’.
3.2 Tweet pre-processing pipeline
About 950,000 Irish language tweets were posted
between Twitter’s launch in 2006 and September
2014 by approximately 8000 users identified and
tracked by the Indigenous Tweets web site. Non-
Irish tweets from these users were filtered out us-
ing a simple character-trigram language identifier.
We selected a random sample of 1550 tweets from
these 950,000 tweets and processed them as fol-
lows:
(1) We tokenised the set with Owoputi et al.
(2013)’s version of twokenise
3
, which works
well on web content features such as emoticons
and URLs.
(2) Using a list of multiword units from U
´
ı
Dhonnchadha (2009)’s rule-based Xerox FST to-
keniser
4
, we rejoined multiword tokens that had
3
Available to download from http://www.ark.cs.
cmu.edu/TweetNLP/#pos
4
Available to download from https://github.
com/stesh/apertium-gle/tree/master/dev/
Tag Description (PAROLE TAGS)
N common noun
(Noun, Pron Ref, Subst)
proper noun
(Prop Noun)
O pronoun (Pron Pers, Pron Idf,
Pron Q, Pron Dem)
VN verbal noun (Verbal Noun)
V verb (Cop, Verb*)
A adjective (Adj, Verbal Adj,
Prop Adj)
R adverb (Adv*)
D determiner (Art, Det)
P preposition, prep. pronoun
(Prep*, Pron Prep)
T particle (Part*)
, punctuation (Punct)
& conjunction (Conj Coord,
Conj Subord)
$ numeral, quantifier (Num)
! interjection (Itj)
G foreign words, abbreviations, item
(Foreign, Abr, Item, Unknown)
˜ discourse marker
# hashtag
#MWE multi-word hashtag
@ at-mention
E emoticon
U URL/email address/XML (Web)
Table 1: Mapping of Irish Twitter tagset to PA-
ROLE tagset. (* indicates all forms of the fine-
grained set for that tag.)
been split by the language-independent tokenizer
(e.g. the compound preposition go dt
´
ı).
(3) Using regular expressions, we then split to-
kens with the contractions b’ (ba), d’ (do), m’ (mo)
prefixes. For example b’fh
´
eidir ‘maybe’; d’ith
‘ate’; m’aigne ‘my mind’.
(4) We took a bootstrapping approach by pre-
tagging and lemmatising the data with the rule-
based Irish POS-tagger first, and then mapped the
tags to our new Twitter-specific tagset.
(5) In cases where the rule-based tagger failed to
produce a unique tag, we used a simple bigram tag
model (trained on the gold-standard POS-tagged
corpus from U
´
ı Dhonnchadha (2009) see Sec-
tion 5.1) to choose the most likely tag from among
irishfst
those output by the rule-based tagger.
(6) Finally, we manually corrected both the tags
and lemmas to create a gold-standard corpus.
3.3 Annotation
The annotation task was shared between two an-
notators. Correction of the first 500 tweets formed
a basis for assessing both the intuitiveness of our
tagset and the usability of our annotation guide.
Several discussions and revisions were involved at
this stage before finalising the tagset. The next
1000 tweets were annotated in accordance with
the guidelines, while using the first 500 as a ref-
erence. At this stage, we removed a small number
of tweets that contained 100% English text (errors
in the language identifier). All other tweets con-
taining non-Irish text represented valid instances
of code-switching.
The annotators were also asked to verify and
correct the lemma form if an incorrect form was
suggested by the morphological analyser. All
other tokeniser issues, often involving Irish con-
tractions, were also addressed at this stage. For
example T
´
a’n > T
´
a an.
4 Inter-Annotator Agreement
Inter-Annotator agreement (IAA) studies are car-
ried out during annotation tasks to assess consis-
tency, levels of bias, and reliability of the anno-
tated data. For our study, we chose 50 random
Irish tweets, which both annotators tagged from
scratch. This differed from the rest of the anno-
tation process, which was semi-automated. How-
ever, elimination of possible bias towards the pre-
annotation output allowed for a more disciplined
assessment of agreement level between the anno-
tators. We achieved an agreement rate of 90% and
a κ score (Cohen, 1960) of 0.89.
Smaller tagsets make an annotation task eas-
ier due to the constraint on choices available to
the annotator, and is certainly one reason for our
high IAA score. This result also suggests that the
tagging guidelines were clear and easy to under-
stand. A closer comparison analysis of the IAA
data explains some disagreements. The inconsis-
tency of conflicts suggests that the disagreements
arose from human error. Some examples are given
below.
Noun vs Proper Noun The word Gaeilge ‘Irish’
was tagged on occasion as N (common noun)
instead of (proper noun). This also applied
to some proper noun strings such as
´
Aras an
Uachtar
´
ain (the official name of the President of
Ireland’s residence).
Syntactic at-mentions A small number of at-
mentions that were syntactically part of a tweet
(e.g. mar chuid de @SnaGaeilge ‘as a part of
@SnaGaeilge’) were incorrectly tagged as regular
at-mentions (@).
Retweet colons One annotator marked ‘:’ as
punctuation at random stages rather than using the
discourse tag ˜.
5 Experiments
5.1 Data
We took the finalised set of Irish POS-tagged
tweets and divided them into a test set (148
tweets), development set (147 tweets) and train-
ing set (1242 tweets). Variations of this data are
used in our experiments where we normalise cer-
tain tokens (described further in Section 5.2.)
We also automatically converted U
´
ı Dhonn-
chadha (2009)’s 3198 sentence (74,705 token)
gold-standard POS-tagged corpus using our map-
ping scheme. This text is from the New Corpus for
Ireland – Irish
5
, which is a collection of text from
books, newswire, government documents and
websites. The text is well-structured, well-edited,
and grammatical, and of course lacks Twitter-
specific features like hashtags, at-mentions, and
emoticons, thus differing greatly from our Twitter
data. The average sentence length in this corpus
is 27 tokens, diverging significantly from the
average tweet length of 17.2 tokens. Despite this,
and despite the fact the converted tags were not
reviewed for accuracy, we were still interested
in exploring the extent to which this additional
training data could improve the accuracy of our
best-performing model. We refer to this set as
NCII 3198.
5.2 Taggers
We trained and evaluated three state-of-the-art
POS-taggers with our data. All three taggers are
open-source tools.
Morfette As Irish is an inflected language, in-
clusion of the lemma as a training feature is desir-
5
New Corpus for Ireland - Irish. See http://corpas.
focloir.ie
able in an effort to overcome data sparsity. There-
fore we trained Morfette (Chrupala et al., 2008),
a lemmatization tool that also predicts POS tags
and uses the lemma as a training feature. We re-
port on experiments both with and without an op-
tional dictionary (Dict) information. We used the
dictionary from Scannell (2003), which contains
350, 418 surface forms. Our baseline Morfette
data (BaseMorf) contains the token, lemma and
POS-tag. The lemmas of URLs and non-syntactic
hashtags have been normalised as < URL > and
< # >, respectively.
We then evaluated the tagger with (non-
syntactic) < # >, < @ > and < URL >
normalisation of both token form and lemma
(NormMorf). Both experiments are re-
run with the inclusion of our dictionary
(BaseMorf+Dict,NormMorf+Dict).
ARK We also trained the CMU Twitter POS-
tagger (Owoputi et al., 2013), which in addition
to providing pre-trained models, allows for re-
training with new languages. The current release
does not allow for the inclusion of the lemma as
a feature in training, however. Instead, for com-
parison purposes, we report on two separate ex-
periments, one using the surface tokens as fea-
tures, and the other using only the lemmas as fea-
tures (ArkForm, ArkLemma). We also tested
versions of our data with normalised at-mentions,
hashtags and URLs, as above.
Stanford tagger We re-trained the Stanford
tagger (Toutanova et al., 2003) with our Irish data.
We experimented by training models using both
the surface form only (BestStanForm) and
the lemma only (BestStanLemma). The best
performing model was based on the feature set
left3words, suffix(4), prefix(3),
wordshapes(-3,3), biwords(-1,1),
using the owlqn2 search option.
6
Baseline Finally, to establish a baseline
(Baseline), and more specifically to evaluate
the importance of domain-adaptation in this
context, we evaluated a slightly-enhanced version
of the rule-based Irish tagger on the Twitter
dataset. When the rule-based tagger produced
more than one possible tag for a given token,
we applied a bigram tag model to choose the
most likely tag, as we did in creating the first
6
All other default settings were used.
draft of the gold-standard corpus. In addition, we
automatically assigned the tag U to all URLs, # to
all hashtags, and @ to all at-mentions.
5.3 Results
Training Data Dev Test
Baseline
Rule-Based Tagger 85.07 83.51
Morfette
BaseMorf 86.77 88.67
NormMorf 87.94 88.74
BaseMorf+Dict 87.50 89.27
NormMorf+Dict 88.47 90.22
ARK
BaseArkForm 88.39 89.92
ArkForm#@ 89.36 90.94
ArkForm#URL@ 89.32 91.02
BaseArkLemma#URL 90.74 91.62
ArkLemma#URL@ 91.46 91.89
Stanford
BestStanForm 82.36 84.08
BestStanLemma 87.34 88.36
Bootstrapping Best Model
ArkLemma#URL@+NCII 92.60 93.02
Table 2: Results of evaluation of POS-taggers on
new Irish Twitter corpus
The results for all taggers and variations of data-
setup are presented in Table 2.
Firstly, our best performing single model
(ArkLemma#URL@) on the test set achieves a
score of 91.89%, which is 8 points above our
rule-based baseline score of 83.51%. This con-
firms that tailoring training data for statistically-
driven tools is a key element in processing noisy
user-generated content, even in the case of minor-
ity languages. It is worth noting that the best-
performing model learns from the lemma infor-
mation instead of the surface form. This clearly
demonstrates the effect that the inflectional nature
of Irish has on data sparsity. The Twitter-specific
tokens such as URLs, hashtags and at-mentions
have been normalised which demonstrates the im-
pact the relative uniqueness of these tokens has on
the learner.
All of our results are comparable with state-of-
the-art results produced by Gimpel et al. (2011)
and Owoputi et al. (2013). This is interesting,
given that in contrast to their work, we have
not optimised our system with unsupervised word
clusters due to the lack of sufficient Irish tweet
data. Nor have we included a tag dictionary, distri-
bution similarity or phonetic normalisation also
due to a lack of resources.
We carried out a closer textual comparison
of Owoputi et al. (2013)’s English tweet dataset
(daily547) and our new Irish tweet dataset.
After running each dataset through a language-
specific spell-checker, we could see that the list
of highly ranked OOV (out of vocabulary) tokens
in English are forms of text-speak, such as lol
‘laugh out loud’, lmao ‘laugh my ass off and ur
‘your’, for example. Whereas the most common
OOVs in Irish are English words such as ‘to’, ‘on’,
‘for’, ‘me’, and words misspelled without diacrit-
ics. This observation shows the differences be-
tween textual challenges of processing these two
languages. It may also suggest that Irish Twitter
text may follow a more standard orthography than
English Twitter text, and will make for an interest-
ing future cross-lingual study of Twitter data.
Finally, we explored the possibility of lever-
aging from existing POS-tagged data by adding
NCII 3198 to our best performing model
ArkLemma#URL@. We also duplicated the tweet
training set to bring the weighting for both do-
mains into balance. This brings our training set
size to 5682 (117,273 tokens). However, we find
that a significant increase in the training set size
only results in just over a 1 point increase in POS-
tagging accuracy. At a glance, we can see some
obvious errors the combined model makes. For
example, there is confusion when tagging the word
an. This word functions as both a determiner and
an interrogative verb particle. The lack of direct
questions in the NCII corpus results in a bias to-
wards the D (determiner) tag. In addition, many
internal capitalised words (e.g. the beginning of a
second part of a tweet) are mislabelled as proper
nouns. This is a result of the differing structure of
the two data sets each tweet may contain one or
more phrases or sentences, while the NCII is split
into single sentences.
6 Future Work
Limited resources and time prevented exploration
of some options for improving our POS-tagging
results. One of these options is to modify the CMU
(English) Twitter POS-tagger to allow for inclu-
sion of lemma information as a feature. Another
option, when there is more unlabelled data avail-
able (i.e. more Irish tweets online), would be to
include Irish word cluster features in the training
model. This approach has also been taken by Re-
hbein (2013) for POS tagging German tweets.
The resources we provide through this study
are a valuable contribution to the Irish NLP com-
munity. Firstly, we expect that this new data re-
source (the POS-tagged Twitter corpus) will pro-
vide a solid basis for linguistic and sociolinguistic
study of Irish on a social media platform. This
new domain of Irish language use can be analysed
in an empirical and scientific manner through cor-
pus analysis by means of our data. The authors are
currently working towards this follow-up study.
From a tool-development perspective, we ex-
pect this corpus and the derived POS-tagging
models could be used in a domain-adaptation ap-
proach to parsing Irish tweets, similar to the work
of Kong et al. (2014). This would involve adapting
Lynn et al. (2012)’s Irish statistical dependency
parser for use with social media text. Our corpus
could provide the basis of a treebank for this work.
Following our discovery of the extent that code-
switching is present our Irish Twitter data, we feel
future studies on this phenomenon would be of in-
terest to various research disciplines (e.g Solorio
et al. (2014)). In order to do that, we suggest up-
dating the corpus with a separate tag for English
tokens (that is, a tag other than G, which is also
used for abbreviations, items and unknowns) be-
fore carrying out further experiments in this area.
7 Conclusion
We present the first dataset of gold-standard POS-
tagged Irish language tweets and we have pro-
duced training models for a selection of POS-
taggers.
7
We have also shown how we have lever-
aged from existing work to build these resources
for a low-resourced language, to achieve state-of-
the-art results. We also confirm that the NLP chal-
lenges arising from noisy user-generated text can
also apply to a minority language.
8 Acknowledgments
The authors would like to thank the three anony-
mous reviewers for their helpful feedback. We
also would like to thank Kevin Gimpel for his
support with using the CMU English Twitter
7
Our data is available to download from https://
github.com/tlynn747/IrishTwitterPOS
POS tagger, Djam
´
e Seddah for his support with
Morfette, and Elaine U
´
ı Dhonnchadha and Fran-
cis Tyers for their support with the Irish rule-
based POS tagger. This work was funded by
the Fulbright Commision of Ireland (Fulbright
Enterprise-Ireland Award 2014-2015), and sup-
ported by Science Foundation Ireland through the
CNGL Programme (Grant 12/CE/I2267) in the
ADAPT Centre (www.adaptcentre.ie) at Dublin
City University. The second author was partially
supported by US NSF grant 1159174.
References
Tetske Avontuur, Iris Balemans, Laura Elshof, Nanne
van Noord, and Menno van Zaanen. 2012. De-
veloping a part-of-speech tagger for dutch tweets.
Computational Linguistics in the Netherlands Jour-
nal, 2:34–51, 12/2012.
Grzegorz Chrupala, Georgiana Dinu, and Josef van
Genabith. 2008. Learning morphology with mor-
fette. In Proceedings of the International Confer-
ence on Language Resources and Evaluation, LREC
2008, 26 May - 1 June 2008, Marrakech, Morocco.
J. Cohen. 1960. A Coefficient of Agreement for Nom-
inal Scales. Educational and Psychological Mea-
surement, 20(1):37.
Leon Derczynski, Alan Ritter, Sam Clark, and Kalina
Bontcheva. 2013. Twitter part-of-speech tagging
for all: Overcoming sparse and noisy data. In Galia
Angelova, Kalina Bontcheva, and Ruslan Mitkov,
editors, RANLP, pages 198–206. RANLP 2011 Or-
ganising Committee / ACL.
Jacob Eisenstein. 2013. What to do about bad lan-
guage on the internet. In Proceedings of the 2013
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, pages 359–369. Associa-
tion for Computational Linguistics.
J. Foster,
¨
O. C¸ etinoglu, J. Wagner, J. Le Roux,
S. Hogan, J. Nivre, D. Hogan, J. Van Genabith, et al.
2011. # hardtoparse: Pos tagging and parsing the
twitterverse. In Proceedings of the Workshop On
Analyzing Microtext (AAAI 2011), pages 20–25.
Kevin Gimpel, Nathan Schneider, Brendan O’Connor,
Dipanjan Das, Daniel Mills, Jacob Eisenstein,
Michael Heilman, Dani Yogatama, Jeffrey Flanigan,
and Noah A. Smith. 2011. Part-of-speech tagging
for Twitter: Annotation, features, and experiments.
In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human
Language Technologies: Short Papers - Volume 2,
HLT ’11, pages 42–47, Stroudsburg, PA, USA. As-
sociation for Computational Linguistics.
IT
´
E. 2002. PAROLE Morphosyntactic Tagset for
Irish. Institi
´
uid Teangeola
´
ıochta
´
Eireann.
Lingpeng Kong, Nathan Schneider, Swabha
Swayamdipta, Archna Bhatia, Chris Dyer, and
Noah A. Smith. 2014. A dependency parser for
tweets. In Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing
(EMNLP), pages 1001–1012, Doha, Qatar, October.
Association for Computational Linguistics.
Teresa Lynn, Jennifer Foster, Mark Dras, and Elaine U
´
ı
Dhonnchadha. 2012. Active learning and the Irish
treebank. In Proceedings of the Australasian Lan-
guage Technology Workshop (ALTA), pages 23–32.
Olutobi Owoputi, Brendan O’Connor, Chris Dyer,
Kevin Gimpel, Nathan Schneider, and Noah A.
Smith. 2013. Improved part-of-speech tagging for
online conversational text with word clusters. In
Proceedings of the 2013 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
pages 380–390, Atlanta, Georgia, June. Association
for Computational Linguistics.
Slav Petrov and Ryan McDonald. 2012. Overview of
the 2012 Shared Task on Parsing the Web. In First
Workshop on Syntactic Analysis of Non-Canonical
Language (SANCL).
Ines Rehbein. 2013. Fine-grained pos tagging of ger-
man tweets. In Iryna Gurevych, Chris Biemann,
and Torsten Zesch, editors, GSCL, volume 8105 of
Lecture Notes in Computer Science, pages 162–175.
Springer.
Alan Ritter, Sam Clark, Mausam, and Oren Etzioni.
2011. Named entity recognition in tweets: An ex-
perimental study. In Proceedings of the Conference
on Empirical Methods in Natural Language Pro-
cessing, EMNLP ’11, pages 1524–1534, Strouds-
burg, PA, USA. Association for Computational Lin-
guistics.
Kevin P. Scannell. 2003. Automatic thesaurus genera-
tion for minority languages: an Irish example. Actes
de la 10e conf
´
erence TALNa Batz-sur-Mer, 2:203–
212.
Djam
´
e Seddah, Benoit Sagot, Marie Candito, Virginie
Mouilleron, and Vanessa Combet. 2012. The
French Social Media Bank: a treebank of noisy
user generated content. In Proceedings of COLING
2012, pages 2441–2458.
Thamar Solorio, Elizabeth Blair, Suraj Mahar-
jan, Steven Bethard, Mona Diab, Mahmoud
Ghoneim, Abdelati Hawwari, Fahad AlGhamdi, Ju-
lia Hirschberg, Alison Chang, and Pascale Fung,
2014. Proceedings of the First Workshop on Com-
putational Approaches to Code Switching, chapter
Overview for the First Shared Task on Language
Identification in Code-Switched Data, pages 62–72.
Association for Computational Linguistics.
Kristina Toutanova, Dan Klein, Christopher D. Man-
ning, and Yoram Singer. 2003. Feature-rich part-of-
speech tagging with a cyclic dependency network.
In Proceedings of the 2003 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics on Human Language Technology
- Volume 1, NAACL ’03, pages 173–180, Strouds-
burg, PA, USA. Association for Computational Lin-
guistics.
Elaine U
´
ı Dhonnchadha and Josef van Genabith. 2006.
A part-of-speech tagger for Irish using finite-state
morphology and constraint grammar disambigua-
tion. In Proceedings of the 5th International Confer-
ence on Language Resources and Evaluation (LREC
2006).
Elaine U
´
ı Dhonnchadha. 2009. Part-of-Speech Tag-
ging and Partial Parsing for Irish using Finite-State
Transducers and Constraint Grammar. Ph.D. thesis,
Dublin City University.
... These tweets are often usually informal and are not carefully edited, often lack punctuation, and can include ungrammatical structures. In addition, the data often comprises spelling errors and creative use of language, resulting in large number of unknown words [14]. Moreover, the limited number of letters for each tweet stimulates creativity and encourages an innovative and non-standard language usage together with Twitter-specific elements like emoticons, hashtags, retweet tokens and usernames [8] in the content. ...
... Moreover, the limited number of letters for each tweet stimulates creativity and encourages an innovative and non-standard language usage together with Twitter-specific elements like emoticons, hashtags, retweet tokens and usernames [8] in the content. Accordingly, such data is referred to as noisy user generated text [14] to represent these deviations from the conventions used in corpora of well-edited news text. ...
Article
Full-text available
In this paper, the process of creating a Dependency Treebank for tweets in Urdu, a morphologically rich and less-resourced language is described. The 500 Urdu tweets treebank is created by manually annotating the treebank with lemma, POS tags, morphological and syntactic relations using the Universal Dependencies annotation scheme, adopted to the peculiarities of Urdu social media text. annotation process is evaluated through Inter-annotator agreement for dependency relations and total agreement of 94.5% and resultant weighted Kappa  = 0.876 was observed. The treebank is evaluated through 10-fold cross validation using Maltparser with various feature settings. Results show average UAS score of 74%, LAS score of 62.9% and LA score of 69.8%.
... In order to successfully process the data available from such sources, linguistic analysis is often helpful (Vilares et al. 2017;Mataoui et al. 2018), which in turn prompts the use of NLP tools to that end. Despite the ever increasing number of contributions, especially on Part-of-Speech tagging (Gimpel et al. 2011;Owoputi et al. 2013;Lynn et al. 2015;Bosco et al. 2016;Çetinoglu and Çöltekin 2016;Proisl 2018;Rehbein et al. 2018;Behzad and Zeldes 2020) and parsing (Foster 2010;Petrov and McDonald 2012;Liu et al. 2018;Sanguinetti et al. 2018), automatic processing of usergenerated content (UGC) still represents a challenging task, as is shown by some tracks of the workshop series 1. Dipartimento di Matematica e Informatica, Università degli Studi di Cagliari, Italy · 2. Dipartimento di Informatica, Università degli Studi di Torino, Italy · 3. ADAPT Centre, Dublin City University, Ireland · 4. IMS, University of Stuttgart, Germany · 5. PRHLT Research Center, Universitat Politècnica de València, Spain · 6. University of Mannheim, Germany · 7. Leibniz-Institut für Deutsche Sprache Mannheim, Germany · 8. INRIA Paris, France · 9. Georgetown University, USA. ...
... Three resources are in French (Frb, xUGC, FSMB), one includes codeswitching data in French and transliterated dialectal North-African Arabic (NBZ) and two in Italian (TWRO, Pst); the remaining ones are in Arabic (ATDT), Chinese (CWT), Finnish (TDT), German (tweeDe) and Turkish (ITU). While the current Irish Twitter corpus, which is the source of some of the examples in Table 1, has not yet been converted to treebank format (and as such is not listed in Table 2), its annotation presented most of the same challenges that make up the discussion below (Lynn et al. 2015;Lynn and Scannell 2019). ...
Preprint
Full-text available
This article presents a discussion on the main linguistic phenomena which cause difficulties in the analysis of user-generated texts found on the web and in social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework of syntactic analysis. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this article is twofold: (1) to provide a condensed, though comprehensive, overview of such treebanks -- based on available literature -- along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The overarching goal of this article is to provide a common framework for researchers interested in developing similar resources in UD, thus promoting cross-linguistic consistency, which is a principle that has always been central to the spirit of UD.
... Similar approaches are used in projects creating corpora for linguistic and translation purposes, such as specialised health corpora (Symseridou, 2018) for instructing translation professionals in required abilities like locating terms, studying collocations, grammar, and syntax. The same approach is followed in Castagnoli (2015) to build corpora for health, law, and cell phone scenarios, and in Lynn et al. (2015), aimed at constructing linguistic corpora for less commonly used languages (e.g. Irish). ...
Article
Full-text available
Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.
... The Irish language was recognized as the first official language of Ireland and also by the EU (Dowling et al., 2020a;Scannell, 2007). Belonging to the Goidelic language family and the Celtic family (Scannell, 2007;Lynn et al., 2015), Irish is also claimed as one of the low resourced languages due to its limited resources by the META-NET report (Dhonnchadha et al., 2012;Scannell, 2006). ...
Article
Full-text available
This study aimed to discover the historical background of the Malaysian Tamil folk songs. Qualitative approaches with historical, descriptive and explanatory designs were used for this study. Apart from the historical background, information on the characteristics and essence of the Malaysian Tamil folk songs are also discussed in this paper. The researcher used information from sources on Malaysian Tamil literature and Malaysian Tamil folk songs to analyze and explore the historical background. Information was also obtained from an interview with Murasu Nedumaran, a Tamil scholar. The elements embedded in the folk songs were categorized and described. Findings revealed that the essence of the Malaysian folk songs compiled by Murasu Nedumaran, Thandayutham and Muthammal Palanisamy expressed loss, gain and hopes of the Tamil laborers.
... In order to successfully process the data available from such sources, linguistic analysis is often helpful (Mataoui et al., 2018;Vilares et al., 2017), which in turn prompts the use of NLP tools to that end. Despite the ever increasing number of contributions, especially on part-of-speech tagging (Behzad & Zeldes, 2020;Bosco et al., 2016;Ç etinoglu & Ç öltekin, 2016;Gimpel et al., 2011;Lynn et al., 2015;Owoputi et al., 2013;Proisl, 2018;Rehbein et al., 2018) and parsing (Foster, 2010;Kong et al., 2014;Liu et al., 2018;Petrov & McDonald, 2012;Sanguinetti et al., 2018), automatic processing of user-generated content (UGC) still represents a challenging task, as it is shown by some tracks of the workshop series on noisy user-generated text (W-NUT). 1 UGC is a continuum of text sub-domains that vary considerably according to the specific conventions and limitations posed by the medium used (blog, discussion forum, online chat, microblog, etc.), the degree of ''canonicalness'' with respect to a more standard language, as well as the linguistic devices 2 adopted to convey a message. Overall, however, there are some wellrecognized phenomena that characterize UGC as a whole (Eisenstein, 2013;Foster, 2010;Seddah et al., 2012), and that continue to make its treatment a difficult task. ...
Article
Full-text available
This article presents a discussion on the main linguistic phenomena which cause difficulties in the analysis of user-generated texts found on the web and in social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework of syntactic analysis. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this article is twofold: (1) to provide a condensed, though comprehensive, overview of such treebanks—based on available literature—along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The overarching goal of this article is to provide a common framework for researchers interested in developing similar resources in UD, thus promoting cross-linguistic consistency, which is a principle that has always been central to the spirit of UD.
... Nesse cenário, já há vários taggers (etiquetadores morfossintáticos) [p.ex.: Owoputi et al., 2013;Lynn et al., 2015;Bosco et al., 2016;Proisl, 2018] e parsers (analisadores sintáticos) [p.ex. : Foster, 2010;Petrov, Mcdonald, 2012;Kong et al., 2014e Liu et al., 2018 relativamente precisos para o processamento de UGCs, sobretudo em inglês. ...
Conference Paper
Full-text available
A anotação de textos informais segundo a Universal Dependencies requer dois processos de segmentação: delimitação da unidade relevante para a análise sintática e identificação das palavras sintáticas. Neste artigo, apresentam-se as idiossincrasias linguísticas do corpus DANTEStocks, composto por tweets do mercado financeiro, escritos em Português, e as estratégias gerais de segmentação automática. Assim, contribui-se para a descrição de aspectos linguísticos dos tweets e para o desenvolvimento de recursos e ferramentas de processamento automático desse subgênero de "user-generated content".
... The Irish language was recognized as the first official language of Ireland and also by the EU (Dowling et al., 2020a;Scannell, 2007). Belonging to the Goidelic language family and the Celtic family (Scannell, 2007;Lynn et al., 2015), Irish is also claimed as one of the low resourced languages due to its limited resources by the META-NET report (Dhonnchadha et al., 2012;Scannell, 2006). ...
Preprint
Full-text available
This paper reports the Machine Translation (MT) systems submitted by the IIITT team for the English->Marathi and English->Irish language pairs LoResMT 2021 shared task. The task focuses on getting exceptional translations for rather low-resourced languages like Irish and Marathi. We fine-tune IndicTrans, a pretrained multilingual NMT model for English->Marathi, using external parallel corpus as input for additional training. We have used a pretrained Helsinki-NLP Opus MT English->Irish model for the latter language pair. Our approaches yield relatively promising results on the BLEU metrics. Under the team name IIITT, our systems ranked 1, 1, and 2 in English->Marathi, Irish->English, and English->Irish, respectively.
... Language on Twitter, however, is quite different from well-edited text of news, books, etc., due to the presence of unconventional orthography, punctuation and grammatical mistakes, along with Twitter-specific conventions such as hashtags, emoticons, usernames and retweet tokens [2]. Such language style variation is often characterized as noisy user-generated text [3]. Since the performance of Natural Language Processing (NLP) applications depends on the type of text being processed [4], the effect of this language style variation of user-generated text on the performance of standard NLP tools has been explored by Foster et al. [5] and Petrov and McDonald [6]. ...
Article
Full-text available
Processing of social media text like tweets is challenging for traditional Natural Language Processing (NLP) tools developed for well-edited text due to the noisy nature of such text. However, demand for tools and resources to correctly process such noisy text has increased in recent years due to the usefulness of such text in various applications. Literature reports various efforts made to develop tools and resources to process such noisy text for various languages, notably, part-of-speech (POS) tagging, an NLP task having a direct effect on the performance of other successive text processing activities. Still, no such attempt has been made to develop a POS tagger for Urdu social media content. Thus, the focus of this paper is on POS tagging of Urdu tweets. We introduce a new tagset for POS-tagging of Urdu tweets along with the POS-tagged Urdu tweets corpus. We also investigated bootstrapping as a potential solution for overcoming the shortage of manually annotated data and present a supervised POS tagger with an accuracy of 93.8% precision, 92.9% recall and 93.3% F-measure.
... In the field of Natural Language Processing, Gaelic is less advanced than two related languages, Irish [2,3] and Welsh 2 . However, some key resources are available, such as oral archives 3 , online dictionaries 4 , and corpora such as DASG 5 , ARCOSG 6 and the UD Gaelic Treebank. ...
Chapter
Full-text available
This paper describes the Gaelic Linguistic Analyser, a new resource for the Scottish Gaelic language. The GLA includes a tagger, a lemmatiser and a parser, which were developed largely on the basis of existing resources. This tool is available online as the first component of the Scottish Gaelic Toolkit.
Conference Paper
Full-text available
We report on our ongoing work in developing the Irish Dependency Treebank, describe the results of two Inter-annotator Agreement (IAA) studies, demonstrate improvements in annotation consistency which have a knock-on effect on parsing accuracy, and present the final set of dependency labels. We then go on to investigate the extent to which active learning can play a role in treebank and parser development by comparing an active learning boot-strapping approach to a passive approach in which sentences are chosen at random for manual revision. We show that active learning outperforms passive learning, but when annotation effort is taken into account, it is not clear how much of an advantage the active learning approach has. Finally, we present results which suggest that adding automatic parses to the training data along with manually revised parses in an active learning setup does not greatly affect parsing accuracy.
Article
Full-text available
In recent years, statistical parsers have reached high performance levels on well-edited texts. Domain adaptation techniques have improved parsing results on text genres differing from the journalistic data most parsers are trained on. However, such corpora usually comply with standard linguistic, spelling and typographic conventions. In the meantime, the emergence of Web 2.0 communication media has caused the apparition of new types of online textual data. Although valuable, e.g., in terms of data mining and sentiment analysis, such user-generated content rarely complies with standard conventions: They are noisy. This prevents most NLP tools, especially treebank based parsers, from performing well on such data. For this reason, we have developed the French Social Media Bank, the first user-generated content treebank for French, a morphologically rich language (MRL). The first release of this resource contains 1,700 sentences from various Web 2.0 sources, including data specifically chosen for their high noisiness. We describe here how we created this treebank and expose the methodology we used for fully annotating it. We also provide baseline POS tagging and statistical constituency parsing results, which are lower by far than usual results on edited texts. This highlights the high difficulty of automatically processing such noisy data in a MRL.
Article
Full-text available
This paper describes the methodology used to develop a part-of-speech tagger for Irish, which is used to annotate a corpus of 30 million words of text with part-of-speech tags and lemmas. The tagger is evaluated using a manually disambiguated test corpus and it currently achieves 95% accuracy on unrestricted text. To our knowledge, this is the first part-of-speech tagger for Irish.
Conference Paper
Full-text available
We evaluate the statistical dependency parser, Malt, on a new dataset of sentences taken from tweets. We use a version of Malt which is trained on gold standard phrase structure Wall Street Journal (WSJ) trees converted to Stanford labelled dependencies. We observe a drastic drop in performance moving from our in-domain WSJ test set to the new Twitter dataset, much of which has to do with the propagation of part-of-speech tagging errors. Retraining Malt on dependency trees produced by a state-of-the-art phrase structure parser, which has itself been self-trained on Twitter material, results in a significant improvement. We analyse this improvement by examining in detail the effect of the retraining on individual dependency types.
Article
This paper presents the first work on POS tagging German Twitter data, showing that despite the noisy and often cryptic nature of the data a fine-grained analysis of POS tags on Twitter microtext is feasible. Our CRF-based tagger achieves an accuracy of around 89% when trained on LDA word clusters, features from an automatically created dictionary and additional out-of-domain training data.
Conference Paper
Part-of-speech information is a pre-requisite in many NLP algorithms. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. We present a detailed error analysis of existing taggers, motivating a series of tagger augmentations which are demonstrated to improve performance. We identify and evaluate techniques for improving English part-of-speech tagging performance in this genre. Further, we present a novel approach to system combination for the case where available taggers use different tagsets, based on voteconstrained bootstrapping with unlabeled data. Coupled with assigning prior probabilities to some tokens and handling of unknown words and slang, we reach 88.7% tagging accuracy (90.5% on development data). This is a new high in PTB-compatible tweet part-of-speech tagging, reducing token error by 26.8% and sentence error by 12.2%. The model, training data and tools are made available.
Article
We describe a shared task on parsing web text from the Google Web Treebank. Partic-ipants were to build a single parsing system that is robust to domain changes and can han-dle noisy text that is commonly encountered on the web. There was a constituency and a dependency parsing track and 11 sites submit-ted a total of 20 systems. System combina-tion approaches achieved the best results, how-ever, falling short of newswire accuracies by a large margin. The best accuracies were in the 80-84% range for F1 and LAS; even part-of-speech accuracies were just above 90%.