PreprintPDF Available

Semantic Enrichment of Nigerian Pidgin English for Contextual Sentiment Classification

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Nigerian English adaptation, Pidgin, has evolved over the years through multi-language code switching, code mixing and linguistic adaptation. While Pidgin preserves many of the words in the normal English language corpus, both in spelling and pronunciation, the fundamental meaning of these words have changed significantly. For example,'ginger' is not a plant but an expression of motivation and 'tank' is not a container but an expression of gratitude. The implication is that the current approach of using direct English sentiment analysis of social media text from Nigeria is sub-optimal, as it will not be able to capture the semantic variation and contextual evolution in the contemporary meaning of these words. In practice, while many words in Nigerian Pidgin adaptation are the same as the standard English, the full English language based sentiment analysis models are not designed to capture the full intent of the Nigerian pidgin when used alone or code-mixed. By augmenting scarce human labelled code-changed text with ample synthetic code-reformatted text and meaning, we achieve significant improvements in sentiment scoring. Our research explores how to understand sentiment in an intrasentential code mixing and switching context where there has been significant word localization.This work presents a 300 VADER lexicon compatible Nigerian Pidgin sentiment tokens and their scores and a 14,000 gold standard Nigerian Pidgin tweets and their sentiments labels.
Content may be subject to copyright.
Published as a conference paper at ICLR 2020
SEMANTIC ENRICHMENT OF NIGERIAN PIDGIN
ENGLISH FOR CONTEXTUAL SENTIMENT
CLASSIFICATION.
Wuraola Fisayo Oyewusi
Data Science Nigeria
Lagos, Nigeria
wuraola@datasciencenigeria.ai
Olubayo Adekanmbi
Data Science Nigeria
Lagos Nigeria
olubayo@datasciencenigeria.ai
Olalekan Akinsande
Data Science Nigeria Lagos,Nigeria
olalekan@datasciencenigeria.ai
ABS TRACT
Nigerian English adaptation, Pidgin, has evolved over the years through multi-
language code switching, code mixing and linguistic adaptation. While Pidgin
preserves many of the words in the normal English language corpus, both in
spelling and pronunciation, the fundamental meaning of these words have changed
significantly. For example, ginger is not a plant but an expression of motivation
and ’tank’ is not a container but an expression of gratitude. The implication is
that the current approach of using direct English sentiment analysis of social me-
dia text from Nigeria is sub-optimal, as it will not be able to capture the semantic
variation and contextual evolution in the contemporary meaning of these words.
In practice, while many words in Nigerian Pidgin adaptation are the same as the
standard English, the full English language based sentiment analysis models are
not designed to capture the full intent of the Nigerian pidgin when used alone or
code-mixed. By augmenting scarce human labelled code-changed text with am-
ple synthetic code-reformatted text and meaning, we achieve significant improve-
ments in sentiment scoring. Our research explores how to understand sentiment
in an intrasentential code mixing and switching context where there has been sig-
nificant word localization.This work presents a 300 VADER lexicon compatible
Nigerian Pidgin sentiment tokens and their scores and a 14,000 gold standard
Nigerian Pidgin tweets and their sentiments labels.
1 BACKGRO UND
Language is evolving with the flattening world order and the pervasiveness of the social media in
fusing culture and bridging relationships at a click. One of the consequences of the conversational
evolution is the intrasentential code switching, a language alternation in a single discourse between
two languages, where the switching occurs within a sentence (Koban, 2013). The increased instances
of these often lead to changes in the lexical and grammatical context of the language, which are
largely motivated by situational and stylistic factors (Inuwa et al., 2014). In addition, the need to
communicate effectively to different social classes have further orchestrated this shift in language
meaning over a long period of time to serve socio-linguistic functions (Ifechelobi, 2015) Nigeria
is estimated to have between three and five million people, who primarily use Pidgin in their day-
to-day interactions. But it is said to be a second language to a much higher number of up to 75
million people in Nigeria alone, about half the population.(Carons & Onyioha, 2012). It has evolved
in meaning compared to Standard English due to intertextuality, the shaping of a text’s meaning by
another text based on the interconnection and influence of the audience’s interpretation of a text. One
of the biggest social catalysts is the emerging urban youth subculture and the new growing semi-
literate lower class in a chaotic medley of a converging megacity (Igboanusi, 2008) (Samanta et al.,
2019) VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based
1
arXiv:2003.12450v1 [cs.CL] 27 Mar 2020
Published as a conference paper at ICLR 2020
sentiment analysis tool that is specifically attuned to sentiments expressed in social media and works
well on texts from other domains. VADER lexicon has about 9000 tokens (built from existing well-
established sentiment word-banks (LIWC, ANEW, and GI) incorporated with a full list of Western-
style emoticons, sentiment-related acronyms and initialisms (e.g., LOL and WTF)commonly used
slang with sentiment value (e.g., nah, meh and giggly) ) with their mean sentiment rating.(Hutto
& Gilbert, 2014). Sentiment analysis in code-mixed text has been established in literature both at
word and sub-word levels (Prabhu et al., 2016) (Roncal, 2019) (Jang & Shin, 2010). The possibility
of improving sentiment detection via label transfer from monolingual to synthetic code-switched
text has been well executed with significant improvements in sentiment labelling accuracy (1.5%,
5.11%, 7.20%) for three different language pairs (Samanta et al., 2019)
2 ME THO D
This study uses the original and updated VADER (Valence Aware Dictionary and Sentiment Rea-
soner) to calculate the compound sentiment scores for about 14,000 Nigerian Pidgin tweets1. The
updated VADER lexicon (updated with 300 Pidgin tokens2and their sentiment scores) performed
better than the original VADER lexicon. The labelled sentiments from the updated VADER were
then compared with sentiment labels by expert Pidgin English speakers.
Figure 1: The semantic enrichment of Nigerian pidgin English for contextual sentiment classification
methodology.
3 RE SULTS
During the translation of VADER English lexicon to suitable one-word Nigerian Pidgin translation,
a total of 300 Nigerian pidgin tokens were successfully translated from the standard VADER English
lexicon. One of the challenges of this translation is that the direct translation of most the sentiment
words in the original VADER English Lexicon translates to phrases not single one-word tokens and
certain pidgin words translates to many english words.2.
1Link to Nigerian Pidgin tweets and Sentiments https://git.io/JvHrp.
2Link to 300 Nigerian Pidgin Sentiments and Scores https://git.io/Jv9og.
2
Published as a conference paper at ICLR 2020
Table 1: Nigerian Pidgin tweets with different sentiment labels
Pidgin Sentence Compound
Sentiment Score
before VADER
English Lexicon
Update
Compound
Sentiment Score
after VADER
English Lexicon
Update
Sentiment
Label before
VADER English
Lexicon Update
Sentiment
Label after
VADER English
Lexicon Update
Sentiment
Label by Expert
Pidgin Speaker
som teams get
black, som get
purple but no one
fine reach our
jersey wey blue
-0.1154 0.7964 negative positive positive
tiri kon-
doooooooooooooo!
Sabi striker
dzeko tear net
wit pellegrini
assist!
0.0000 0.5080 neutral positive positive
gooooooooooooal!!!
leonardo
spinazzola throw
beta cross enta
and davide
biraschi score for
inside hin own
post! 0-2
0.0000 0.6209 neutral positive positive
39 willian try
make beta pass,
na beg we dey.
0.0000 0.5106 neutral positive positive
Na to delete am 0.0000 -0.6908 neutral negative negative
Abed share your
insight with me
0.2960 0.5423 positive positive positive
Why greenwood
dey play nw?ole
you don start
0.3400 -0.2500 positive negative negative
4 CONCLUSION
The quality of sentiment labels generated by our updated VADER lexicon is better compared to the
labels generated by the original VADER English lexicon.1.Sentiment labels by human annotators
was able to capture nuances that the rule based sentiment labelling could not capture.More work can
be done to increase the number of instances in the dataset.
REFERENCES
Tosin Carons, Abraham and M Amaka Onyioha. The origin of pidgin. Afrostyle Magazine, 2, 2012.
URL http://www.afrostylemag.com/ASM7/pidgin.html.
C.J Hutto and Eric Gilbert. Vader: A parsimonious rule-based model for sentiment analysis of social
media text. Eighth International Conference on Weblogs and Social Media (ICWSM-14), 2014.
Jane Ifechelobi. Code switching: a variation in language use. Mgbakoigba: Journal of African
Studies, 4:1–7, 2015.
Herbert Igboanusi. Empowering nigerian pidgin: A challenge for status planning? World Englishes,
27:68 – 82, 02 2008. doi: 10.1111/j.1467-971X.2008.00536.x.
3
Published as a conference paper at ICLR 2020
Yusuf Inuwa, Nuhu, Anne Christopher, Althea, and Haryati Bakrin, Bt. Factors motivating code
switching within the social contact of hausa bilinguals. IOSR Journal Of Humanities And Social
Science (IOSR-JHSS), 19:43 – 49, 2014. doi: 10.1016/j.sbspro.2013.01.173.
Hayeon Jang and Hyopil Shin. Language-specific sentiment analysis in morphologically rich lan-
guages. In Proceedings of the 23rd International Conference on Computational Linguistics:
Posters, pp. 498–506. Association for Computational Linguistics, 2010.
Didem Koban. Intra-sentential and inter-sentential code-switching in turkish-english bilinguals in
new york city, u.s. Procedia - Social and Behavioral Sciences, 70:1174–1179, 01 2013. doi:
10.1016/j.sbspro.2013.01.173.
Ameya Prabhu, Aditya Joshi, Manish Shrivastava, and Vasudeva Varma. Towards sub-word level
compositions for sentiment analysis of hindi-english code mixed text. 11 2016.
I˜
naki San Vicente Roncal. Multilingual sentiment analysis in social media. PhD thesis, Universidad
del Pa´
ıs Vasco-Euskal Herriko Unibertsitatea, 2019.
Bidisha Samanta, Niloy Ganguly, and Soumen Chakrabarti. Improved sentiment detection via label
transfer from monolingual to synthetic code-switched text. arXiv preprint arXiv:1906.05725,
2019.
A APPENDIX
Table 2: Average Sentiment Score for Nigerian Pidgin Sentiments with Multiple English Meanings
Pidgin Words VADER Sentiment Token and Score Average Score
kasala riot(-2.6), riots(- 2.), trouble(-1.7) -2.2
gbege catastrophe (3.4), chaos (2.7), chaotic(-2.2), problem (1.7), problems(-1.7) -2.9
para angry(-2.3), annoyed(-1.6), rage(-2.6) -2.2
A.1 SE LEC TI ON O F DATA LABE LLE RS
Three people who are indigenes or lived in the South South part of Nigeria, where Nigerian Pidgin
is a prevalent method of communication were briefed on the fundamentals of word sentiments. Each
labelled Data point was verified by at least one other person after initial labelling.
ACKNOWLEDGMENTS
We acknowledge Kessiena Rita David,Patrick Ehizokhale Oseghale and Peter Chimaobi Onuoha for
using their mastery of Nigerian Pidgin to translate and label the datasets.
4
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Sentiment analysis (SA) using code-mixed data from social media has several applications in opinion mining ranging from customer satisfaction to social campaign analysis in multilingual societies. Advances in this area are impeded by the lack of a suitable annotated dataset. We introduce a Hindi-English (Hi-En) code-mixed dataset for sentiment analysis and perform empirical analysis comparing the suitability and performance of various state-of-the-art SA methods in social media. In this paper, we introduce learning sub-word level representations in LSTM (Subword-LSTM) architecture instead of character-level or word-level representations. This linguistic prior in our architecture enables us to learn the information about sentiment value of important morphemes. This also seems to work well in highly noisy text containing misspellings as shown in our experiments which is demonstrated in morpheme-level feature maps learned by our model. Also, we hypothesize that encoding this linguistic prior in the Subword-LSTM architecture leads to the superior performance. Our system attains accuracy 4-5% greater than traditional approaches on our dataset, and also outperforms the available system for sentiment analysis in Hi-En code-mixed text by 18%.
Conference Paper
Full-text available
The inherent nature of social media content poses serious challenges to practical applications of sentiment analysis. We present VADER, a simple rule-based model for general sentiment analysis, and compare its effectiveness to eleven typical state-of-practice benchmarks including LIWC, ANEW, the General Inquirer, SentiWordNet, and machine learning oriented techniques relying on Naive Bayes, Maximum Entropy, and Support Vector Machine (SVM) algorithms. Using a combination of qualitative and quantitative methods, we first construct and empirically validate a gold-standard list of lexical features (along with their associated sentiment intensity measures) which are specifically attuned to sentiment in microblog-like contexts. We then combine these lexical features with consideration for five general rules that embody grammatical and syntactical conventions for expressing and emphasizing sentiment intensity. Interestingly , using our parsimonious rule-based model to assess the sentiment of tweets, we find that VADER outperforms individual human raters (F1 Classification Accuracy = 0.96 and 0.84, respectively), and generalizes more favorably across contexts than any of our benchmarks.
Article
Full-text available
This study explores patterns of intra-sentential and inter-sentential code-switching (CS) that are manifest in the speech of Turkish-English bilinguals in New York City, U.S. and investigates the influence of language proficiency on intra-sentential CS. The data were collected via a sociolinguistic survey and face-to-face interviews conducted with 20 bilingual speakers who have lived in the U.S. for at least 10 years. The results indicate that intra-sentential CS occurred at a higher rate than inter-sentential CS and speakers dominant in both Turkish and English used more intra-sentential code switching than inter-sentential CS. (C) 2012 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of ALSC 2012
Article
Full-text available
  In spite of the fact that Nigerian Pidgin (NP) is probably the language with the highest population of users in Nigeria, it does not enjoy official recognition and is excluded from the education system. It lacks prestige because it is seen by many Nigerians as a “bad” form of English and associated with a socially deprived set of people. This paper explores the possibility of empowering NP (and its speakers) by raising the value of the language through status planning, especially in the education system. On the way to realising this goal, it analyses the attitudes of 200 educated Nigerians towards its use as well as towards steps aimed at empowering it. The results show that, although there is no consensus as to whether NP should be granted official status, a large majority of respondents did not favour its use in education. The study highlights three major problems associated with the promotion of NP: (1) lack of economic value, (2) perceived effects on the local languages, and (3) the effect of the use of NP on English language proficiency. The study discovers that empowering1 NP is a challenge, which will be very difficult to overcome in the near future, and it suggests the way forward.
Conference Paper
In this paper, we propose language-specific methods of sentiment analysis in morphologically rich languages. In contrast of previous works confined to statistical methods, we make use of various linguistic features effectively. In particular, we make chunk structures by using the dependence relations of morpheme sequences to restrain semantic scope of influence of opinionated terms. In conclusion, our linguistic structural methods using chunking improve the results of sentiment analysis in Korean news corpus. This approach will aid sentiment analysis of other morphologically rich languages like Japanese and Turkish.
The origin of pidgin
  • Tosin Carons
  • M Amaka Onyioha
Tosin Carons, Abraham and M Amaka Onyioha. The origin of pidgin. Afrostyle Magazine, 2, 2012. URL http://www.afrostylemag.com/ASM7/pidgin.html.
Code switching: a variation in language use
  • Jane Ifechelobi
Jane Ifechelobi. Code switching: a variation in language use. Mgbakoigba: Journal of African Studies, 4:1-7, 2015.
Factors motivating code switching within the social contact of hausa bilinguals
  • Yusuf Inuwa
  • Anne Nuhu
  • Althea Christopher
  • Haryati Bakrin
  • Bt
Yusuf Inuwa, Nuhu, Anne Christopher, Althea, and Haryati Bakrin, Bt. Factors motivating code switching within the social contact of hausa bilinguals. IOSR Journal Of Humanities And Social Science (IOSR-JHSS), 19:43 -49, 2014. doi: 10.1016/j.sbspro.2013.01.173.
Multilingual sentiment analysis in social media
  • Vicente Iñaki San
  • Roncal
Iñaki San Vicente Roncal. Multilingual sentiment analysis in social media. PhD thesis, Universidad del País Vasco-Euskal Herriko Unibertsitatea, 2019.