Sinhala Language Corpora and
Stopwords from a Decade of Sri
Yudhanjaya Wijeratne†, Nisansa de Silva‡
† LIRNEasia, 12 Balcombe Place, Colombo, Sri Lanka (email@example.com)
‡ University of Oregon, 1585 E 13th Ave, Eugene, OR 97403, United States (firstname.lastname@example.org)
LIRNEasia is a pro-poor, pro-market think tank whose mis-
sion is catalyzing policy change through research to improve
people’s lives in the emerging Asia Pacic by facilitating
their use of hard and soft infrastructures through the use
of knowledge, information and technology. This work was
carried out with the aid of a grant from the International
Development Research Centre (IDRC), Ottawa, Canada.
This paper presents two colloquial Sinhala language corpora from the language eorts of the
Data, Analysis and Policy team of LIRNEasia, as well as a list of algorithmically derived
stopwords. The larger of the two corpora spans 2010 to 2020 and contains 28,825,820 to
29,549,672 words of multilingual text posted by 533 Sri Lankan Facebook pages, including
politics, media, celebrities, and other categories; the smaller corpus amounts to 5,402,76 words
of only Sinhala text extracted from the larger. Both corpora have markers for their date of
creation, page of origin, and content type.
‘The limits of my language mean the limits of my world.’
– Ludwig Wittgenstein
Sinhala, as with many other languages in the Global South, currently suers from a phenomenon
know as resource poverty . To wit, many of the fundamental tools that are required for
easy and ecient natural language analysis are unavailable; many of the more computational
components taken for granted in languages like English are either as yet unbuilt, in a nascent
stage, and in other cases, lost or retained among select institutions .
Adding to the complexity is the fact that Sinhala exhibits diglossia, as noted by Gair , and
this diglossia extends even unto the script and symbols therein: Silva and Kariyawasam  note
the existence of Suddha Sinhala, a ‘core set’ capable of representing the basic sounds of the
Sinhala vocabulary, as well as ‘misra’ or Mixed Sinhalese devised as an adaptation to inuences
from Pali and Sankrit; and also that in later ages the alphabet was inuenced again by English,
adding sounds not found in the former sets, like the sound of the English letter ‘f’. When
combined with the later study of code-mixing (and the eect of English on Sinhala usage and
vice versa) by Senaratne , the literature suggests a language with multiple, stable forms of
usage, both formal and informal varieties, in a process of constant evolution. To engage in the
vernacular, corpora reecting this evolution are essential to enrich the foundations of language
research in Sinhala .
The Facebook Crowdtangle1platform provides default lists - collections of pages put forth by
Facebook as belonging to the information ecosystem of a particular country. The list for Sri
Lanka contains the Facebook pages of national politicians and political groups; major news and
media channels; the pages of major Sri Lankan celebrities; and sports channels. To maximize
the amount of Sinhala content available, we used a truncated version of this list, and from
them downloaded a total of 1,820,930 posts created between 01-01-2010 and 02-02-2020, thus
representing a decade of content poured out into public view on Facebook2.
This data contains one notable defect. Characters that sport the rakārānsaya, such as the word
- as in “Sri Lanka” - appears as ශ්. This may be because of known issues with unicode: the
rakārāngśaya, yanśaya and répaya are Sinhala characters that does not exist in the Unicode 13
specication 3. This lack has been noted by Dias  as far back as 2010. Simply copying
and pasting the character on Facebook immediately results on it being decomposed into ශ්4.
What is ultimately recorded in the data is phonetically similar, but inaccurate spelling5.
After stripping the data of irrelevant elds to minimize processing time, the remaining data
retained the following attributes:
•Page Name - the origin of the post.
•Create - this being the timestamp of creation, reduced to a date eld in DD-MM-YY
•Type - this being internal categories that Facebook assigns, based on the content of the
•Message - the actual text content that the user typed in the post.
The categories present in the Type column are as follows: YouTube, Status, Link, Photo, Native
Video, Video, Vine, Live Video Complete, and Live Video.
After stripping out entries that had no informational value in the Message column, we were
left with 1,500,917 posts from 533 pages, a mixture of Sinhala, Tamil and English content - all
three major languages spoken in Sri Lanka. We consider words to be separated by one or more
spaces6; by this count there are 29,549,672 words in this dataset.
An examination of quantiles reveals that the median status stands at 13 words and the maxi-
mum is of 232 words in length as shown in Table 1.
This set of gures, however, is imprecise. Our goal with this corpus is to capture as much
information as possible, including the dierent vagaries of colloquial spelling on social media
and content-sharing habits: therefore, this data includes URLs.
2It should be noted that as of the time of writing, the Crowdtangle platform, either by design or by tech-
nical limitation, returns only around 200,000 posts per request, or 200 megabytes of data; this number varies,
suggesting that it is not a hardcoded cap. Thus this data fetch was performed year-by-year; a single pull for the
date range specied resulted in far less data. Likewise, a more granular set of requests may yield more data.
The mechanism for sampling is unknown to us: we must therefore assume that it is reasonably representative
of the full river of content that Facebook hosts.
3For quick reference, see https://unicode.org/charts/PDF/U0D80.pdf
4Indeed, the same inaccuracy occurred in the Overleaf editor in which we wrote this text.
5While pages beginning with exist on Facebook, they can be retrieved by a text search for ශ්, leaving us
to surmise that the rendering hack proposed by Dias  in 2010 is still in play a decade later.
6As matched by the regular expression \s+
0% 25% 50% 75% 100%
1 8 13 22 232
Table 1: Distribution of wordcounts at various quantiles.
When cleaned of URLs7, the result stands at 28,825,820 words. The distribution of wordcounts
at various quantiles remains the same.
Due to how search operations and rulesets interact with multilingual text data, we have foot-
noted some examples observed in practice8and chosen to minimize cleaning operations here.
Therefore, we can say that this corpus contains between 29,549,672 and 28,825,820 words.
Media pages are key contributors of data, as shown in Table 2.
Contributing page Number of posts
HIRU FM 77786
Sun FM 75721
Shaa FM 66761
Table 2: Top ve contributors to the corpus.
To better dierentiate between the languages contained herein, and to more eciently extract
data, each item of data was annotated for language using fastText library for text classica-
tion and representation . fastText provides neural networks trained on language classica-
tion tasks; we specically used the 176.ftz compressed language identication model . This
model, with the kparameter set to 2, generates an annotation of the top two most-represented
languages in a piece of text; this variable is appended as langprediction for each post.
It should be noted that the fastText annotation is not perfect; it was noted that its base training
data9is unable to account for posts that exhibit codemixing, and thus it erronously assigns
false language classications to said posts10 . It is this classication error that makes it easier
to search for and extract codemixed language from this dataset; otherwise the search would be
manual and exhausting11.
We present this initial dataset as Corpus-Alpha.
7For cases detected within the dataset by random sampling; there is a non-zero probability of minor artifacts
of truncated URLs components still existing, due to the diversity of URL types in social media
8For example, the line Live at 12 News Update:(26 December 2010) would benet from the non-alphanumeric
characters replaced with spaces, as presently Update:(26 registers as one word. However, the same operation
would turn it’s into it and s, thus presenting as two words. Likewise, an operation that searches for and removes
text prexing and suxing /pages/ - an artefact of Facebook page URLs - will also remove useful information
from a post such as “Do you often interact with new proles/pages/groups on Facebook?”
9A combination of Wikipedia, Tatoeba and SETimes.
10For example, Maxxxxxxxxxxxxxxxxxxxa, Sri Lankan street slang that borrows heavily from English, is repre-
sented using English characters; this it chose to annotate as English and Catalan / Valencian. Likewise, “Mama
Marunath Janathawa Samagai” - Johnston - a Sinhala statement written in English script - is annotated as
English and Indonesian.
11Its failings highlight other interesting behaviors. There are 0 posts with Sinhala and Tamil tags appearing
together: while a handful of posts do contain Sinhala and Tamil together, their Tamil content is so small that
fastText apparently cannot detect it. A qualitative analysis using a random sample of posts tagged as having
Tamil content reveals that in this dataset, Sinhala and Tamil are indeed rarely written together. However, a
select few words do nd themselves neighboring each other in a statistically insignicant number of posts.
Because Corpus-Alpha is mixed, it may not be desirous for language-specic applications -
especially Sinhala, which, of the three languages - English, Sinhala and Tamil - is the most
resource-poor. Therefore, we derived from Corpus-Alpha a smaller corpus containing only
For this we rst extracted from Corpus-Alpha the posts tagged for Sinhala by fastText, on the
logic that for fastText to detect Sinhala, therein a substantive percentage of a given post must
be of Sinhala text. Out of posts with 771,075 posts tagged for Sinhala, 87,922 also have tags
for English; this number rises to 208,702 if we add posts tagged for Sinhala and Latin. Thus,
in total, 27% of Sinhala posts contain Latinate characters, be they artifacts of codemixing or
URLs pasted in the status itself.
From these posts, we removed punctuation12, URLs, and platform artefacts such as ’featurey-
outube’, ’indexph’, ’and pid and vid and page’. The pages from Rivira and Sirasa Lakshapathi
- one a news/media platform, the other a program series produced by Sirasa TV - were re-
moved altogether, as status text from them contained a high incidence of URL formatting
quirks (sundayrividaharahtml, editionbreakingnewshtml and periodical ’click here to view’ text
that appeared to have been added manually to increase audience interaction with posts. All
remaining text was subjected to a purge of all remaining Latinate, Tamil and Chinese charac-
ters13. Special characters were also removed14. This yields a corpus of 364,402 posts, containing
a combined 5,402,760 words of status text in the Message column. We show the statistics of
this data set in Table 3.
0% 25% 50% 75% 100%
1 6 9 15 198
Table 3: Distribution of wordcounts at various quantiles.
The number of contributing pages has reduced to 420 (see Appendix B). The top contributors
of text are still media organizations that we saw in Table 2, but their ranks have shifted slightly
as we show in Table 4.
The standard bag-of-words model describes the frequencies of occurrence of individual words in
a given text and thus, generating such a collection is an extremely common procedure in natural
language processing, particularly for language classication and modelling tasks. We use it here
as a descriptive tool. However, single words alone destroy basic semantic relationships, such as
12Sinhala, unlike English, does not use inverted commas in the middle of words (ie: it’s); pronouns are handled
by the ෙ sux; thus such characters can be safely deleted without splitting a word into two.
13As we noted before, fastText’s detection is not perfect.
14In the process of cleaning, curious interactions were observed between the [:punct:] range in R and Sinhala
diacritics such as ◌
්. Notably, the hal kirīma diacritic was removed, presumably as punctuation, while the
compound kombuva saha halkirīma (ෙ◌
)was preserved. There are various possible explanations for this, from
poor support in Unicode for certain Sinhala conventions (as noted earlier) or possible range conicts between
what R considers to be emoji blocs (which are scattered throughout the unicode spec) and Sinhala characters. It
was further observed that removing periods individually in R also led to the removal of certain base characters,
leaving behind artifacts such as ◌ාක. Not wishing to take apart R to see what the issue might be, we switched
to Perl, using repeated unigram calculations to identify and remove specic unwanted characters, particularly
ellipticals, emojis, and other ideograms, on an individual basis.
Contributing page Number of posts
Ada Derana Sinhala 44756
Neth FM 36362
Hiru News 12549
Live at 8 10624
Shaa FM 9897
Hiru Gossip 8600
BBC News ංහල 8589
HIRU FM 8426
Table 4: Top ten contributors to the corpus.
the ”bill gates” or ”white house” examples used by Bekkerman and Allan . We therefore
compute word pairs for Corpus-Sinhala-Redux in order to better display these relationships.
Computing word frequencies shows that Corpus-Sinhala-Redux contains 228,533 unique words
and 1,868,589 unique word-pairs. The most common occurrences are shown in Table 5.
word pair freq
ශ් ලංකා 12514
වැ ස්තර 6056
රා ට 5807
එස ජාක 5268
පාෙ ම 4428
අද න 4396
මද රාජපෂ 4229
ෙගඨාභය රාජපෂ 4196
ෙයව ෙමතැ 4040
ලංකා දහස 3346
Table 5: Top ten words and word pairs by frequency
It can be readily inferred from Table 5that the data herein contains a heavy bias towards
political conversation, especially since the top word pairs refer to parliamentary ministers, the
starting two words of the United National Party (in Sinhala), the starting two words of the Sri
Lanka Freedom Party, and two Presidents of Sri Lanka by name. The reason for the frequency
counts for the rst two words of a political party appear in the high frequency list while the
pair made up by the second and third word does notis due to the inected nature of Sinhala.
Depending on the case that is being used, the third word,පෂය gets morphed into පෂෙ,
පෂෙය,පෂයට and other relevant morphological forms along with the root form පෂය.
This results in the dilution of the pairs created by the second and the third word of the party
We present this cleaned dataset as Corpus-Sinhala-Redux.
Deriving stopwords from Corpus-Sinhala-Redux
It is by now an axiom of natural language processing that every corpus of text contains words
that occur in virtually every sentence, and thus have low informational value ; Francis et al.
 posited that in English, the most common words accounted for some 20-30% of a document.
These words are commonly referred to as stopwords, and it is customary to remove them from
analysis in most applications, especially those that rely on bag-of-words models, such as most
popular topic modelling approaches. English has robust lists of stopwords; Tamil less so; and
Sinhala least of all. Therefore it behooves us to extract stopwords from the Corpus-Sinhala-
However, despite their ubiquity, building a list of stopwords is still a contentious task. Most
practitioners favor manually-constructed, domain-specic lists, which take extraordinary time
and eort to build (especially from corpora as large as ours) and may not ultimately be worth
the eort in the rst place . Various algorithmic approaches, most based on the frequency of
unigrams in a text, seem to generate usable results in English, though few or none can be said
to be demonstrably superior . For some non-English languages, such as Arabic, stopwords
have been built with tightly-dened rulesets , while in languages closer related to Sinhala,
like Sanskrit, a combination of word frequency and manual vetting  has yielded suitable
Given the proximity of Sinhala to Sanskrit, we too utilized a frequency based method. The most
important question in an automated stop word extraction method is the frequency threshold;
given that our study is mainly data driven, we opted to let the threshold also be dynamically
decided from the data itself.
As such, the rst step is to collect the frequencies of words across the entire corpus, as done
previously; then, outliers on the lowest end of the distribution (frequency = 1) are eliminated15.
Next, we calculated the standard deviation (σ) and the mean (µ) of the word frequencies. Using
these corpus statistics, then we can standardize the frequencies using the equation 1where z
is the standard score, and xis the word frequency.
Next we calculated −1.5< Z < 1.5for a 93.3% threshold, as shown in the higlighted area
under the curve in Fig. 1.
From this process, we obtained the following world list: [ ශ් ] [ මහතා ] [ න ] [ ජාක ] [ කරන ] [
අමාය ] [ ඇ ] [ ලංකා ] [ ගැන ] [ සඳහා ] [ කට ] [ ම ] [ රාජපෂ ] [ ෙවෙව ] [ ග ] [ ෙලස
] [ අතර ] [ ට ] [ ජනාප ] [ ] [ වැ ] [ ෙමම ] [ ජනප ] [ සමග ] [ බලන ] [ ට ] [ මට ] [
ස්තර ] [ පාෙ ] [ ම ] [ සදහා ] [ මද ] [ ලබා ] [ අ ] [ ] [ දහස් ] [ එස ] [ කර
] [ ජනතාව ] [ සංවධන ] [ කරන ] [ ය ] [ මහා ] [ රා ] [ රධානවෙය ] [ පෂ ] [ අවස්ථාව ] [
ප ] [ රාය ] [ ෙකළඹ ] [ පැව ] [ වැඩ ] [ මාය ] [ ජනතා ] [ ෙ ].
Translated to English, this would be: [ sri ] [ mister ] [ day ] [ national ] [ do ] [ minister ] [
enough ] [ lanka ] [ about ] [ for ] [ aairs ] [ doing ] [ rajapaksa ] [ for ] [ the honorable ] [ as
the ] [ while ] [ from ] [ president ] [ by ] [ higher ] [ this ] [ president (alternate/poetic spelling)]
15In this study we have been very lenient in the denition of outliers to avoid data loss.
−3−2.5−2−1.5−1−0.5 0 0.5 1 1.522.5 3x
Figure 1: Statistical range of stop words at −1.5< Z < 1.5
[ with a ] [ see ] [ former ] [ to ] [ details ] [ parliamentary ] [ member (of legislative body) ] [ for
(in this case, a colloquial misspelling of the more formal variant above)] [ mahinda ] [ obtaining
] [ we ] [ done ] [ free ] [ united ] [ does ] [ people ] [ development ] [ do (conjugation)] [ it was ]
[ great ] [ night ] [ chiey ] [ parties ] [ the situation ] [ appoint ] [ of the state ] [ colombo ] [
existing ] [ work ] [ the media ] [ people ] [ doing ].
There are both similarities and dissimilarities to commonly used English stopwords lists, such
as those in Python’s NLTK package16 and R’s stopwords package17.
We present this list as a set of stopwords derived from Corpus-Sinhala-Redux.
Upeksha et al.  created an ambitious 119-million-word corpus project (in which a co-author
of this paper took part). Though the portal links referenced in ocial documentation lead
to 404s , a copy is preserved in the archives of one of the co-authors of this paper hosted
at Open Science Framework (OSF)18 . Notably, Upeksha et al. drew from newspapers, text
books, wikipedia, the Mahavamsa, ction, blogs, magazines and gazettes, all of which require
a degree of formal Sinhala; we present our work, which uses social media data, as a colloquial
complement to this dataset.
In this paper we present two Sinhala corpora and a list of stopwords, both extracted from
Facebook, which is a rich source of colloquial text data and invaluable for studying resource-
poor languages. Corpus-Alpha is extracted from Facebook pages for a period from 2010 to
2020, and consists of between 28,825,820 to 29,549,672 words of text posted by a variety of Sri
Lankan Facebook pages. Corpus-Sinhala-Redux, which is derived from Corpus-Alpha, contains
5,402,760 words of only Sinhala text from 420 of those pages.
The objective of Corpus-Alpha is to capture as much information as possible, and thus present
a rich source for discourse analysis and codemixing between the three languages in use in Sri
Lanka, with a bias towards Sinhala. It contains text in English, Sinhala and Tamil, all three
major languages spoken in Sri Lanka; additionally, it contains punctuation, URLs, ideograms
such as emojis, and serves as a snapshot of Sri Lankan discourse on Facebook19.
The objective of Corpus-Sinhala-Redux is to provide a collection of colloquial Sinhala com-
monly used on social media, more immediately suited to Sinhala-specic language applications.
Corpus-Sinhala-Redux showcases specic artifacts brought about by how Facebook records
Sinhala text, but has been cleaned of ideograms, punctuation, and other symbols not directly
relevant to this process. A list of stopwords have been algorithmically derived from Corpus-
In this paper we have documented both the processes involved in creating these corpora and
various platform and language-related nuances around the topic. Both corpora and stop-
words list are made available under open access terms at https://github.com/LIRNEasia/
FacebookDecadeCorpora. It is our hope that this data will be used to improve natural language
processing applications in Sinhala.
This research has been made possible through a grant from the International Development Re-
search Centre, Canada (IDRC) and Facebook’s generous access to the Crowdtangle platform.
19Given that English occupies a curious role in Sri Lanka - rst as the language of colonizers, and today as
marker of social class , the examination of language in conjuction with Page Names may yield interesting
observations on the audience that these pages seek to target.
List of highest contributors to corpus-redux-sinhala
Contributing Page Number of posts
Ada Derana Sinhala 44756
Neth FM 36362
Hiru News 12549
Live at 8 10624
Shaa FM 9897
Hiru Gossip 8600
BBC News ����� 8589
HIRU FM 8426
Neth News 7850
Ada �� 5041
Sri Lanka Mirror 4996
Hiru TV 4419
Siyatha FM 4269
Lankadeepa ������� 4125
Neth Gossip 3863
Ranjan Ramanayake 3510
Kanaka Herath 2755
Ada Derana Biz Sinhala 2531
Wimal weerawansa 2518
United National Party 2335
Udaya Prabhath Gammanpila 2314
Anura Kumara Dissanayake 2219
FM Derana 2145
Rajitha Senaratne 2142
Mirror Arts 2125
Patali Champika Ranawaka 2101
Harshana Rajakaruna 1830
S M Marikkar 1807
Wajira Abeywardena 1763
JVP Srilanka 1624
Resa Newspaper ��� ������� 1613
Sajith Premadasa 1595
 Y. Wijeratne, N. de Silva, and Y. Shanmugarajah, “Natural language processing for government: Problems
and potential,” 2019.
 N. de Silva, “Survey on Publicly Available Sinhala Natural Language Processing Tools and Research,”
arXiv preprint arXiv:1906.02358, 2019.
 J. W. Gair, “Sinhalese diglossia,” Anthropological Linguistics, pp. 1–15, 1968.
 C. Silva and C. Kariyawasam, “Segmenting sinhala handwritten characters,” International Journal of
Conceptions on Computing and Information Technology, vol. 2, no. 4, pp. 22–26, 2014.
 C. D. Senaratne, “Sinhala-english code-mixing in sri lanka? a sociolinguistic study,” 2009.
 T. U. Consortium, “The unicode standard, version 13.0.0,” The Unicode Consortium, 2020. ISBN 978-1-
936213-26-9, http://www.unicode.org/versions/Unicode13.0.0/, 2020.
 G. Dias, “Sinhala named sequences,” https://www.unicode.org/L2/L2010/10164-sinhala-named-seq.pdf,
 A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for ecient text classication,” arXiv
preprint arXiv:1607.01759, 2016.
 A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov, “Fasttext. zip: Compressing
text classication models,” arXiv preprint arXiv:1612.03651, 2016.
 R. Bekkerman and J. Allan, “Using bigrams in text categorization,” 2004.
 H. P. Luhn, “A statistical approach to mechanized encoding and searching of literary information,” IBM
Journal of research and development, vol. 1, no. 4, pp. 309–317, 1957.
 W. N. Francis, H. Kučera, and A. W. Mackie, “Frequency analysis of english usage: Lexicon and grammar,”
 A. Schoeld, M. Magnusson, and D. Mimno, “Pulling out the stops: Rethinking stopword removal for topic
models,” pp. 432–436, 2017.
 R. T.-W. Lo, B. He, and I. Ounis, “Automatically building a stopword list for an information retrieval
system,” vol. 5, pp. 17–24, 2005.
 A. Alajmi, E. Saad, and R. Darwish, “Toward an arabic stop-words list generation,” International Journal
of Computer Applications, vol. 46, no. 8, pp. 8–13, 2012.
 J. K. Raulji and J. R. Saini, “Generating stopword list for sanskrit language,” pp. 799–802, 2017.
 D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. H. N. D. De Silva, and
G. Dias, “Implementing a Corpus for Sinhala Language,” in Symposium on Language Technology for South
Asia 2015, 2015.
 “Sinmin-web university of moratuwa, sri lanka,” https://sinhala-corpus.projects.uom.lk/sinmin-web, ac-
 C. Fernando, “The post-imperial status of english in sri lanka 1940–1990: From rst to second language,”
Post-imperial English: Status change in former British and American colonies, 1940–1990, pp. 485–511,