Conference PaperPDF Available

Handling of Nonstandard Spelling in GRAC

Authors:
Handling of Nonstandard Spelling in GRAC
Maria Shvedova
Kyiv National Linguistic University
Lviv Polytechnic National University
KyivLviv, Ukraine
ORCID: 0000-0002-0759-1689
Andriy Rysin
Kyiv, UkraineCary, USA
Vasyl Starko
Lviv Catholic University
Lviv, Ukraine
ORCID: 0000-0002-2530-2107
AbstractGRAC is a large reference corpus of
Ukrainian spanning over 200 years. The system of
morphological analysis used to mark up the corpus was
originally designed only for modern language.
Meanwhile, the corpus includes texts that sometimes
differ significantly from modern ones. Orthographic
systems different from the modern standard have been
used throughout the history of the Ukrainian language,
including regional and diaspora publications. The
article describes the algorithms used in the corpus to
handle old orthographies and some other cases of
nonstandard spelling, and discusses the prospects for
their development. The developed tools have been made
available online to the NLP and CL community.
Keywordscorpus, spelling, text preprocessing,
Ukrainian, GRAC, VESUM
I. INTRODUCTION
The General Regionally Annotated Corpus of Ukrainian
(GRAC: uacorpus.org) covers a span of more than 200
years and includes a great variety of texts written using
three distinct orthographic systems and some mixed
versions. To our knowledge, this is the only Ukrainian
corpus to include a large selection of texts published before
1930, a period when Ukrainian spelling underwent multiple
changes and exhibited most variation. To add to the
apparent confusion, multiple nonstandard sets of spelling
rules have been used by some publishers. Texts that are
added to successive versions of GRAC come from a variety
of sources and formats that employ, naturally, different
conventions. Importantly, some texts, especially older ones,
are scanned, OCRed, and then manually proofread to make
sure they are consistent with the original.
There are several demands that the texts in GRAC
should meet. One is the necessity of preserving texts in
their original form as GRAC embodies a descriptive
approach to language and strives to faithfully reflect texts
as they were produced by their authors. (That said, the
corpus inevitably includes a number of texts from the more
easily accessible later reprints. Such texts are provided with
the appropriate markup.) From this perspective, texts in the
corpus should be as close as possible to the original as any
significant deviations would be detrimental to the overall
objective of the corpus. This sets GRAC apart from corpora
intended for machine use, which are focused primarily on
current usage and may undergo more radical modifications
at the stage of normalization.
Ukrainian texts in GRAC have to be tokenized and
lemmatized as fully and adequately as possible in order to
make them accessible to all users via a search interface and
to researchers investigating fine aspects of language use
across time and space. This is where spelling variations
pose a challenge: the different spellings of the same word
should be correctly reduced to one lemma, while
morphological variations (for example, in grammatical
gender) need to be represented as separate lemmas
(although the introduction of hyperlemmas bringing these
lemmas together is also a feasible option).
II. RELEVANT RESEARCH
An important part of preparing texts for a language
corpus is preprocessing and normalization: texts need to be
converted into the form that can be handled by the
morphological analysis program, which in many cases is
geared toward modern standard language.
Nonstandard lexical data has been one of the key
challenges for text processing in NLP and computational
linguistics. One of the contributing factors in the past
decade has been the increasing digitization of textual
heritage [1]. In Ukrainian corpus linguistics, nowhere has
this problem been more pressing than in GRAC as it
contains arguably the most varied collection of Ukrainian-
language texts. The goal in handling nonstandard lexical
units is often to normalize them; hence automatic word
correction has received significant attention in the literature
[2], [3]; see also a recent comprehensive overview of
research in this area spanning three decades [4]. Various
approaches have been proposed to handle lexical variation
caused by a number of factors, from historical spelling
changes to errors to morphological variation [5], [6],
including for Russian, a Slavic language with rich
morphology [7].
A relevant issue in this context is that of normalization.
This is a complex concept, and various approaches to
normalization have been proposed [8], including neural
networks [6]. Lexical tokens are usually normalized only
orthographically, i.e., a nonstandard spelling is converted to
the standard one [1]. In GRAC, with its emphasis on
presenting texts in their original form, normalization is
performed only for the purposes of lemmatization and part-
of-speech tagging. After the preprocessing stage, tokens in
the text remain unchanged, while nonstandard
orthographical items are lemmatized to standard lemmas
and PoS-tagged whenever possible. This makes it possible
to preserve the authentic text and, at the same time, perform
morphological analysis of nonstandard lexical items.
III. MAIN RESULTS
In this paper, we will consider several key challenges
having to do with nonstandard spelling in texts for GRAC
and describe the methods and tools we have developed to
solve them. One of the key requirements for GRAC is that
its morphological annotation needs to adequately cover
wordforms written using alternative (old, variant)
orthographic systems.
Adding old texts from the first editions to the corpus
involves adapting the morphological analyzer to their
orthographic systems. Ukrainian texts from the 19th
century were written in several different such systems,
named after their creators (Mykhailo Maksymovych,
Panteleimon Kulish, Yevhen Zhelekhivsky, and Borys
Hrinchenko), as well as the Russian-oriented yaryzhka
spelling and others. In practice, they still had further
variants as changes were introduced in later editions [9].
A similar task arises when compiling corpora of other
languages, such as historical corpora of English [10], [11],
[12], [13] or Yiddish corpora, also featuring many spelling
systems [14].
A significant problem for the construction of a reference
corpus of Ukrainian language with a historical part
covering the Early Modern literary language is the
inaccurate reflection of the linguistic features in the
“modernized” editions in the 20th and 21st centuries.
Western Ukrainian authors are redacted the most, even
though, as can be seen from the examples quoted in Table 1
from [15], texts from other regions are also affected. The
western Ukrainian version of the literary language is found
in the texts of the 19th and the first half of the 20th century.
Compared to the central Ukrainian standard, it had
numerous distinct features [16], [17], [18], etc., in particular
its own orthography. The texts of western Ukrainian
authors were republished in the USSR, on a larger scale
after the Second World War, and their language was
brought closer to the literary standard of the time. The
western Ukrainian version of the literary Ukrainian
language was not recognized in official Soviet linguistics as
a separate linguistic phenomenon but was construed as a
language corrupted by foreign borrowings and dialectisms.
In Soviet republications, editors not only introduced
standard spelling and grammatical forms but also often
replaced, deleted, and added words, effectively making
stylistic corrections. For example, in the book of short
stories by the western Ukrainian author Stefan Kovaliv
(18481920) edited in 1960 and intended for children’s
audience, the preface sets forth the following principles of
correction, among others: возьмеш → візьмеш, оден
один, мід → мед, сюда → сюди; Володко → Володько,
всего всього; деревляний дерев’яний; каміня
каміння, житєм життям; волосе волосся,
щастє щастя; попсуєся попсується, піднесеся
→ піднесеться; по огородах → по городах; знарядів
знарядь; хотять хочуть, гудить гуде; з
нічим → ні з чим, д’ церкви → до церкви. This list
includes standard replacements of the author's regional
phonetic, grammatical, and syntactic features [19: 34].
Editions intended for an adult audience, and even academic
critical editions, were modified in a similar way, in
particular the collected works of the western Ukrainian
classic Ivan Franko in 50 volumes published in the 1970s.
A comparison of this edition with the first editions of
Franko’s texts highlights numerous examples of lexical,
grammatical, and syntactic substitutions, as well as
conceptual distortions and editorial errors [20].
Thus, later republications are an unreliable source for
the historical part of the corpus where it is crucial to
faithfully represent all linguistic features. The metatextual
markup in GRAC contains the attribute
doc.publicationYear ("date of publication"), and in cases
where editorial interventions were accurately recorded, the
date of editing is indicated after the title of the work, while
the main date featured in the corpus is the year of creation,
for example, Дмитро Бузько, Льоля [редакція 2016-
2018 р.], 1924 (Dmytro Buzko, Lyola [2016-2018 edition],
1924). Thus, the researcher can take into account the year
of publication and the possibility of editorial interventions
when working with individual texts of the corpus. (This is
much more difficult to do in statistical calculations of many
texts at once). That said, there is an urgent need to include
more authentic texts in the corpus for academic use.
Texts generally suitable for processing by a system of
morphological analysis based on VESUM, a large
morphological dictionary of Ukrainian, and designed for
modern language began to appear in central Ukraine with
the release of Hrinchenko's dictionary in the 1900s. This
dictionary became a spelling guide for printed publications,
whereas in western Ukraine this spelling became accepted
in the 1920s. The differences between the 1928
orthography, which was adopted as a common orthography
by both eastern and western Ukraine and has since been
used in the diaspora, and the 1933 orthography and its
variants used in the USSR are largely taken into account in
VESUM, and the morphological analyzer handles texts
written in both spelling systems. These subnorms also
featured differences in phonological handling of many
borrowings (like логіка vs. льогіка, implying different
pronunciation). Still, there are some spelling (and/or
phonological) variants not represented in the dictionary.
These include instances of regular variation that may occur
in hundreds and thousands of words. To handle these cases,
the morphological analyzer employs a specially developed
dynamic tagging module, which goes a long way in solving
this problem. Below is a list of main cases of variant
spelling and/or phonology (with examples) handled
dynamically:
Lain numbers, dates/times, and hashtags
Words with ґ instead of г (тренінґ, ґестапо)
Words with ер instead of р in word-final position
(фотометер)
Words with льо instead of ло (ідеольогічний)
Words with ія instead of іа (антирадіяційно)
Words with the missing ь in ськ (Мелітопольский)
Older spelling of сьв (сьвідомий, сьвіт)
Numbered adjectives and nouns (1-ша, 2-гий, 100-
річному, 10-річчя)
Numeric combinations in which numbers are spelled
out (три-чотири, п'яти-шести, абзац-два)
Numbered entities (Євро-2014, вибори-2012, Ан-
140, ТЕЦ-1)
Adverbs with по- (по-сибірськи, по-абхазькому)
Words with particles -но, -то, -от, -таки, and -бо
(стривай-но, першого-таки, годі-бо)
Nonstandard (hyphenated) spelling of particles ж,
б/би (були-б, ти-ж)
Adjectives with -подібний та -вмісний (Ш-
подібний, карбонат-вмісний)
Nouns with regular prefixoids (VIP–будинок, PR-
департамент, топ-десять, еспресо-машина,
супер-Маріо)
Name suffixes (Мустафа-ага, Ібрагім-ефенді)
Nouns with пів- (пів-України)
Street names ending in -авеню, -стрі(и)т, -
штрас(с)е (Пенсильванія-авеню, Уолл-стрит)
Hyphenated compounds consisting of two words
belonging to the same part of speech (жило-було,
учиш-учиш, ось-ось, вгору-вниз, великий-
превеликий, лікар-гомеопат, міста-фортеці)
Reduplicated interjections and onomatopoeic words
(ого-го-го-го, га-га)
Elongated words with repeated letters and/or
hyphens (Та-а-ак, ва-ре-ни-ки, з-зателефоную)
Surnames with -старший and -молодший (Алієва-
старшого)
Compound nouns without hyphens, which are
identified based on known patterns, mostly using
standard prefixes and suffixes (Лангштрассе,
транс'ядерний, напівяпонка)
Hyphenated compound adjectives in which the first
part ends in о or е (патолого-анатомічний, дво-
триметровий, сліпуче-яскравого, Альпійсько-
Карпатського).
Some of the words in the categories listed above are
assigned additional tags, e.g., ґестапо will receive the tag
:alt and напівяпонка will be tagged :bad due to the missing
apostrophe:
ґестапо[ґестапо/noun:inanim:n:v_naz:alt,ґестапо/noun:
inanim:n:v_zna:alt]
напівяпонка[напівяпонка/noun:anim:f:v_naz:bad]
VESUM implements new official spelling rules
introduced in 2019. Spelling variants that were valid in
19922018 are marked with the :ua_1992 tag, while
variants that were part of the 2019 spelling reform are
tagged :ua_2019. Note, however, that some words in the
latter group (for example, священник) can be found in
older texts. Thus, this pair of spelling tags points only to the
differences between the 1992 and 2019 orthographies rather
than the time period in which such forms occur in actual
usage:
священик noun:anim:m:v_naz:ua_1992
священник noun:anim:m:v_naz:ua_2019
Байєр Байєр noun:anim:m:v_naz:prop:lname:ua_1992
Баєр Баєр noun:anim:m:v_naz:prop:lname:ua_2019
тонно-операція noun:inanim:f:v_naz:ua_1992
тоннооперація noun:inanim:f:v_naz:ua_2019
The 2019 spelling reform also introduced a number of
subtle changes and optional forms. For example, some
geographical names (but not surnames) received a variant
ending -у (along with -а) in the singular genitive case. An
effort has been made to represent them in VESUM as
comprehensively as possible, for example,
Лондона noun:anim:m:v_rod:prop:lname:xp1
Лондона noun:inanim:m:v_rod:prop:geo:xp2
Лондону noun:inanim:m:v_rod:prop:geo:ua_2019:xp2
In order to add western Ukrainian texts to the corpus, it
is necessary to further adapt the system of morphological
analysis to the older western Ukrainian standard spelling
(from 1886 to the 1920s) known as zhelekhivka, named
after its creator Yevhen Zhelekhivsky. Of all the historical
spelling systems, this one was used in the majority of texts,
particularly in western Ukrainian newspapers. It was also
the most standardized variant, used in the school education
system, and it was valid for a long time. Therefore, as far as
historical Ukrainian orthographies are concerned, computer
analysis is needed, above all, for zhelekhivka, even though
maksymovychivka, early kulishivka, and other systems will
also need to be covered in the future.
Zhelekhivka differs from the modern standard spelling
in several key ways. Examples include the use of ї after soft
consonants in the place of the etymological ѣ and е (снїг ←
снѣгъ, лїд ледъ; modern standard сніг and лід) and to
denote iotation (мілїон; modern standard мільйон);
omission of the apostrophe before я, ю, and є after labial
consonants and р (память, мякий, бю; modern standard
пам'ять, м'який, and б'ю); separation of the reflexive ся
(тримати ся; modern standard триматися) and affixes in
the complex forms of the future tense (робити му; modern
standard робитиму).
Currently, only several types of distinct zhelekhivka
forms are recognized and tagged in GRAC:
the reflexive ся in immediate postposition as in
називати ся,
words with ї after consonants without a preceding
apostrophe as in цїлком,
no apostrophe after labial consonants as in мякий.
We also recognize and tag the following cases that are
not specific to zhelekhivka only but occur more widely in
early 20th-century texts:
particles after a hyphen, e.g., могли-б (modern
standard могли б)
the suffix -ьск instead of -ьськ, e.g., україньский.
palatalization of [z], [t
͡s], [s] marked with the soft
sign (ь) before labials: сьвято
Other cases that do not correspond to the modern
orthography like моглиб, як би (instead of якби) and
others, particularly those that do not follow a regular
pattern of conversion to modern spelling (e.g., житє), are
currently not recognized in GRAC-12. Just like all other
forms not found in VESUM or tagged by the dynamic
module, they do not have lemmas and can only be found in
GRAC via “word search”, which matches exact word
forms.
GRAC’s metatextual markup has the ORTHOGRAPHY
attribute. Texts written in zhelekhivka are marked as such
in the metadata: doc.ORTHOGRAPHY = ZHEL. The
morphological analyzer recognizes zhelekhivka features
only in these texts. Texts in modern spelling
(doc.ORTHOGRAPHY = CONT) also sometimes contain
fragments of zhelekhivka, for example, in quotations.
Currently, such fragments within modern texts are not
recognized as zhelekhivka and distinct zhelekhivka forms
that differ from modern spelling are not recognized in them,
while all other words are properly processed. In the future,
historical spelling recognition can be extended to all texts
being preprocessed, lemmatized, and PoS tagged for
GRAC.
IV. CONCLUSIONS AND FUTURE DEVELOPMENT
As a large corpus that includes Ukrainian texts written
in several different orthographies and spanning over 200
years, GRAC presents special challenges in terms of
handling nonstandard spelling. These issues include
treatment of historical and modern orthographical systems
and variants within these systems, as well as handling other
nonstandard instances that can be processed via dynamic
normalizing instruments (such as вели-и-и-кий instead of
великий). To address them, the GRAC team has developed
tailored algorithms, tools, and approaches. We have
described their scope, operation and outputs and conclude
that together they have significantly improved search in
GRAC.
The GRAC team has plans to improve these tools to
make them recognize a wider range of historical spelling
variants, including other early orthographies in addition to
zhelekhivka.
REFERENCES
[1] N. Ljubešić, K. Zupan, D. Fišer, and T. Erjavec. 2016. Normalising
Slovene data: historical texts vs. user-generated content. In
Proceedings of the 13th Conference on Natural Language Processing
(KONVENS 2016), 146155.
[2] G. Navarro. A guided tour to approximate string matching. ACM
Comput. Surv. 33(1), 3188 (2001)
[3] S. Cucerzan, E. Brill. Spelling correction as an iterative process that
exploits the collective knowledge of web users. In: Lin, D., Wu, D.
(eds.) Proceedings of EMNLP 2004, pp. 293300. Association for
Computational Linguistics, Barcelona (2004)
[4] D. Hládek, J. Staš, M. Pleva. Survey of Automatic Spelling
Correction. Electronics 2020, 9, 1670.
[5] M. Reynaert. 2010. Character confusion versus focus word-based
correction of spelling and OCR variants in corpora. International
Journal on Document Analysis and Recognition 14 (2): 173187.
[6] C. M. Veliz, O. De Clercq and V. Hoste. Comparing MT Approaches
for Text Normalization. Proceedings of Recent Advances in Natural
Language Processing, Varna, Bulgaria, Sept. 24, 2019, pp. 740749.
https://www.aclweb.org/anthology/R19-1086.pdf
[7] A. Sorokin. Spelling Correction for Morphologically Rich Language:
a Case Study of Russian. Proceedings of the 6th Workshop on
BaltoSlavic Natural Language Processing, 2017 (April), pp. 4553.
[8] J. Eisenstein. What to do about bad language on the internet. In Proc.
of the 2013 Conference of NAACL: Human Language Technologies,
pp. 359369.
[9] Istoriia ukrainskoho pravopysu: XVIXX stolittia [History of
Ukrainian Spelling: 16th to 20th Century]. Kyiv, 2004. (In Ukrainian).
[10] E. Pettersson, B. Megyesi, J. Nivre. (2013). Normalisation of
Historical Text Using Context-Sensitive Weighted Levenshtein
Distance and Compound Splitting. Proceedings of 19th Nordic
Conference on Computational Linguistics.
https://www.researchgate.net/publication/257921590_Normalisation_
of_Historical_Text_Using_Context-
Sensitive_Weighted_Levenshtein_Distance_and_Compound_Splitting
[11] A. Robertson. Automatic Normalisation of Historical Text.
Edinburgh. 2017.
https://homepages.inf.ed.ac.uk/s1202948/pdfs/msc_thesis.pdf.
[12] M. Bollmann. A Large-Scale Comparison of Historical Text
Normalization Systems. 2019. https://arxiv.org/pdf/1904.02036.pdf
[13] M. Bollmann. Normalization of Historical Texts with Neural Network
Models. 2018.
https://www.linguistics.rub.de/forschung/arbeitsberichte/22.pdf
[14] Y. P. Blum. Techniques for Automatic Normalization of
Orthographically Variant Yiddish Texts. New York. 2015.
https://academicworks.cuny.edu/cgi/viewcontent.cgi?article=1524&co
ntext=gc_etds
[15] V. Starko, A. Rysin, M. Shvedova. Ukrainian text preprocessing in
GRAC, this volume.
[16] Z. Franko, Variantnist chy terytorialna vidminnist ukrainskoi
literaturnoi movy [Variation or territorial difference of the Ukrainian
literary language], Ukrainska istorychna ta dialektna leksyka
[Ukrainian Historical and Dialectal Vocabulary], 2 (1991), pp. 169
173. (in Ukrainian).
[17] P. E. Hrytsenko, Nekotorye zamechanyia o dyalektnoi osnove
ukrayinskoho literaturnoho yazyka [Some remarks on the dialectal
basis of the Ukrainian literary language]. Philologia slavica: To the
70-th anniversary of the Academician N.Y. Tolstoy (1993), 284294.
(In Russian).
[18] I. Matviias, Varianty ukrainskoi literaturnoi movy v kintsi XVIII i v
XIX stolitti [Variants of the Ukrainian literary language in the late
18th and 19th centuries], Kultura slova [Culture of the Word] 48-49
(1996), pp. 1128. (In Ukrainian).
[19] O. Varchenko. Vid redaktora [From the Editor]. Stefan Kovaliv. Svit
uchyt rozumu. Opovidannia [Stefan Kovaliv. The World Teaches
Intelligence. Stories]. Kyiv, 1960, pp. 34. (In Ukrainian).
[20] O. Drul. Popravliuvanyi Franko [Corrected Franko]. Zbruch. 2015.
URL: https://zbruc.eu/node/35977 (In Ukrainian).
... The program addresses multiple issues ranging from low-level noise reduction in processing raw OCR-ed text to a rather fine-grained language-sensitive rules involving excision of Russian text or correcting the use of the letter і. A separate paper by the authors [14] is dedicated to more language-specific tasks, such as resolving the problem of multiple orthographies, which is characteristic of Ukrainian throughout its history. This challenge is also associated with pre-processing Ukrainian texts as their linguistically relevant characteristics need to be kept intact for automatic linguistic analysis. ...
Article
Full-text available
Automatic spelling correction has been receiving sustained research attention. Although each article contains a brief introduction to the topic, there is a lack of work that would summarize the theoretical framework and provide an overview of the approaches developed so far. Our survey selected papers about spelling correction indexed in Scopus and Web of Science from 1991 to 2019. The first group uses a set of rules designed in advance. The second group uses an additional model of context. The third group of automatic spelling correction systems in the survey can adapt its model to the given problem. The summary tables show the application area, language, string metrics, and context model for each system. The survey describes selected approaches in a common theoretical framework based on Shannon’s noisy channel. A separate section describes evaluation methods and benchmarks.
Article
Full-text available
Natural language processing for historical text imposes a variety of challenges, such as to deal with a high degree of spelling variation. Furthermore, there is often not enough linguistically annotated data available for training part-of-speech taggers and other tools aimed at handling this specific kind of text. In this paper we present a Levenshtein-based approach to normalisation of historical text to a modern spelling. This enables us to apply standard NLP tools trained on contemporary corpora on the normalised version of the historical input text. In its basic version, no annotated historical data is needed, since the only data used for the Levenshtein comparisons are a contemporary dictionary or corpus. In addition, a (small) corpus of manually normalised historical text can optionally be included to learn normalisation for frequent words and weights for edit operations in a supervised fashion, which improves precision. We show that this method is successful both in terms of normalisation accuracy, and by the performance of a standard modern tagger applied to the historical text. We also compare our method to a previously implemented approach using a set of hand-written normalisation rules, and we see that the Levenshtein-based approach clearly outperforms the hand-crafted rules. Furthermore, the experiments were carried out on Swedish data with promising results and we believe that our method could be successfully applicable to analyse historical text for other languages, including those with less resources.
Conference Paper
Full-text available
Logs of user queries to an internet search engine p ro- vide a large amount of implicit and explicit inform a- tion about language. In this paper, we investigate their use in spelling correction of search queries, a task which poses many additional challenges beyond the traditional spelling correction problem. We pre - sent an approach that uses an iterative transformat ion of the input query strings into other strings that corre- spond to more and more likely queries according to statistics extracted from internet search query log s.
Article
Full-text available
We present a new approach based on anagram hashing to handle globally the lexical variation in large and noisy text collections. Lexical variation addressed by spelling correction systems is primarily typographical variation. This is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbors is applied, where near-neighbors are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbors constitutes a particular character confusion. We present a global way of performing this action: for all possible particular character confusions given a particular edit distance, we sequentially identify all the pairs of text strings in the text collection that display a particular confusion. We work on large digitized corpora, which contain lexical variation due to both the OCR process and typographical or typesetting error and show that all these types of variation can be handled equally well in the framework we present. The character confusion-based prototype of Text-Induced Corpus Clean-up (ticcl) is compared to its focus word-based counterpart and evaluated on 6 years' worth of digitized Dutch Parliamentary documents. The character confusion approach is shown to gain an order of magnitude in speed on its word-based counterpart on large corpora. Insights gained about the useful contribution of global corpus variation statistics are shown to also benefit the more traditional word-based approach to spelling correction. Final tests on a held-out set comprising the 1918 edition of the Dutch daily newspaper 'Het Volk' show that the system is not sensitive to domain variation.
Conference Paper
One of the main characteristics of social media data is the use of non-standard language. Since NLP tools have been trained on traditional text material their performance drops when applied to social media data. One way to overcome this is to first perform text normalization. In this work, we apply text normalization to noisy English and Dutch text coming from different social media genres: text messages, message board posts and tweets. We consider the normalization task as a Machine Translation problem and test the two leading paradigms: statistical and neural machine translation. For SMT we explore the added value of varying background corpora for training the language model. For NMT we have a look at data augmentation since the parallel datasets we are working with are limited in size. Our results reveal that when relying on SMT to perform the normalization it is beneficial to use a background corpus that is close to the genre you are normalizing. Regarding NMT, we find that the translations - or normalizations - coming out of this model are far from perfect and that for a low-resource language like Dutch adding additional training data works better than artificially augmenting the data.
Article
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1