Content uploaded by Šandor Dembitz
Author content
All content in this area was uploaded by Šandor Dembitz on May 11, 2016
Content may be subject to copyright.
Croatian Language N-Gram System
Šandor Dembitz, Bruno Blašković, Gordan Gledec
University of Zagreb, Faculty of Electrical Engineering and Computing
Unska 3, HR-10000 Zagreb, Croatia
{sandor.dembitz, bruno.blaskovic, gordan.gledec}@fer.hr
Abstract. Large-scale n-gram models are available for a small number of
languages. So far, Croatian was not one of them. The research presented in this
paper describes the development of n-gram database system suitable for large-
scale language modeling in Croatian. The process of n-gram collection relies on
Croatian academic online spellchecker Hascheck, which has been publicly
available since 1993, and is today a popular language service, with average
daily traffic exceeding million tokens. The approach demonstrated in this paper
eliminated the need of n-gram data cleaning in the post-processing phase, which
is a serious issue in other languages. The spellchecker dynamics allowed
Heaps’ law modeling to be applied to Croatian n-grams, which enabled the
prediction of n-gram count growth.
Keywords: Croatian, lexical n-gram, language modeling, Heaps’ law
1 Introduction
Lexical n-grams are nowadays an important data infrastructure in many areas of
natural language processing (NLP) [1]. Many technologies are taking advantage of
large-scale language models derived from gigantic corpora. “More words and less
linguistic annotation” is a trend well expressed in the statement from the Google
research team [2], which is followed in our research:
“So, follow the data. Choose a representation that can use unsupervised learning
on unlabeled data, which is so much more plentiful than labeled data. Represent all
the data with a nonparametric model rather than trying to summarize it with a
parametric model, because with very large data sources, the data holds a lot of detail.
For natural language applications, trust that human language has already evolved
words for the important concepts. See how far you can go by tying together the words
that are already there, rather than by inventing new concepts with clusters of words.
Now go out and gather some data, and see what it can do.” (p. 12)
However, large-scale language models are a privilege of a handful of world
languages [3]. Abundant linguistic data collection is a prerequisite for large-scale
language modeling, but in many cases it is a hardly feasible step in the machine
processing of minority languages, like Croatian. Therefore, we took the advantage of
Croatian academic online spellchecker Hascheck [4] and started collecting n-grams, n
= 1,∙∙∙, 5, in the summer of 2007. Relying on existing and popular language service
eliminated web crawling, which might be a risky method for compiling gigantic
corpus written in a minority language.
Our original intention was to use n-grams as the basis for Hascheck’s upgrade into
contextual spellchecker [5], [6], but in the course of development it became clear that
the results are much broadly applicable, as demonstrated in [7]. From a respectable
amount of data collected so far, we succeeded to develop a consistent and
maintainable database system for 1-grams, 2-grams and 3-grams. Extending the
system to 4-grams and 5-grams is left for time when our processing power becomes
able to cope with large giga- and terabyte files within an acceptable response time.
The methodology for the upgrade is already prepared, as will be demonstrated
throughout the paper.
The paper is organized in 5 sections. Section 2 gives a short overview on Hascheck
and its traffic and corpus growth. N-gram database creation and maintenance is
described in Section 3. Section 4 is devoted to Heaps’ law applied to Croatian n-
grams. It also covers n-gram counts reached by the end of 2011, which allows the
comparison of Croatian language modeling power with the power achieved in two
better resourced Slavic languages. Finally, Section 5 brings our concluding remarks.
2 On Hascheck and Its Traffic
Hascheck started as an e-mail embedded service in 1993, first locally, only for the
staff of the Faculty of Electrical Engineering and Computing in Zagreb, but it quickly
became a public service (in March 1994), primarily dedicated to the Croatian
academic community. In the summer of 2003 the e-mail service was converted into
web service available at http://hascheck.tel.fer.hr/. With the web interface
Hascheck became a world-wide adopted service, with users in 115
countries/territories.
Hascheck has two subsystems: the real-time subsystem, which reacts immediately
to text received for spellchecking, and the post-processing subsystem, which uses
collected process data and performs learning, system statistics and similar tasks. The
outcome of learning is the update of the dictionary, i.e. the improvement of the
spellchecker’s functionality. The learning system incorporated into the post-
processing subsystem is what makes Hascheck different from other spellcheckers.
Hascheck’s dictionary is organized in three word-list files:
Word-Type (WT) file,
Name-Type (NT) file,
English-Type (EngT) file.
WT-file contains Croatian common word-types - words which may occur written in
lower-case only, with an initial upper-case letter (at the start of a sentence, for
example), or written in upper-case only, and which were not borrowed (with their
orthography) from foreign languages, but belong intrinsically to the Croatian
language itself. The WT-file started with approximately 100,000 entries, but due to
the learning process it has increased to approximately 900,000 word-types.
The NT-file contains all case-sensitive elements of writing. These are proper and
other names, abbreviations, acronyms, as well as names with unusual use of lower-
and upper-case, like “LaTeX” or “WordPerfect”, etc. The NT-file contains also
alphanumeric lexical elements like “3D” and similar. Furthermore, it contains words
from German, Hungarian, Italian and other foreign languages that appear in Croatian
writing in their original orthography. The file started as an empty file, but has in the
course of learning increased by now to approximately 700,000 different name-types.
Our decision to include English word-list (EngT-file) in Croatian spellchecking is
based on the fact that English, as the modern lingua franca, often comes mixed with
Croatian in contemporary Croatian writing. The initial word-list was compiled at the
very start of service by using several reliable sources [4]. After exclusion of words
equally written in Croatian and English, like “atom” or “zebra”, which are placed in
the WT-file only, this produced our EngT-file with approximately 70,000 different
word-types. The EngT-file is the only static component of the dictionary. New
English common word-types, when they appear in the learning process, are placed
into the NT-file.
Increase of Hascheck’s popularity is well expressed by growth of its traffic.
Systematic traffic data collection started in September 2003. From then till December
31, 2011, Hascheck has processed a text corpus amounting to 660 Mtokens (Fig. 1).
y = 1387391e0.0626x
R2 = 0.9969
0
100000000
200000000
300000000
400000000
500000000
600000000
700000000
010 20 30 40 50 60 70 80 90 100
Figure 1: Cumulative traffic growth
x-axis: months; y-axis: processed corpus in tokens
The cumulative approach to traffic growth modeling eliminates the seasonal
oscillations which the traffic normally has. The data points at Fig. 1 represent the
corpus reached in the last 72 months, while the line is the trend-line which fits the
data the best. The growth function is exponential, with the correlation coefficient R2 =
0.9969, which is a very high one. Therefore, the traffic growth function can be
considered as a law capable of predicting traffic behavior in the future. We expect to
reach a billion tokens corpus in the summer of 2012. With this corpus size, the
Croatian language enters among languages represented in WEB1T corpus [3]. Corpus
size of all web pages written in Croatian is estimated to 1.2 billion tokens [8].
3 Database Creation and Maintenance
In order to be suitable for producing an applicable language model, a lexical n-gram
database system has to be consistent as much as possible. In our case, the system
consistency means that each n-gram is built only of tokens recognized by Hascheck as
legal words with evidence in Croatian writing. N-grams do not contain punctuation
marks or other non-alphanumeric characters. We decided to take blank and dash as
the only legal token separators. Additionally, to keep up with current practice in our
spellchecking service, the dash was later converted into the blank, thus producing two
tokens from each semi-compound. This decision eliminated tokens like e-mail or
URL addresses, as well as decimal and ordinal numbers (in Croatian an ordinal
number ends with full stop) from our database. Further restriction was posed on the
occurrence of numerical tokens in n-grams. No two consecutive numbers may occur
as constituents of bigrams, trigrams etc., which eliminates ISBN numbers, UUIDs and
similar tokens. All these restrictions made our database almost purely lexical in
conventional meaning of the term.
Because of processing limits, we had to create the Croatian n-gram database
system as a set of disjoint files able to be handled as a program array. Therefore the
bigrams are split into 8 files, while the trigrams are split into 22 files. The record
structure is simple: first comes an n-gram as a record key followed by its frequency,
i.e. the number of the n-gram occurrences in the corpus.
The database creation is in fact the update of the existing n-gram base with n-
grams collected after its last creation. Let as denote with DATA-BASEi the existing
database, where the index expresses its splitting into corresponding number of files
for a given n, and with data-set the new collection of n-grams. It is clear that the new
n-gram collection must not exceed the size of its array manageability, and database
maintainers have to take that into account. The creation procedure follows these steps:
1. From each DATA-BASEi extract all records whose key is present in data-set
and form from them a DBi.
2. For each given index create Dbasei = DATA-BASEi \ DBi, where “\” denotes
the set difference operation.
3. For all given indexes add DBi records to the data-set and from the obtained
set create dBASEi sets according to the file disjunction criteria for the given
n.
4. Create new DATA-BASEi = Dbasei U dBASEi for each given index.
The n-gram probability P(w), w = w1w2∙∙∙wn, can be calculated forwards,
P(w) = P(w1) P(w2|w1) ∙∙∙P(wn|w1∙∙∙wn-1) (1),
and backwards,
P(w) = P(wn) P(wn-1|wn) ∙∙∙P(w1|wn∙∙∙w2) (2).
The bidirectional calculation is important when w1 or wn becomes a set of correction
suggestions for a misspelling/typo found in a text. For example, if an error affects the
starting word of a sentence, only the backward calculation can be applied when
looking for the optimal correction(s). Furthermore, in many applications only
conditional probabilities are used. A plausible example about usefulness of
bidirectional calculation is drawn from our bigram database. To the bigram “Eric
Clapton” two conditional probabilities are associated:
P(w2|w1) = P(Clapton|Eric) = 0.0541462,
P(w1|w2) = P(Eric|Clapton) = 0.856287.
A practical consequence: if one finds only „Clapton” in a text, the text is almost
surely dealing with Eric Clapton, while if one finds only „Eric”, additional processing
has to be done in order to determine to whom the name belongs.
The bidirectional calculation implies that the language model for a given n must be
supplied with positional m-gram sub-bases, m = (n-1), ∙∙∙, 1. T. The sub-base creation
is a recursive process which follows the n-gram base creation. The dimension of n-
grams contained in the data-set decreases gradually by 1 until m becomes 1, and in
each step the corresponding sub-base is updated following the procedure described
above. In our case, this means that the bigram database has two unigram sub-bases,
while the trigram database is supported with two bigram sub-bases and three unigram
sub-bases. The positional bigram sub-bases are real subsets of the original bigram
database, both in terms of keys and frequencies, and the same is valid for positional
unigram sub-bases, too. Furthermore, each bigram sub-base is also split into 8 disjoint
files, like the original one, which follows from the creation procedure. This means
that the complete Croatian trigram language model relies on data from 41 files (22
trigram files, 2 times 8 positional bigram files an 3 positional unigram files), while the
bigram model relies on data from 8 bigram files and 2 positional unigram files. We
found these numbers to be a practical upper limit for our processing environment,
both in terms of system creation and maintainability.
Hascheck’s dictionary is not a 100% error-free dictionary. The portion of
erroneous elements in it is a small one (dictionary impurity is measured in permilles
[4]), but they exist. A new error may enter into the dictionary during the learning. A
special part of learning system takes care of dictionary errors. If such an error is
found, it is deleted from the corresponding dictionary file. Each dictionary error
deletion causes the n-gram files update, since an n-gram is legal in the database
system if and only if it is composed of elements recognized by Hascheck as legal
words with evidence in Croatian writing.
Let us denote with e an error found in the dictionary. This element may occur in n-
gram database system:
as the unigram e-record in the unigram file;
as the constituent of two bigram types, w1e and ew2, and the corresponding
bigrams may be spread over up to 8 bigram files,
as the constituent of three trigram types, w1w2e, w1ew3 and ew2w3, and the
corresponding trigrams may be spread over up to 22 trigram files.
Deletions of n-grams with an erroneous constituent cause also the update of positional
m-gram sub-bases, in order to keep the system consistent. At the trigram level this
means that bigrams w1e, w2e, ew2 and ew3 have to be deleted from corresponding
bigram sub-bases for all values of w1, w2 and w3. Furthermore, the frequencies for all
w1w2 and w2w3 in corresponding bigram sub-bases have to be reduced for the number
of occurrences of w1w2e and ew2w3, respectively. The procedure continues to the level
of trigram-base positional unigrams, where the e has to be deleted and frequencies for
each w1, w2 and w3 has to be reduced for the values attached to deleted w1w2e, w1ew3
and ew2w3 trigrams. At the bigram level, after the deletion of all w1e and ew2, only
two unigram sub-bases have to be updated in terms of content and frequencies, which
is done the same way as already described in the case of trigram-base positional
bigram and unigram sub-bases update. Finally, the unigram database update is
equivalent to the deletion of the e-record only.
A program able to perform the cleaning described above has to take care of many
variables: files in which the erroneous n-grams and their sub-grams are placed,
frequency of each n-gram, together with frequencies of its sub-gram constituents not
affected by the dictionary error, and so on. The problem is solved by writing a
program able to create another program in which all variables are properly placed,
depending upon updating cases encountered.
4 Heaps’ Law Applied to N-Grams
Zipf’s law [9], which states that the frequency of tokens in a large corpus of natural
language is inversely proportional to the token rank, can be extended to lexical n-
grams, too [10]. If a phenomenon obeys Zipf’s law, it also obeys Heaps’ law [11].
The law connects a “vocabulary” size (V), in terms of number of different n-grams in
it, with the size of corpus (t) in which the n-grams are present:
ttV
(3)
Parameters α and β are free parameters to be determined empirically. The parameter α
is strongly language dependant, while β is almost language independent. In the case of
β, a condition 0 < β < 1 has to be satisfied.
From the corpus processed between summer 2007 and the end of 2011 (630
Mtokens in total) 24 empirical points in Fig. 2 were chosen randomly. From our
previous experience [4] we were sure that the law is going to fit them almost
perfectly, as this is demonstrated by extremely high correlation coefficient in the
figure.
y = 335.57x0.4261
R2 = 0.999
0
250000
500000
750000
1000000
1250000
1500000
1750000
2000000
0100000000 200000000 300000000 400000000 500000000 600000000 700000000
Figure 2: Heaps’ law for Croatian unigrams
x-axis: corpus size in tokens; y-axis: number of different unigrams
y = 79.472x0.6596
R2 = 0.9992
0
10000000
20000000
30000000
40000000
50000000
60000000
0100000000 200000000 300000000 400000000 500000000 600000000 700000000
Figure 3: Heaps’ law for Croatian bigrams
x-axis: corpus size in tokens; y-axis: number of different bigrams
Because of Zipf’s law deviations in the low corpus size range [12], and for bigrams
or trigrams a low corpus size may be in the range of 100 Mtokens, for Heaps’ law
parameter calculation in the case of higher n-gram orders we have decided to take
empirical points from the last 24 months of processing. The results are presented in
Fig. 3 and 4. In both cases the R2 is telling that we have obtained a tool for predicting
n-gram count behavior in the future, which is important when dealing with modest
processing resources.
y = 20.344x0.7728
R2 = 0.9994
0
20000000
40000000
60000000
80000000
100000000
120000000
140000000
0100000000 200000000 300000000 400000000 500000000 600000000 700000000
Figure 4: Heaps’ law for Croatian trigrams
x-axis: corpus size in tokens; y-axis: number of different trigrams
At the end of 2011, the n-gram databases reached the following volumes:
The unigram base had 1,857,301 records, representing 590,863,453 legal
unigram tokens found in the corpus. The number of unigram records exceeds
the dictionary size, because of n-gram case sensitivity, which means that a
common word may occur written in three ways (lower-case only, with an
initial upper-case, or upper-case only), while the same word is counted as
unique word-type in the dictionary.
The bigram base had 50,173,968 records, representing 495,550,637 legal
bigram tokens found in the corpus. Its first-position unigram base (w1 base)
had 1,490,142 records, while the second-position unigram base (w2 base) had
1,354,993 records.
The trigram base had 127,323,579 records, representing 421,896,220 legal
trigram tokens found in the corpus. Its first-position bigram base (w1w2 base)
had 43,571,360 records, while the second-position bigram base (w2w3 base)
had 44,017,264 records. At the unigram sub-bases level the following
volumes were encountered: w1 base had 1,361,392 records, w2 base had
1,199,561 records and w3 base had 1,256,003 records.
The whole Croatian n-gram database system was 5.4 Gbytes in size on December 31,
2011. This size will grow as the corpus size increases, as described in Fig 1. With a
billion token corpus processed by Hascheck, which is a size comparable to the size of
Croatian web corpus [8], we expect to acquire 70 million bigrams and over 180
million trigrams.
When comparing above mentioned volumes with n-gram counts in Czech [13] and
Polish [14], the only two Slavic languages represented in WEB1T corpus [3], it
becomes clear that the Croatian language has a similar infrastructure for serious
language modeling. This result was possible only because of Hascheck. Relying on
Hascheck eliminates also the need for n-gram data cleaning, which is a serious issue
in both compared languages.
5 Conclusion
The paper presents how it was possible to produce a maintainable and upgradable
linguistic data infrastructure for serious language modeling in Croatian language.
Instead of crawling the Web for the purpose of corpus creation, we have used an
existing language service, the Croatian online spellchecker Hascheck, for data
collection, which demonstrated to be an economic and reliable way for obtaining a
large-scale lexical n-gram system.
The infrastructure is ready to be used in various NLP areas: contextual
spellchecking, speech synthesis, automatic speech recognition, text mining,
information retrieval, interactive information access, machine translation and others.
In order to become explorable in its full power, the infrastructure will migrate to the
Isabella Cluster [15], a shared resource of all scientist in Croatia dedicated to
demanding data processing, and become publically available. With this migration the
database system extension to 4-grams and 5-grams will become also easily feasible.
Even before the migration, the system’s usefulness has been demonstrated through
Hascheck. Based on n-gram databases, correction suggestions for run-on words are
successfully implemented. Furthermore, a systematic inspection of lexical n-grams
resulted in detection of several frequent, but simply solvable contextual cases, like the
usage of preposition “with”, whose Croatian counterpart may occur in two forms, “s”
and “sa”, which depends on word that follows the preposition, resolving of which is
now incorporated in our spellchecking tool. Therefore, we believe that a public
accessibility of Croatian language n-gram system may accelerate the NLP research
and development in Croatia and bring Croatian NLP technology products closer to the
leading world languages.
References
1. Pueyo, J., Quiles-Follana, J. A.: Trends in Natural Language Processing and Text
Mining. Upgrade 11(3), 33-39 (2010)
2. Halevy, A. Y., Norvig, P., Pereira, F.: The Unreasonable Effectiveness of Data. IEEE
Intelligent Systems 24(2), 8-12 (2009)
3. Brants, T., Franz, A.: Web 1T 5-gram, 10 European Languages, Version 1. Linguistic
Data Consortium, Philadelphia (2009), http://www.ldc.upenn.edu/
4. Dembitz, Š., Randić, M., Gledec, G.: Advantages of Online Spellchecking: a Croatian
Example. Software – Practice & Experience 41, 1203-1231 (2011)
5. Wilcox-O’Hearn, A., Hirst, G., Budanitsky, A.: Real-Word Spelling Correction with
Trigrams: A Reconsideration of the Mays, Damerau and Mercer Model. In:
Proceedings of the 9th International Conference on Intelligent Text Processing and
Computational Linguistics (LNCS, vol. 4919), pp. 605–616. Haifa, Israel (2008)
6. Islam, A., Inkpen, D.: Real-Word Spelling Correction using Google Web 1T 3-grams.
In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language
Processing, pp. 1241-1249. Singapore (2009)
7. Jurić, D., Banek, M., Dembitz, Š.: Informativness of Noun Bigrams in Croatian, paper
accepted by KES-AMSTA-12.
8. Ljubešić, N., Erjavec, T.; hrWaC and slWaC: Compiling Web Corpora for Croatian and
Slovene, In: Text, Speech and Dialogue 2011 Conference Proceedings, pp. 395-402.
Springer: Berlin, Heidelberg (2011)
9. Zipf, G. K.: Human Behavior and the Principle of the Least Effort. Addison-Wesley,
Cambridge, MA (1949)
10. Ha, L. O., Sicilia-Garcia, E. I., Ming, J., Smith, F. J.: Extension of Zipf's Law to Words
and Phrases. In: Proceedings of the 19th International Conference on Computational
Linguistics, pp. 315-320. Taipei, Taiwan (2002)
11. Kornai, A. How Many Words are there? Glottometrics 4, 61-86 (2002)
12. Kornai, A. Zipf's Law outside the Middle Range. In: Proceedings of the Sixth Meeting
on Mathematics of Language, pp. 347-356. University of Central Florida, Orlando
(1999)
13. Prochazka, V., Pollak, P., Zdansky, J., Nouza, J.: Performance of Czech Speech
Recognition with Language Models Created from Public Resources. Radioengineering
20, 1002-1008 (2011)
14. Ziółko, B., Skurzok, D.: N-Gram Model for Polish. In: Ipšić, I. (ed.) Speech and
Language Technologies, pp. 107-126. InTech, Rijeka, Croatia (2011)
15. Isabella Cluster, University of Zagreb, University Computing Centre (2012),
http://www.srce.unizg.hr/homepage/products-services/computer-
resources/cluster-and-grid-technologies/isabella-cluster/