Conference PaperPDF Available

Converting Afrikaans to Dutch for technology recycling

Authors:

Abstract and Figures

HLT resource development for a resource scarce language (L2) can be expedited by recycling existing technologies for a closely related language (L1). To improve the success of L1 technologies on L2 data, one can convert L2 data to make it appear more L1-like. We explore this possibility by developing an Afrikaans-to-Dutch lexical conversion module and using it as pre-processing step before applying a Dutch part of speech tagger to Afrikaans data. The accuracy of the Dutch tagger increased from 62.6%, when tagging raw Afrikaans data, to 80.6% when tagging converted Afrikaans data. We therefore conclude that, at least in the case of Dutch and Afrikaans, the use of lexical conversion as a pre-processing step for technology recycling merits further investigation.
Content may be subject to copyright.
Converting Afrikaans to Dutch for technology
recycling
Suléne Pilon
School for Languages
North-West University (VTC)
Vanderbijlpark
Sulene.Pilon@nwu.ac.za
Gerhard van Huyssteen
Centre for Text Technology (CTexT)
North-West University (PC)
Potchefstroom
Gerhard.VanHuyssteen@nwu.ac.za
Liesbeth Augustinus
Centre for Computational Linguistics
Katholieke Universiteit
Leuven
Liesbeth.Augustinus@gmail.com
Abstract—HLT resource development for a resource scarce
language (L2) can be expedited by recycling existing technologies
for a closely related language (L1). To improve the success of L1
technologies on L2 data, one can convert L2 data to make it
appear more L1-like. We explore this possibility by developing an
Afrikaans-to-Dutch lexical conversion module and using it as
pre-processing step before applying a Dutch part of speech
tagger to Afrikaans data. The accuracy of the Dutch tagger
increased from 62.6%, when tagging raw Afrikaans data, to
80.6% when tagging converted Afrikaans data. We therefore
conclude that, at least in the case of Dutch and Afrikaans, the use
of lexical conversion as a pre-processing step for technology
recycling merits further investigation.
Keywords-technology recycling; lexical conversion; Part of
speech tagging; Afrikaans; Dutch
I. I
NTRODUCTION
To adapt/re-engineer existing technologies for language L1
to a closely related, resource-scarce language L2 is a strategy
that could be adopted to fast-track the development of
resources for L2. The rationale behind this process of
technology transfer is that if the languages are similar enough,
it should be faster and cheaper to adapt L1 technologies to L2
than to develop L2 technologies from scratch [1]. In [2] we
have argued that Dutch and Afrikaans are similar enough, and
that it would thus be easier and quicker to adapt existing Dutch
technologies to Afrikaans, rather than to develop Afrikaans
resources from scratch.
To improve the efficiency of L1 technologies on L2 data,
and thereby further reduce the need for manual intervention, an
extra layer of processing could be added during which L2 data
is manipulated to appear more L1-like. The nature of this
processing would depend on the differences between the
languages in question, and could include syntactic re-ordering,
the splitting of morphemes, or some form of machine
translation. Given the differences between Afrikaans and Dutch
(see [2]), for example, a lexical convertor could be used to
convert Afrikaans lexemes to Dutch before a Dutch POS tagger
is applied to the data. Even though this conversion would not
yield a good Dutch translation, the fact that the data appears
more Dutch-like raises the question whether this will have a
positive effect on the success of the tagger. This is the research
question we would like to address in this paper, and we
describe the development and application of such an Afrikaans-
to-Dutch convertor (A2DC).
We first give an overview of previous work in Section 2,
after which A2DC is explained in Section 3 and a word-level
and sentence-level evaluation of A2DC is described in Section
4.1. Since A2DC was developed to expedite and improve
technology recycling between Afrikaans and Dutch, we then
use a Dutch POS tagger to annotate unconverted and converted
Afrikaans data to evaluate the contribution that the conversion
step makes in the recycling process. This experiment is
described in Section 4.2. Section 5 concludes and gives a view
to future work.
II. P
REVIOUS
W
ORK
1
To facilitate the development of lexical convertors, we
developed a lexical conversion module, which consists of a
language independent algorithm and four language specific
resources, viz. a specialised bilingual dictionary (Lex.LangIn-
LangOut.txt), a target language lexicon (Lex.LangOut.txt),
language-pair specific morpheme conversion rules
(MorphRules.txt), and language-pair specific grapheme-to-
grapheme conversion rules (G2GRules.txt). Since Dutch is
generally considered to be morphologically more complex than
Afrikaans, we first used this conversion module to develop a
Dutch to Afrikaans lexical convertor (D2AC) to serve as proof-
of-concept for rule-based lexical conversion between Afrikaans
and Dutch [2].
In [2] we reported an accuracy of 71% on word-level and a
BLEU score of 0.2519 in a small-scale evaluation on running
text. After some minor improvements to D2AC, which includes
expansion of the bilingual lexicon and rule optimisation and re-
ordering, we obtained an accuracy of 71.8% in the word-level
evaluation (see [3]). To illustrate the differences between
D2AC
2009
[2] and D2AC
2010
[3], the sizes of the language
dependant components of each of the convertors are given in
Table 1.
We also conducted a sentence-level evaluation to compare
the convertor to the Dutch-Afrikaans Google Translate (GT).
The results of this evaluation are shown in Table 2. It is not
surprising that GT achieves a higher BLEU score than
D2AC
2010
in the automatic evaluation, since D2AC
2010
is only
1 Please see [2] for a detailed discussion of the D2 AC module and procedure.
intended to do lexical transfer and not fully-fledged machine
translation.
Table 1: Comparison of language specific components
D2AC
2
009
D2AC
2010
Lex.LangIn-LangOut.txt
2 696 2 740
Lex.LangOut.txt
350 943 385 599
MorphRules.txt
94 80
G2GRules.txt
53 57
A human assessment of the translation outputs shows that
the output of the two systems contains similar errors: both
leave a large number of words untranslated and have difficulty
translating specific syntactic constructions (such as the
negative construction) correctly. In addition, GT seems to
translate Dutch compounds ineffectively, since compounds are
consistently split into constituents, before being translated (e.g.
Du. vervoersituaties is incorrectly translated to *vervoer
situasies instead of the correct Afr. vervoersituasies ‘transport
situations’).
Table 2: BLEU scores of D2AC2010 and GT
D2AC
2010
GT
% of 1-gram matches 58.88 72.32
% of 2-gram matches 29.36 46.21
% of 3-gram matches 16.76 33.09
% of 4-gram matches 9.59 23.57
BLEU 0.22 0.40
III. A
N
A
FRIKAANS
-
TO
-D
UTCH
C
ONVERTOR
A. Development of A2DC
Given the relative success of D2AC, we also wanted to
experiment with the conversion of Afrikaans to Dutch. Since
A2DC would have to do the exact opposite of D2AC, the
relevant language-pair specific components of D2AC
2010
(i.e.
Lex.LangIn-LangOut.txt, G2GRules.txt and MorphRules.txt)
were reversed semi-automatically to create a base-line system
from which A2DC could be developed. After the reversal, the
rules contained in G2GRules.txt and MorphRules.txt were re-
ordered to ensure that more specific rules would be executed
first. Also, very general rules, such as the rule stating that any
word can be suffixed with a d or t, were removed and replaced
with more specific rules where possible.
Since it is sometimes not easy to decide whether a rule
should be included in MorphRules.txt or in G2GRules.txt (e.g.
the rule stating that the Afrikaans word ending -sie should be
replaced by -tie to convert Afr. petisie to Du. petitie ‘petition’),
some experiments were done in this regard during a manual
rule-order optimisation process. “Ambiguous” rules (i.e. rules
which are not clear-cut morphological or graphological rules)
were divided in a way which seemingly optimised the
performance of the convertor. However, further investigation is
needed to ensure an optimal rule division and automatic rule-
induction will also be used to ensure the optimality of the rule-
ordering and division. The resulting list of A2DC’s
MorphRules.txt contains 62 entries, while G2GRules.txt
contains 80 entries.
The Lex.LangIn-LangOut.txt used in D2AC was also
reversed so that it contained a list of Afrikaans entries (one
entry per line) with possible Dutch translation alternatives. To
enrich the Lex.LangIn-LangOut.txt a bilingual list of function
words was developed and added to the existing entries in the
lexicon. The resulting Lex.LangIn-LangOut.txt used in A2DC
contains 2 474 entries. This list is smaller than the Lex.LangIn-
LangOut.txt of D2AC
2010
due to the fact that different Dutch
lexemes were translated to the same Afrikaans lexeme (e.g. Du.
attachment and bijlage, which both translate to Afr. aanhegsel
‘attachment’). Therefore, when the list was reversed, multiple
Dutch entries were combined into one Afrikaans entry (i.e.
aanhegsel attachment//bijlage). The lemma list of e-lex [4] was
used as a Dutch lexicon (i.e. as Lex.LangOut.txt) and contains
412,683 Dutch entries.
B. A2DC as part of technology recycling
Given the fact that A2DC was not developed as a fully-
fledged machine translation system but as a lexical convertor to
aid technology transfer, we used it as part of a process of
technology recycling. In this experiment we investigated the
efficiency of A2DC as a pre-processing step for the recycling
of a Dutch part of speech tagger. To evaluate this, Tadpole, a
morphosyntactic tagger and parser for Dutch [5], was used to
tag an Afrikaans translation of the METIS II test data as a first
phase in the experiment. In the second phase, the same
Afrikaans data was converted with A2DC before being tagged
with Tadpole. The METIS II test data consists of 200 sentences
from the Parole Corpus (see http://korpus.dsl.dk/e-
resurser/parole-korpus.php). The Afrikaans translation was
done by a professional translator and the resulting Afrikaans
data set consists of 1 963 Afrikaans words.
Tadpole uses the CGN tagset [6], which consists of 12 part
of speech categories, each of which is further divided using
various tag specifications, resulting in a very large tagset
consisting of more than 300 different tags. Since many of the
tag specifications explicated in the CGN tagset are not
applicable to Afrikaans (e.g. gender specification of common
nouns), and given the fact that this evaluation is done on a
small test set (200 sentences), only the twelve main tag
categories were used in this experiment. Besides POS tags, the
Tadpole output also contains lemmas and morphological
analyses of annotated words, but those features were ignored
for the purpose of this experiment. The two phases of this
experiment will be discussed in the following sections.
IV. R
ESULTS
A. Word-level evaluation
For the word-level evaluation, 500 words were randomly
extracted from the 5,000 most frequent words in the Spoken
Dutch Corpus [7]
2
. Capitalised words (e.g. proper names and
acronyms), abbreviations, and interjections were removed and
2 This is the same dataset that was used for the word-level evaluation of D2AC in [2].
replaced with other randomly-selected words from the CGN
frequency list. The resulting wordlist was manually translated
to Afrikaans. Where more than one Afrikaans translation
alternative existed, the most probable translation alternative
was selected, resulting in a list of 500 Afrikaans words.
The resulting Afrikaans list was then translated back to
Dutch, taking care to ensure that all possible Dutch translations
were added. A2DC was then used to translate the 500
Afrikaans words and the output was compared to the manual
Dutch translations. The results of this evaluation are given in
Table 3.
Table 3: Results of word-level evaluation
# tags
assigned
# tags
correct
<Lex.LangIn-LangOut> 92 92
<Lex.LangOut> 199 199
<Translated> 77 71
<Untranslated> 132 0
TOTAL 500 362
A2DC was able to provide at least one Dutch translation
alternative for 72.4% (i.e. 362 words) of the words in the
evaluation test set. Although the conversion modules only
translated 77 words, 71 of the attempted translations were
correct. Incorrect translation attempts include Afr. uitgawe
‘expense’ translated to Du. uitgaven ‘expenses’. This means
that the conversion modules have a conversion precision of
92.21%. The high precision is, once again, due to the fact that
translated words are looked up in Lex.LangOut.txt to ensure
that the word resulting from conversion is a valid Dutch word.
Table 4: Error analysis of word-level evaluation
Cause # of
untranslated
words
% of
untranslated
words
Not in Lex.LangOut 5 3.79%
Not in Lex.LangIn-LangOut 18 13.64%
Not in Rules 20 15.15%
Past Tense Verbs 21 15.91%
Rule-ordering 68 51.51%
TOTAL 132 100
An analysis of the untranslated words shows that four
factors are responsible for the bulk of the words not being
converted (see Table 4). The first of these is the coverage of
Lex.LangIn-LangOut.txt, since almost 14% of the words left
untranslated are non-cognates or false friends and therefore
should have been included in this lexicon. The Afr. dieselfde,
for example, should be translated to Du. hetzelfde, but was not
included in Lex.LangIn-LangOut.txt.
The second problem is the comprehensiveness of the rules
included in MorphRules.txt and G2GRules.txt. More than 15%
of the words left untranslated are instances that could have
been handled by rules, and which should have been included in
the system. So, for instance, Afr. militêre ‘military’ should
have been translated to Du. militaire, but A2DC only contains
a rule that converts êr at the end of a word to air, and the
convertor is therefore unable to handle –êre correctly.
The conjugation of verbs is one major difference between
Afrikaans and Dutch, since Afrikaans has a much simpler verb
morphology. Therefore it is not surprising that the third
hampering factor concerns the fact that A2DC is not able to
handle the translation of Afrikaans past tense verbs effectively
(e.g. Afr. geken ‘knew’ which should have been translated to
Du. kende/kenden/gekend). It would, however, only be possible
to successfully convert Afrikaans verbs to Dutch if some
contextual syntactic information could be taken into
consideration, but syntactic conversion currently falls without
the scope of this research project.
The bulk of the untranslated words (51.51%) should have
been converted by rules currently included in the system. So,
for example, Afr. beset ‘occupied’ should have been translated
to Du. bezet by the same rule that translated Afr. versoek
‘request’ into Du. verzoek, but beset was left untranslated while
versoek was translated correctly. This is due to the ordering of
the rules, and optimising A2DC would therefore entail a
thorough reconsideration of the rule-ordering. Since the focus
of our current research is on expediting the development of
resources, we decided to not spend much more time on such
optimization, in order to determine the efficiency of technology
transfer even with less-than-perfect technologies.
B. Sentence level evaluation
Despite the fact that our current research is not aimed at
providing a fully-fledged Afrikaans-to-Dutch machine
translation system, we did an experiment to get an impression
of how A2DC compares to another available solution, the
Afrikaans-Dutch GT, when translating sentences. To do this
comparative evaluation, we used the development test set that
was used to evaluate the Dutch-English machine translation
system in the METIS II project [8]. The Dutch METIS II
sentences were translated to Afrikaans, and these Afrikaans
sentences were sent to different translators to prepare reference
translations with which the BLEU scores [9] could be
calculated. The results for A2DC and GT are shown in Table 5.
Table 5: Results of sentence level evaluation
A2DC GT
% of 1-gram matches 53.54 73.69
% of 2-gram matches 22.04 49.75
% of 3-gram matches 10.63 36.95
% of 4-gram matches 5.29 28.15
BLEU 0.16 0.44
In this evaluation GT significantly outperforms A2DC as
was expected, since A2DC is not able to handle any syntactic
differences between Afrikaans and Dutch. This inability of
A2DC is particularly apparent when one considers the
percentage of n-gram matches shown in Table 5. While GT has
a 4-gram match of almost 28%, only 5.29% of 4-grams in the
A2DC output also occurred in at least one of the reference
translations. The fact that A2DC only obtains a BLEU score of
0.16, while D2AC scored 0.22 in a similar experiment is also
not surprising, given that Dutch has a higher level of
morphological complexity than Afrikaans.
A human assessment of the translations shows that, apart
from the fact that A2DC leaves a large number of words (many
of which are past tense forms of verbs) untranslated, it is also
unable to handle syntactic issues like the Afrikaans double
negation. Surprisingly, GT can also not handle the Afrikaans
negation construction correctly and consistently retains the
second negation particle (the second nie in negated Afrikaans
sentences) after translating it to Du. niet. However, given the
fact that A2DC was not developed as a fully-fledged machine
translation system but as a lexical convertor to aid technology
recycling, this evaluation does not give a good indication of the
efficiency of the system. Therefore we also evaluated A2DC as
part of a process of technology recycling, by using a Dutch part
of speech (POS) tagger to annotate Afrikaans data. This
evaluation experiment is described in the following section.
Table 6: Results per POS category (raw and converted Afrikaans data)
Table 7:
Confusion matrix (raw Afrikaans data)
N ADJ V NUM PRON ART PREP CONJ ADV INTERJ SPEC
N 17.42 1.27 4.02 0.31 1.32 2.75 1.63 0.71 1.68 0 0.82
ADJ
0.97 6.16
1.78 0 0.10 0 0.31 0 0.76 0 0.10
V
1.27 0.36 12.12
0 0.05 0.05 0.05 0.05 0.05 0 0.05
NUM
0 0 0 1.32
0 0 0 0 0 0 0
PRON
0 0.10 0.66 0.20 5.40
9.16 0 0.36 0.10 0 0
ART
0 0 0.82 0 0 0.15
0 0 0 0 0
PREP
0 0 0 0 0 0 10.29
0 0 0 0
CONJ
0 0 0 0 0.92 0 0 1.73
0 0 0
ADV
0 0.10 0.05 0.05 0.66 0 0.05 0 5.20
0 0
INTERJ
0 0 0 0 0 0 0 0 0 0
0
SPEC
0.71 0.41 0.41 0 1.27 0.05 0.36 0.10 0.36 0 2.80
Table 8: Confusion matrix (converted Afrikaans data)
N ADJ V NUM PRON ART PREP CONJ ADV INTERJ SPEC
N 18.54 1.02 3.36 0.10 0.10 0 0.05 0 1.27 0 3.06
ADJ
0.31 6.52
2.09 0 0 0 0.05 0 0.82 0 0.36
V
0.92 0.46 12.38
0 0 0 0 0 0.10 0 0.10
NUM
0.05 0 0 1.43
0 0 0 0 0 0 0
PRON
0 0.10 1.07 0.25 8.61
0.05 0 0.10 0.10 0 0
ART
0 0 0.46 0 0.20 12.07
0 0 0 0 0
PREP
0 0 0 0 0 0 12.53
0.10 0 0 0
CONJ
0 0 0.10 0 0 0 0 2.55
0 0 0
ADV
0.05 0.15 0.20 0.10 0.87 0 0.05 0.20 5.70
0 0
INTERJ
0 0 0 0 0 0 0 0 0 0 0
SPEC
0.51 0.15 0.20 0 0 0 0 0 0.15 0 0.25
3 In tables 6 – 8 and in Section C the following ta g abbreviations are used: noun (N), adjective (ADJ), verb (V), numeral (NUM), pronoun (PRON ), article (ART), preposition (PREP),
conjunction (CONJ), adverb (ADV), interje ction (INTERJ) and special tokens (SPEC).
Results on raw
Afrikaans data
Results on converted
Afrikaans data
Precision Recall f-score Precision Recall f-
score
N
0.54 0.86 0.67 0.67 0.91 0.77
ADJ 0.61 0.73 0.66 0.64 0.78 0.7
V 0.86 0.61 0.71 0.89 0.62 0.73
NUM 1 0.79 0.88 0.97 0.76 0.86
PRON 0.34 0.55 0.42 0.84 0.88 0.86
ART 0.16 0.01 0.02 0.95 1 0.97
PREP 1 0.81 0.9 0.99 0.99 0.99
CONJ 0.65 0.59 0.62 0.96 0.86 0.91
ADV 0.64 0.85 0.73 0.78 0.7 0.74
INTERJ 0 0 0 0 0 0
SPEC 0.43 0.74 0.54 0.2 0.07 0.1
C. A2DC as part of technology recycling
1) Tagging raw Afrikaans data with a Dutch POS
tagger
In the first phase of the experiment, the complete Afrikaans
test set was tagged with Tadpole. Tag specifications, lemmas
and morphological analyses were removed from the Tadpole
output, and clustered words were manually separated (e.g.
onder_andere VZ_ADJ were separated into onder VZ andere
ADJ). The POS annotations provided by Tadpole were then
manually checked and corrected to create a Gold Standard. The
Tadpole output was compared to the Gold Standard and
accuracy, recall, precision and f-scores were calculated for each
POS category. These results are given in Table 6 while
Table 7 shows a confusion matrix with relative values for
the POS categories
4
. For the calculations, punctuation was not
included.
The all-over accuracy of the Tadpole output on unconverted
Afrikaans data is 62.6%. Most POS categories have a rather
low f-score (see Table 6), except for NUM and PREP, which
have f-scores of 0.88 and 0.90 respectively. The confusion
matrix (Table 7) also indicates that instances of NUM and
PRON were very seldom mistagged. The good results for these
categories are due to the fact that many Afrikaans and Dutch
numerals are identical in form (e.g. eerste ‘first’, twintig
‘twenty’, op ‘on’, onder ‘below’, etc.). In contrast, ART have a
very low f-score (0.02), because the Afrikaans definite article
die ‘the’ is an unambiguous pronoun in Dutch. The very
frequently occurring Afrikaans die is therefore consistently
tagged as a pronoun, resulting in the high ART-PRON
confusion (see
Table 7). N, ADJ, V, PRON, CONJ, ADV and SPEC have
relatively average f-scores, ranging from 0.42 to 0.73.
Tadpole uses N as default tag and therefore N is the tag that
is assigned to all instances that Tadpole is unable to
disambiguate correctly. This leads to a relatively high rate of N
confusion for all categories (see Table 7). This is especially
true in the case of verbs. Because Afrikaans verbs are not
conjugated in the same way as Dutch verbs, many verbs are not
in the correct form, given the context in which they occur. This
makes it impossible for Tadpole to determine that a word
should be assigned the tag V and Tadpole then assumes that the
word is a noun and should be tagged with N. This results in a
high V-N confusion.
A manual assessment of the data showed that, apart from
the definite article, Tadpole also persistently tags the Afrikaans
verb het ‘have’ (Du. hebben) incorrectly. This is because het
can only be a definite article or a pronoun (but never a verb) in
Dutch. Furthermore, Tadpole erroneously tags the Afrikaans
first person personal pronoun ek ‘I’ (Du. ik) as a conjunction
and the Afrikaans indefinite article ʼn ‘a’ (Du. een) as a noun.
The fact that function words, such as PRON, ART, PREP and
CONJ are often erroneously tagged is problematic, since these
classes are used as anchors for the rest of the tagging task.
4
Rows indicate the Tadpole output; columns refer to the true classes.
These categories are closed classes and therefore Tadpole
assigns the tags for these classes with a high level of certainty.
They then serve as contextual information to assign the tags of
open classes (N, V, ADJ and ADV). The erroneous tagging of
these words leads to incorrect contextual information which, in
turn, has a negative effect on the accuracy of the tagger on
open classes.
2) Tagging converted Afrikaans data with a Dutch POS
tagger
The Afrikaans translation of the METIS II test set was once
again used as data in this second phase of the experiment. This
time, instead of sending raw Afrikaans data through Tadpole
(as was done in the first phase described in the previous
section), we first converted the Afrikaans test set with A2DC to
make the data more “Dutch-like”. The basic idea here is to
eliminate errors such as those described above by first
converting the Afrikaans text to Dutch. Even though this
conversion will not yield a good Dutch translation, we
hypothesise that a Dutch POS tagger will benefit from the
conversion, since frequently occurring false friends and non-
cognates could be handled better (or even correctly) after they
had been converted to Dutch. After conversion the A2DC
output was tagged with Tadpole, and the Tadpole output was
once again compared to the Gold Standard. Precision, recall
and f-scores are shown in Table 6, while Table 8 shows the
confusion matrix.
The accuracy for the A2DC Tadpole output is 80.6%,
which is a considerable improvement over the 62.6% obtained
in the first phase of this experiment, and which clearly
illustrates the value of using conversion as a pre-processing
step for technology recycling. The confusion matrix shows that
the accuracy for all classes, except for SPEC, increased. The f-
scores (see Table 6) for PRON and especially for ART are
much higher than in the experiment with the raw Afrikaans
data. A2DC successfully converted Afrikaans articles into their
Dutch equivalents (i.e. A2DC translated Afr. ʼn to Du. een ‘a’
and Afr. die to Du. de or het ‘the’), causing the f-score for ART
to increase from 0.02 to 0.97.
Other POS categories that show an evident f-score
improvement are PRON (from 0.42 to 0.86), CONJ (from 0.62
to 0.91) and PREP (from 0.9 to 0.99). In the case of N, ADJ,
ADV and V a slight gain is noticeable. The score for numerals
is almost the same as in the first part of the experiment. The
only category that shows a decrease in f-score (from 0.54 to
0.10) is the category of special tokens. The words that receive
tag SPEC are mainly parts of multiword proper nouns, e.g.
Afr. and Du. Europese SPEC Unie SPEC ‘European SPEC
Union SPEC’. Many such instances in the converted data are
tagged incorrectly, since A2DC changes all words into lower
case, which makes it hard for Tadpole to recognize these words
as proper nouns. Given the lower case version, Tadpole tags
instances like these as europese ADJ unie N.
Table 8 reiterates the improvements over all categories
except SPEC. It is apparent though that a large number of verbs
are still erroneously tagged as nouns. This is because A2DC is
unable to convert verbs into the correct form, and therefore
these verbs are still assigned the default tag, N. Other
categories are consistently confused with N less often and ART
and CONJ are never mistakenly tagged as N after conversion
with A2DC. This is because articles and conjunctions are
closed classes which are correctly converted by the bilingual
lexicon included in A2DC. The fact that closed classes are
assigned the correct tag more often after conversion, improves
the tagger’s ability to correctly assign tags to the open classes.
The fact that it was possible to develop an Afrikaans POS
tagger achieving an accuracy of more than 80% using the
technology recycling approach, is especially noteworthy when
it is compared to the Afrikaans POS tagger development
described in [10]. In this project an Afrikaans POS tagger was
developed using the TnT tagger 0. Given the fact that no
annotated Afrikaans data was available at the time, the
development of the POS tagger necessitated the manual
annotation of data to train the TnT algorithm. The Afrikaans
TnT tagger, using 13 tags, was only able to achieve an
accuracy of more than 80% after the manual annotation of
1 999 words. The use of A2DC and technology recycling could
therefore have expedited the development described in [10],
since it would not have been necessary to manually tag the first
batch of training data. Of course, to have developed A2DC also
required some effort and time, but now that a conversion
module is available in the open source domain
(sourceforge.net/projects/d2ac-a2dc), it could benefit the
development of technologies for Afrikaans that already exist
for Dutch (e.g. semantic parsers, chunkers, etc.).
V. C
ONCLUSION AND
F
UTURE WORK
In this article we described the development and
performance of an A2DC, which achieved an accuracy of
72.4% in a word-level evaluation and a BLEU score of 0.16 in
a sentence level evaluation. A2DC was then used as part of a
technology recycling process and the use of the convertor
improved part of speech tagging accuracy by 18% (from 62.6%
without conversion to 80.6% when using conversion). In the
evaluation experiments we found that handling of especially
verb conjugations, false friends and non-cognates require close
attention. Most of these issues could be addressed by refining
and/or extending the bilingual translation list (Lex.LangIn-
LangOut.txt) (semi-) automatically – for instance by using the
CELEX database to complete verb paradigms and plural forms
of nouns, and by using semi-automatic processes to extend the
list of false friends. In addition, some more attention should be
paid to rule-ordering addition, as well as iteration of rules and
the effect thereof on the greediness of the conversion modules.
Automatic rule-induction should also be investigated.
An important conclusion of this paper is that rule-based
conversion could play a major role in expediting technology
development for resource-scarce languages, specifically when
technology recycling is used. Further research needs to be done
in order to determine whether this observation holds true for
tasks other than part of speech tagging. Work in the near future
will therefore include using A2DC in technology recycling
tasks involving other resources, such as an Afrikaans
lemmatiser, chunker and morphological analyser. The approach
used here could, in principle, also be extended to other
resource-scarce languages. Given the fact that all the
indigenous South African languages are considered resource
scarce, future work will include the use of the approach
described in this paper as a way in which technology
development for these languages can be expedited.
VI. A
CKNOWLEDGEMENTS
Part of this research was made possible through a research
grant by the South African National Research Foundation
(FA207041600015). We would like to extend our gratitude to
the following people who were involved in various aspects of
the project: Kirsten Arnauts, Veronique de Gres, Shanna
Pettens, Martin Puttkammer, Carla-Mari van den Heever, and
Daan Wissing. All fallacies remain ours.
R
EFERENCES
[1] Rayner, M., Carter, D., Bretan, I., Eklund, R., Wiren, M.,
Hansen, S.L., Kirchmeier-Andersen, S., Philp, C., Sorensen, F.,
and Thomsen, H.E., “Recycling lingware in a multilingual MT
system”, In Burstein, J., and Leacock, C. (eds.), From research
to commercial applications: making NLP work in practice,
ACL, Somerset, 1997, pp 65-70.
[2] Van Huyssteen, G.B. & Pilon, S., “Rule-based conversion of
closely-related languages: A Dutch to Afrikaans convertor”,
Proc. of the 20th Annual Symposium of the PRASA,
Stellenbosch, South Africa, 2009, pp 23-28.
[3] Van Huyssteen, G.B. & Pilon, S, “A Dutch-to-Afrikaans
convertor”, Poster presented at the 20
th
CLIN Meeting, Utrecht,
The Netherlands, 2010.
[4] Dutch Language Union (Nederlandse Taalunie, NTU). 2006. E-
Lex Version 1.1. Institute for Dutch Lexicology, Leiden, The
Netherlands.
[5] Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius,
S., “An efficient memory-based morphosyntactic tagger and
parser for Dutch”, F. van Eynde, P. Dirix, I. Schuurman, and V.
Vandeghinste (eds.), Selected Papers of the 17th CLIN Meeting,
Leuven, Belgium, 2007, pp 99-114.
[6] Van Eynde, F., Zavrel, J. & Daelemans, W., “Part of speech
tagging and lemmatisation for the Spoken Dutch Corpus”, In M.
Gavrilidou et al. (eds.), Proc. of the 2
nd
Intern. LREC, Athens,
Greece, 2010, pp 1427-1433.
[7] Nederlandse Taalunie, Corpus gesproken Nederlands 1.0, TST-
Central, Leiden, 2004, [Web:]
http://www.tst.inl.nl/cgndocs/doc_English/topics/index.htm
[Accessed on 2009/09/26].
[8] Vandeghinste, V., Dirix, P., Schuurman, I., Markantonatou, S.,
Sofianopoulos, S., Vassiliou, M., Yannoutsou, O., Badia, T.,
Melero, M., Boleda, G., Carl, M., and Schmidt, P., “Evaluation
of a machine translation system for low resource languages:
METIS-II”, Proc. of the 6
th
intern. LREC, Marrakech, 2008, pp
449-456.
[9] Papineni, K., Roukos, S., Ward, T., and Zhu, W.J., “BLEU: a
method for automatic evaluation of machine translation”, Proc.
of the 40
th
an. meeting of the ACL, Philadelphia, 2002, pp 311-
318.
[10] Pilon, S. 2005. Outomatiese Afrikaanse woordsoortetikettering.
North-West University (Potchefstroom Campus).
Potchefstroom. (Dissertation – MA).
Brants, T., “TnT – A statistical part-of-speech tagger”, Proc.of the
Sixth Applied Natural Language Processing Conference, 2000,
[Web:] http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf [Date
accessed: 2004-03-03]
... By adapting existing technologies for closely related languages, the development of resources for resource-scarce languages can be fast-tracked. This process is known as technology recycling [1]. Given a technology, created for a well sourced language L1, which is needed in another language L2 which is resource-scarce, it would be faster and cheaper to adapt the L1 technology for L2 than to redevelop the L2 technology from the ground up [1]. ...
... This process is known as technology recycling [1]. Given a technology, created for a well sourced language L1, which is needed in another language L2 which is resource-scarce, it would be faster and cheaper to adapt the L1 technology for L2 than to redevelop the L2 technology from the ground up [1]. ...
... We evaluate a genre classification system when classifying a strange language and then implement approaches to enhance its performance. Dutch and Afrikaans have been used successfully in technology recycling experiments because these two languages are similar enough [1] [5] and as a result thereof, Dutch and Afrikaans will be used as the languages in question for this article. ...
... Volgens Pilon et al. [39] sou dit goedkoper en vinniger wees om bestaande tegnologieë van ʼn nabyverwante taal te hergebruik om die ontwikkeling vir ʼn hulpbronskaars taal te bespoedig, eerder as om hierdie tegnologieë van voor af te ontwikkel. Die bestaande tegnologieë kan hergebruik word deur verskeie aanpassings te maak aan die taal waarvoor die tegnologieë nog nie bestaan nie, as die taal waarin ʼn tegnologie reeds bestaan, nabyverwant genoeg is [39]. ...
... Volgens Pilon et al. [39] sou dit goedkoper en vinniger wees om bestaande tegnologieë van ʼn nabyverwante taal te hergebruik om die ontwikkeling vir ʼn hulpbronskaars taal te bespoedig, eerder as om hierdie tegnologieë van voor af te ontwikkel. Die bestaande tegnologieë kan hergebruik word deur verskeie aanpassings te maak aan die taal waarvoor die tegnologieë nog nie bestaan nie, as die taal waarin ʼn tegnologie reeds bestaan, nabyverwant genoeg is [39]. Die taal waarvoor die tegnologie herwin word, word dus verander sodat dit meer soos die nabyverwante taal "lyk" deur metodes soos sintaktiese herordening, aanpassing van morfeme, of volledige of gedeeltelike masjienvertaling [39]. ...
... Die bestaande tegnologieë kan hergebruik word deur verskeie aanpassings te maak aan die taal waarvoor die tegnologieë nog nie bestaan nie, as die taal waarin ʼn tegnologie reeds bestaan, nabyverwant genoeg is [39]. Die taal waarvoor die tegnologie herwin word, word dus verander sodat dit meer soos die nabyverwante taal "lyk" deur metodes soos sintaktiese herordening, aanpassing van morfeme, of volledige of gedeeltelike masjienvertaling [39]. In die konteks van die Suid-Afrikaanse hulpbronskaars tale, sou hierdie benadering moontlik van waarde kon wees in gevalle waar daar bestaande hulpbronne vir een taal in ʼn taalfamilie is, maar nie vir die ander nie, of waar die een taal oor meer hulpbronne beskik as die ander en ontwikkeling dan eers vir dié taal gedoen word, met die idee om oordrag na die ander tale te vergemaklik. ...
... Uit vorige studies (Pilon et al., 2010;Van Huyssteen & Pilon, 2009) ...
... Tegnologieherwinning is nog nooit gebruik in die ontwikkeling van ʼn Afrikaanse BEH nie, maar gegewe die goeie resultate wat reeds deur tegnologieherwinning verkry is, vir ander kerntegnologieë (Pilon et al., 2010) ...
... In ʼn ander studie deur Black en Vasilakopoulos (2002) Die elf woordsoort-kategorieë in Tabel5 word soos volg beskryf (Pilon et al., 2010): ...
Article
Full-text available
Named entity recognition is a core technology, widely used in applications such as question-and-answer systems, automated text summarisation systems, information retrieval, machine translation, bioinformatics and search engines. Owing to the resources needed, the labour-intensive methods required and the difficulty of developing a named entity recogniser (NER), the focus has mainly been on developing NERs for resource-rich languages such as Dutch, Swedish and Spanish, while neglecting the development for related resource-scarce languages such as Afrikaans, Norwegian and Portuguese. The lack of NERs limits access to information and also lowers the general development thrust of resources for such languages. This article presents a time- and cost-efficient method based on the principle of technology recycling (or technology transfer) that can be used to develop NERs for suitable resource-scarce languages; its application to Afrikaans is used here as a case study. Six experiments are described, which differ only by the pre-processing methods (A2DC and gazetteers) and the language of the input data used. The final experiment yielded an f-score of 0.72 for the identification and an f-score of 0.65 for the classification of named entities. This study provides evidence for the usefulness of technology recycling in developing an NER for Afrikaans and potentially for other languages that have so far been neglected.
... With regard to lexical similarity, approximately 90 to 95% of Afrikaans vocabulary originated in Dutch dialects (Mesthrie, 2002;Kamwangamalu, 2004), while other words in Afrikaans are originally borrowed from other languages such as Bantu, Khoisan, Malay and Portuguese (Sebba, 1997;Niesler et al., 2005). Many of the words in Afrikaans from Dutch origin are not graphologically identical to their original cognates anymore, which means that it is impossible to blindly use Dutch technologies to process Afrikaans text (Pilon et al., 2010). One therefore needs to distinguish between identical cognates (i.e., etymologically-related words from two different languages), non-identical cognates, false friends, and noncognates when dealing with these two languages. ...
Conference Paper
Full-text available
In most languages, new words can be created through the process of compounding, which combines two or more words into a new lexical unit. Whereas in languages such as English the components that make up a compound are separated by a space, in languages such as Finnish, German, Afrikaans and Dutch these components are concatenated into one word. Compounding is very productive and leads to practical problems in developing machine translators and spelling checkers, as newly formed compounds cannot be found in existing lexicons. The Automatic Compound Processing (AuCoPro) project deals with the analysis of compounds in two closely-related languages, Afrikaans and Dutch. In this paper, we present the development and evaluation of two datasets, one for each language, that contain compound words with annotated compound boundaries. Such datasets can be used to train classifiers to identify the compound components in novel compounds. We describe the process of annotation and provide an overview of the annotation guidelines as well as global properties of the datasets. The inter-rater agreements between the annotators are considered highly reliable. Furthermore, we show the usability of these datasets by building an initial automatic compound boundary detection system, which assigns compound boundaries with approximately 90% accuracy.
... In another part of the project, we have demonstrated that this approach yields favourable results for the development of part-of-speech (POS) annotated data for Afrikaans, using a Dutch POS tagger [15]. Along the same lines, the development of a syntactic parser and chunker for Afrikaans is currently ongoing. ...
Conference Paper
Full-text available
Afrikaans is one of the eleven official languages of South Africa. It is classified as an under-resourced language. No annotated broadband speech corpora currently exist for Afrikaans. This article reports on the development of speech resources for Afrikaans, specifically a broadband speech corpus and an extended pronunciation dictionary. Baseline results for an ASR system that was built using these resources are also presented. In addition, the article suggests different strategies to exploit the close relationship between Afrikaans and Dutch for the purposes of technology development.
Article
Full-text available
We describe TADPOLE, a modular memory-based morphosyntactic tagger and dependency parser for Dutch. Though primarily aimed at being accurate, the design of the system is also driven by optimizing speed and memory usage, using a trie-based approximation of k-nearest neighbor classification as the basis of each module . We perform an evaluation of its three main modules: a part-of-speech tagger, a morphological analyzer, and a depen- dency parser, trained on manually annotated material available for Dutch - the parser is additionally trained on automatically parsed data. A global analysis of the system shows that it is able to process text in linear time close to an estim ated 2,500 words per second, while maintaining sufficient accuracy.
Conference Paper
Full-text available
In this paper we describe the METIS-II system and its evaluation on each of the language pairs: Dutch, German, Greek, and Spanish to English. The METIS-II system envisaged developing a data-driven approach in which no parallel corpus is required and in which no full parser or extensive rule sets are needed. We describe the evaluation on a development test set and on a test set taken from Europarl, and compare our results with SYSTRAN. We also provide some further analysis, namely researching the impact of the number and source of the reference translations and analysing the results according to test text type. The results are expectably lower for the METIS system, but not at an unattainable distance from a mature system like SYSTRAN.
Article
Full-text available
20th Annual Symposium of the Pattern Recognition Association of South Africa (PRASA). Stellenbosch, South Africa, 30 November - 01 December 2009 For fast-tracking the development of resources for resource-scarce languages, one could transfer existing technologies from one language to another well-sourced, closely-related language. In this contribution, the authors describe the development and performance of a rule-based Dutch-to-Afrikaans converter, with the aim to transform Dutch text so that it looks more like an Afrikaans text (even though it might not even be a good Dutch translation). The rules we used is based on systematic orthographic, morphosyntactic and lexical differences between the two languages. The authors report on an accuracy of 71% on word-level, after minor optimisation with regard to iteration of rules. In a small-scale evaluation on running text, the authors obtain a BLEU score of 0.2519. They conclude that such a rule-based approach to conversion of closely-related languages holds much promise, with potential application in technology transfer (or even machine translation) between such languages.
Article
Full-text available
Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused.
Article
Full-text available
This paper describes the lemmatisation and tagging guidelines developed for the "Spoken Dutch Corpus", and lays out the philosophy behind the high granularity tagset that was designed for the project. To bootstrap the annotation of large quantities of material (10 million words) with this new tagset we tested several existing taggers and tagger generators on initial samples of the corpus. The results show that the most effective method, when trained on the small samples, is a high quality implementation of a Hidden Markov Model tagger generator. 1. Introduction The Dutch-Flemish project "Corpus Gesproken Nederlands " (1998-2003) aims at the collection, transcription and annotation of ten million words of spoken Dutch (Oostdijk, 2000). The first layer of linguistic annotation concerns the assignment of base forms and morphosyntactic tags to each of those ten million words. The first part of this paper presents the lemmatisation guidelines and the tagset which have been devised for thi...
Article
Full-text available
We describe two methods relevant to multi-lingual machine translation systems, which can be used to port linguistic data (grammars, lexicons and transfer rules) between systems used for processing related languages. The methods are fully implemented within the Spoken Language Translator system, and were used to create versions of the system for two new language pairs using only a month of expert effort.
Article
Thesis (M.A.)--North-West University, Potchefstroom Campus, 2005. Includes bibliographical references (leaves [134]-142)
Article
Trigrams'n'Tags (TnT) is an efficient statistical part-of-speech tagger. Contrary to claims found elsewhere in the literature, we argue that a tagger based on Markov models performs at least as well as other current approaches, including the Maximum Entropy framework. A recent comparison has even shown that TnT performs significantly better for the tested corpora. We describe the basic model of TnT, the techniques used for smoothing and for handling unknown words. Furthermore, we present evaluations on two corpora.
Corpus gesproken Nederlands 1.0, TST- Central
  • Nederlandse Taalunie
Nederlandse Taalunie, Corpus gesproken Nederlands 1.0, TST- Central, Leiden, 2004, [Web:] http://www.tst.inl.nl/cgndocs/doc_English/topics/index.htm [Accessed on 2009/09/26].