Conference PaperPDF Available

A Factory of Comparable Corpora from Wikipedia

Authors:

Abstract and Figures

Multiple approaches to grab comparable data from the Web have been developed up to date. Nevertheless, coming out with a high-quality comparable corpus of a specific topic is not straightforward. We present a model for the automatic extraction of comparable texts in multiple languages and on specific topics from Wikipedia. In order to prove the value of the model, we automatically extract parallel sentences from the comparable collections and use them to train statistical machine translation engines for specific domains. Our experiments on the English‐ Spanish pair in the domains of Computer Science, Science, and Sports show that our in-domain translator performs significantly better than a generic one when translating in-domain Wikipedia articles. Moreover, we show that these corpora can help when translating out-of-domain texts.
Content may be subject to copyright.
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 3–13,
Beijing, China, July 30, 2015. c
2015 Association for Computational Linguistics
A Factory of Comparable Corpora from Wikipedia
Alberto Barr´
on-Cede˜
no1, Cristina Espa˜
na-Bonet2, Josu Boldoba2and Llu´
ıs M`
arquez1
1Qatar Computing Research Institute, HBKU, Doha, Qatar
2TALP Research Center, Univesitat Polit`
ecnica de Catalunya, Barcelona, Spain
{albarron,lmarquez}@qf.org.qa
cristinae@cs.upc.edu jboldoba08@gmail.com
Abstract
Multiple approaches to grab comparable
data from the Web have been developed
up to date. Nevertheless, coming out
with a high-quality comparable corpus of
a specific topic is not straightforward.
We present a model for the automatic
extraction of comparable texts in multi-
ple languages and on specific topics from
Wikipedia. In order to prove the value of
the model, we automatically extract paral-
lel sentences from the comparable collec-
tions and use them to train statistical ma-
chine translation engines for specific do-
mains. Our experiments on the English–
Spanish pair in the domains of Computer
Science, Science, and Sports show that
our in-domain translator performs signif-
icantly better than a generic one when
translating in-domain Wikipedia articles.
Moreover, we show that these corpora can
help when translating out-of-domain texts.
1 Introduction
Multilingual corpora with different levels of com-
parability are useful for a range of natural lan-
guage processing (NLP) tasks. Comparable cor-
pora were first used for extracting parallel lexicons
(Rapp, 1995; Fung, 1995). Later they were used
for feeding statistical machine translation (SMT)
systems (Uszkoreit et al., 2010) and in multilin-
gual retrieval models (Sch¨
onhofen et al., 2007;
Potthast et al., 2008). SMT systems estimate
the statistical models from bilingual texts (Koehn,
2010). Since only the words that appear in the
corpus can be translated, having a corpus of the
right domain is important to have high coverage.
However, it is evident that no large collections of
parallel texts for all domains and language pairs
exist. In some cases, only general-domain parallel
corpora are available; in some others there are no
parallel resources at all.
One of the main sources of parallel data is the
Web: websites in multiple languages are crawled
and contents retrieved to obtain multilingual data.
Wikipedia, an on-line community-curated ency-
clopædia with editions in multiple languages, has
been used as a source of data for these purposes —
for instance, (Adafre and de Rijke, 2006; Potthast
et al., 2008; Otero and L´
opez, 2010; Plamada and
Volk, 2012). Due to its encyclopædic nature, ed-
itors aim at organising its content within a dense
taxonomy of categories.1Such a taxonomy can be
exploited to extract comparable and parallel cor-
pora on specific topics and knowledge domains.
This allows to study how different topics are anal-
ysed in different languages, extract multilingual
lexicons, or train specialised machine translation
systems, just to mention some instances. Never-
theless, the process is not straightforward. The
community-generated nature of the Wikipedia has
produced a reasonably good —yet chaotic— tax-
onomy in which categories are linked to each other
at will, even if sometimes no relationship among
them exists, and the borders dividing different ar-
eas are far from being clearly defined.
The rest of the paper is distributed as follows.
We briefly overview the definition of compara-
bility levels in the literature and show the diffi-
culties inherent to extracting comparable corpora
from Wikipedia (Section 2). We propose a sim-
ple and effective platform for the extraction of
comparable corpora from Wikipedia (Section 3).
We describe a simple model for the extraction of
parallel sentences from comparable corpora (Sec-
tion 4). Experimental results are reported on each
of these sub-tasks for three domains using the En-
glish and Spanish Wikipedia editions. We present
an application-oriented evaluation of the compara-
ble corpora by studying the impact of the extracted
parallel sentences on a statistical machine transla-
tion system (Section 5). Finally, we draw conclu-
sions and outline ongoing work (Section 6).
1http://en.wikipedia.org/wiki/Help:
Category
3
2 Background
Comparability in multilingual corpora is a fuzzy
concept that has received alternative definitions
without reaching an overall consensus (Rapp,
1995; Eagles Document Eag–Tcwg–Ctyp, 1996;
Fung, 1998; Fung and Cheung, 2004; Wu and
Fung, 2005; McEnery and Xiao, 2007; Sharoff et
al., 2013). Ideally, a comparable corpus should
contain texts in multiple languages which are sim-
ilar in terms of form and content. Regarding con-
tent, they should observe similar structure, func-
tion, and a long list of characteristics: register,
field, tenor, mode, time, and dialect (Maia, 2003).
Nevertheless, finding these characteristics in
real-life data collections is virtually impossible.
Therefore, we attach to the following simpler
four-class classification (Skadin¸a et al., 2010):
(i)Parallel texts are true and accurate translations
or approximate translations with minor language-
specific variations. (ii)Strongly comparable texts
are closely related texts reporting the same event
or describing the same subject. (iii)Weakly com-
parable texts include texts in the same narrow sub-
ject domain and genre, but describing different
events, as well as texts within the same broader
domain and genre, but varying in sub-domains
and specific genres. (iv)Non-comparable texts are
pairs of texts drawn at random from a pair of very
large collections of texts in two or more languages.
Wikipedia is a particularly suitable source of
multilingual text with different levels of compa-
rability, given that it covers a large amount of lan-
guages and topics.2Articles can be connected via
interlanguage links (i.e., a link from a page in one
Wikipedia language to an equivalent page in an-
other language). Although there are some missing
links and an article can be linked by two or more
articles from the same language (Hecht and Ger-
gle, 2010), the number of available links allows to
exploit the multilinguality of Wikipedia.
Still, extracting a comparable corpus on a spe-
cific domain from Wikipedia is not so straight-
forward. One can take advantage of the user-
generated categories associated to most articles.
Ideally, the categories and sub-categories would
compose a hierarchically organized taxonomy,
e.g., in the form of a category tree. Nevertheless,
2Wikipedia contains 288 language editions out of which
277 are active and 12 have more than 1M articles at the time
of writing, June 2015 (http://en.wikipedia.org/
wiki/List_of_Wikipedias).
Sport
Sports Mountain
sports
MountaineeringMountains
Mountains
of Andorra
Pyrenees
Science Scientific
disciplines
Natural
sciences
Earth
sciencies
Geology Geology
by country
Geology
of Spain
Mountains
by country
Mountain ran-
ges of Spain
Mountains of
the Pyrenees
Figure 1: Slice of the Spanish Wikipedia category
graph (as in May 2015) departing from categories
Sport and Science. Translated for clarity.
the categories in Wikipedia compose a densely-
connected graph with highly overlapping cate-
gories, cycles, etc. As they are manually-crafted,
the categories are somehow arbitrary and, among
other consequences, the potential categorisation of
articles does not accomplish with the properties
for representing the desirable —trusty enough—
categorisation of articles from different domains.
Moreover, many articles are not associated to the
categories they should belong to and there is a phe-
nomenon of over-categorization.3
Figure 1 is an example of the complexity
of Wikipedia’s category graph topology. Al-
though this particular example comes from the
Wikipedia in Spanish, similar phenomena exist
in other editions. Firstly, the paths from different
apparently unrelated categories —Sport and
Science—, converge in a common node soon
in the graph (node Pyrenees). As a result,
not only Pyrenees could be considered as a
sub-category of both Sport and Science,
but all its descendants. Secondly, cycles exist
among the different categories, as in the sequence
Mountains of Andorra Pyrenees
Mountains of the Pyrenees
Mountains of Andorra. Ideally, every
sub-category of a category should share the same
attributes, since the “failure to observe this princi-
ple reduces the predictability [of the taxonomy]
and can lead to cross-classification” (Rowley and
Hartley, 2000, p. 196). Although fixing this issue
—inherent to all the Wikipedia editions— falls
3This is a phenomenon specially stressed in the
Wikipedia itself: http://en.wikipedia.org/wiki/
Wikipedia:Overcategorization.
4
out of the scope of our research, some heuristic
strategies are necessary to diminish their impact
in the domain definition process.
Plamada and Volk (2012) dodge this issue by
extracting a domain comparable corpus using IR
techniques. They use the characteristic vocabulary
of the domain (100 terms extracted from an exter-
nal in-domain corpus) to query a Lucene search
engine4over the whole encyclopædia. Our ap-
proach is completely different: we try to get along
with Wikipedia’s structure with a strategy to walk
through the category graph departing from a root
or pseudo-root category, which defines our do-
main of interest. We empirically set a threshold
to stop exploring the graph such that the included
categories most likely represent an entire domain
(cf. Section 3). This approach is more similar
to Cui et al. (2008), who explore the Wiki-Graph
and score every category in order to assess its like-
lihood of belonging to the domain.
Other tools are being developed to extract cor-
pora from Wikipedia. Linguatools5released a
comparable corpus extracted from Wikipedias in
253 language pairs. Unfortunately, neither their
tool nor the applied methodology description are
available. CatScan26is a tool that allows to ex-
plore and search categories recursively. The Accu-
rat toolkit (Pinnis et al., 2012; S¸ tef˘
anescu, Dan and
Ion, Radu and Hunsicker, Sabine, 2012)7aligns
comparable documents and extracts parallel sen-
tences, lexicons, and named entities. Finally, the
most related tool to ours: CorpusPedia8extracts
non-aligned, softly-aligned, and strongly-aligned
comparable corpora from Wikipedia (Otero and
L´
opez, 2010). The difference with respect to our
model is that they only consider the articles asso-
ciated to one specific category and not to an entire
domain.
The inter-connection among Wikipedia editions
in different languages has been exploited for mul-
tiple tasks including lexicon induction (Erdmann
et al., 2008), extraction of bilingual dictionar-
ies (Yu and Tsujii, 2009), and identification of
particular translations (Chu et al., 2014; Prochas-
son and Fung, 2011). Different cross-language
4https://lucene.apache.org/
5http://linguatools.org
6http://tools.wmflabs.org/catscan2/
catscan2.php
7http://www.accurat-project.eu
8http://gramatica.usc.es/pln/tools/
CorpusPedia.html
NLP tasks have particularly taken advantage of
Wikipedia. Articles have been used for query
translation (Sch¨
onhofen et al., 2007) and cross-
language semantic representations for similarity
estimation (Cimiano et al., 2009; Potthast et al.,
2008; Sorg and Cimiano, 2012). The extraction
of parallel corpora from Wikipedia has been a
hot topic during the last years (Adafre and de Ri-
jke, 2006; Patry and Langlais, 2011; Plamada and
Volk, 2012; Smith et al., 2010; Tom´
as et al., 2008;
Yasuda and Sumita, 2008).
3 Domain-Specific Comparable Corpora
Extraction
In this section we describe our proposal to ex-
tract domain-specific comparable corpora from
Wikipedia. The input to the pipeline is the top cat-
egory of the domain (e.g., Sport). The terminol-
ogy used in this description is as follows. Let cbe
a Wikipedia category and cbe the top category
of a domain. Let abe a Wikipedia article; ac
if acontains camong its categories. Let Gbe the
Wikipedia category graph.
Vocabulary definition. The domain vocabulary
represents the set of terms that better characterises
the domain. We do not expect to have at our dis-
posal the vocabulary associated to every category.
Therefore, we build it from the Wikipedia itself.
We collect every article acand apply stan-
dard pre-processing; i.e., tokenisation, stopword-
ing, numbers and punctuation marks filtering, and
stemming (Porter, 1980). In order to reduce noise,
tokens shorter than four characters are discarded
as well. The vocabulary is then composed of the
top nterms, ranked by term frequency. This value
is empirically determined.
Graph exploration. The input for this step is G,
c(i.e., the departing node in the graph), and the
domain vocabulary. Departing from c, we per-
form a breadth-first search, looking for all those
categories which more likely belong to the re-
quired domain. Two constraints are applied in or-
der to make a controlled exploration of the graph:
(i) in order to avoid loops and exploring already
traversed paths, a node can only be visited once,
(ii) in order to avoid exploring the whole cate-
gories graph, a stopping criterion is pre-defined.
Our stopping criterion is inspired by the classifica-
tion tree-breadth first search algorithm (Cui et al.,
2008). The core idea is scoring the explored cate-
5
Edition Articles Categories Ratio
English 4,123,676 1,032,222 4.0
Spanish 965,543 210,803 4.6
Intersection 631,710 107,313
Table 1: Amount of articles and categories in
the Wikipedia editions and in the intersection
(i.e., pages linked across languages).
gories to determine if they belong to the domain.
Our heuristic assumes that a category belongs to
the domain if its title contains at least one of the
terms in the characteristic vocabulary. Neverthe-
less, many categories exist that may not include
any of the terms in the vocabulary. (e.g., consider
category pato in Spanish —literally ”duck” in
English— which, somehow surprisingly, refers to
a sport rather than an animal). Our na¨
ıve solution
to this issue is to consider subsets of categories
according to their depth respect to the root. An
entire level of categories is considered part of the
domain if a minimum percentage of its elements
include vocabulary terms.
In our experiments we use the English and
Spanish Wikipedia editions.9Table 1 shows some
statistics, after filtering disambiguation and redi-
rect pages. The intersection of articles and cate-
gories between the two languages represents the
ceiling for the amount of parallel corpora one can
gather for this pair. We focus on three domains:
Computer Science (CS), Science (Sc), and Sports
(Sp) —the top categories cfrom which the graph
is explored in order to extract the corresponding
comparable corpora.
Table 2 shows the number of root articles asso-
ciated to cfor each domain and language. From
them, we obtain domain vocabularies with a size
between 100 and 400 lemmas (right-side columns)
when using the top 10% terms. We ran experi-
ments using the top 10%,15%,20% and 100%.
The relatively small size of these vocabularies al-
lows to manually check that 10% is the best op-
tion to characterise the desired category, higher
percentages add more noise than in-domain terms.
The plots in Figure 2 show the percentage of cat-
egories with at least one domain term in the ti-
9Dumps downloaded from https://dumps.
wikimedia.org in July 2013 and pre-processed with
JWPL (Zesch et al., 2008) (https://code.google.
com/p/jwpl/).
Articles Vocabulary
en es en es
CS 4 130 106 447
Sc 29 3 464 140
Sp 3 10 122 100
Table 2: Number of articles in the root categories
and size of the resulting domain vocabulary.
0
10
20
30
40
50
60
70
80
90
100
1 3 5 7 9 11 13 15 17 19 21 23 25 27
In-vocabulary categories (%)
Depth
ENGLISH
CS
Sc
Sp
0
10
20
30
40
50
60
70
80
90
100
1 3 5 7 9 11 13 15 17 19 21 23 25 27
In-vocabulary categories (%)
Depth
SPANISH
Figure 2: Percentage of categories with at least
one domain term in the title for the two languages
and the three domains under study.
tle: the starting point for our graph-based method
for selecting the in-domain articles. As expected,
nearly 100% of the categories in the root include
domain terms and this percentage decreases with
increasing depth in the tree.
When extracting the corpus, one must decide
the adequate percentage of positive categories
allowed. High thresholds lead to small cor-
pora whereas low thresholds lead to larger —but
noisier— corpora. As in many applications, this
is a trade-off between precision and recall and de-
pends on the intended use of the corpus. Table 3
shows some numbers on two different thresholds.
Increasing the threshold does not always mean
6
Articles Distance from the root
50% 60% 50% 60%
en-es en-es en es en es
CS 18,168 8,251 6 5 5 5
Sc 161,130 21,459 6 4 4 4
Sp 72,315 1,980 8 8 3 4
Table 3: Number of article pairs according to the
percentage of positive categories used to select the
levels of the graph and distance from the root at
which the percentage is smaller to the desired one.
lowering the selected depth, but when it does, the
difference in the number of extracted articles can
be significant. The same table shows the number
of article pairs extracted for each value: the result-
ing comparable corpus for each domain. The stop-
ping level is selected for every language indepen-
dently, but in order to reduce noise, the compara-
ble corpus is only built from those articles that ap-
pear in both languages and are related via an inter-
language link. We validate the quality in terms of
application-based utility of the generated compa-
rable corpora when used in a translation system
(cf. Section 5). Therefore, we choose to give more
importance to recall and opt for the corpora ob-
tained with a threshold of 50%.
4 Parallel Sentence Extraction
In this section we describe a simple technique for
extracting parallel sentences from a comparable
corpus.
Given a pair of articles related by an interlan-
guage link, we estimate the similarity between all
their pairs of cross-language sentences with dif-
ferent text similarity measures. We repeat the pro-
cess for all the pairs of articles and rank the result-
ing sentence pairs according to its similarity. After
defining a threshold for each measure, those sen-
tence pairs with a similarity higher than the thresh-
old are extracted as parallel sentences. This is a
non-supervised method that generates a noisy par-
allel corpus. The quality of the similarity mea-
sures will then affect the purity of the parallel cor-
pus and, therefore, the quality of the translator.
However, we do not need to be very restrictive
with the measures here and still favour a large cor-
pus, since the word alignment process in the SMT
system can take care of part of the noise.
Similarity computation. We compute similari-
ties between pairs of sentences by means of co-
sine and length factor measures. The cosine sim-
ilarity is calculated on three well-known charac-
terisations in cross-language information retrieval
and parallel corpora alignment: (i) character n-
grams (cng) (McNamee and Mayfield, 2004);
(ii) pseudo-cognates (cog) (Simard et al., 1992);
and (iii) word 1-grams, after translation into a
common language, both from English to Span-
ish and vice versa (monoen, monoes ). We add
the (iv) length factor (len) (Pouliquen et al., 2003)
as an independent measure and as penalty (multi-
plicative factor) on the cosine similarity.
The threshold for each of the measures just in-
troduced is empirically set in a manually anno-
tated corpus. We define it as the value that max-
imises the F1score on this development set. To
create this set, we manually annotated a corpus
with 30 article pairs (10 per domain) at sentence
level. We considered three sentence classes: par-
allel, comparable, and other. The volunteers of
the exercise were given as guidelines the defini-
tions by Skadin¸a et al. (2010) of parallel text and
strongly comparable text (cf. Section 2). A pair
that did not match any of these definitions had to
be classified as other. Each article pair was anno-
tated by two volunteers, native speakers of Span-
ish with high command of English (a total of nine
volunteers participated in the process). The mean
agreement between annotators had a kappa coeffi-
cient (Cohen, 1960) of κ0.7. A third annotator
resolved disagreed sentences.10
Table 4 shows the thresholds that obtain the
maximum F1scores. It is worth noting that, even
if the values of precision and recall are relatively
low —the maximum recall is 0.57 for len—, our
intention with these simple measures is not to ob-
tain the highest performance in terms of retrieval,
but injecting the most useful data to the translator,
even at the cost of some noise. The performance
with character 3-grams is the best one, comparable
to that of mono, with an F1of 0.36. This suggests
that a translator is not mandatory for performing
the sentences selection. Len and 1-grams have no
discriminating power and lead to the worse scores
(F1of 0.14 and 0.21, respectively).
We ran a second set of experiments to explore
the combination of the measures. Table 5 shows
10The corpus is publicly available at http://www.cs.
upc.edu/˜cristinae/CV/recursos.php.
7
c1g c2g c3g c4g c5g cog monoenmonoes len
Thres. 0.95 0.60 0.25 0.20 0.15 0.30 0.20 0.15 0.90
P0.18 0.29 0.28 0.24 0.23 0.16 0.30 0.26 0.08
R0.25 0.31 0.53 0.47 0.47 0.49 0.46 0.34 0.57
F10.21 0.30 0.36 0.32 0.31 0.24 0.36 0.30 0.14
Table 4: Best thresholds and their associated Precision (P), recall (R) and F1.
¯
S¯
S·len S·F1S·F1·len
Thres. 0.25 0.15 0.05 0.05
P0.27 0.33 0.18 0.32
R0.50 0.62 0.77 0.65
F10.35 0.43 0.29 0.43
Table 5: Precision, recall, and F1for the average
of the similarities weighted by length model (len)
and/or their F1.
the performance obtained by averaging all the sim-
ilarities (¯
S), also after multiplying them by the
length factor and/or the observed F1obtained in
the previous experiment. Even if the length fac-
tor had shown a poor performance in isolation, it
helps to lift the F1figures consistently after affect-
ing the similarities. In this case, F1grows up to
0.43. This impact is not so relevant when the indi-
vidual F1is used for weighting ¯
S.
We applied all the measures —both combined
and in isolation— on the entire comparable cor-
pora previously extracted. Table 6 shows the
amount of parallel sentences extracted by apply-
ing the empirically defined thresholds of Tables 4
and 5. As expected, more flexible alternatives,
such as low-level n-grams or length factor result
in a higher amount of retrieved instances, but in all
cases the size of the corpora is remarkable. For the
most restricted domain, CS, we get around 200k
parallel sentences for a given similarity measure.
For the widest domain, SC, we surpass the 1M
sentence pairs. As it will be shown in the fol-
lowing section, these sizes are already useful to
be used for training SMT systems. Some standard
parallel corpora have the same order of magnitude.
For tasks other than MT, where the precision on
the extracted pairs can be more important than the
recall, one can obtain cleaner corpora by using a
threshold that maximises precision instead of F1.
CS Sc Sp
c1g 207,592 1,585,582 404,656
c2g 99,964 745,821 326,882
c3g 96,039 724,210 335,147
c4g 110,701 863,090 394,105
c5g 126,692 1,012,993 466,007
cog 182,981 1,215,008 451,941
len 271,073 1,941,866 550,338
monoen 211,209 1,367,917 461,731
monoes 183,439 1,273,509 435,671
¯
S154,917 1,098,453 450,933
¯
S·len 121,697 957,662 390,783
S·F1153,056 1,085,502 448,076
S·F1·len 121,407 957,967 392,241
Table 6: Size of the parallel corpora extracted with
each similarity measure.
5 Evaluation: Statistical Machine
Translation Task
In this section we validate the quality of the ob-
tained corpora by studying its impact on statisti-
cal machine translation. There are several paral-
lel corpora for the English–Spanish language pair.
We select as a general-purpose corpus Europarl
v7 (Koehn, 2005), with 1.97M parallel sentences.
The order of magnitude is similar to the largest
corpus we have extracted from Wikipedia, so we
can compare the results in a size-independent way.
If our corpus extracted from Wikipedia was made
up with parallel fragments of the desired domain,
it should be the most adequate to translate these
domains. If the quality of the parallel fragments
was acceptable, it should also help when translat-
ing out-of-domain texts. In order to test these hy-
potheses we analyse three settings: (i) train SMT
systems only with Wikipedia (WP) or Europarl
(EP) to translate domain-specific texts, (ii) train
SMT systems with Wikipedia and Europarl to
8
translate domain-specific texts, and (iii) train SMT
systems with Wikipedia and Europarl to translate
out-of-domain texts (news).
For the out-of-domain evaluation we use the
News Commentaries 2011 test set and the News
Commentaries 2009 for development.11 For the
in-domain evaluation we build the test and devel-
opment sets in a semiautomatic way. We depart
from the parallel corpora gathered in Section 4
from which sentences with more than four tokens
and beginning with a letter are selected. We es-
timate its perplexity with respect to a language
model obtained with Europarl in order to select
the most fluent sentences and then we rank the
parallel sentences according to their similarity and
perplexity. The top-nfragments were manually
revised and extracted to build the Wikipedia test
(WPtest) and development (WPdev) sets. We re-
peated the process for the three studied domains
and drew 300 parallel fragments for development
for every domain and 500 for test. We removed
these sentences from the corresponding training
corpora. For one of the domains, CS, we also gath-
ered a test set from a parallel corpus of GNOME
localisation files (Tiedemann, 2012). Table 7
shows the size in number of sentences of these test
sets and of the 20 Wikipedia training sets used for
translation. Only one measure, that with the high-
est F1score, is selected from each family: c3g,
cog, monoen and ¯
S·len (cf. Tables 4 and 5). We
also compile the corpus that results from the union
of the previous four. Notice that, although we
eliminate duplicates from this corpus, the size of
the union is close to the sum of the individual cor-
pora. This indicates that every similarity measure
selects a different set of parallel fragments. Beside
the specialised corpus for each domain, we build a
larger corpus with all the data (Un). Again, dupli-
cate fragments coming from articles belonging to
more than one domain are removed.
SMT systems are trained using standard freely
available software. We estimate a 5-gram lan-
guage model using interpolated Kneser–Ney dis-
counting with SRILM (Stolcke, 2002). Word
alignment is done with GIZA++ (Och and Ney,
2003) and both phrase extraction and decoding are
done with Moses (Koehn et al., 2007). We opti-
mise the feature weights of the model with Min-
imum Error Rate Training (MERT) (Och, 2003)
11Both are available at http://www.statmt.org/
wmt14/translation-task.html.
CS Sc Sp Un
c3g 95,715 723,760 334,828 883,366
cog 182,283 1,213,965 451,324 1,430,962
monoen 210,664 1,367,169 461,237 1,638,777
¯
S·len 120,835 956,346 389,975 1,160,977
union 577,428 3,847,381 1,181,664 4,948,241
WPdev 300 300 300 900
WPtest 500 500 500 1500
GNOME 1000 – – –
Table 7: Number of sentences of the Wikipedia
parallel corpora used to train the SMT systems
(top rows) and of the sets used for development
and test.
CS Sc Sp Un Comp.
Europarl 27.99 34.00 30.02 30.63
c3g 38.81 40.53 46.94 43.68 43.68
cog 57.32 56.17 57.60 58.14 54.89
monoen 54.27 52.96 55.74 55.17 52.45
¯
S·len 56.14 57.40 58.39 58.80 56.78
union 64.65 62.95 62.65 64.47
Table 8: BLEU scores obtained on the Wikipedia
test sets for the 20 specialised systems described in
Section 5. A comparison column (Comp.) where
all the systems are trained with corpora of the
same size is also included (see text).
against the BLEU evaluation metric (Papineni et
al., 2002). Our model considers the language
model, direct and inverse phrase probabilities, di-
rect and inverse lexical probabilities, phrase and
word penalties, and a lexicalised reordering.
(i) Training systems with Wikipedia or Eu-
roparl for domain-specific translation. Table 8
shows the evaluation results on WPtest. All the
specialised systems obtain significant improve-
ments with respect to the Europarl system, regard-
less of their size. For instance, the worst spe-
cialised system (c3g with only 95,715 sentences
for CS) outperforms by more than 10 points of
BLEU the general Europarl translator. The most
complete system (the union of the four representa-
tives) doubles the BLEU score for all the domains
with an impressive improvement of 30 points.
This is of course possible due to the nature of the
test set that has been extracted from the same col-
lection as the training data and therefore shares its
structure and vocabulary.
To give perspective to these high numbers we
evaluate the systems trained on the CS domain
9
CS Un Comp.
c3g 11.08 9.56 9.56
cog 18.48 17.66 16.31
monoen 19.48 20.58 18.84
¯
S·len 20.71 20.56 19.76
union 22.41 20.63 –
Table 9: BLEU scores obtained on the GNOME
test set for systems trained only with Wikipedia.
A system with Europarl achieves a score of 18.15.
against the GNOME dataset (Table 9). Except for
c3g, the Wikipedia translators always outperform
the baseline with EP; the union system improves it
by 4 BLEU points (22.41 compared to 18.15) with
a four times smaller corpus. This confirms that a
corpus automatically extracted with an F1smaller
than 0.5 is still useful for SMT. Notice also that us-
ing only the in-domain data (CS) is always better
than using the whole WP corpus (Un) even if the
former is in general ten times smaller (cf. Table 7).
According to this indirect evaluation of the sim-
ilarity measures, character n-grams (c3g) repre-
sent the worst alternative. These results contra-
dict the direct evaluation, where c3g and monoen
had the highest F1scores on the development set
among the individual similarity measures. The
size of the corpus is not relevant here: when we
train all the systems with the same amount of data,
the ranking in the quality of the measures remains
the same. To see this, we trained four additional
systems with the top mnumber of parallel frag-
ments, where mis the size of the smallest cor-
pus for the union of domains: Un-c3g. This new
comparison is reported in columns “Comp. in Ta-
bles 8 and 9. In this fair comparison c3g is still the
worst measure and ¯
S·len the best one. The trans-
lator built from its associated corpus outperforms
with less than half of the data used for training
the general one (883,366 vs. 1,965,734 parallel
fragments) both in WPtest (56.78 vs. 30.63) and
GNOME (19.76 vs. 18.15).
(ii) Training systems on Wikipedia and Eu-
roparl for domain-specific translation. Now
we enrich the general translator with Wikipedia
data or, equivalently, complement the Wikipedia
translator with out-of-domain data. Table 10
shows the results. Augmenting the size of the in-
domain corpus by 2 million fragments improves
the results even more, about 2 points of BLEU
CS Sc Sp Un
Europarl 27.99 34.00 30.02 30.63
union 64.65 62.95 62.65 64.47
EP+c3g 46.07 48.29 50.40 49.34
EP+cog 58.39 57.70 59.05 58.98
EP+monoen 54.44 53.93 56.05 55.88
EP+ ¯
S·len 56.05 57.53 59.78 58.72
EP+union 66.22 64.24 64.39 65.67
Table 10: BLEU scores obtained on the Wikipedia
test set for the 20 systems trained with the com-
bination of the Europarl (EP) and the Wikipedia
corpora. The results with a Europarl system and
the best one from Table 8 (union) shown for com-
parison.
CS Un
EP+c3g 19.78 19.49
EP+cog 21.09 20.14
EP+monoen 21.27 20.66
EP+ ¯
S·len 21.58 20.65
EP+union 22.37 21.43
Table 11: BLEU scores obtained on the GNOME
test set for systems trained with Europarl and
Wikipedia. A system with Europarl achieves a
score of 18.15.
when using all the union data. System c3g benefits
the most of the inclusion of the Europarl data. The
reason is that it is the individual system with less
corpus available and the one obtaining the worst
results. In fact, the better the Wikipedia system,
the less important the contribution from Europarl
is. For the independent test set GNOME, Table 11
shows that the union corpus on CS is better than
any combination of Wikipedia and Europarl. Still,
as aforementioned, the best performance on this
test set is obtained with a pure in-domain system
(cf. Table 9).
(iii) Training systems on Wikipedia and Eu-
roparl for out-of-domain translation. Now we
check the performance of the Wikipedia transla-
tors on the out-of-domain news test. Table 12
shows the results. In this neutral domain for Eu-
roparl and Wikipedia, the in-domain Wikipedia
systems show a lower performance. The BLEU
score obtained with the Europarl system is 27.02
whereas the Wikipedia union system achieves
22.16. When combining the two corpora, results
10
CS Sc Sp Un
union 16.74 22.28 15.82 22.16
EP+c3g 26.06 26.35 26.81 27.07
EP+cog 26.61 27.33 26.71 27.08
EP+monoen 27.18 26.80 26.96 27.44
EP+ ¯
S·len 27.59 26.80 27.58 27.22
EP+union 26.76 27.52 27.35 26.72
Table 12: BLEU scores for the out-of-domain
evaluation on the News Commentaries 2011 test
set. We show in boldface all the systems that im-
prove the Europarl translator, which achieves a
score of 27.02.
are controlled by the Europarl baseline. In general,
systems in which we include only texts from an
unrelated domain do not improve the performance
of the Europarl system alone, results of the com-
bined system are better when we use Wikipedia
texts from all the domains together (column Un)
for training. This suggests that, as expected, a gen-
eral Wikipedia corpus is necessary to build a gen-
eral translator. This is a different problem to deal
with.
6 Conclusions and Ongoing Work
In this paper we presented a model for the au-
tomatic extraction of in-domain comparable cor-
pora from Wikipedia. It makes possible the auto-
matic extraction of monolingual and comparable
article collections as well as a one-click parallel
corpus generation for on-demand language pairs
and domains. Given a pair of languages and a
main category, the model explores the Wikipedia
categories graph and identifies a subset of cate-
gories (and their associated articles) to generate
a document-aligned comparable corpus. The re-
sulting corpus can be exploited for multiple natu-
ral language processing tasks. Here we applied it
as part of a pipeline for the extraction of domain-
specific parallel sentences. These parallel in-
stances allowed for a significant improvement in
the machine translation quality when compared to
a generic system and applied to a domain specific
corpus (in-domain). The experiments are shown
for the English–Spanish language pair and the do-
mains Computer Science, Science, and Sports.
Still it can be applied to other language pairs and
domains.
The prototype is currently operating in other
languages. The only prerequisite is the existence
of the corresponding Wikipedia edition and some
basic processing tools such as a tokeniser and a
lemmatiser. Our current efforts intend to generate
a more robust model for parallel sentences identi-
fication and the design of other indirect evaluation
schemes to validate the model performance.
Acknowledgments
This work was partially funded by the TAC-
ARDI project (TIN2012-38523-C02) of the Span-
ish Ministerio de Econom´
ıa y Competitividad
(MEC).
References
Sisay Fissaha Adafre and Maarten de Rijke. 2006.
Finding Similar Sentences across Multiple Lan-
guages in Wikipedia. In Proceedings of the 11th
Conference of the European Chapter of the Associa-
tion for Computational Linguistics, pages 62–69.
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard,
Joseph Mariani, Jan Odjik, Stelios Piperidis, and
Daniel Tapias, editors. 2008. Proceedings of the
Sixth International Language Resources and Evalu-
ation (LREC 2008), Marrakech, Morocco. European
Language Resources Association (ELRA).
Chenhui Chu, Toshiaki Nakazawa, and Sadao Kuro-
hashi. 2014. Iterative Bilingual Lexicon Extraction
from Comparable Corpora with Topical and Con-
textual Knowledge. In Alexander Gelbukh, edi-
tor, Computational Linguistics and Intelligent Text
Processing, volume 8404 of Lecture Notes in Com-
puter Science, pages 296–309. Springer Berlin Hei-
delberg.
Philipp Cimiano, Antje Schultz, Sergej Sizov, Philipp
Sorg, and Steffen Staab. 2009. Explicit Versus La-
tent Concept Models for Cross-language Informa-
tion Retrieval. In Proceedings of the 21st Inter-
national Jont Conference on Artifical Intelligence,
IJCAI’09, pages 1513–1518, San Francisco, CA,
USA. Morgan Kaufmann Publishers Inc.
Jacob Cohen. 1960. A coefficient of agreement
for nominal scales. Educational and Psychological
Measurement, 20(1):37–46.
Gaoying Cui, Qin Lu, Wenjie Li, and Yirong Chen.
2008. Corpus Exploitation from Wikipedia for On-
tology Construction. In Calzolari et al. (Calzolari et
al., 2008), pages 2126–2128.
Eagles Document Eag–Tcwg–Ctyp. 1996. EAGLES
Preliminary recommendations on Corpus Typology.
Maike Erdmann, Kotaro Nakayama, Takahiro Hara,
and Shojiro Nishio. 2008. An Approach for Ex-
tracting Bilingual Terminology from Wikipedia. In
11
Proceedings of the 13th International Conference
on Database Systems for Advanced Applications,
DASFAA’08, pages 380–392, Berlin, Heidelberg.
Springer-Verlag.
Pascale Fung and Percy Cheung. 2004. Min-
ing verynon-parallel corpora: Parallel sentence and
lexicon extraction via bootstrapping and em. In
Proceedings of EMNLP, pages 57–63, Barcelona,
Spain, July 25–July 26.
Pascale Fung. 1995. Compiling Bilingual Lexicon En-
tries from a Non-Parallel English-Chinese Corpus.
In Proceedings of the Third Annual Workshop on
Very Large Corpora, pages 173–183.
Pascale Fung. 1998. A Statistical View on Bilingual
Lexicon Extraction: From Parallel Corpora to Non-
Parallel Corpora. Lecture Notes in Computer Sci-
ence, 1529:1–17.
Brent Hecht and Darren Gergle. 2010. The Tower
of Babel Meets Web 2.0: User-generated Content
and Its Applications in a Multilingual Context. In
Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems, CHI ’10, pages 291–
300, New York, NY, USA. ACM.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondrej Bojar, Alexan-
dra Constantin, and Evan Herbst. 2007. Moses:
Open Source Toolkit for Statistical Machine Trans-
lation. In Proceedings of the 45th Annual Meeting of
the Association for Computational Linguistics (ACL
2007), Prague, Czech Republic.
Philipp Koehn. 2005. Europarl: A Parallel Corpus for
Statistical Machine Translation. In Proceedings of
the Machine Translation Summit X, pages 79–86.
Philipp Koehn. 2010. Statistical Machine Translation.
Cambridge University Press, New York, NY, USA,
1st edition.
Belinda Maia. 2003. What are comparable corpora. In
Proceedings of the Corpus Linguistics workshop on
Multilingual Corpora: Linguistic requirements and
technical perspectives.
Anthony M. McEnery and Zhonghua Xiao, 2007. In-
corporating Corpora: Translation and the Linguist,
chapter Parallel and comparable corpora: What are
they up to? Translating Europe. Multilingual Mat-
ters.
Paul McNamee and James Mayfield. 2004. Character
N-Gram Tokenization for European Language Text
Retrieval. Information Retrieval, 7(1-2):73–97.
Franz Josef Och and Hermann Ney. 2003. A System-
atic Comparison of Various Statistical Alignment
Models. Computational Linguistics, 29(1):19–51.
See also [http://www.fjoch.com/GIZA++.
html].
Franz Josef Och. 2003. Minimum Error Rate Training
in Statistical Machine Translation. In Proceedings
of the 41st Annual Meeting of the Association for
Computational Linguistics (ACL), pages 160–167,
Sapporo, Japan.
Pablo Gamallo Otero and Issac Gonz´
alez L´
opez. 2010.
Wikipedia as multilingual source of comparable
corpora. In Proceedings of the 3rd Workshop on
Building and Using Comparable Corpora, pages
21–25, 22 May. Available at http://www.fb06.uni-
mainz.de/lk/bucc2010/documents/Proceedings-
BUCC-2010.pdf.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. BLEU: A Method for Automatic
Evaluation of Machine Translation. In Proceedings
of the 40th Annual Meeting of the Association for
Computational Linguistics (ACL 2002), pages 311–
318, Philadelphia, PA. Association for Computa-
tional Linguistics.
Alexandre Patry and Philippe Langlais. 2011. Iden-
tifying Parallel Documents from a Large Bilingual
Collection of Texts: Application to Parallel Arti-
cle Extraction in Wikipedia. In Pierre Zweigen-
baum, Reinhard Rapp, and Serge Sharoff, editors,
Proceedings of the 4th Workshop on Building and
Using Comparable Corpora: Comparable Corpora
and the Web, pages 87–95, Portland, Oregon. Asso-
ciation for Computational Linguistics.
M¯
arcis Pinnis, Radu Ion, Dan S¸ tef˘
anescu, Fangzhong
Su, Inguna Skadin¸ a, Andrejs Vasil¸jevs, and Bogdan
Babych. 2012. Accurat toolkit for multi-level align-
ment and information extraction from comparable
corpora. In Proceedings of the ACL 2012 System
Demonstrations, ACL’12, pages 91–96, Strouds-
burg, PA, USA. Association for Computational Lin-
guistics.
Magdalena Plamada and Martin Volk. 2012. Towards
a Wikipedia-extracted alpine corpus. In The Fifth
Workshop on Building and Using Comparable Cor-
pora, May.
Martin F. Porter. 1980. An Algorithm for Suffix Strip-
ping. Program, 14:130–137.
Martin Potthast, Benno Stein, and Maik Anderka.
2008. A Wikipedia-Based Multilingual Retrieval
Model. Advances in Information Retrieval, 30th
European Conference on IR Research, LNCS
(4956):522–530. Springer-Verlag.
Bruno Pouliquen, Ralf Steinberger, and Camelia Ignat.
2003. Automatic Identification of Document Trans-
lations in Large Multilingual Document Collections.
In Proceedings of the International Conference on
Recent Advances in Natural Language Processing
(RANLP-2003), pages 401–408, Borovets, Bulgaria.
Emmanuel Prochasson and Pascale Fung. 2011. Rare
Word Translation Extraction from Aligned Compa-
rable Documents. In Proceedings of the 49th Annual
12
Meeting of the Association for Computational Lin-
guistics: Human Language Technologies - Volume 1,
HLT ’11, pages 1327–1335, Stroudsburg, PA, USA.
Association for Computational Linguistics.
Reinhard Rapp. 1995. Identifying Word Translations
in Non-Parallel Texts. CoRR, cmp-lg/9505037.
Jennifer Rowley and Richard Hartley. 2000. Organiz-
ing Knowledge. An Introduction to Managing Access
to Information. Ashgate, 3rd edition.
P´
eter Sch¨
onhofen, Andr´
as A. Bencz´
ur, Istv´
an B´
ır´
o, and
K´
aroly Csalog´
any. 2007. Cross-language retrieval
with wikipedia. In Advances in Multilingual and
Multimodal Information Retrieval, 8th Workshop of
the Cross-Language Evaluation Forum, CLEF 2007,
Budapest, Hungary, September 19-21, 2007, Re-
vised Selected Papers, pages 72–79.
Serge Sharoff, Reinhard Rapp, and Pierre Zweigen-
baum, 2013. Building and Using Comparable Cor-
pora, chapter Overviewing Important Aspects of the
Last Twenty Years of Research in Comparable Cor-
pora. Springer.
Michel Simard, George F. Foster, and Pierre Isabelle.
1992. Using Cognates to Align Sentences in Bilin-
gual Corpora. In Proceedings of the Fourth Interna-
tional Conference on Theoretical and Methodologi-
cal Issues in Machine Translation.
Inguna Skadin¸ a, Ahmet Aker, Voula Giouli, Dan Tufis¸,
Robert Gaizauskas, Madara Mierin¸a, and Nikos
Mastropavlos. 2010. A collection of comparable
corpora for under-resourced languages. In Proceed-
ings of the 2010 Conference on Human Language
Technologies – The Baltic Perspective: Proceed-
ings of the Fourth International Conference Baltic
HLT 2010, pages 161–168, Amsterdam, The Nether-
lands, The Netherlands. IOS Press.
Jason R. Smith, Chris Quirk, and Kristina Toutanova.
2010. Extracting Parallel Sentences from Compa-
rable Corpora using Document Level Alignment.
In Human Language Technologies: The 2010 An-
nual Conference of the North American Chapter of
the Association for Computational Linguistics, HLT
’10, pages 403–411, Stroudsburg, PA, USA. Associ-
ation for Computational Linguistics.
Philipp Sorg and Philipp Cimiano. 2012. Exploit-
ing Wikipedia for Cross-lingual and Multilingual In-
formation Retrieval. Data Knowl. Eng., 74:26–45,
April.
S¸ tef˘
anescu, Dan and Ion, Radu and Hunsicker, Sabine.
2012. Hybrid Parallel Sentence Mining from Com-
parable Corpora. In Proceedings of the 16th Annual
Conference of the European Association for Ma-
chine Translation (EAMT 2012), Trento, Italy. Eu-
ropean Association for Machine Translation .
Andreas Stolcke. 2002. SRILM - An Extensible Lan-
guage Modeling toolkit. In Intl. Conference on Spo-
ken Language Processing, Denver, Colorado.
J¨
org Tiedemann. 2012. Parallel Data, Tools and In-
terfaces in OPUS. In Nicoletta Calzolari, Khalid
Choukri, Thierry Declerck, Mehmet Ugur Dogan,
Bente Maegaard, Joseph Mariani, Jan Odijk, and
Stelios Piperidis, editors, Proceedings of the Eighth
International Conference on Language Resources
and Evaluation (LREC 2012), Istanbul, Turkey,
May. European Language Resources Association
(ELRA).
Jes´
us Tom´
as, Jordi Bataller, Francisco Casacuberta,
and Jaime Lloret. 2008. Mining wikipedia as a par-
allel and comparable corpus. LANGUAGE FORUM,
34(1). Article presented at CICLing-2008, 9th Inter-
national Conference on Intelligent Text Processing
and Computational Linguistics, February 17 to 23,
2008, Haifa, Israel.
Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe
Dubiner. 2010. Large scale parallel document min-
ing for machine translation. In Chu-Ren Huang and
Dan Jurafsky, editors, Proceedings of the 23rd Inter-
national Conference on Computational Linguistics
(COLING 2010), pages 1101–1109, Beijing, China,
August. COLING 2010 Organizing Committee.
Dekai Wu and Pascale Fung. 2005. Inversion trans-
duction grammar constraints for mining parallel sen-
tences from quasi-comparable corpora. In Natural
Language Processing - IJCNLP 2005, Second Inter-
national Joint Conference, pages 257–268, Jeju Is-
land, Korea, Oct 11–Oct 13.
Keiji Yasuda and Eiichiro Sumita. 2008. Method for
Building Sentence-Aligned Corpus from Wikipedia.
In Association for the Advancement of Artificial In-
telligence.
Kun Yu and Junichi Tsujii. 2009. Bilingual dictionary
extraction from wikipedia. In Proceedings of Ma-
chine Translation Summit XII.
Torsten Zesch, Christof M¨
uller, and Iryna Gurevych.
2008. Extracting Lexical Semantic Knowledge from
Wikipedia and Wikictionary. In Calzolari et al. (Cal-
zolari et al., 2008).
13
... Wikipedia's inter-language links are crucial to obtain an aligned comparable corpus. The value of the Wikipedia as a source of highly comparable and parallel sentences has been appreciated over the years [1,5,9,37,[47][48][49]55]. With the rise of deep learning for NLP and the need of large amounts of clean data, the use of Wikipedia has grown exponentially not only for parallel sentence extraction and machine translation [25,44,46,53], but also for semantics. ...
... This results in a collection of 741 categories. For comparison purposes, categories used in previous research are added if not already present: Archaeology, Linguistics, Physics, Biology, and Sport [22]; Mountaineering [38] and Computer Science [5]. Observe that Computer Science does not exist in the Greek edition nor Mountaineering in the Occitan one. ...
Article
Full-text available
We propose a language-independent graph-based method to build à-la-carte article collections on user-defined domains from the Wikipedia. The core model is based on the exploration of the encyclopedia’s category graph and can produce both mono- and multilingual comparable collections. We run thorough experiments to assess the quality of the obtained corpora in 10 languages and 743 domains. According to an extensive manual evaluation, our graph model reaches an average precision of 84%84%84\% on in-domain articles, outperforming an alternative model based on information retrieval techniques. As manual evaluations are costly, we introduce the concept of domainness and design several automatic metrics to account for the quality of the collections. Our best metric for domainness shows a strong correlation with human judgments, representing a reasonable automatic alternative to assess the quality of domain-specific corpora. We release the WikiTailor toolkit with the implementation of the extraction methods, the evaluation measures and several utilities.
... Similar to our proposed approach, Barrón-Cedeño et al. (2015) showed how using parallel documents from Wikipedia for domain specific alignment would improve translation quality of SMT systems on in-domain data. In this method, similarity between all pairs of cross-language sentences with different text similarity measures are estimated. ...
Preprint
Resources for the non-English languages are scarce and this paper addresses this problem in the context of machine translation, by automatically extracting parallel sentence pairs from the multilingual articles available on the Internet. In this paper, we have used an end-to-end Siamese bidirectional recurrent neural network to generate parallel sentences from comparable multilingual articles in Wikipedia. Subsequently, we have showed that using the harvested dataset improved BLEU scores on both NMT and phrase-based SMT systems for the low-resource language pairs: English--Hindi and English--Tamil, when compared to training exclusively on the limited bilingual corpora collected for these language pairs.
... In this paper, we used five datasets PAN11 [22], JRC-ACQUIS [23], EUROPARL [24], Wikipedia [25], and conference papers [26] for three language pairs En-Fr, En-Es, En-De. In the case of English-French, we used 10,620 plagiarized (P) documents, combining data from two sources: conference papers and JRC-Acquis. ...
Article
Full-text available
span lang="EN-US">The pervasive availability of vast online information has fundamentally altered our approach to acquiring knowledge. Nevertheless, this wealth of data has also presented significant challenges to academic integrity, notably in the realm of cross-lingual plagiarism. This type of plagiarism involves the unauthorized copying, translation, ideas, or works from one language into others without proper citation. This research introduces a methodology for identifying multilingual plagiarism, utilizing a pre-trained multilingual bidirectional and auto-regressive transformers (mBART) model for document feature extraction. Additionally, a siamese long short-term memory (SLSTM) model is employed for classifying pairs of documents as either "plagiarized" or "non-plagiarized". Our approach exhibits notable performance across various languages, including English (En), Spanish (Es), German (De), and French (Fr). Notably, experiments focusing on the En-Fr language pair yielded exceptional results, with an accuracy of 98.83%, precision of 98.42%, recall of 99.32%, and F-score of 98.87%. For En-Es, the model achieved an accuracy of 97.94%, precision of 98.57%, recall of 97.47%, and an F-score of 98.01%. In the case of En-De, the model demonstrated an accuracy of 95.59%, precision of 95.21%, recall of 96.85%, and F-score of 96.02%. These outcomes underscore the effectiveness of combining the MBART transformer and SLSTM models for cross-lingual plagiarism detection.</span
... This problem has been addressed in two main ways in previous work, not specifically related to news translation. Computational techniques have been employed to mine parallel sentences from comparable corpora and noisy parallel corpora (Barrón-Cedeño, España-Bonet, Boldoba, & Màrquez, 2015;Gete et al., 2022). Extracting parallel sentences from similar multilingual corpora is a well-known problem, addressed as a necessary step when gathering data for training and testing of machine translation systems, as well as for cross-lingual information retrieval algorithms. ...
Article
Full-text available
This contribution addresses the challenging issue of building corpus resources for the study of news translation, a domain in which the coexistence of radical rewriting and close translation makes the use of established corpus-assisted analytical techniques problematic. In an attempt to address these challenges, we illustrate and test two related methods for identifying translated segments within trilingual (Spanish, French and English) sets of dispatches issued by the global news agency Agence France-Press. One relies on machine translation and semantic similarity scores, the other on multilingual sentence embeddings. To evaluate these methods, we apply them to a benchmark dataset of translations from the same domain and perform manual evaluation of the dataset under study. We finally leverage the cross-linguistic equivalences thus identified to build a ‘comparallel’ corpus, which combines the parallel and comparable corpus architectures, highlighting its affordances and limitations for the study of news translation. We conclude by discussing the theoretical and methodological implications of our findings both for the study of news translation and more generally for the study of contemporary, novel forms of translation.
... In order to have sufficient training data, we have gathered four datasets: PAN-PC-11, JRC-Acquis, Europarl, and Wikipedia (Spanish-English) [36,37,38,39]. An evaluation corpus for automatic plagiarism detection algorithms is the PAN 2011 (PAN-PC-11). ...
Article
Full-text available
Academic plagiarism has become a serious concern as it leads to the retardation of scientific progress and violation of intellectual property. In this context, we make a study aiming at the detection of cross-linguistic plagiarism based on Natural language Preprocessing (NLP), Embedding Techniques, and Deep Learning. Many systems have been developed to tackle this problem, and many rely on machine learning and deep learning methods. In this paper, we propose Cross-language Plagiarism Detection (CL-PD) method based on Doc2Vec embedding techniques and a Siamese Long Short-Term Memory (SLSTM) model. Embedding techniques help capture the text's contextual meaning and improve the CL-PD system's performance. To show the effectiveness of our method, we conducted a comparative study with other techniques such as GloVe, FastText, BERT, and Sen2Vec on a dataset combining PAN11, JRC-Acquis, Europarl, and Wikipedia. The experiments for the Spanish-English language pair show that Doc2Vec+SLSTM achieve the best results compared to other relevant models, with an accuracy of 99.81%, a precision of 99.75%, a recall of 99.88%, an f-score of 99.70%, and a very small loss in the test phase.
... Unlike parallel corpora, so-called comparable corpora do not necessarily possess parallel structures, but merely share the same topics per corresponding unit (e.g., articles). Wikipedia 2 can be seen as a comparable corpus, since a correspondence relation between languages can be established for individual articles (McEnery and Xiao, 2007;Otero and López, 2010;Barrón-Cedeno et al., 2015). ...
Article
Full-text available
This work describes a blueprint for an application that generates language learning exercises from parallel corpora. Word alignment and parallel structures allow for the automatic assessment of sentence pairs in the source and target languages, while users of the application continuously improve the quality of the data with their interactions, thus crowdsourcing parallel language learning material. Through triangulation, their assessment can be transferred to language pairs other than the original ones if multiparallel corpora are used as a source. Several challenges need to be addressed for such an application to work, and we will discuss three of them here. First, the question of how adequate learning material can be identified in corpora has received some attention in the last decade, and we will detail what the structure of parallel corpora implies for that selection. Secondly, we will consider which type of exercises can be generated automatically from parallel corpora such that they foster learning and keep learners motivated. And thirdly, we will highlight the potential of employing users, that is both teachers and learners, as crowdsourcers to help improve the material.
... Most likely, adding more language pairs and using ideas from recent work should help improve the accuracy of our models. Wikipedia has always been an interesting dataset for solving NLP problems including machine translation (Li et al., 2012;Patry and Langlais, 2011;Lin et al., 2011;Tufiş et al., 2013;Barrón-Cedeño et al., 2015;Ruiter et al., 2019). The WikiMatrix data (Schwenk et al., 2019a) is the most similar effort to ours in terms of using Wikipedia, but with using supervised translation models. ...
Preprint
Full-text available
We present a simple but effective approach for leveraging Wikipedia for neural machine translation as well as cross-lingual tasks of image captioning and dependency parsing without using any direct supervision from external parallel data or supervised models in the target language. We show that first sentences and titles of linked Wikipedia pages, as well as cross-lingual image captions, are strong signals for a seed parallel data to extract bilingual dictionaries and cross-lingual word embeddings for mining parallel text from Wikipedia. Our final model achieves high BLEU scores that are close to or sometimes higher than strong supervised baselines in low-resource languages; e.g. supervised BLEU of 4.0 versus 12.1 from our model in English-to-Kazakh. Moreover, we tailor our wikily translation models to unsupervised image captioning and cross-lingual dependency parser transfer. In image captioning, we train a multi-tasking machine translation and image captioning pipeline for Arabic and English from which the Arabic training data is a wikily translation of the English captioning data. Our captioning results in Arabic are slightly better than that of its supervised model. In dependency parsing, we translate a large amount of monolingual text, and use it as an artificial training data in an annotation projection framework. We show that our model outperforms recent work on cross-lingual transfer of dependency parsers.
... Experimental Setup We use Wikipedia (WP) as a comparable corpus and download the English, French, German and Spanish dumps, 2 pre-process them and extract comparable articles per language pair using WikiTailor 3 ( Barrón-Cedeño et al., 2015;España-Bonet et al., 2020). All articles are normalized, tokenized and truecased using standard Moses (Koehn et al., 2007) scripts. ...
Article
The number of sentence pairs in the bilingual corpus is a key to translation accuracy in computational machine translations. However, if the amount goes beyond a certain degree, the increasing number of cases has less impact on the translation while the construction of translation systems requires a considerable amount of time and energy, thus preventing the development of a statistical translation by the computer. This article offers a number of classifications for measuring the amount of information for each pair of sentences, using the Heuristic Bilingual Graph Corpus Network (HBGCN) to form an improved method of corpus selection that takes the difference between the first amount of information between the pairs of sentences into account. Using a graphic-based selector method as a training set, they achieve a close translation result through our experiments with the whole body and achieve better results than basic results for the following based on the Document Inverse Frequency (DIF) ranking approach.
Article
Full-text available
This paper presents the current status of OPUS, a growing language resource of parallel corpora and related tools. The focus in OPUS is to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. In this paper, we report about new data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the project.
Book
The fourth edition of this standard student text, Organizing Knowledge, incorporates extensive revisions reflecting the increasing shift towards a networked and digital information environment, and its impact on documents, information, knowledge, users and managers. Offering a broad-based overview of the approaches and tools used in the structuring and dissemination of knowledge, it is written in an accessible style and well illustrated with figures and examples. The book has been structured into three parts and twelve chapters and has been thoroughly updated throughout. Part I discusses the nature, structuring and description of knowledge. Part II, with its five chapters, lies at the core of the book focusing as it does on access to information. Part III explores different types of knowledge organization systems and considers some of the management issues associated with such systems. Each chapter includes learning objectives, a chapter summary and a list of references for further reading. This is a key introductory text for undergraduate and postgraduate students of information management. © 2008 Jennifer Rowley and Richard Hartley. All rights reserved.
Book
This introductory text to statistical machine translation (SMT) provides all of the theories and methods needed to build a statistical machine translator, such as Google Language Tools and Babelfish. In general, statistical techniques allow automatic translation systems to be built quickly for any language-pair using only translated texts and generic software. With increasing globalization, statistical machine translation will be central to communication and commerce. Based on courses and tutorials, and classroom-tested globally, it is ideal for instruction or self-study, for advanced undergraduates and graduate students in computer science and/or computational linguistics, and researchers in natural language processing. The companion website provides open-source corpora and tool-kits.
Book
The 1990s saw a paradigm change in the use of corpus-driven methods in NLP. In the field of multilingual NLP (such as machine translation and terminology mining) this implied the use of parallel corpora. However, parallel resources are relatively scarce: many more texts are produced daily by native speakers of any given language than translated. This situation resulted in a natural drive towards the use of comparable corpora, i.e. non-parallel texts in the same domain or genre. Nevertheless, this research direction has not produced a single authoritative source suitable for researchers and students coming to the field. The proposed volume provides a reference source, identifying the state of the art in the field as well as future trends. The book is intended for specialists and students in natural language processing, machine translation and computer-assisted translation.
Chapter
The beginning of the 1990s marked a radical turn in various NL Papplications towards using large collections of texts.
Article
In the literature, two main categories of methods have been proposed for bilingual lexicon extraction from comparable corpora, namely topic model and context based methods. In this paper, we present a bilingual lexicon extraction system that is based on a novel combination of these two methods in an iterative process. Our system does not rely on any prior knowledge and the performance can be iteratively improved. To the best of our knowledge, this is the first study that iteratively exploits both topical and contextual knowledge for bilingual lexicon extraction. Experiments conduct on Chinese-English and Japanese-English Wikipedia data show that our proposed method performs significantly better than a state-of-the-art method that only uses topical knowledge.
Article
This paper describes a method for extracting parallel sentences from comparable texts. We present the main challenges in creating a German-French corpus for the Alpine domain. We demonstrate that it is difficult to use the Wikipedia categorization for the extraction of domain-specific articles from Wikipedia, therefore we introduce an alternative information retrieval approach. Sentence alignment algorithms were used to identify semantically equivalent sentences across the Wikipedia articles. Using this approach, we create a corpus of sentence-aligned Alpine texts, which is evaluated both manually and automatically. Results show that even a small collection of extracted texts (approximately 10000 sentence pairs) can partially improve the performance of a state-of-the-art statistical machine translation system. Thus, the approach is worth pursuing on a larger scale, as well as for other language pairs and domains.
Conference Paper
While several recent works on dealing with large bilingual collections of texts, e.g. (Smith et al., 2010), seek for extracting parallel sentences from comparable corpora, we present Paradocs, a system designed to recognize pairs of parallel documents in a (large) bilingual collection of texts. We show that this system outperforms a fair baseline (Enright and Kondrak, 2007) in a number of controlled tasks. We applied it on the French-English cross-language linked article pairs of Wikipedia in order see whether parallel articles in this resource are available, and if our system is able to locate them. According to some manual evaluation we conducted, a fourth of the article pairs in Wikipedia are indeed in translation relation, and Paradocs identifies parallel or noisy parallel article pairs with a precision of 80%.
Article
In this article we show how Wikipedia as a multilingual knowledge resource can be exploited for Cross-Language and Multilingual Information Retrieval (CLIR/MLIR). We describe an approach we call Cross-Language Explicit Semantic Analysis (CL-ESA) which indexes documents with respect to explicit interlingual concepts. These concepts are considered as interlingual and universal and in our case correspond either to Wikipedia articles or categories. Each concept is associated to a text signature in each language which can be used to estimate language-specific term distributions for each concept. This knowledge can then be used to calculate the strength of association between a term and a concept which is used to map documents into the concept space. With CL-ESA we are thus moving from a Bag-Of-Words model to a Bag-Of-Concepts model that allows language-independent document representations in the vector space spanned by interlingual and universal concepts. We show how different vector-based retrieval models and term weighting strategies can be used in conjunction with CL-ESA and experimentally analyze the performance of the different choices. We evaluate the approach on a mate retrieval task on two datasets: JRC-Acquis and Multext. We show that in the MLIR settings, CL-ESA benefits from a certain level of abstraction in the sense that using categories instead of articles as in the original ESA model delivers better results.