United nations general assembly resolutions: A six-language parallel corpus

Article (PDF Available) · January 2009with 2,367 Reads
Abstract
In this paper we describe a six-ways paral-lel public-domain corpus consisting of 2100 United Nations General Assembly Resolu-tions with translations in the six official lan-guages of the United Nations, with an av-erage of around 3 million tokens per lan-guage. The corpus is available in a pre-processed, formatting-normalized TMX for-mat with paragraphs aligned across multiple languages. We describe the background to the corpus and its content, the process of its con-struction, and some of its interesting proper-ties.
United Nations General Assembly Resolutions:
A Six-Language Parallel Corpus
Alexandre Rafalovitch1,2
1United Nations
New York
USA
arafalov@gmail.com
Robert Dale2
2Centre for Language Technology
Macquarie University
Sydney, Australia
rdale@science.mq.edu.au
Abstract
In this paper we describe a six-ways paral-
lel public-domain corpus consisting of 2100
United Nations General Assembly Resolu-
tions with translations in the six official lan-
guages of the United Nations, with an av-
erage of around 3 million tokens per lan-
guage. The corpus is available in a pre-
processed, formatting-normalized TMX for-
mat with paragraphs aligned across multiple
languages. We describe the background to the
corpus and its content, the process of its con-
struction, and some of its interesting proper-
ties.
1 Introduction
Parallel corpora are a useful resource for a wide
variety of purposes including the training of ma-
chine translation algorithms (Koehn, 2005), multi-
lingual terminology extraction (Le An Ha and Cor-
pas, 2008) and even bootstrapping algorithms for
languages that do not enjoy the research resources of
English (Yarowsky et al., 2001). Multi-parallel cor-
pora can be more useful than bilingual corpora when
the additional languages may be used to assist text
alignment (Simard, 1999) or as translation bridge
languages, in the sense of (Kumar et al., 2007).
While many European and Germanic languages
already have good parallel research corpus re-
sources, such as JRC-Acquis (Steinberger et al.,
2006) and EuroParl (Koehn, 2005), material in the
Slavic, Sino-Tibetian and Semitic language families
is much rarer. The corpus presented here consists
of a collection of documents containing manually
translated official resolutions of the General Assem-
bly of the United Nations (UN) in the six official lan-
guages of the UN: Arabic, Chinese, English, French,
Russian, and Spanish. Since resolutions are legally
significant, they pass through multiple levels of hu-
man translation and verification, and so the transla-
tions can be expected to be of high quality.
While documents of the United Nations are al-
ready available online via the Official Document
System of the UN1and are mostly in the public do-
main (see (STAI, 1987)), they are not in the most
convenient form for machine processing. The docu-
ments are typically PDF files with reasonably com-
plex typography consisting of two-column text and
extended footnotes, and some are available only as
images without a text layer. The last public research-
ready corpus of UN documents was produced by
the Linguistic Data Consortion (LDC; see (Graff,
1994)) and included English, French and Spanish
only. Given the linguistic variety encompassed by
the UN’s official languages, it is therefore somewhat
disappointing that this material is not easily avail-
able for research in machine translation.
This paper describes some steps towards address-
ing this concern. The corpus described here con-
tains the text of 2100 resolutions for each language
aligned at the level of paragraphs, with just over
74000 paragraphs in each language. The corpus
contains an average of around 3 million tokens for
each language.2The corpus is encoded in XML
1See http://documents.un.org.
2Counting tokens in the Chinese corpus is difficult; not in-
cluding the Chinese data, the average token count across the
other five languages is 3.11 million.
using Translation Memory eXchange format, with
some of the significant sections and text segments
marked to assist future research. TMX format was
selected as a storage format as it is a standard used
in Computer-Assisted Translation tools and has a
structure that, while simple, is nonetheless sufficient
for our needs.3
In this paper, we first provide in Section 2 some
background information on the context in which the
documents in this corpus appear, before going on
to describe the corpus itself and the motivation for
the selection of this subset of UN documents in Sec-
tion 3. Section 4 discusses a number of interesting
properties exhibited by the corpus that may encour-
age specific research directions. Section 5 provides
details regarding access and availability.
2 The UN Document Space
In order to understand the nature of this corpus, it
may be useful to first understand the overall struc-
ture of the United Nations and the range of docu-
ments that the organisation produces.
The United Nations is a large international organ-
isation consisting of six principal organs: the Gen-
eral Assembly, the Security Council, the Economic
and Social Council, the Trusteeship Council, the
International Court of Justice, and the Secretariat.
These, together with a number of other agencies,
programmes and bodies (such as UNICEF,UN ITAR
and UN U), form the United Nations family.
Collectively, these bodies generate a massive
quantity of documentation: over the last 15 years,
about 14000 documents have been published each
year, with around half of these being available in all
six languages (these are mostly public documents),
and the other half in some subset of the languages
(these are generally internal documents).
Documents belong to a number of categories:
1. Official resolutions, decisions, statements and
legal instruments: these are the outputs of the
work by the bodies that define the official po-
sitions and provide internal and external guid-
ance. These documents often carry significant
legal weight and are widely published and dis-
tributed.
3We use TMX Version 1.4b: see
http://www.lisa.org.
2. Reports: these are documents reporting work
done; they often serve as inputs for the deliber-
ations of organs, and provide assistance in de-
cision making. A report may also be delivered
by one body or organisational unit to another
as a way of summarising the decisions taken by
the originating unit; for example, the Secretary-
General reports to the General Assembly on the
work of the Secretariat.
3. Records of meetings and discussions: these
may either be verbatim reports or summaries.
4. Letters and notes from the Member States to
the organisations.
5. Internal records such as daily journals, agendas
of work and draft resolutions.
6. Sales publications, such as books and key
reports (e.g., World Economic Situation and
Prospects 2009).
For non-repudiation reasons, changes and additions
to the documents are recorded as separate corrigenda
and addenda documents.
The General Assembly (GA) is the main delib-
erative organ of the United Nations, with a current
membership of 192 Member States. Final delib-
erations of the General Assembly are made in the
plenary sessions, but most of the work is done in
one of the six main committees or many subcom-
mittees, boards, commissions, working groups and
other bodies.4
The official output of the GA is collected together
as Official Records, which consist of resolutions, de-
cisions and key reports. The GA meets in sessions
described as regular,special and emergency special.
Regular sessions start in September and last as long
as required, often right until the start of the next ses-
sion; for example, the 62nd regular session started
on September 18, 2007 and lasted until September
15, 2008. Regular sessions are divided into two
parts: a main session, which lasts until the end of
the year, and a resumed session, which starts in Jan-
uary. Special and emergency special sessions have
more flexibility in the organisation of their work.
4The six main committees of the General Assembly are:
Disarmament and International Security; Economic and Finan-
cial; Social, Humanitarian and Cultural; Special Political and
Decolonization; Administrative and Budgetary; and Legal.
Draft resolutions are introduced into the discus-
sion under agenda items and may be amended,
merged or withdrawn during the course of discus-
sion. Draft resolutions adopted by a committee are
then published in that committee’s report for further
discussion and approval in one of the GA’s plenary
meetings. Resolutions can be adopted at a plenary
meeting by a vote or by an acclamation. The GA
adopts around 300 resolutions in any given session.
While non-binding, resolutions of the General As-
sembly carry legal weight. Final resolutions and de-
cisions of the General Assembly are issued in three
volumes of the Official Records: Volume I contains
resolutions from the main part of the session; vol-
ume II contains decisions from the main part; and
volume III contains both resolutions and decisions
from the resumed part of the session.
3 The Resolutions Corpus
Of the various types of documents present in the
UN document collection, the resolutions are partic-
ularly interesting from a machine translation per-
spective because of their high quality of transla-
tion and strict adherence to editorial conventions;
they also cover an extremely broad range of topics,
whereas other document subsets are more focussed.
We chose, therefore, to begin the construction of an
NLP-friendly UN corpus with the resolutions data.
At around 300 per year, the final resolutions doc-
uments are a relatively small proportion of the to-
tal number of documents produced by the UN, but
the full document count also includes draft resolu-
tions and a large number of other documents that ul-
timately lead to the final resolutions.
The corpus described here consists of the reso-
lutions in Volume I of the regular sessions of the
General Assembly for sessions 55 through 62, cor-
responding to the period 2000–2007. Prior to this
period, and going back to the UN’s first session in
1946, the only electronic versions of the resolutions
available are scans with no text layers.
3.1 The Nature of Resolutions
An example of the initial fragment of a resolution
is shown in Figure 1. A resolution consists of the
following parts:
1. A symbol that identifies the resolution, consist-
Figure 1: Initial fragment of a resolution [English]
ing of a number corresponding to the session,
and a number corresponding to the ordinal po-
sition of this resolution in the series of reso-
lutions adopted in this session; in the present
example, this is 59/34.
2. Information regarding the adoption of the reso-
lution, which may include a list of the Member
States that voted on the resolution.
3. The title of resolution; in the present example
this is Nationality of natural persons in relation
to the succession of States.
4. The name of the organ stating the resolution;
in the corpus described here, this is always The
Figure 2: Initial fragment of a resolution [Russian]
General Assembly.
5. Zero or more preambulatory paragraphs; these
set the context for the rest of the resolution. By
editorial convention, each of these begins with
the present participle form of a verb or verb
phrase in italics.
6. One or more operative paragraphs that make
up the essence of the resolution: generally
speaking, these are the actions the GA wants to
see take place. Each is introduced by a present
tense verb or verb phrase in italics; the specific
choice of verb has some significance. By edi-
torial convention, the operative paragraphs are
numbered when there is more than one.
For comparison, the Russian translation of the initial
part of this resolution is shown in Figure 2.
Resolutions have an unconventional syntactic and
orthographic structure. In each language, from the
name of the organ onwards, the resolution takes the
form of a single, extended sentence, where the sen-
tence is broken into a series of distinct paragraphs.
Each orthographic paragraph is therefore really what
we would normally think of as a complex clause.
These characteristics, and a number of other typo-
graphic features, are dictated by the resolution edit-
ing conventions (United Nations, 1983).
Language # tokens # characters (M)
English 3067550 20.7
French 3442254 22.8
Spanish 3581566 22.9
Russian 2748898 22.0
Chinese – 5.7
Arabic 2721463 17.2
Table 1: Corpus statistics for the six languages. Token
count for the Chinese data is omitted because of the dif-
ficulty in providing a reliable or meaningful number for
comparison purposes.
While simple resolutions contain only the ele-
ments listed above, more complex resolutions can
contain additional sections, often with their own ti-
tles and/or preambles. Furthermore, some resolu-
tions contain annexes and embedded texts that may
not follow the editorial conventions; and tables may
also appear.5
3.2 The Content of the Corpus
For each language, the corpus contains just over
74000 paragraphs of text, and, for English, around
3 million tokens. Table 1 provides statistics on the
data for the six languages; Figure 3 shows the 20
most common tokens that appear in five of the lan-
guages. The most frequent words in this corpus are
consistent with those found in other corpora, with
the unsurprising exception of the appearance of the
terms United and Nations and a few other domain
specific elements.
Within the UN’s own document processing en-
vironment, the resolutions that make up Volume I
are grouped together in seven large Microsoft Word
files: six of these contain resolutions that came
through one of the six main committees, and the sev-
enth contains those that were introduced directly to
the plenary meeting.
3.3 Building the Corpus
We extracted the individual resolutions from these
files and converted them into basic HTML format
5These items are relatively rare. In the 2100-resolution cor-
pus described here, 41 (1.95%) contain tables and 71 (3.38%)
contain annexes; 27 (1.29%) of these contain both.
Figure 3: The 20 most frequent tokens in five of the languages
using Word’s HTML export capability. The HTML
output was then run through a cleanup process that
would allow only basic paragraph marking and ty-
pographic markup (normally italics) at the start of
preambulatory and operative paragraphs, indicating
lead-in phrases. Additionally, some normalization
was performed to account for the fact that format-
ting that looks continuous within MS Word may ac-
tually consist of multiple formatting segments on the
HTML code level; for example, a contiguous se-
quence of italicised words may appear in the source
as a sequence of distinct italicisation events. Incon-
sistencies with the use of quotes and non-breaking
spaces were also normalized. Finally, tables were
stripped from the text as they mostly contain num-
bers and form nested paragraph structures, which are
difficult to represent in TMX form.
From the HTML format, the multiple language
versions for the same resolution symbol (the identifi-
cation numbers introduced earlier) were aligned, us-
ing the assumption that the translations were strict at
the level of formatting as well as at the level of con-
tent. In a small number of cases, the Word format-
ting caused problems (typically the introduction of
spurious paragraph breaks); these were fixed man-
ually. The aligned resolution texts were then con-
verted into TMX format, while at the same time
marking the adoption information section, incorpo-
rating and marking footnotes, and converting lead-in
phrase marking into standard TMX markup. An ex-
ample is provided in Figure 4.6
Given the availability of a number of existing to-
kenisers that users might wish to apply to the texts
for different purposes, we have not carried out a
complete tokenization of the corpus. However, we
have marked document symbols, since these tokens
might be problematic for standard tokenisers.
4 Interesting Properties of the Corpus
4.1 Document Symbols
Because of their importance, it is essential that the
scope for misinterpretation of resolutions be min-
imised. To this end, these documents generally
contain all relevant context and make heavy use of
fully-explicit and unambiguous document symbol
6Note that the Arabic token ordering displays incorrectly
here as a consequence of the XML labels.
Figure 4: A six-way aligned resolution paragraph.
Document Description Symbol Used
A document from the 62nd session of the General Assembly A/62/100
A resolution from the 52nd session of the General Assembly 52/215 A to D
A Security Council resolution adopted in the year 2000 1325 (2000)
Second addendum to the document of the Commission on Human Rights
(CN.4) of the Economic and Social Council
E/CN.4/1998/53/Add.2
A document with a dual symbol, one from the General Assembly and one from
the Security Council
A/50/60-S/1995/1
A resolution from the 50th session of the International Atomic Energy Agency GC(50)/RES/16
Table 2: Document symbols
references. Similar to complex symbols in the bi-
ological domain (Proux et al., 1998), these symbols
may include slashes, dashes, full stops, brackets and
spaces. Some examples are shown in Table 2.
While it is not practical to identify all possible
symbol variations,7we have developed a set of regu-
lar expressions to locate and mark a significant pro-
portion of the symbols in the corpus.
4.2 Lead-in Phrases
As noted above, preambulatory and operative para-
graphs begin with specially-marked lead-in phrases
based on verbs whose meaning carries some signifi-
cance. While there are no official guidelines on what
constitutes an acceptable phrase, the requirements of
reliable translation tend to limit the chosen words
and phrase forms to a number of popular choices
used in the majority of the cases. Table 3 shows the
ten most frequent lead-in phrases across three of the
six languages.
4.3 Named Entity Mentions
United Nations documentation in general, and the
resolutions of the General Assembly in particular,
include a large number of complex named entity
mentions referring to a broad variety of entity types.
Here are some examples, separated by semi-colons:
Bodies: United Nations; International Atomic En-
ergy Agency; United Nations Educational, Sci-
entific and Cultural Organization.
7(Griffiths, 2005) presents an analysis of 14000 symbols as-
signed in one year, and identifies a number of flaws that include
inconsistent application of editorial conventions.
Organisational units: General Assembly; Eco-
nomic and Social Council; the Advisory Com-
mittee on the United Nations Programme of
Assistance in the Teaching, Study, Dissemina-
tion and Wider Appreciation of International
Law; Open-ended Ad Hoc Working Group on
the Causes of Conflict and the Promotion of
Durable Peace and Sustainable Development in
Africa.
Agents: Secretary-General; United Nations Special
Representative for Children and Armed Con-
flict; the Special Rapporteur of the Commission
on Human Rights.
As can be seen from these examples, named entity
mentions can be very long and may contain tokens,
such as commas, that are normally treated as delim-
iters.
5 Corpus Availability
The corpus described here is available from
http://www.uncorpora.org as a 49.4Mb zip
file that contains just over 74000 pre-processed,
formatting-normalised aligned paragraphs in Trans-
lation Memory eXchange (TMX) 1.4b format. As
noted earlier, special markup is included for docu-
ment symbols and the lead-in phrases in preambu-
latory and operative paragraphs; footnote content is
also marked. Tables have been removed. The voting
information is also marked specially, as it contains
country lists in alphabetic order, and may not partic-
ularly useful for alignment purposes.
Basic utilities are provided to manipulate, extract,
or delete specially tagged areas as well as to extract
specific languages from the six-language set.
Rank English French Spanish
1 3707 Requests 3509 Prie 3905 Pide
2 3325 Recalling 3383 Rappelant 3323 Recordando
3 2177 Calls upon 1973 Demande 1877 Exhorta
4 1927 Welcomes 1870 D ´
ecide 1871 Insta
5 1797 Decides 1738 Souligne 1796 Decide
6 1688 Urges 1660 Invite 1587 Alienta
7 1604 Encourages 1446 R´
eaffirme 1502 Reconociendo
8 1402 Invites 1407 R´
eaffirmant 1396 Invita
9 1350 Recognizing 1361 Se f´
elicite 1291 Reafirmando
10 1269 Reaffirming 1273 Encourage 1257 Reafirma
Table 3: The 10 most frequent lead-in phrases in the three languages
6 Conclusions
In this paper we have described a unique six-
language parallel corpus consisting of 2100 UN res-
olutions, multiply-aligned and marked-up for a num-
ber of constituent phenomena. The variety of lan-
guage families present is particulary of interest for
work based on the use of bridge languages (Kumar
et al., 2007).
We see this as the first step in the construction of
a constantly growing corpus of aligned documents
harvested from the UN’s document collection. We
encourage wide use of the corpus; our sponsors will
be likely to support further extension if there is a
perceived value in its availability.
Acknowledgments
We would like to thank the Department for Gen-
eral Assembly and Conference Management of the
United Nations Secretariat for providing access to
the source documents.
References
D Graff. 1994. UN parallel text (complete). Linguistic
Data Consortium, Philadelphia.
D Griffiths. 2005. The united nations classification
scheme a critique and recommendations. Cataloging
and Classification Quarterly, 40(1):19–41.
P Koehn. 2005. Europarl: A parallel corpus for statisti-
cal machine translation. In MT Summit 2005, Phuket,
Thailand, September.
Shankar Kumar, Franz J. Och, and Wolfgang Macherey.
2007. Improving word alignment with bridge lan-
guages. In Proceedings of the 2007 Joint EMNLP-
CoNLL Conference, pages 42–50, Prague, Czech Re-
public, June. Association for Computational Linguis-
tics.
Ruslan Mitkov Le An Ha, Gabriela Fernandez and Glo-
ria Corpas. 2008. Mutual bilingual terminology ex-
traction. In Proceedings of the Sixth International
Language Resources and Evaluation (LREC’08), Mar-
rakech, Morocco, May.
D Proux, F Rechenmann, L Julliard, V Pillet, and B Jacq.
1998. Detecting gene symbols and names in biologi-
cal texts: A first step toward pertinent information ex-
traction. In S. Miyano and T. Takagi, editors, Genome
informatics: Workshop on Genome Informatics, vol-
ume 9, pages 72–80, Tokyo, Japan, December.
Michel Simard. 1999. Text-translation alignment: Three
languages are better than two. In Proceedings of the
1999 Joint SIGDAT Conference on Empirical Meth-
ods in Natural Language Processing and Very Large
Corpora, pages 2–11.
STAI. 1987. UN Administrative Instruction
ST/AI/189/Add.9/Rev.2.
Ralf Steinberger, Bruno Pouliquen, Anna Widiger,
Camelia Ignat, Tomaz Erjavec, Dan Tufis, and Daniel
Varga. 2006. The jrc-acquis: A multilingual aligned
parallel corpus with 20+ languages. In Proceedings
of the 5th International Conference on Language Re-
sources and Evaluation (LREC’2006), pages 2142–
2147, Genoa, Italy, May.
United Nations. 1983. United Nations Editorial Manual.
Department of Conference Services ST/DCS/2, Sales
No. E. 83.I.16.
David Yarowsky, Grace Ngai, and Richard Wicentowski.
2001. Inducing multilingual text analysis tools via ro-
bust projection across aligned corpora. In Proceedings
of the First International Conference on Human Lan-
guage Technology Research, pages 1–8, Morristown,
NJ, USA. Association for Computational Linguistics.
  • ... In order to carry out this discourse comparative study, a Spanish-Chinese parallel corpus is necessary. Currently, Spanish-Chinese parallel corpora without annotation exist (Resnik, Olsen & Diab, 1999; Rafalovitch & Dale, 2009; Eisele & Chen, 2010; Wang et al., 2013). These parallel corpora do not have any discourse information, neither syntactic information nor part-of-speech (POS) information. ...
    ... As we have mentioned in the previous section, there are currently few parallel Spanish-Chinese corpora. To our knowledge, the already existing parallel corpora are: (a) The Holy Bible (Resnik, Olsen & Diab, 1999), (b) The United Nations Multilingual Corpus (UN) (Rafalovitch & Dale, 2009) and (c) Sina Weibo Parallel Corpus (Wang et al., 2013). An analysis of each corpus will be given in this section, with the purpose of explaining why they are not adequate for translation and language learning purposes between Spanish and Chinese. ...
    Article
    Full-text available
    Spanish and Chinese are two very different languages in all language levels. Therefore, translation (both human and machine translation) from one to another and learning one of them as a foreign language are challenging tasks. Some automatic translate systems exist for this pair of languages, but there is enough room to improve the translation quality between Spanish and Chinese. Besides, the accessible sources, such as a parallel corpus for studying and understanding this language pair, are still few. In this paper, we present how we have created a Spanish-Chinese parallel corpus designed for translation and language learning tasks at the discourse level. This corpus has been enriched automatically with part-of-speech (POS) and several queries based on morpho-syntactic information can be done. We have made available the parallel corpus to the academic community.
  • ... Then, based on the activity results, they give some linguistic suggestions for English-Spanish teaching, which can also help the English- Spanish language learners to comprehend the language differences between both languages. On the other hand, corpus-based studies for Spanish-Chinese language learning are still few: v) Cao, da Cunha and Bel (2016) annotate all the cases of the Spanish DM aunque ('although') and their corresponding Chinese translations in The United Nations Multilingual Corpus (UN) (Rafalovitch and Dale, 2009; Eisele and Yu, 2010). They analyze the used translation strategies in the translation process and give some suggestions for how to translate this Spanish DM into Chinese. ...
    ... They analyze the used translation strategies in the translation process and give some suggestions for how to translate this Spanish DM into Chinese. vi) Several Spanish-Chinese parallel corpus exist and have been used for different research purposes, including Spanish-Chinese language learning, these corpora are: (a) The Holy Bible (Resnik, Olsen and Diab, 1999), (b) The United Nations Multilingual Corpus (UN) (Rafalovitch and Dale, 2009; Eisele and Yu, 2010) and (c) Sina Weibo Parallel Corpus (Wang et al., 2013). The above mentioned works are great achievements that offer different approaches for language learning. ...
    Conference Paper
    Full-text available
    Due to the huge population that speaks Spanish and Chinese, these languages occupy an important position in the language learning studies. Although there are some automatic translation systems that benefit the learning of both languages, there is enough space to create resources in order to help language learners. As a quick and effective resource that can give large amount language information, corpus-based learning is becoming more and more popular. In this paper we enrich a Spanish-Chinese parallel corpus automatically with part-of-speech (POS) information and manually with discourse segmentation (following the Rhetorical Structure Theory (RST) (Mann and Thompson, 1988)). Two search tools allow the Spanish-Chinese language learners to carry out different queries based on tokens and lemmas. The parallel corpus and the research tools are available to the academic community. We propose some examples to illustrate how learners can use the corpus to learn Spanish and Chinese.
  • ... The corpus used in this work is "The United Nations General Assembly Resolutions: A Six-Language Parallel Corpus", created by [25]. This corpus consists of a collection of documents containing official resolutions of the General Assembly of the United Nations translated manually into the six official languages: Arabic, Chinese, English, French, Russian and Spanish. ...
    Conference Paper
    Full-text available
    This work focuses on the extraction of collocations from Arabic textual corpora. The method that we propose is hybrid. It uses syntactico-semantic information by combining an enriched linguistic filtering with a statistical filtering. Linguistic filtering applies syntactic patterns represented by finite state automata and considers both elementary collocations and augmented ones. Statistical filtering is novel since it does not use classical association measures and applies the Latent Semantic Analysis (LSA) method to infer deeper semantic relations than those derived by contiguity frequencies, co-occurrence counts, or correlations in usage. The experiments showed that LSA gives globally better results than those achieved by Mutual Information (MI), Likelihood ration (LLR), Khi-Carré (X2), z score and t-test.
  • ... However, each stem or word in this ontology can have multi-labels or multi descriptions those we can use to add the Arabic translation to each stem and can be used in bilingual processing. Rafalovitch et al.[88], described a six-ways parallel public-domain corpus from which many of the available ontology has started. This six-ways parallel public-domain corpus contains vocabularies of six languages: Arabic, Chinese, English, French, Russian, and Spanish. ...
    Article
    Full-text available
    As a result of the rapid changes in information and communication technology (ICT), the world has become a small village where people from all over the world connect with each other in dialogue and communication via the Internet. Also, communications have become a daily routine activity due to the new globalization where companies and even universities become global residing cross countries' borders. As a result, translation becomes a needed activity in this connected world. ICT made it possible to have a student in one country take a course or even a degree from a different country anytime anywhere easily. The resulted communication still needs a language as a means that helps the receiver understands the contents of the sent message. People need an automated translation application because human translators are hard to find all the times, and the human translations are very expensive comparing to the translations automated process. Several types of research describe the electronic process of the Machine-Translation. In this paper, the authors are going to study some of these previous researches, and they will explore some of the needed tools Original Research Article Alsohybe et al.; CJAST, 23(4): 1-19, 2017; Article no.CJAST.36124 2 for the Machine-Translation. This research is going to contribute to the Machine-Translation area by helping future researchers to have a summary for the Machine-Translation groups of research and to let lights on the importance of the translation mechanism.
  • ... However, each stem or word in this ontology can have multi-labels or multi descriptions those we can use to add the Arabic translation to each stem and can be used in bilingual processing. Rafalovitch et al.[88], described a six-ways parallel public-domain corpus from which many of the available ontology has started. This six-ways parallel public-domain corpus contains vocabularies of six languages: Arabic, Chinese, English, French, Russian, and Spanish. ...
    Article
    Full-text available
    As a result of the rapid changes in information and communication technology (ICT), the world has become a small village where people from all over the world connect with each other in dialogue and communication via the Internet. Also, communications have become a daily routine activity due to the new globalization where companies and even universities become global residing cross countries' borders. As a result, translation becomes a needed activity in this connected world. ICT made it possible to have a student in one country take a course or even a degree from a different country anytime anywhere easily. The resulted communication still needs a language as a means that helps the receiver understands the contents of the sent message. People need an automated translation application because human translators are hard to find all the times, and the human translations are very expensive comparing to the translations automated process. Several types of research describe the electronic process of the Machine-Translation. In this paper, the authors are going to study some of these previous researches, and they will explore some of the needed tools Original Research Article Alsohybe et al.; CJAST, 23(4): 1-19, 2017; Article no.CJAST.36124 2 for the Machine-Translation. This research is going to contribute to the Machine-Translation area by helping future researchers to have a summary for the Machine-Translation groups of research and to let lights on the importance of the translation mechanism.
  • ... Although the idea of crawling the web indiscriminately for parallel data goes back to the 20th century (Resnik, 1999 ), work in the academic community on extraction of parallel corpora from the web has so far mostly focused on large stashes of multilingual content in homogeneous form, such as the Canadian Hansards, Europarl (Koehn, 2005), the United Nations (Rafalovitch and Dale, 2009; Ziemski et al., 2015), or European Patents (Täger, 2011). A nice collection of the products of these efforts is the OPUS web site 1 (Skadin¸ˇSkadin¸Skadin¸ˇ s et al., 2014). ...
  • ... A breakdown of the preprocessing tools used for each language is given in Table 2. Data sets. In order to generate training data for POLYGLOT, we used the following sources of parallel data: The UN corpus of official United Nations documents (Rafalovitch et al., 2009 ), the Europarl corpus of European parliament proceedings (Koehn, 2005), the OpenSubtitles corpus of movie subtitles (Lison and Tiedemann, 2016), the Hindencorp corpus automatically gathered from web sources (Bojar et al., 2014) and the Tatoeba corpus of language learning examples 1 . The data sets were obtained from the OPUS project (Tiedemann, 2012) and word aligned using the Berkeley Aligner 2 . ...
  • ... The embeddings are trained using the default parameters described by the authors and with a dimension of 200. We constructed an English-Spanish parallel corpus from Europarl (Koehn, 2005), UN (Rafalovitch et al., 2009), data sets of the quality estimation shared task in WMT 2012 (Callison-Burch et al., 2012), as well as the training data of the monolingual STS task from previous years (See Subsection 3.1) and used this data to train our bilingual embeddings. Sentence embeddings are then generated by averaging the word embeddings in each sentence. ...
  • Article
    Full-text available
    The availability of corpora is a major factor in building natural language processing applications. However, the costs of acquiring corpora can prevent some researchers from going further in their endeavours. The ease of access to freely available corpora is urgent needed in the NLP research community especially for language such as Arabic. Currently, there is not easy was to access to a comprehensive and updated list of freely available Arabic corpora. We present in this paper, the results of a recent survey conducted to identify the list of the freely available Arabic corpora and language resources. Our preliminary results showed an initial list of 66 sources. We presents our findings in the various categories studied and we provided the direct links to get the data when possible.
  • Article
    Full-text available
    In this article we present Curras, the first morphologically annotated corpus of the Palestinian Arabic dialect. Palestinian Arabic is one of the many primarily spoken dialects of the Arabic language. Arabic dialects are generally under-resourced compared to Modern Standard Arabic, the primarily written and official form of Arabic. We start in the article with a background description that situates Palestinian Arabic linguistically and historically and compares it to Modern Standard Arabic and Egyptian Arabic in terms of phonological, morphological, orthographic, and lexical variations. We then describe the methodology we developed to collect Palestinian Arabic text to guarantee a variety of representative domains and genres. We also discuss the annotation process we used, which extended previous efforts for annotation guideline development, and utilized existing automatic annotation solutions for Standard Arabic and Egyptian Arabic. The annotation guidelines and annotation meta-data are described in detail. The Curras Palestinian Arabic corpus consists of more than 56 K tokens, which are annotated with rich morphological and lexical features. The inter-annotator agreement results indicate a high degree of consistency.
  • Article
    Anecdotal evidence suggests that dissatisfaction with the United Nations Classification Scheme (UNCS), a notational system in continuous use since 1946, has been widespread among researchers and government information specialists. Through the examination of over fourteen thousand document symbols assigned over the course of a year, this study identifies flaws in the notation that have limited its effectiveness. The criteria for this evaluation, which are drawn from both archival and library classification literature, include simplicity, the appropriate use of mnemonics, brevity, serial piece collocation, and the appropriate representation of administrative origin. The author concludes that the scheme satisfies none of these criteria consistently, due in part to the lack of centralized control over its development, and offers recommendations for correcting its defects.
  • Article
    We collected a corpus of parallel text in 11 lan-guages from the proceedings of the European Par-liament, which are published on the web 1 . This cor-pus has found widespread use in the NLP commu-nity. Here, we focus on its acquisition and its appli-cation as training data for statistical machine trans-lation (SMT). We trained SMT systems for 110 lan-guage pairs, which reveal interesting clues into the challenges ahead.
  • Conference Paper
    We describe an approach to improve Statistical Machine Translation (SMT) performance using multi-lingual, parallel, sentence-aligned corpora in several bridge languages. Our approach consists of a simple method for utilizing a bridge language to create a word alignment system and a procedure for combining word alignment systems from multiple bridge languages. The final translation is obtained by consensus decoding that combines hypotheses obtained using all bridge language word alignments. We present experiments showing that multilingual, parallel text in Spanish, French, Russian, and Chinese can be utilized in this framework to improve translation performance on an Arabic-to-English task. 1
  • Conference Paper
    Full-text available
    This paper investigates the potential for projecting linguistic annotations including part-of-speech tags and base noun phrase bracketings from one language to another via automatically word-aligned parallel corpora. First, experiments assess the accuracy of unmodified direct transfer of tags and brackets from the source language English to the target languages French and Chinese, both for noisy machine-aligned sentences and for clean hand-aligned sentences. Performance is then substantially boosted over both of these baselines by using training techniques optimized for very noisy data, yielding 94-96% core French part-of-speech tag accuracy and 90% French bracketing F-measure for stand-alone monolingual tools trained without the need for any human-annotated data in the given language.
  • Conference Paper
    Full-text available
    This paper describes a novel methodology to perform bilingual terminology extraction, in which automat ic alignment is used to improve the performance of terminology extraction for each language. The strengths of monolingual terminology extraction for each language are exploited to improve the performance o f terminology extraction in the other language, tha nks to the availability of a sentence-level aligned bilingual corpus, and an aut omatic noun phrase alignment mechanism. The experiment indicates that weaknesses in monolingual terminology extraction due to the limitation of resources in certain languag es can be overcome by using another language which has no such limitation.
  • Article
    Full-text available
    In this article, we show how a bilingual texttranslation alignment method can be adapted to deal with more than two versions of a text. Experiments on a trilingual corpus demonstrate that this method yields better bilingual alignments than can be obtained with bilingual textalignment methods. Moreover, for a given number of texts, the computational complexity of the multilingual method is the same as for bilingual alignment.
  • Article
    Gathering data on molecular interactions to be fed into a specialized database has motivated the development of a computer system to help extracting pertinent information from texts, relying on advanced linguistic tools, completed with object-oriented knowledge modeling capabilities. As a first step toward this challenging objective, a program for the identification of gene symbols and names inside sentences has been devised. The main di#culty is that these names and symbols do not appear to follow construction rules. The program is thus made up of a series of sieves of di#erent natures, lexical, morphological and semantic, to distinguish among the words of a sentence those which can only be potential gene symbols or names. Its performance has been evaluated, in terms of coverage and precision ratios, on a corpus of texts concerning D. melanogaster for which the list of names of known genes is available for checking.
  • Article
    Full-text available
    This paper describes a system and set of algorithms for automatically inducing stand-alone monolingual part-of-speech taggers, base noun-phrase bracketers, named-entity taggers and morphological analyzers for an arbitrary foreign language. Case studies include French, Chinese, Czech and Spanish. Existing text analysis tools for English are applied to bilingual text corpora and their output projected onto the second language via statistically derived word alignments. Simple direct annotation projection is quite noisy, however, even with optimal alignments. Thus this paper presents noise-robust tagger, bracketer and lemmatizer training procedures capable of accurate system bootstrapping from noisy and incomplete initial projections. Performance of the induced stand-alone part-of-speech tagger applied to French achieves 96% core part-of-speech (POS) tag accuracy, and the corresponding induced noun-phrase bracketer exceeds 91% F-measure. The induced morphological analyzer achieves over 99% lemmatization accuracy on the complete French verbal system. This achievement is particularly noteworthy in that it required absolutely no hand-annotated training data in the given language, and virtually no language-specific knowledge or resources beyond raw text. Performance also significantly exceeds that obtained by direct annotation projection. Keywords multilingual, text analysis, part-of-speech tagging, noun phrase bracketing, named entity, morphology, lemmatization, parallel corpora 1. TASK OVERVIEW A fundamental roadblock to developing statistical taggers, bracketers and other analyzers for many of the world's 200 major languages is the shortage or absence of annotated training data for the large majority of these languages. Ideally, one would like to lever- . [ ] [ ] IN N...
  • Article
    Full-text available
    We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EUanguages, with additional documents being available in the languages of the EU candidate countries. The corpus consists of almost 8,000 documents per language, with an average size of nearly 9 million words per language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is available for all 190+ language pair combinations. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and keyword-assignment software. The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines. Due to the large number of parallel texts in many languages, the JRC-Acquis is particularly suitable to carry out all types of cross-language research, as well as to test and benchmark text analysis software across different languages (for instance for alignment, sentence splitting and term extraction).
  • United Nations Editorial Manual. Department of Conference Services ST
    United Nations. 1983. United Nations Editorial Manual. Department of Conference Services ST/DCS/2, Sales No. E. 83.I.16.