ArticlePDF Available

United nations general assembly resolutions: A six-language parallel corpus

  • Language Technology Group

Abstract and Figures

In this paper we describe a six-ways paral-lel public-domain corpus consisting of 2100 United Nations General Assembly Resolu-tions with translations in the six official lan-guages of the United Nations, with an av-erage of around 3 million tokens per lan-guage. The corpus is available in a pre-processed, formatting-normalized TMX for-mat with paragraphs aligned across multiple languages. We describe the background to the corpus and its content, the process of its con-struction, and some of its interesting proper-ties.
Content may be subject to copyright.
United Nations General Assembly Resolutions:
A Six-Language Parallel Corpus
Alexandre Rafalovitch1,2
1United Nations
New York
Robert Dale2
2Centre for Language Technology
Macquarie University
Sydney, Australia
In this paper we describe a six-ways paral-
lel public-domain corpus consisting of 2100
United Nations General Assembly Resolu-
tions with translations in the six official lan-
guages of the United Nations, with an av-
erage of around 3 million tokens per lan-
guage. The corpus is available in a pre-
processed, formatting-normalized TMX for-
mat with paragraphs aligned across multiple
languages. We describe the background to the
corpus and its content, the process of its con-
struction, and some of its interesting proper-
1 Introduction
Parallel corpora are a useful resource for a wide
variety of purposes including the training of ma-
chine translation algorithms (Koehn, 2005), multi-
lingual terminology extraction (Le An Ha and Cor-
pas, 2008) and even bootstrapping algorithms for
languages that do not enjoy the research resources of
English (Yarowsky et al., 2001). Multi-parallel cor-
pora can be more useful than bilingual corpora when
the additional languages may be used to assist text
alignment (Simard, 1999) or as translation bridge
languages, in the sense of (Kumar et al., 2007).
While many European and Germanic languages
already have good parallel research corpus re-
sources, such as JRC-Acquis (Steinberger et al.,
2006) and EuroParl (Koehn, 2005), material in the
Slavic, Sino-Tibetian and Semitic language families
is much rarer. The corpus presented here consists
of a collection of documents containing manually
translated official resolutions of the General Assem-
bly of the United Nations (UN) in the six official lan-
guages of the UN: Arabic, Chinese, English, French,
Russian, and Spanish. Since resolutions are legally
significant, they pass through multiple levels of hu-
man translation and verification, and so the transla-
tions can be expected to be of high quality.
While documents of the United Nations are al-
ready available online via the Official Document
System of the UN1and are mostly in the public do-
main (see (STAI, 1987)), they are not in the most
convenient form for machine processing. The docu-
ments are typically PDF files with reasonably com-
plex typography consisting of two-column text and
extended footnotes, and some are available only as
images without a text layer. The last public research-
ready corpus of UN documents was produced by
the Linguistic Data Consortion (LDC; see (Graff,
1994)) and included English, French and Spanish
only. Given the linguistic variety encompassed by
the UN’s official languages, it is therefore somewhat
disappointing that this material is not easily avail-
able for research in machine translation.
This paper describes some steps towards address-
ing this concern. The corpus described here con-
tains the text of 2100 resolutions for each language
aligned at the level of paragraphs, with just over
74000 paragraphs in each language. The corpus
contains an average of around 3 million tokens for
each language.2The corpus is encoded in XML
2Counting tokens in the Chinese corpus is difficult; not in-
cluding the Chinese data, the average token count across the
other five languages is 3.11 million.
using Translation Memory eXchange format, with
some of the significant sections and text segments
marked to assist future research. TMX format was
selected as a storage format as it is a standard used
in Computer-Assisted Translation tools and has a
structure that, while simple, is nonetheless sufficient
for our needs.3
In this paper, we first provide in Section 2 some
background information on the context in which the
documents in this corpus appear, before going on
to describe the corpus itself and the motivation for
the selection of this subset of UN documents in Sec-
tion 3. Section 4 discusses a number of interesting
properties exhibited by the corpus that may encour-
age specific research directions. Section 5 provides
details regarding access and availability.
2 The UN Document Space
In order to understand the nature of this corpus, it
may be useful to first understand the overall struc-
ture of the United Nations and the range of docu-
ments that the organisation produces.
The United Nations is a large international organ-
isation consisting of six principal organs: the Gen-
eral Assembly, the Security Council, the Economic
and Social Council, the Trusteeship Council, the
International Court of Justice, and the Secretariat.
These, together with a number of other agencies,
programmes and bodies (such as UNICEF,UN ITAR
and UN U), form the United Nations family.
Collectively, these bodies generate a massive
quantity of documentation: over the last 15 years,
about 14000 documents have been published each
year, with around half of these being available in all
six languages (these are mostly public documents),
and the other half in some subset of the languages
(these are generally internal documents).
Documents belong to a number of categories:
1. Official resolutions, decisions, statements and
legal instruments: these are the outputs of the
work by the bodies that define the official po-
sitions and provide internal and external guid-
ance. These documents often carry significant
legal weight and are widely published and dis-
3We use TMX Version 1.4b: see
2. Reports: these are documents reporting work
done; they often serve as inputs for the deliber-
ations of organs, and provide assistance in de-
cision making. A report may also be delivered
by one body or organisational unit to another
as a way of summarising the decisions taken by
the originating unit; for example, the Secretary-
General reports to the General Assembly on the
work of the Secretariat.
3. Records of meetings and discussions: these
may either be verbatim reports or summaries.
4. Letters and notes from the Member States to
the organisations.
5. Internal records such as daily journals, agendas
of work and draft resolutions.
6. Sales publications, such as books and key
reports (e.g., World Economic Situation and
Prospects 2009).
For non-repudiation reasons, changes and additions
to the documents are recorded as separate corrigenda
and addenda documents.
The General Assembly (GA) is the main delib-
erative organ of the United Nations, with a current
membership of 192 Member States. Final delib-
erations of the General Assembly are made in the
plenary sessions, but most of the work is done in
one of the six main committees or many subcom-
mittees, boards, commissions, working groups and
other bodies.4
The official output of the GA is collected together
as Official Records, which consist of resolutions, de-
cisions and key reports. The GA meets in sessions
described as regular,special and emergency special.
Regular sessions start in September and last as long
as required, often right until the start of the next ses-
sion; for example, the 62nd regular session started
on September 18, 2007 and lasted until September
15, 2008. Regular sessions are divided into two
parts: a main session, which lasts until the end of
the year, and a resumed session, which starts in Jan-
uary. Special and emergency special sessions have
more flexibility in the organisation of their work.
4The six main committees of the General Assembly are:
Disarmament and International Security; Economic and Finan-
cial; Social, Humanitarian and Cultural; Special Political and
Decolonization; Administrative and Budgetary; and Legal.
Draft resolutions are introduced into the discus-
sion under agenda items and may be amended,
merged or withdrawn during the course of discus-
sion. Draft resolutions adopted by a committee are
then published in that committee’s report for further
discussion and approval in one of the GA’s plenary
meetings. Resolutions can be adopted at a plenary
meeting by a vote or by an acclamation. The GA
adopts around 300 resolutions in any given session.
While non-binding, resolutions of the General As-
sembly carry legal weight. Final resolutions and de-
cisions of the General Assembly are issued in three
volumes of the Official Records: Volume I contains
resolutions from the main part of the session; vol-
ume II contains decisions from the main part; and
volume III contains both resolutions and decisions
from the resumed part of the session.
3 The Resolutions Corpus
Of the various types of documents present in the
UN document collection, the resolutions are partic-
ularly interesting from a machine translation per-
spective because of their high quality of transla-
tion and strict adherence to editorial conventions;
they also cover an extremely broad range of topics,
whereas other document subsets are more focussed.
We chose, therefore, to begin the construction of an
NLP-friendly UN corpus with the resolutions data.
At around 300 per year, the final resolutions doc-
uments are a relatively small proportion of the to-
tal number of documents produced by the UN, but
the full document count also includes draft resolu-
tions and a large number of other documents that ul-
timately lead to the final resolutions.
The corpus described here consists of the reso-
lutions in Volume I of the regular sessions of the
General Assembly for sessions 55 through 62, cor-
responding to the period 2000–2007. Prior to this
period, and going back to the UN’s first session in
1946, the only electronic versions of the resolutions
available are scans with no text layers.
3.1 The Nature of Resolutions
An example of the initial fragment of a resolution
is shown in Figure 1. A resolution consists of the
following parts:
1. A symbol that identifies the resolution, consist-
Figure 1: Initial fragment of a resolution [English]
ing of a number corresponding to the session,
and a number corresponding to the ordinal po-
sition of this resolution in the series of reso-
lutions adopted in this session; in the present
example, this is 59/34.
2. Information regarding the adoption of the reso-
lution, which may include a list of the Member
States that voted on the resolution.
3. The title of resolution; in the present example
this is Nationality of natural persons in relation
to the succession of States.
4. The name of the organ stating the resolution;
in the corpus described here, this is always The
Figure 2: Initial fragment of a resolution [Russian]
General Assembly.
5. Zero or more preambulatory paragraphs; these
set the context for the rest of the resolution. By
editorial convention, each of these begins with
the present participle form of a verb or verb
phrase in italics.
6. One or more operative paragraphs that make
up the essence of the resolution: generally
speaking, these are the actions the GA wants to
see take place. Each is introduced by a present
tense verb or verb phrase in italics; the specific
choice of verb has some significance. By edi-
torial convention, the operative paragraphs are
numbered when there is more than one.
For comparison, the Russian translation of the initial
part of this resolution is shown in Figure 2.
Resolutions have an unconventional syntactic and
orthographic structure. In each language, from the
name of the organ onwards, the resolution takes the
form of a single, extended sentence, where the sen-
tence is broken into a series of distinct paragraphs.
Each orthographic paragraph is therefore really what
we would normally think of as a complex clause.
These characteristics, and a number of other typo-
graphic features, are dictated by the resolution edit-
ing conventions (United Nations, 1983).
Language # tokens # characters (M)
English 3067550 20.7
French 3442254 22.8
Spanish 3581566 22.9
Russian 2748898 22.0
Chinese – 5.7
Arabic 2721463 17.2
Table 1: Corpus statistics for the six languages. Token
count for the Chinese data is omitted because of the dif-
ficulty in providing a reliable or meaningful number for
comparison purposes.
While simple resolutions contain only the ele-
ments listed above, more complex resolutions can
contain additional sections, often with their own ti-
tles and/or preambles. Furthermore, some resolu-
tions contain annexes and embedded texts that may
not follow the editorial conventions; and tables may
also appear.5
3.2 The Content of the Corpus
For each language, the corpus contains just over
74000 paragraphs of text, and, for English, around
3 million tokens. Table 1 provides statistics on the
data for the six languages; Figure 3 shows the 20
most common tokens that appear in five of the lan-
guages. The most frequent words in this corpus are
consistent with those found in other corpora, with
the unsurprising exception of the appearance of the
terms United and Nations and a few other domain
specific elements.
Within the UN’s own document processing en-
vironment, the resolutions that make up Volume I
are grouped together in seven large Microsoft Word
files: six of these contain resolutions that came
through one of the six main committees, and the sev-
enth contains those that were introduced directly to
the plenary meeting.
3.3 Building the Corpus
We extracted the individual resolutions from these
files and converted them into basic HTML format
5These items are relatively rare. In the 2100-resolution cor-
pus described here, 41 (1.95%) contain tables and 71 (3.38%)
contain annexes; 27 (1.29%) of these contain both.
Figure 3: The 20 most frequent tokens in five of the languages
using Word’s HTML export capability. The HTML
output was then run through a cleanup process that
would allow only basic paragraph marking and ty-
pographic markup (normally italics) at the start of
preambulatory and operative paragraphs, indicating
lead-in phrases. Additionally, some normalization
was performed to account for the fact that format-
ting that looks continuous within MS Word may ac-
tually consist of multiple formatting segments on the
HTML code level; for example, a contiguous se-
quence of italicised words may appear in the source
as a sequence of distinct italicisation events. Incon-
sistencies with the use of quotes and non-breaking
spaces were also normalized. Finally, tables were
stripped from the text as they mostly contain num-
bers and form nested paragraph structures, which are
difficult to represent in TMX form.
From the HTML format, the multiple language
versions for the same resolution symbol (the identifi-
cation numbers introduced earlier) were aligned, us-
ing the assumption that the translations were strict at
the level of formatting as well as at the level of con-
tent. In a small number of cases, the Word format-
ting caused problems (typically the introduction of
spurious paragraph breaks); these were fixed man-
ually. The aligned resolution texts were then con-
verted into TMX format, while at the same time
marking the adoption information section, incorpo-
rating and marking footnotes, and converting lead-in
phrase marking into standard TMX markup. An ex-
ample is provided in Figure 4.6
Given the availability of a number of existing to-
kenisers that users might wish to apply to the texts
for different purposes, we have not carried out a
complete tokenization of the corpus. However, we
have marked document symbols, since these tokens
might be problematic for standard tokenisers.
4 Interesting Properties of the Corpus
4.1 Document Symbols
Because of their importance, it is essential that the
scope for misinterpretation of resolutions be min-
imised. To this end, these documents generally
contain all relevant context and make heavy use of
fully-explicit and unambiguous document symbol
6Note that the Arabic token ordering displays incorrectly
here as a consequence of the XML labels.
Figure 4: A six-way aligned resolution paragraph.
Document Description Symbol Used
A document from the 62nd session of the General Assembly A/62/100
A resolution from the 52nd session of the General Assembly 52/215 A to D
A Security Council resolution adopted in the year 2000 1325 (2000)
Second addendum to the document of the Commission on Human Rights
(CN.4) of the Economic and Social Council
A document with a dual symbol, one from the General Assembly and one from
the Security Council
A resolution from the 50th session of the International Atomic Energy Agency GC(50)/RES/16
Table 2: Document symbols
references. Similar to complex symbols in the bi-
ological domain (Proux et al., 1998), these symbols
may include slashes, dashes, full stops, brackets and
spaces. Some examples are shown in Table 2.
While it is not practical to identify all possible
symbol variations,7we have developed a set of regu-
lar expressions to locate and mark a significant pro-
portion of the symbols in the corpus.
4.2 Lead-in Phrases
As noted above, preambulatory and operative para-
graphs begin with specially-marked lead-in phrases
based on verbs whose meaning carries some signifi-
cance. While there are no official guidelines on what
constitutes an acceptable phrase, the requirements of
reliable translation tend to limit the chosen words
and phrase forms to a number of popular choices
used in the majority of the cases. Table 3 shows the
ten most frequent lead-in phrases across three of the
six languages.
4.3 Named Entity Mentions
United Nations documentation in general, and the
resolutions of the General Assembly in particular,
include a large number of complex named entity
mentions referring to a broad variety of entity types.
Here are some examples, separated by semi-colons:
Bodies: United Nations; International Atomic En-
ergy Agency; United Nations Educational, Sci-
entific and Cultural Organization.
7(Griffiths, 2005) presents an analysis of 14000 symbols as-
signed in one year, and identifies a number of flaws that include
inconsistent application of editorial conventions.
Organisational units: General Assembly; Eco-
nomic and Social Council; the Advisory Com-
mittee on the United Nations Programme of
Assistance in the Teaching, Study, Dissemina-
tion and Wider Appreciation of International
Law; Open-ended Ad Hoc Working Group on
the Causes of Conflict and the Promotion of
Durable Peace and Sustainable Development in
Agents: Secretary-General; United Nations Special
Representative for Children and Armed Con-
flict; the Special Rapporteur of the Commission
on Human Rights.
As can be seen from these examples, named entity
mentions can be very long and may contain tokens,
such as commas, that are normally treated as delim-
5 Corpus Availability
The corpus described here is available from as a 49.4Mb zip
file that contains just over 74000 pre-processed,
formatting-normalised aligned paragraphs in Trans-
lation Memory eXchange (TMX) 1.4b format. As
noted earlier, special markup is included for docu-
ment symbols and the lead-in phrases in preambu-
latory and operative paragraphs; footnote content is
also marked. Tables have been removed. The voting
information is also marked specially, as it contains
country lists in alphabetic order, and may not partic-
ularly useful for alignment purposes.
Basic utilities are provided to manipulate, extract,
or delete specially tagged areas as well as to extract
specific languages from the six-language set.
Rank English French Spanish
1 3707 Requests 3509 Prie 3905 Pide
2 3325 Recalling 3383 Rappelant 3323 Recordando
3 2177 Calls upon 1973 Demande 1877 Exhorta
4 1927 Welcomes 1870 D ´
ecide 1871 Insta
5 1797 Decides 1738 Souligne 1796 Decide
6 1688 Urges 1660 Invite 1587 Alienta
7 1604 Encourages 1446 R´
eaffirme 1502 Reconociendo
8 1402 Invites 1407 R´
eaffirmant 1396 Invita
9 1350 Recognizing 1361 Se f´
elicite 1291 Reafirmando
10 1269 Reaffirming 1273 Encourage 1257 Reafirma
Table 3: The 10 most frequent lead-in phrases in the three languages
6 Conclusions
In this paper we have described a unique six-
language parallel corpus consisting of 2100 UN res-
olutions, multiply-aligned and marked-up for a num-
ber of constituent phenomena. The variety of lan-
guage families present is particulary of interest for
work based on the use of bridge languages (Kumar
et al., 2007).
We see this as the first step in the construction of
a constantly growing corpus of aligned documents
harvested from the UN’s document collection. We
encourage wide use of the corpus; our sponsors will
be likely to support further extension if there is a
perceived value in its availability.
We would like to thank the Department for Gen-
eral Assembly and Conference Management of the
United Nations Secretariat for providing access to
the source documents.
D Graff. 1994. UN parallel text (complete). Linguistic
Data Consortium, Philadelphia.
D Griffiths. 2005. The united nations classification
scheme a critique and recommendations. Cataloging
and Classification Quarterly, 40(1):19–41.
P Koehn. 2005. Europarl: A parallel corpus for statisti-
cal machine translation. In MT Summit 2005, Phuket,
Thailand, September.
Shankar Kumar, Franz J. Och, and Wolfgang Macherey.
2007. Improving word alignment with bridge lan-
guages. In Proceedings of the 2007 Joint EMNLP-
CoNLL Conference, pages 42–50, Prague, Czech Re-
public, June. Association for Computational Linguis-
Ruslan Mitkov Le An Ha, Gabriela Fernandez and Glo-
ria Corpas. 2008. Mutual bilingual terminology ex-
traction. In Proceedings of the Sixth International
Language Resources and Evaluation (LREC’08), Mar-
rakech, Morocco, May.
D Proux, F Rechenmann, L Julliard, V Pillet, and B Jacq.
1998. Detecting gene symbols and names in biologi-
cal texts: A first step toward pertinent information ex-
traction. In S. Miyano and T. Takagi, editors, Genome
informatics: Workshop on Genome Informatics, vol-
ume 9, pages 72–80, Tokyo, Japan, December.
Michel Simard. 1999. Text-translation alignment: Three
languages are better than two. In Proceedings of the
1999 Joint SIGDAT Conference on Empirical Meth-
ods in Natural Language Processing and Very Large
Corpora, pages 2–11.
STAI. 1987. UN Administrative Instruction
Ralf Steinberger, Bruno Pouliquen, Anna Widiger,
Camelia Ignat, Tomaz Erjavec, Dan Tufis, and Daniel
Varga. 2006. The jrc-acquis: A multilingual aligned
parallel corpus with 20+ languages. In Proceedings
of the 5th International Conference on Language Re-
sources and Evaluation (LREC’2006), pages 2142–
2147, Genoa, Italy, May.
United Nations. 1983. United Nations Editorial Manual.
Department of Conference Services ST/DCS/2, Sales
No. E. 83.I.16.
David Yarowsky, Grace Ngai, and Richard Wicentowski.
2001. Inducing multilingual text analysis tools via ro-
bust projection across aligned corpora. In Proceedings
of the First International Conference on Human Lan-
guage Technology Research, pages 1–8, Morristown,
NJ, USA. Association for Computational Linguistics.
... We have carried out experiments in order to evaluate the proposed methods. The corpus we used is "The United Nations General Assembly Resolutions: A Six-Language Parallel Corpus" [28]. It is a collection of documents containing official resolutions of the General Assembly of the United Nations translated manually into six official languages: Arabic, Chinese, English, French, Russian and Spanish. ...
Full-text available
Bilingual collocation extraction could improve the performance of monolingual extraction. This is especially true for the English–Arabic pair, as difficulties of Arabic collocation extraction can be overcome. We present in this paper two novel approaches for extracting both monolingual and bilingual collocations. The monolingual extraction approach is hybrid, based on linguistic patterns and statistical measures. We propose during statistical filtering to combine vector-based measures with different association measures via a voting procedure. The bilingual extraction capitalizes on different cues (position, frequency, cross-language correspondence between POS-patterns, distribution, translation). It allows enhancing the monolingual collocation extraction by considering not only collocation equivalents with direct translation. Indeed, it can validate unconfirmed collocations because they translate confirmed ones. The results showed, in particular, how the extraction of Arabic collocations can be improved by extracting English–Arabic ones. The precision of extracting Arabic collocations moved upward, respectively, from about 86 to 96%.
... The concept of crawling and mining the web to identify sources of parallel data has been previously explored (Resnik, 1999). A large body of this work has focused on identifying parallel text from multilingual data obtained from a single source: for example the United Nations General Assembly Resolutions (Rafalovitch et al., 2009;Ziemski et al., 2016) or European Parliament parallel corpus (Koehn, 2005). These parallel corpora were curated from specific, homogeneous sources by examining the content and deriving domainspecific rules for aligning documents. ...
Full-text available
Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. In this paper, we exploit the signals embedded in URLs to label web documents at scale with an average precision of 94.5% across different language pairs. We mine sixty-eight snapshots of the Common Crawl corpus and identify web document pairs that are translations of each other. We release a new web dataset consisting of over 392 million URL pairs from Common Crawl covering documents in 8144 language pairs of which 137 pairs include English. In addition to curating this massive dataset, we introduce baseline methods that leverage cross-lingual representations to identify aligned documents based on their textual content. Finally, we demonstrate the value of this parallel documents dataset through a downstream task of mining parallel sentences and measuring the quality of machine translations from models trained on this mined data. Our objective in releasing this dataset is to foster new research in cross-lingual NLP across a variety of low, medium, and high-resource languages.
... Few English-Arabic/French-Arabic parallel corpora are available that support research in contrastive linguistics, machine translation, or lexicography. One of the available Arabic parallel corpora is the English-Arabic Parallel Corpus of United Nations Texts (EAPCOUNT), a 5,392,490-word parallel corpus comprised of United Nations (UN) annual reports for supporting linguistic research (Rafalovitch and Dale 2009;Salhi 2013). Another resource is the Open Parallel Corpus (OPUS), which presents different parallel texts collected from the web to support machine translation research (Tiedemann 2012). ...
Full-text available
Around the world, a growing interest has been seen in learner translator corpora, which are invaluable resources for teaching and research. This paper introduces a new resource to support researchers from different interdisciplinary areas such as computational linguistics, descriptive translation studies, computer-aided translation technology, Arabic machine translation applications, cognitive science, and translation pedagogy. Motivated by the lack of learner translator resources that provide data about learners of translation from and into Arabic, the undergraduate learner translator corpus (ULTC) is an ongoing, error-tagged sentence-aligned parallel corpus of English, Arabic, and French, with Arabic as its main language. The present corpus, consisting of parallel texts of female learners of translation from English or French into Arabic, is the first of its kind in terms of the languages represented, tasks covered, and number of students involved. It is also unique in terms of combining many complementary corpora of cross-lingual data, each of which has its own web-based query interface and corpus analysis tools. This paper describes the ULTC compilation process, preliminary findings, and planned future expansion and research.
... Based on the presence of ambiguous meanings of the words, there may be many candidate sentences for source sentence. Then we use a corpus to match most suitable sentence from the given candidate sentences we have used United Nations (Arabic-English) parallel corpus [20] same training and test data split was used as in [31]: 1,000,000 training sentence pairs and tested on 994 test sentences. ...
Full-text available
It is practically impossible for pure machine translation approach to process all of translation problems; however, Rule Based Machine Translation and Statistical Machine translation (RBMT and SMT) use different architectures for performing translation task. Lexical analyser and syntactic analyser are solved by Rule Based and some amount of ambiguity is left to be solved by Expectation-Maximization (EM) algorithm, which is an iterative statistic algorithm for finding maximum likelihood. In this paper we have proposed an integrated Hybrid Machine Translation (HMT) system. The goal is to combine the best properties of each approach. Initially, Arabic text is keyed into RBMT; then the output will be edited by EM algorithm to generate the final translation of English text. As we have seen in previous works, the performance and enhancement of EM algorithm, the key of EM algorithm performance is the ability to accurately transform a frequency from one language to another. Results showing that, as proved by BLEU system, the proposed method can substantially outperform standard Rule Based approach and EM algorithm in terms of frequency and accuracy. The results of this study have been showed that the score of HMT system is higher than SMT system in all cases. When combining two approaches, HMT outperformed SMT in Bleu score.
... The corpus used in this work is "The United Nations General Assembly Resolutions: A Six-Language Parallel Corpus", created by [25]. This corpus consists of a collection of documents containing official resolutions of the General Assembly of the United Nations translated manually into the six official languages: Arabic, Chinese, English, French, Russian and Spanish. ...
Conference Paper
Full-text available
This work focuses on the extraction of collocations from Arabic textual corpora. The method that we propose is hybrid. It uses syntactico-semantic information by combining an enriched linguistic filtering with a statistical filtering. Linguistic filtering applies syntactic patterns represented by finite state automata and considers both elementary collocations and augmented ones. Statistical filtering is novel since it does not use classical association measures and applies the Latent Semantic Analysis (LSA) method to infer deeper semantic relations than those derived by contiguity frequencies, co-occurrence counts, or correlations in usage. The experiments showed that LSA gives globally better results than those achieved by Mutual Information (MI), Likelihood ration (LLR), Khi-Carré (X2), z score and t-test.
Software tools are of vital importance in corpus-based research, but they can also lead to restrictions on the type of supported corpora and the range of analyses that can be performed. For example, corpus analysis tools, as general purpose software, do not include specific features to process corpora of theatre plays. This situation is even worse for parallel corpora of theatrical texts, in that there is currently a lack of software that allows for both the alignment and analysis of parallel corpora here. In this contribution, we will first outline the peculiarities of theatre texts and suggest three software features to address them: annotation of the structural units of plays, alignment at the utterance level, and concordances and statistics using the annotated units. Second, we will present the specific functionalities of TAligner and ACM to build and analyse parallel corpora of play texts, showing how new avenues of research are opening up with the development of these tools.
Full-text available
Language is the primary medium through which human information is communicated and coordination is achieved. One of the most important language functions is to categorize the world so messages can be communicated through conversation. While we know a great deal about how human languages vary in their encoding of information within semantic domains such as color, sound, number, locomotion, time, space, human activities, gender, body parts and biology, little is known about the global structure of semantic information and its effect on human communication. Using large-scale computation, artificial intelligence techniques, and massive, parallel corpora across 15 subject areas--including religion, economics, medicine, entertainment, politics, and technology--in 999 languages, here we show substantial variation in the information and semantic density of languages and their consequences for human communication and coordination. In contrast to prior work, we demonstrate that higher density languages communicate information much more quickly relative to lower density languages. Then, using over 9,000 real-life conversations across 14 languages and 90,000 Wikipedia articles across 140 languages, we show that because there are more ways to discuss any given topic in denser languages, conversations and articles retrace and cycle over a narrower conceptual terrain. These results demonstrate an important source of variation across the human communicative channel, suggesting that the structure of language shapes the nature and texture of conversation, with important consequences for the behavior of groups, organizations, markets, and societies.
Full-text available
Although deep neural networks have recently led to great achievements in machine translation (MT), various challenges are still encountered during the development of Korean-Vietnamese MT systems. Because Korean is a morphologically rich language and Vietnamese is an analytic language, neither have clear word boundaries. The high rate of homographs in Korean causes word ambiguities, which causes problems in neural MT (NMT). In addition, as a low-resource language pair, there is no freely available, adequate Korean-Vietnamese parallel corpus that can be used to train translation models. In this paper, we manually established a lexical semantic network for the special characteristics of Korean as a knowledge base that was used for developing our Korean morphological analysis and word-sense disambiguation system: UTagger. We also constructed a large Korean-Vietnamese parallel corpus, in which we applied the state-of-the-art Vietnamese word segmentation method RDRsegmenter to Vietnamese texts and UTagger to Korean texts. Finally, we built a bi-directional Korean-Vietnamese NMT system based on the attention-based encoder-decoder architecture. The experimental results indicated that UTagger and RDRsegmenter could significantly improve the performance of the Korean-Vietnamese NMT system, achieving remarkable results by 27.79 BLEU points and 58.77 TER points in Korean-to-Vietnamese direction and 25.44 BLEU points and 58.72 TER points in the reverse direction.
Conference Paper
Full-text available
In most languages, many words have multiple senses, thus machine translation systems have to choose between several candidates representing different senses of an input word. Although neural machine translation has recently become a dominant paradigm and achieved great progress, it still has to confront with the challenge of word sense disambiguation. Neural machine translation models are trained to identify the correct sense of a word as part of an end-to-end translation task, and their performances on word sense disambiguation are not satisfactory. This paper presents a case study of machine translation for Korean language. We have manually built a Korean lexical semantic network - UWordMap - as a large-scale lexical semantic knowledge-based in which each sense of every polysemous word is associated with a sense-code constituting a network node. Then, based on UWordMap, we determine the correct sense and tag the appropriated sense-code for polysemous words of the training corpus before training neural machine translation models. Experiments on translation from Korean to English and Vietnamese show that UWordMap can significantly improve quality of Korean neural machine translation systems in terms of BLEU and TER cores.
Conference Paper
Full-text available
This paper investigates the potential for projecting linguistic annotations including part-of-speech tags and base noun phrase bracketings from one language to another via automatically word-aligned parallel corpora. First, experiments assess the accuracy of unmodified direct transfer of tags and brackets from the source language English to the target languages French and Chinese, both for noisy machine-aligned sentences and for clean hand-aligned sentences. Performance is then substantially boosted over both of these baselines by using training techniques optimized for very noisy data, yielding 94-96% core French part-of-speech tag accuracy and 90% French bracketing F-measure for stand-alone monolingual tools trained without the need for any human-annotated data in the given language.
Conference Paper
Full-text available
This paper describes a novel methodology to perform bilingual terminology extraction, in which automat ic alignment is used to improve the performance of terminology extraction for each language. The strengths of monolingual terminology extraction for each language are exploited to improve the performance o f terminology extraction in the other language, tha nks to the availability of a sentence-level aligned bilingual corpus, and an aut omatic noun phrase alignment mechanism. The experiment indicates that weaknesses in monolingual terminology extraction due to the limitation of resources in certain languag es can be overcome by using another language which has no such limitation.
Full-text available
In this article, we show how a bilingual texttranslation alignment method can be adapted to deal with more than two versions of a text. Experiments on a trilingual corpus demonstrate that this method yields better bilingual alignments than can be obtained with bilingual textalignment methods. Moreover, for a given number of texts, the computational complexity of the multilingual method is the same as for bilingual alignment.
Full-text available
This paper describes a system and set of algorithms for automatically inducing stand-alone monolingual part-of-speech taggers, base noun-phrase bracketers, named-entity taggers and morphological analyzers for an arbitrary foreign language. Case studies include French, Chinese, Czech and Spanish. Existing text analysis tools for English are applied to bilingual text corpora and their output projected onto the second language via statistically derived word alignments. Simple direct annotation projection is quite noisy, however, even with optimal alignments. Thus this paper presents noise-robust tagger, bracketer and lemmatizer training procedures capable of accurate system bootstrapping from noisy and incomplete initial projections. Performance of the induced stand-alone part-of-speech tagger applied to French achieves 96% core part-of-speech (POS) tag accuracy, and the corresponding induced noun-phrase bracketer exceeds 91% F-measure. The induced morphological analyzer achieves over 99% lemmatization accuracy on the complete French verbal system. This achievement is particularly noteworthy in that it required absolutely no hand-annotated training data in the given language, and virtually no language-specific knowledge or resources beyond raw text. Performance also significantly exceeds that obtained by direct annotation projection. Keywords multilingual, text analysis, part-of-speech tagging, noun phrase bracketing, named entity, morphology, lemmatization, parallel corpora 1. TASK OVERVIEW A fundamental roadblock to developing statistical taggers, bracketers and other analyzers for many of the world's 200 major languages is the shortage or absence of annotated training data for the large majority of these languages. Ideally, one would like to lever- . [ ] [ ] IN N...
Full-text available
We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EUanguages, with additional documents being available in the languages of the EU candidate countries. The corpus consists of almost 8,000 documents per language, with an average size of nearly 9 million words per language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is available for all 190+ language pair combinations. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and keyword-assignment software. The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines. Due to the large number of parallel texts in many languages, the JRC-Acquis is particularly suitable to carry out all types of cross-language research, as well as to test and benchmark text analysis software across different languages (for instance for alignment, sentence splitting and term extraction).
Anecdotal evidence suggests that dissatisfaction with the United Nations Classification Scheme (UNCS), a notational system in continuous use since 1946, has been widespread among researchers and government information specialists. Through the examination of over fourteen thousand document symbols assigned over the course of a year, this study identifies flaws in the notation that have limited its effectiveness. The criteria for this evaluation, which are drawn from both archival and library classification literature, include simplicity, the appropriate use of mnemonics, brevity, serial piece collocation, and the appropriate representation of administrative origin. The author concludes that the scheme satisfies none of these criteria consistently, due in part to the lack of centralized control over its development, and offers recommendations for correcting its defects.
We collected a corpus of parallel text in 11 lan-guages from the proceedings of the European Par-liament, which are published on the web 1 . This cor-pus has found widespread use in the NLP commu-nity. Here, we focus on its acquisition and its appli-cation as training data for statistical machine trans-lation (SMT). We trained SMT systems for 110 lan-guage pairs, which reveal interesting clues into the challenges ahead.
Conference Paper
We describe an approach to improve Statistical Machine Translation (SMT) performance using multi-lingual, parallel, sentence-aligned corpora in several bridge languages. Our approach consists of a simple method for utilizing a bridge language to create a word alignment system and a procedure for combining word alignment systems from multiple bridge languages. The final translation is obtained by consensus decoding that combines hypotheses obtained using all bridge language word alignments. We present experiments showing that multilingual, parallel text in Spanish, French, Russian, and Chinese can be utilized in this framework to improve translation performance on an Arabic-to-English task. 1
Gathering data on molecular interactions to be fed into a specialized database has motivated the development of a computer system to help extracting pertinent information from texts, relying on advanced linguistic tools, completed with object-oriented knowledge modeling capabilities. As a first step toward this challenging objective, a program for the identification of gene symbols and names inside sentences has been devised. The main di#culty is that these names and symbols do not appear to follow construction rules. The program is thus made up of a series of sieves of di#erent natures, lexical, morphological and semantic, to distinguish among the words of a sentence those which can only be potential gene symbols or names. Its performance has been evaluated, in terms of coverage and precision ratios, on a corpus of texts concerning D. melanogaster for which the list of names of known genes is available for checking.
United Nations Editorial Manual. Department of Conference Services ST
United Nations. 1983. United Nations Editorial Manual. Department of Conference Services ST/DCS/2, Sales No. E. 83.I.16.