ArticlePDF Available

A statistical approach to language translation

Authors:
  • Renaissance Technologies
  • Renaissance Technologies

Abstract

An approach to automatic translation is outlined that utilizes techniques of statistical information extraction from large data bases. The method is based on the availability of pairs of large corresponding texts that are translations of each other. In our case, the texts are in English and French.Fundamental to the technique is a complex glossary of correspondence of fixed locutions. The steps of the proposed translation process are: (1) Partition the source text into a set of fixed locutions. (2) Use the glossary plus contextual information to select the corresponding set of fixed locutions into a sequence forming the target sentence. (3) Arrange the words of the target fixed locutions into a sequence forming the target sentence.We have developed statistical techniques facilitating both the automatic creation of the glossary, and the performance of the three translation steps, all on the basis of an alignment of corresponding sentences in the two texts.While we are not yet able to provide examples of French / English translation, we present some encouraging intermediate results concerning glossary creation and the arrangement of target word sequences.
A STATISTICAL APPROACH TO LANGUAGE TRANSLATION
P. BROWN, J. COCKE, S. DELI,A PIETRA, V.
DELLA PIETRA,
F. JELINEK, R, MF, RCF, R, and P. ROOSSIN
IBM Research Division
T.J. Watson Research Center
Department of Computer Science
P.O.
Box 218
Yorktown Heights, N.Y. 10598
ABSTRACT
An approach to automatic translation is outlined that utilizes
technklues of statistical inl'ormatiml extraction from large data
bases. The method is based on the availability of pairs of large
corresponding texts that are translations of each other. In our
case, the iexts are in English and French.
Fundamental to the technique is a complex glossary of
correspondence of fixed locutions. The steps of the proposed
translation process are: (1) Partition the source text into a set
of fixed locutioris. (2) Use the glossary plus coutextual
information to select tim corresponding set of fixed Ioctttions into
a sequen{e forming the target sentence. (3) Arrange the words
of the talget fixed locutions into a sequence forming the target
sentence.
We have developed stalistical techniques facilitating both tile
autonlatic creation of the glossary, and the performance of tile
three translation steps, all on the basis of an aliglnncllt of
corresponding sentences in tile two texts.
While wc are not yet able to provide examples of French /
English tcanslation, we present some encouraging intermediate
results concerning glossary creation and the arrangement of target
WOl'd seq lie)lees.
1. INTRODUCTION
In this paper we will outline an approach to automatic translation
that utilizes techniques of statistical information extraction from
large data bases. These self-organizing techniques have proven
successful in the field of automatic speech recognition [1,2,3].
Statistical approaches have also been used recently in
lexicography [41 and natural language processing [3,5,6]. The idea
of automatic translation by statistical (information thco,'etic)
methods was proposed many years ago by Warren Weaver [711.
As will be seen in the body of tile paper, tile suggested technique
is based on the availability of pairs of large corresponding texts
that are Iranslations of each other.
Ill
particular, we have chosen
to work with the English and French languages because we were
able to obtain the biqingual llansard corpus of proceedings of the
Canadian parliament containing 30 million words of text [8]. We
also prefer to apply our ideas initially to two languages whose
word orcter is similar, a condition that French and English satisfy.
Our approach eschews the use of an internmdiate ,nechalfism
(language) that would encode the "meaning" of tile source text.
The proposal will seem especially radical since very little will be
sakl about employment of conventional grammars. This
omissiol], however, is not essential, and may only rcl'lect our
relative lack of tools as well as our uncertainty about tile degree
of grammar sophistication required. We are keeping an open
mind!
Ill what follows we will not be able to give actual results el French
/ English translation: our less than a year old project is not I'ar
enongh ahmg. Rather, we will outline our current thinking, sketch
certain techniqttes,
and
substantiate our Ol)timism
by
presenting:
some intermediate quantitative data. We wrote this solnewhat
specttlativc paper hoping to stimulate interest in applications el
statistics to transhttion and to seek cooperation in achieving this
difficult task.
2. A IIEURIST|C OUTLINE OF 'FILE BASIC PHII,OSOPttY
Figure I juxtaposes a rather typical pair of corresponding English
mid ]:rench selltenees, as they appear in the Ih.nlsard corpus.
They arc arranged graphically so as to make evident thai (a) the
literal word order is on the whole preserved, (b) the chulsal (and
perhaps phrasal) structure is preserved, and (c) the sentence pairs
contain stretches of essentially literal correspondence interrupted
by fixed locutions. In the latter category arc [I rise on = ie
souleve], ]affecting = apropos], and [one which reflects
o n
=
i/our mettre cn doutc].
It can thus be argued that translation ought
to
bc based
on a
complex glossary of correspondence el' fixed locutions. Inch~ded
would be single words as well as phrases consisting el contiguous
or tuna--contiguous words. E.g., I word = mot l, I word = proposl.
[not = ne ... pasl, [no = ne ... pas[, [scat belt = ccmturc[, late =
a mangel and even (perhaps} lone which reflects Oil
=
[)()ill"
mcttrc ell doute], etc.
Transhttion call he sotnewhat naively regarded as a tht'cc slag¢
process:
( 1 ) Partition the source text into a set of fixed locutions
(2) Use the glossary plus contextual information to select the
corresponding set of fixed locutions in the target language.
(3) Arrange the words of the target fixed locutions into a
sequence that forms the target sentence.
This naive approach forms the basis of our work. In fact, we have
developed statistical techniques facilitating the creation of the
glossary, and the performance of the three translation steps.
While the only way to refute the many weighty objections to our
ideas woukl be to construct a machine that actually carries out
satisfactory translation, some mitigating comments are ill order,
7l
We do not hope to partition uniquely the source sentence into
locutions. In most cases, many partitions will be possible, each
having a probability attached to it.
Whether "affccting" is to be translated as "apropos" or
"cuncernant," or, as our dictionary has it, "touchant" or
"cmouvant," or in a variety of other ways, depends on the rest
of the sentence. However, a statistical indication may be
obtained from the presence or absence of particular guide words
in that scntcncc. Tile statistical technique of decision trees [9]
can be used to determine the guide word set, and to estimate the
ln'obability to be attached to cach possible translate.
The sequential arrangement of target words obtained from the
glossary inay depend on an analysis of the source sentence• For
instance, clause corrcspondence may be insisted upon, in which
case only permutations of words which originate in the same
source clause wotdd be possible. Furthermore, the character of
the source clause may affect the probability of use of certain
functioll words in the target clause. There is, of course, nothing
to prcvent the use of more detailed information about the
structure of the parse of the source sentence. However,
preliminary experilnents presented below indicate that only a very
crude grammar may be needed (see Section 6).
3. CREATING THE GLOSSARY,'FIRST ATTEMPT
We have already indicated in the previous section why creating a
glossary is not just a matter of copying some currently available
dictiouary into the computer, in fact, in the paired sentences of
Figure 1, "affecting" was translated as "apropos," a
correspondence that is riot ordinarily available• Laying aside for
the time being the desirability of (idiomatic) word cluster - to -
word cluster translation, what we are'after at first is to find for
each word f in the (French) source language the list of words
{e~, e2 ..... e,} of the (English) target language into which f can
translate, and the probability
P(e, If )
that such a translation takes
place.
A first approach to a solution that takes advantage of a large data
basc of paired sentences (referred to as 'training text') may be as
follows. Suppose for a moment that in every French / English
sentence pair each French wordftranslates into one and only one
English word e, and that this word is somehow revealed to the
computer. Then we could proceed by!'
I. Establish a counter
C(e,,f)
for each word e~ of the English
w~cabulary. Initially set
C(e~,f)
= 0 for words et. Set J = 1.
2. Find the Jth occurrence of the word fin the French text. Let
it take place in the Kth sentence, and let its translate be the qth
word in the Kth English sentence E = e~,, e~ .....
e~,.
Then
increment by 1 the counter
C(e,,¢f).
3. Increase J by 1 and repeat steps 2 and 3.
Setting
M(f )
equal to the sum of all the counters
C(e,, f)
at the
conclusion of the above operation (in fact, it is easy to see that
M(f) is the number of occurrences of fin the total French text),
we could then estimate the probability
P(e, J f )
of translating the
word f by the word e, by the fraction
C(e,,f)/M(f).
The problem with the above approach is that it relies on correct
identification of the translates of French words, i.e., on the
solution of a significant part of tile translation problem. In the
absence of such identification, the obvious recourse is to profess
complete ignorance, beyond knowing that the translate is one of
the words of the corresponding English sentence, each of its
words being equally likely. Step 2 of the above algorithm then
must be changed to
2'. Find the Jth occurrence of the word fin the French text. Let
it take place in the Kth sentence, and let the Kth English sentence
consist of words e,,,
e,~, ..., e,°.
Then increment the counters
C(e,,,f), C(e,,,f) ..... C(o,o,f)
by tire fraction 1/n.
This second approach is based on tile faith that in a large corpus,
the frequency of occurrence of true translates of f in
corresponding English sentences would overwhelm that of other
candidates whose appearance in those sentences is accidental•
This belief is obviously flawed. In particular, the article "the"
would get the highest count since it would appear multiply in
practically every English sentence, and similar problems would
exist with other function words as well.
What needs to bedone is to introduce some sort of normalization
that would appropriately discount for the expected frequency of
occurrence of words. Let
P(e~)
denote the probability (based on
ttle above procedure) that the word e, is a translate of a randomly
chosen French'word.
P(e~)
is given by
Pie i) = ~f P(eilf')r(f') = ~f P(e~ lf')M(f')/M
(3.i)
where M is the total length of the French text, and
M(f')
is the
number of occurrences off t in that text (as before). The fraction
P(e,
If) /
P(e,)
is an indicator of the strength of association of e,
with f, since
P(e,
If) is normalized by the frequency
P(e,)
of
associating e~ with an average word. Thus it is reasonable to
consider
e,
a likely translate of f if
P(e, I f )
is sufficiently large•
The above normalization may seem arbitrary, but it has a sound
underpinning from the field of Information Theory [ 10]. In fact,
the quantity
P(eilf)
l(ei;
f) = log (3.2)
P(e,)
is the mutual information between the French word f and the
English word
e,.
Unfortunately, while normalization yields ordered lists of likely
English word translates of French words, it does not provide us
with the desired probability values. Furthermore, we get no
guidance as to the size of a threshold T such that e, would be a
candidate translate of f if and only if
l(~;f) > T (3.3)
Various ad hoe modifications exist to circumvent the two
problems• One might, for instance, find the pair e, f with the
highest mutual information, criminate e~ and f from all
corresponding sentences in which they occur (i.e. decide once
and for all that in those sentences e, is tile translate of f !), then
re-compute all the quantities over the shortened texts, determine
the new maximizing pair
e~,f ~
and continue the process until
some arbitrary stopping rule is invoked•
Before the next section introduces a better approach that yields
probabilities, we present in Figure 2 a list of high mutual
72
information English words for some selected French words. The
reader will agree that even tire flawed technique is quite powerful.
4. A
SIMPLE GI,OSSARY BASED ON
A MODE[,
O1" TIlE TRANSI,ATION PROCESS
We will now revert to our original ambition of deriving
probabilities of translation,
P(e,[f).
Let us start by observing
that tlm algorithm of the previous section has the following flaw:
Shonld it be "decided" that the qth word, e,, , of the English
sentence is Ihc translate of the rth word,
~r,
of the French
sentence, that process makes no provision for removing e,. from
eonskk ratiou as a candidate translate of any of tile remaining
French words (those not in the rth position)! We need to find a
mctho0 to decide (probabilistically !) which English word was
general ed by which l.'rench one, and then estimate
P(e, tf )
by the
relative frequency with whiehfgave rise to e, as "observed" ira tire
texts of paired French / English sentence transhttcs. Our
procedure will be based on a model (an admittedly crude one) of
how Ertgtish words are generated from their French counterparts.
With a slight additional refinement to be specified in the next
section (see the discussion on position distortion), the following
model will do the trick. Augment the English vocabulary by the
NULl, vcord eo that leaves no trace in tile English text. Then each
French word f will prodnce exactly one 'primary' English word
(which may be, however, invisible). Furthermore, primary
English words can produce a number of secondary ones.
The provisions for the null word and for tile production of
secondary words will account for the unequal length of
corresponding French and English sentences. It would be
expected that some (but not all) French function words would
be killed by producing null words, and that English ones would
be crealed by secondary production. In particular, in the example
of Figme l, one would expect that "reflects" woakl generate both
"which" and "on" by secondary production, and "rise" would
similarly generate "on." On tbc other hand, the article 'T" of
'TOrat( ur" and the preposition "a" of "apropos" wotfld both
be expected to generate a null word in the primary process.
This model of generation of English words from French ones then
requires the specification of the following quantities:
1. The probabilities
P(e, lf)
that the ith word of the English
dictionary was generated by the French word f.
2. The probabilities
Q(% l e,)
that the jth English word is
generated from tile ith one in a secondary generation process.
3. The probabilities R (k I e~) that the ith English word generates
exactly
k
other words in the secondary process. By convention,
we set R(0 [ e0) = 1 to assure that the null word does not generate
any other words.
The lnollel probability that the word f generates e,, in tile primary
process, and e~:,...,e~, in the secondary one, is equal to the product
P(ei, lf ) R(k -
11%)
Q(ei2lei,) Q(%lei~)... Q(%leq)
(4.1)
Given a pair of English and French sentences E and F, by the
term generation pattern $ we understand the specification of
which English words were generated from which French ones,
and which~secondary words from which primary ones. Therefore,
the probability P(E,$IF) of generating the words of E ira a
pattern $ from those of F is given simply by a product of factors
like (4.1), one for each French word. We can then think of
estimating the probabilities
P(e,
l f),
R(k l e,),
and
Q(e:l¢)
by the
following algorithm at tile start of which all counters are set to
0:
1. For a sentence pair E,F of the texts, find that pattern $ that
gives the maximal value of P(E,$IF), and then make the
(somewhat impulsive) decision that that pattern $ actually took
place.
2. If in the pattern $, f gave rise to e,, augment counter
CP(e,,f)
by l; if e, gave rise to k sccoudary English words,
augment counter
CR(k, e,)
by 1 ; if e~ is any (secondary) word that
was given rise to by e,, augment counter
CQ(e~,
e,) by 1.
3. Carry out steps 1 and 2 for all sentence pairs of tile training
text.
4. Estimate the model probabilities by nornmlizing the
correspnndiug counters, i.e.,
P(e,]f) = CP(ei, f)/CP(f) where CP(f) = ECP(e, f)
i
R(k] e i) = CR(k, ei)/CR(e,) where CR(ei) = E CR(k, ei)
k
Q(ejl e i) = CQ(e 1, e,)/CQ(e i) where CQ(e,) = ECQ(ei, e,)
J
The problem with the above algorithm is that it is circular: ila
order to evalnate P(E,$ ] F) one needs to know the probabilities
P(e,
I)c),
R(kl e,),
and
Q(ejle,)
in the first place! Forttmately, the
difficulty can be alleviated by use of itcrative re-estimation, which
is a technique that starts out by guessing the values of unknown
quantities and gradually re-adjusts them so as to account better
and better for given data [ 11 ].
More precisely, given any specification of the probabilitics
P(e, lf), R(k
l e,), and
Q(%le,) ,
we compute the probabilities
P(E,$ [ F) needed in step 1, and after carrying out step 4, wct, sc
the freshly obtained probabilities
P(e,
If),
R(k ]e,),
and
Q(e,
I e,)
to repeat the process fi'om step I again, etc. We hah the
computation when the obtained estimates stop changing from
iteration to iteration.
While it can be shown that tile probability estimates obtained in
the above process will converge [11,12], it cannot be proven that
the values obtained will be the desired ones. A heuristic argument
can be formulated making it plausible that a more complex but
computationally excessive version [13] will succceC Its truncated
modification leads to a glossary that seems a very satisfactory
one. We present some interesting examples of its
P(e, If)
entries
in Figure 3.
Two important aspects of this process have not yet been dealt
with: the initial selection of values of
P(e, lf), R(kle,) ,
and
Q(51e,),
and a method of finding the pattern $ maximizing
P(E,$ [ F).
A good starting point is as follows:
A. Make
Q(ejle,)
= l/K, where K is the size of the English
vocabulary.
73
g. l.et R(I [e,) = 0.8, R(01¢) = 0.1, R(2I<) = R(31<) =
R(4 I g) = R(5 I e,) = 0.025 for all words e, except the null word
ell l,et R(0 le0) = 1.0.
C. To determine the initial distribution
P(e, lf)
proceed as
I'ollows:
(i) Estimate first
P(< If )
by tile algorithm of Section 3.
(ii) Compute the mutual information values l(e,; f) by formula
(Y2), and for each f find the 20 words e, for which
I(e,;f)
is
largest.
(iii) I.ct
P(<~I./') =
P(< l f) = (I/21) - e for all words<on the
list obtained in OiL whEre e is some small positive number.
l)istributc the remaining probability e uniformly over all the
I nglish words not
on
the list.
I:inding tile maximizing pattern $ for a given sentence pair E, F
is ~ well-studiEd technical problem with a variety of
,'{mHmtatiomdly feasible solutions that arc suboptimal in some
practically uuimportant respects I 14]. Not to interrupt the flow
t,l imuiti\e ideas, we omit the discussion of the corresponding
d 1~2',11 illnns.
5. TOWARD A COMPLEX G1,OfSSARY
In the previous section we have introduced a techniqne that
derives a word - to - word translation glossary. We will now
reline tile model to make the probabilities a better reflection of
reality, and then outline an approach for including in tile glossary
Ihe /ixEd locations discussed in Section 2.
It should be noted that while English / French translation is quite
k)cal (as illustrated by the alignment of Figure 1), the model
leading to (4.1) did not take advantage of this affinity of the two
languages: tile relative position of the word translate pairs ill their
respective selltences was not taken into account. If m and n
denote the respective lengths of corresponding French and
l:.nglish sentences, then the probability that 6~ (the kth word in the
English sentencE) is a primary translate of.f~0 (the hth word in the
[:rench sentence) shoukl more accurately be given by the
probability
P(e,,kl
.f,,,h,m,n) that depends both on word
positions and sentence lengths. '1'o keep the formulation as simple
as possiblE, WE can restrict ourselves to tile functional form
l'(ei ,k I /i,,,h,m,n) = PW(e,~ I fh) PD(k l h,m,n)
(5.1)
In (5.1) we make thc 'distortion' distribution
PD(klh,m,n)
indcpcndcnt o1' the identity of the words whose positional
discrepancy it dcscribcs.
As far as secondary generation is concerned, it is first clear that
the production of preceding words differs from that of those that
Iollow. So the R and Q probabilities should be split into left and
right probabilities
RL
and
QL,
and
RR
and
QR.
Furthermore,
\re shnuld provide the Q -probabilities with their own distortion
components that would depend on the distance of the secondary
word from its primary 'parent'. As a result of these
cons!derations, the probability that f~, generates (for instance) the
primary words e,~ and preceding and following secondary words
<~ ,, <~ ,, e,.~ would be given by
f'W(6~ I .[i~,) PD(k l h,m,n) RL(2 l eiA)
RR(I e~a)
QL(G_:~,
3 I G)
QL(%
,,11%)
QR(e,~+2,2lei ,)
(5.2)
Obviously, other distortion formulations are possible. The
purpose of any is to sharpen the derivation process by restricting
the choice of translates to the positinnally likely candidates in the
corresponding sen tencc.
To find fixed locutions in English, we can use the final
probabilities
QL
and
QR
obtained by tile method of the previous
section to compute mutual informations between primary and
secondary word pairs,
QR(e' I e)
IR(e;e r)
= log---- (5.3)
P(e')
and
QL(e' I e)
1L(e~;e)
= log
P(e')
where
P(e') = C(e')/N
is the relative frequency of occurrence of
the secondary word e' in the English text
(C(e')
denotes the
number of occurrences of e ' in the text of size N), and
QR
and
QL
are the average secondary generation probabilities,
QR(e']e) = ZQR(e', i] e)
(5,4)
i
and
Ql.(e']e) = EQR(e', i l e)
i
WE can then establish an experimentally appropriate threshold
71, and iuchulc in the glossary all pairs (e, e') and (e', e) whose
mutual information exceeds 7'.
While tile process above results in two-word fixed locutions,
longer locutions can be obtained iteratively in the next round after
the two-word variety had been included in the glossary and in the
formulation of its creation.
To obtain French locutions, one must simply reverse the direction
of the translation process, making English and French the source
and target languages, respectively.
With two-word locutions present in both the English and French
parts of the glossary, it is necessary to reformulate the generation
process (4.1). The change would be minimal if we could decide
to treat the words of a locution (,/;f') as a single word f* =
U, f') rather than as two separate words f and f' whenever both
are found in a sentence. In such a case nothing more than a
receding of the French text would be required. However, such a
radical step would almost certainly be wrong: it could well
connect auxiliaries and participles that were not part of a single
past construction. Clearly then, the choice between separateness
and unity should be statistical, with probabilities estimated in the
overall glossary construction process and initialized according to
the frequencies with which elements of the pair
f,f~
were
associated o1 not by secondary generation when they appeared in
the same sentence.
Since the approach of this section was not yet used to obtain any
results, we will leave its complete mathematical specification to a
future report.
74
6. GF, I~,IERATIION ()F TRANSLATED 'I'EXT
We have pointed out in Sectk)u 2 that translation can be
somewhat xaively regarded as a lhrec stage process:
( I )
Partition the source text into a set of fixed locutions.
(2) Use the glossary plus contextual information to select the
corresponding set of fixed lomttious in the target language.
(3) Atrange the words of thc target fixed locutkms into a
seqtteltce forming the target sentence.
We have just fitfished arguin b, itt Section 5 that the partilioning
of sottrcc +ext ili1O locutions is SOIIIUWIIHt conlplex, and that it
must be approached statistically. The basic idea of using
cotltextttal iilfOl'l?lation tO select the correct 'sense' of a Iocutioll
is to eonsh uct a contextual glossary based on a probability of the
form P(el J; g'IFI ) where e and f are English anrl French
locutions, ;tnd
q, ilq
denotes a 'lexical' equivalence class of the
scalence F The tu,;t of class membership woukl typically depend
on tilt pre~:ence of SOIIIC contbination of words in F. The choice
of an app;opriate eqtfivalcncc chtssification schenlc would, of
course, be .'+he subject of research based on yet another statistical
formulation. The estimate of
P(el
./'; ~11"1 ) would be derived
from courtts o1' locttlion alignments in sentmlce translate pai,s, the
alignments being dstimated based on non-contextual glossary
probabilitit+s of the form (5.2).
The last stop in our translation scheme is the re-arrangement of
the words o1" the generated English locutions into an appropriate
sequence. To see whc'ther this can be douc statistically, we
explored what would happen in the ilnpossibly optimistic case
where the words generated in (2) were exactly those of the
l inglish sczttencc (only their order would be unknown):
From a large f'+'uglish corpus we derived estimates of trigram
probabilities,
P(e3let, e:~),
that the word el follows immediately
the sequencc pair e~, % A model of 13,nglish sentence production
based on a trigram estimate would conclude that a sentence
e~, ca, ..J e,, is
generated with probability
P(el,
e2) P(e3 Iet, e2)
P(e41
e2, e3) ..-
P(e,,
I e,,+ 2, e, I) (6.1)
We then rook other l:';nglish sentences (not included in the
training COrlmS) and deterntined which of the n t different
arrangements el + their n words was most likely, using the l'ormula
(6.1). We found that in 63% of sentences of 10 words or less,
the most likely arrangement was the original English sentence.
I;urthermore, the most likely arrangement preserved the meaning
of the original sentence in 79% of the cases.
Figure, 4. shows examples hi' synonymous and non-.synonymous
re-al'rangelnenL'~.
We realize that very little hope exists of the glossary yielding the
words and only the words of an English seutence translating the
original French one, and that, furthermore, Euglish sentences arc
typically longer than 10 words. Nevertheless, we feel that the
abow: result is a hopeful one for fnture statistical translation
methods incorporating the use of appropriate syntactic structure
information.
REFERENCES
111 L.R. Bahl, F. Jclinek, and R.l,. Mercer: A maximum likelihood
approach to contimlous speech recognition, IEEE Traosaclioos on
Pattern Analysis and Machine Intelligence, PAM I-5 (2): 179-190, March
1983.
12] .I.K. Baker: Stochastic modeling for automatic speech
tmdcrstanding. In R.A. P, eddy, editor, Speech Recognition, pages
521-541, Academic Press, New York, 1979.
131 J.D. Ferguson: llidden Markov analysis: An introduction. In J.D.
Fcrguson, Ed., ltldden Marker Models for Speech. Princeton, New
Jersey, IDA-CRD, Oct. 1980, pp. 8-15
14] J. Metl. Sinclair: "Lcxicogral~hic F.vidence" in, I)ielionarie,~,
Lexicography and Langaage l,earniog (l!1+'F Doeoments: 1211), editol
R. llson, NewYork: Pergamon Press, pp. 81-94, 1985.
151 P,.G. Garsidc, G.N. Lccch and (].P,. Sampson, The Comlmlational
Analysis of l(l,glish: a Corpus-Based AI)l)roach, I.ongman 1987
1161 G.R. Sampson, "A Stochastic Approach to Parsing" itl. lh'oceeding+,
of tile I lth lnlernalional Corfferenee oil (k+mputaliotml l,inguintics
(COl ,IN(] '86) Bonn 151-155, 1986.
171 W. Weaver: Translalion (194.9). Reproduced in: I,ocke. WN. &
Booth, A,D. eds.: Maelnine Iranslalimn of hmguages. Calnbrid,ee, MA.:
MIT Press, 1955.
18]
I[lansards: Official l)roeeedings of the liouse of Cemlnons of Canada,
19"I4+.78, CanadialJ Government Printie~ Bureau, Ihtll, Quebec
( ~
~111~/(Ja.
19[ I+. IIrciman, J.ll. Friedtnall, R.A. Olshen, and (J. Stone:
Classification and Regression Trees, Wadsworth alld t~rooks, M(mtcrey.
CA,
[
984.
[10] R.G. Gallager: Informalion Theory aad reliahle (ommuniealion,
John Wiley and Sons, Ii1c,, New York, 1968.
[I I1 A.P. Dcmpstcr, N.M.l.aird, al/d It.B. ll.ubin: Maximum likelihood
from ineolnpletc data via tile I"M algorithm, Journal of Ihe Royal
S|atistical Society, 39(B):
1-38, 1977.
1121 A.J, Viterbi: Error bounds Ior conw)httional codes and an
asylntotically optimum decoding algorithm, 11,'1.;1,~ Transactions on
Information Theory, 1T-13:2611-267, 19fi7.
[ 13] L.E. Bauln: All inequality and associated inaxilnization tcc]miquc
in statistical estimatkm of probabilistic functions o1 a Maikov process,
lneqoalities, 3:1-8, 1972.
[ 14] F. Jclinek: A fast sequential decoding algorithm using a stack, IBM
T. a. Watson Research Development, vol. 13, pp. 6754~85, No\. 19(¢)
Mr. Speaker, I rise on a question of privilege
Monsieur l'Orateur, je souleve la qoestion de privilege
affecting the rights and prerogatives of pmliamentary committees
a propos des droits et des prerogatives des eomites parlenmnlaires
and o11o which reflects oii tile wol'd of two ininisters
et i)otlr nlettre en d<mte les i)ro])os tie detlX illhlistles
of the Crown.
tic la Cotlronne.
FIGURE I
AI,IGNMENT OF A FRENCII AND ENGHSH SI;,NTENCE PAIR
75
eau water
lait milk
banque bank
banques banks
hier yesterday
janvier January
jours days
votre your
cufants children
trop too
toujours always
trois three
monde world
pourquoi why
aujord'bui today
sans without
lui him
mais but
suis am
seulemeot only
peut cannot
ceintures seat
ceinturcs belts
bravo !
FIGURE 2
A LIST OF HIGH MUTUAL INFORMATION FRENCH-ENGLISH
WORD PAIRS
WHICH QUI
I. qui 0.380 who 0.188
2. que 0.177 which 0.161
3. dont 0.082 that 0.084
4, de 0.060 0.038
5. d' 0.035 to 0.032
6. laquclle 0.(131 of 0.027
7. ou 0.027 the 0.026
8. ct 0.022 what 0.018
THEREFORE DONC
1. donc 0.514 therefore 0.322
2. consequent 0.075 so 0.147
3. pat" 0.074 is 0.034
4. ce 0.066 then 0.024
5. pourquoi 0.064 thus 0.022
6. alors 0.025 the 0.018
7. il 0.025 that 0.013
8. aussi 0.015 us 0.012
STILL ENCORE
1. encore 0.435 still 0,181
2. toujours 0.230 again 0.174
3. reste 0.027 yet 0.148
4. *** 0.020 even 0.055
5. quand 0.018 more 0.046
6. meme 0.017 another 0,030
7. de 0.015 further 0.021
8. de 0.014 once 0.013
FIGURE
3 (PART I)
EXAMPLES OF PARTIAL GLOSSARY LISTS OF MOST LIKELY
WORD TRANSLATES AND THEIR PROBABILITIES
Note: *** denotes miscellaneous words not belonging to the lexicon.
PEOPLE GENS
1. les 0.267 people 0.781
2. gens 0.244 they 0.013
3. personnes 0.100 those 0.009
4. population 0.055 individuals 0.008
5. peuple 0.035 persons 0.005
6. canadiens 0.031 people's 0.004
7. habitants 0.024 men 0.004
8. ceux 0.023 person 0.003
OBTAIN OBTENIR
l. obtenir 0.457 get 0.301
2. pour 0.050 obtain 0.108
3. les 0.033 have 0.036
4. de 0.031 getting 0.032
5. trouver 0.026 seeking 0.023
6. se 0.025 available 0.021
7. obtenu 0.020 obtaining 0.021
8. procurer 0.020 information 0.016
QUICKLY RAPIDEMENT
1. rapidement 0.508 quickly 0.389
2. vite 0.130 rapidly 0.147
3. tot 0.042 fast 0.052
4. rapide 0.021 quick 0.042
5. brievement 0.019 soon 0.036
6. aussitot 0.013 faster 0.035
7. plus 0.012 speedy 0.026
8. bientot 0.012 briefly 0.025
FIGURE 3 (PART II)
EXAMPLES OF PARTIAL GLOSSARY LISTS OF MOST LIKELY
WORD TRANSLATES AND THEIR PROBABILITIES
EXAMPLES OF RECONSTRUCTION TttAT PRESERVE
MEANING:
would I report directly to you?
I would report directly to you?
now let me mention some of the disadvantages.
let me mention some of the disadvantages now,
he did this several hours later.
this he did several hours later.
EXAMPLES OF RECONSTRUCTION THAT DO NOT PRESERVE
MEANING
these people have a fairly large rate of turnover.
of these people have a fairly large turnover rate.
in our organization research has two missions.
in our missions research organization has two.
exactly how this might be done is not clear.
clear is not exactly how this might be done.
FIGURE 4
STATISTICAL ARRANGEMENT OF WORDS BELONGING TO
ENGLISH SENTENCES
76
... where p is a Bernoulli random variable, p ∼ B(1, π i,k ) and π i,k is a probability distribution of the features that can be computed as [125] where N is the number of features and a i,k is the attribution of feature i in a sample of class k. It is obtained using any attribution-based XAI method, such as IG or SHAP. ...
... • ROUGE Score: Recall-Oriented Understudy for Gisting Evaluation [124] is a set of metrics for evaluating automatic summarization and machine translation by comparing the overlap of n-grams, word sequences, and word pairs between the generated summary/translation and reference texts. • Perplexity: It is a measurement of how well a language model predicts a sample [125]. It is defined as the exponentiated average negative log-likelihood of a sequence. ...
Article
Full-text available
The recent o-ran specifications promote the evolution of ranran architecture by function disaggregation, adoption of open interfaces, and instantiation of a hierarchical closed-loop control architecture managed by ric entities. This paves the road to novel data-driven network management approaches based on programmable logic. Aided by ai and ml, novel solutions targeting traditionally unsolved ran management issues can be devised. Nevertheless, the adoption of such smart and autonomous systems is limited by the current inability of human operators to understand the decision process of such ai/ml solutions, affecting their trust in such novel tools. xai aims at solving this issue, enabling human users to better understand and effectively manage the emerging generation of artificially intelligent schemes, reducing the human-to-machine barrier. In this survey, we provide a summary of the xai methods and metrics before studying their deployment over the o-ran Alliance ran architecture along with its main building blocks. We then present various use-cases and discuss the automation of xai pipelines for o-ran as well as the underlying security aspects. We also review some projects/standards that tackle this area. Finally, we identify different challenges and research directions that may arise from the heavy adoption of ai/ml decision entities in this context, focusing on how xai can help to interpret, understand, and improve trust in o-ran operational networks.
... Desde la aparición de los primeros sistemas de traducción automática a finales de los años sesenta, basados en reglas lingüísticas desarrolladas por humanos, las nuevas tecnologías han evolucionado significativamente y han permitido dar pasos de gigante en el campo de la traducción automática, hasta la llegada a finales de los años ochenta y principios de los noventa de los primeros sistemas de traducción automática estadística y la traducción automática híbrida, basada en una combinación de ambos sistemas (Brown et al., 1988), que vivieron su apogeo especialmente entre 2004 y 2014 (Kenny, 2022: 39). Los grandes avances experimentados en los últimos años en el campo de la lingüística computacional, el procesamiento del lenguaje natural y la inteligencia artificial han contribuido al perfeccionamiento y a la mayor precisión de las herramientas de traducción automática, especialmente con el lanzamiento de Google en 2016 del primer sistema de traducción automática neuronal, que sigue evolucionando año tras año desde entonces para alcanzar un grado de sofisticación cada vez mayor (Kenny, 2022;Zhi, 2020). ...
Article
Full-text available
Este artículo presenta un estudio comparado de las traducciones humanas disponibles en Glosbe y las traducciones automáticas generadas por DeepL y Google Translate entre italiano y español para ejemplos reales extraídos de SketchEngine con verbos sintagmáticos de movimiento pleonásticos, un tipo de estructura muy interesante desde el punto de vista lingüístico por las diferencias que existen entre ambas lenguas en cuanto a variedad y frecuencia de uso. Los resultados muestran que las traducciones automáticas reflejan estas diferencias importantes entre italiano y español en cuanto al patrón de lexicalización predominante para la expresión del movimiento y la presencia de estas construcciones pleonásticas.
... Por supuesto, este es el mismo principio que utilizan para la traducción de EMV. Sin embargo, las bases teóricas de este enfoque -primero basado en la alineación de pares de palabras (Brown et al., 1988(Brown et al., , 1990) y, posteriormente, basado en la alineación de frases (Zens et al., 2002;Koehn et al., 2003)-no atienden a las necesidades específicas de la traducción de EMV. ...
Article
Full-text available
Es innegable que la traducción automática se ha convertido en una constante en el día a día y que ha transformado la forma en que los usuarios abordan el proceso de traducción. Este fenómeno ha tenido un impacto significativo en diversas áreas, especialmente en el contexto del turismo debido a su carácter internacional. Cada vez es más común que empresas, especialmente las de pequeño y mediano tamaño, recurran a herramientas de traducción automática para llegar a un público más amplio y plurilingüe. No obstante, a pesar de su popularidad, estas herramientas pueden ofrecer resultados limitados en términos de calidad y adecuación. El presente trabajo se centra en el estudio de las posibilidades y limitaciones que los sistemas de traducción automática presentan al lidiar con expresiones multiverbales dentro del ámbito del turismo gastronómico. Para ello, se confeccionó un corpus monolingüe (ES), que incluye treinta folletos y guías de diferentes regiones españolas, siguiendo el protocolo de compilación propuesto por Seghiri (2017). A partir de este corpus, se extrajeron las expresiones multiverbales objeto de estudio, junto con sus respectivos contextos, y se sometieron a un proceso de traducción automática utilizando cuatro motores (DeepL, Google Translate, Microsoft Translator y Yandex) pertenecientes a los paradigmas más utilizados hoy en día dentro de la traducción automática para fines específicos. Los resultados obtenidos, categorizados siguiendo una modificación del modelo propuesto por Ortiz Boix (2016), permitieron identificar diferencias de rendimiento entre los sistemas más populares y revelaron los obstáculos comunicativos a los que los usuarios podrían enfrentarse al lidiar con fraseología.
... The WSD process is important for different applications such as information retrieval [15], automated classification [16] and so on. WSD plays an important role in the field of language translation by machine [17][18][19]. ...
Preprint
In this paper, we are going to find meaning of words based on distinct situations. Word Sense Disambiguation is used to find meaning of words based on live contexts using supervised and unsupervised approaches. Unsupervised approaches use online dictionary for learning, and supervised approaches use manual learning sets. Hand tagged data are populated which might not be effective and sufficient for learning procedure. This limitation of information is main flaw of the supervised approach. Our proposed approach focuses to overcome the limitation using learning set which is enriched in dynamic way maintaining new data. Trivial filtering method is utilized to achieve appropriate training data. We introduce a mixed methodology having Modified Lesk approach and Bag-of-Words having enriched bags using learning methods. Our approach establishes the superiority over individual Modified Lesk and Bag-of-Words approaches based on experimentation.
Article
Analysis and description of text corpora can present a number of technical challenges, especially in the case of corpora built by automated content extraction that may not allow for readily available annotations and other semantic information about the texts. In this work we describe and test an approach to analysis of the semantic content of corpora based on the methods of unsupervised feature extraction, dimensionality reduction and concept learning. With model corpora represented by texts in English newsgroups, we demonstrate how characteristic semantic types can be identified with methods of unsupervised machine learning and clustering. The results can be an instrumental addition to methods of analysis of semantic context of text corpora where the prior description such as annotations may not be available or is scarce. The approach and methods demonstrated in this work are in no way limited to the English language and can be applied to corpora in any language where the appropriate vectorization and preprocessing methods are available.
Preprint
Full-text available
Do autoregressive Transformer language models require explicit positional encodings (PEs)? The answer is "no" as long as they have more than one layer -- they can distinguish sequences with permuted tokens without requiring explicit PEs. This property has been known since early efforts (those contemporary with GPT-2) adopting the Transformer for language modeling. However, this result does not appear to have been well disseminated and was even rediscovered recently. This may be partially due to a sudden growth of the language modeling community after the advent of GPT-2, but perhaps also due to the lack of a clear explanation in prior publications, despite being commonly understood by practitioners in the past. Here we review this long-forgotten explanation why explicit PEs are nonessential for multi-layer autoregressive Transformers (in contrast, one-layer models require PEs to discern order information of their input tokens). We also review the origin of this result, and hope to re-establish it as a common knowledge.
Chapter
This chapter explores the historical timeline of approaches to Sign Language Machine Translation (MT) over the past 25 years. Initially, such approaches were rule-based, but given advances in corpus-based approaches to MT on spoken languages, unsurprisingly researchers started to explore the possibilities for using example-based and statistical methods for sign languages. Then, as now, data sparseness was the main inhibitor to progress, but it is interesting to see to what extent this research engaged with deaf and hard of hearing (DHH) people in identifying use-cases, creating training data, and evaluating the translations as well as the overall process. We end with a description of the Horizon Europe-funded SignON project, focusing particularly on the co-constructed approach to identification and evaluation of use-cases with and by deaf signers that has been adopted. As can be seen throughout this volume (but especially in chapters “The Importance of Including Signed Languages in Natural Language Processing” and “Sign Language Machine Translation Communication and Engagement”), we contend most strongly that this is the only ethical approach to such work.
Article
Full-text available
The integration of neural networks and machine learning techniques has ushered in a revolution in various fields, including electromagnetic inversion, geophysical exploration, and microwave imaging. While these techniques have significantly improved image reconstruction and the resolution of complex inverse scattering problems, this paper explores a different question: Can near‐field electromagnetic waves be harnessed for object classification? To answer this question, we first create a dataset based on the MNIST dataset, where we transform the grayscale pixel values into relative electrical permittivity values to form scatterers and calculate the electromagnetic waves scattered from these objects using a 2D electromagnetic finite‐difference frequency‐domain solver. Then, we train various machine learning models with this dataset to classify the objects. When we compare the classification accuracy and efficiency of these models, we observe that the neural networks outperform others, achieving a 90% classification accuracy solely from the data without a need for projecting the input data into a latent space. The impacts of the training dataset size, the number of antennas, and the location of antennas on the accuracy and time spent during training are also investigated. These results demonstrate the potential for classifying objects with near‐field electromagnetic waves in a simple setup and lay the groundwork for further research in this exciting direction.
Article
This paper investigates the development and evaluation of machine translation models from Cantonese to English (and backward), where we propose a novel approach to tackle low-resource language translations. Despite recent improvements in Neural Machine Translation (NMT) models with Transformer-based architectures, Cantonese, a language with over 80 million native speakers, has below-par State-of-the-art commercial translation models due to a lack of resources. The main objectives of the study are to develop a model that can effectively translate Cantonese to English and evaluate it against state-of-the-art commercial models. To achieve this, a new parallel corpus has been created by combining different available corpora online with preprocessing and cleaning. In addition, a monolingual Cantonese dataset has been created through web scraping to aid the synthetic parallel corpus generation. Following the data collection process, several approaches, including fine-tuning models, back-translation , and model switch , have been used. The translation quality of models has been evaluated with multiple quality metrics, including lexicon-based metrics (SacreBLEU and hLEPOR) and embedding-space metrics (COMET and BERTscore). Based on the automatic metrics, the best model is selected and compared against the 2 best commercial translators using a new human evaluation framework HOPES . The best model proposed in this investigation (NLLB-mBART) with model switch mechanisms has reached comparable and even better automatic evaluation scores against State-of-the-art commercial models (Bing and Baidu Translators), with a SacreBLEU score of 16.8 on our test set. Furthermore, an open-source web application has been developed to allow users to translate between Cantonese and English, with the different trained models available for effective comparisons between models from this investigation and users. CantonMT is available at https://github.com/kenrickkung/CantoneseTranslation
Article
In this paper a new sequential decoding algorithm is introduced that uses stack storage at the receiver. It is much simpler to describe and analyze than the Fano algorithm, and is about six times faster than the latter at transmission rates equal to Rcomp the rate below which the average number of decoding steps is bounded by a constant. Practical problems connected with implementing the stack algorithm are discussed and a scheme is described that facilitates satisfactory performance even with limited stack storage capacity. Preliminary simulation results estimating the decoding effort and the needed stack siazree presented.
Article
S ummary A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Conference Paper
A major problem in machine translation is the semantic description of lexical units which should be based on a semantic system that is both coherent and operationalized to the greatest possible degree. This is to guarantee consistency between lexical ...
Article
The probability of error in decoding an optimal convolutional code transmitted over a memoryless channel is bounded from above and below as a function of the constraint length of the code. For all but pathological channels the bounds are asymptotically (exponentially) tight for rates above R_{0} , the computational cutoff rate of sequential decoding. As a function of constraint length the performance of optimal convolutional codes is shown to be superior to that of block codes of the same length, the relative improvement increasing with rate. The upper bound is obtained for a specific probabilistic nonsequential decoding algorithm which is shown to be asymptotically optimum for rates above R_{0} and whose performance bears certain similarities to that of sequential decoding algorithms.
The Comlmlational Analysis of l(l,glish: a Corpus-Based AI)l)roach, I.ongmanA Stochastic Approach to Parsing" itl. lh'oceeding+, of tile I lth lnlernalional Corfferenee oil (k+mputaliotml l,inguintics (COl ,IN(] '86) Bonn 151-155, 1986. 171 W. Weaver: Translalion (194.9)
  • J Metl
  • P Sinclair
  • . G Garsidc
  • G N Lccch
  • ( ] P Sampson
  • G R Sampson
J. Metl. Sinclair: "Lcxicogral~hic F.vidence" in, I)ielionarie,~, Lexicography and Langaage l,earniog (l!1+'F Doeoments: 1211), editol R. llson, NewYork: Pergamon Press, pp. 81-94, 1985. 151 P,.G. Garsidc, G.N. Lccch and (].P,. Sampson, The Comlmlational Analysis of l(l,glish: a Corpus-Based AI)l)roach, I.ongman 1987 1161 G.R. Sampson, "A Stochastic Approach to Parsing" itl. lh'oceeding+, of tile I lth lnlernalional Corfferenee oil (k+mputaliotml l,inguintics (COl,IN(] '86) Bonn 151-155, 1986. 171 W. Weaver: Translalion (194.9). Reproduced in: I,ocke. WN. & Booth, A,D. eds.: Maelnine Iranslalimn of hmguages. Calnbrid,ee, MA.: MIT Press, 1955.
ltldden Marker Models for Speech
  • Fcrguson
  • Ed
Fcrguson, Ed., ltldden Marker Models for Speech. Princeton, New Jersey, IDA-CRD, Oct. 1980, pp. 8-15
A maximum likelihood approach to contimlous speech recognition Baker: Stochastic modeling for automatic speech tmdcrstanding
  • L R Atrange
  • F Bahl
  • R Jclinek
  • Mercer
Atrange the words of thc target fixed locutkms into a seqtteltce forming the target sentence. REFERENCES 111 L.R. Bahl, F. Jclinek, and R.l,. Mercer: A maximum likelihood approach to contimlous speech recognition, IEEE Traosaclioos on Pattern Analysis and Machine Intelligence, PAM I-5 (2): 179-190, March 1983. 12].I.K. Baker: Stochastic modeling for automatic speech tmdcrstanding. In R.A. P, eddy, editor, Speech Recognition, pages 521-541, Academic Press, New York, 1979. 131 J.D. Ferguson: llidden Markov analysis: An introduction. In J.D.
ubin: Maximum likelihood from ineolnpletc data via tile I"M algorithm
  • R G Gallager
R.G. Gallager: Informalion Theory aad reliahle (ommuniealion, John Wiley and Sons, Ii1c,, New York, 1968. [I I1 A.P. Dcmpstcr, N.M.l.aird, al/d It.B. ll.ubin: Maximum likelihood from ineolnpletc data via tile I"M algorithm, Journal of Ihe Royal S|atistical Society, 39(B): 1-38, 1977.