Page 1

Proceedings of the 12th Conference of the European Chapter of the ACL, pages 166–174,

Athens, Greece, 30 March – 3 April 2009. c ?2009 Association for Computational Linguistics

166

An Alignment Algorithm using Belief Propagation and a Structure-Based

Distortion Model

Fabien Cromi` eres

Graduate school of informatics

Kyoto University

Kyoto, Japan

fabien@nlp.kuee.kyoto-u.ac.jp

Sadao Kurohashi

Graduate school of informatics

Kyoto University

Kyoto, Japan

kuro@i.kyoto-u.ac.jp

Abstract

In this paper, we first demonstrate the in-

terest of the Loopy Belief Propagation al-

gorithm to train and use a simple align-

ment model where the expected marginal

values needed for an efficient EM-training

are not easily computable. We then im-

prove this model with a distortion model

based on structure conservation.

Introduction and Related Work1

Automatic word alignment of parallel corpora is

an important step for data-oriented Machine trans-

lation (whether Statistical or Example-Based) as

well as for automatic lexicon acquisition. Many

algorithms have been proposed in the last twenty

years to tackle this problem. One of the most suc-

cessfull alignment procedure so far seems to be

the so-called “IBM model 4” described in (Brown

et al., 1993). It involves a very complex distor-

tion model (here and in subsequent usages “dis-

tortion” will be a generic term for the reordering

of the words occurring in the translation process)

with many parameters that make it very complex

to train.

By contrast, the first alignment model we are

going to propose is fairly simple. But this sim-

plicity will allow us to try and experiment differ-

ent ideas for making a better use of the sentence

structures in the alignment process. This model

(and even more so its subsequents variations), al-

though simple, do not have a computationally ef-

ficient procedure for an exact EM-based training.

However, wewillgivesometheoreticalandempir-

ical evidences that Loopy Belief Propagation can

give us a good approximation procedure.

Although we do not have the space to review the

many alignment systems that have already been

proposed, we will shortly refer to works that share

some similarities with our approach. In particu-

lar, the first alignment model we will present has

already been described in (Melamed, 2000). We

differ however in the training and decoding pro-

cedure we propose. The problem of making use

of syntactic trees for alignment (and translation),

which is the object of our second alignment model

has already received some attention, notably by

(Yamada and Knight, 2001) and (Gildea, 2003) .

2Factor Graphs and Belief Propagation

In this paper, we will make several use of Fac-

tor Graphs. A Factor Graph is a graphical

model, much like a Bayesian Network. The three

most common types of graphical models (Factor

Graphs, Bayesian Network and Markov Network)

share the same purpose: intuitively, they allow to

represent the dependencies among random vari-

ables; mathematically, they represent a factoriza-

tion of the joint probability of these variables.

Formally, a factor graph is a bipartite graph with

2 kinds of nodes. On one side, the Variable Nodes

(abbreviated as V-Node from here on), and on the

other side, the Factor Nodes (abbreviated as F-

Node). If a Factor Graph represents a given joint

distribution, there will be one V-Node for every

random variable in this joint distribution. Each F-

Node is associated with a function of the V-Nodes

to which it is connected (more precisely, a func-

tion of the values of the random variables associ-

ated with the V-Nodes, but for brevity, we will fre-

quently mix the notions of V-Node, Random Vari-

ables and their values). The joint distribution is

then the product of these functions (and of a nor-

malizing constant). Therefore, each F-Node actu-

ally represent a factor in the factorization of the

joint distribution.

As a short example, let us consider a prob-

lem classically used to introduce Bayesian Net-

work. We want to model the joint probability of

the Weather(W) being sunny or rainy, the Sprin-

kle(S) being on or off, and the Lawn(L) being

wet or dry. Figure 1 show the dependencies of

Page 2

Figure 1: A classical example

the variables represented with a Factor Graph and

with a Bayesian Network. Mathematically, the

Bayesian Network imply that the joint probabil-

ity has the following factorization: P(W,L,S) =

P(W) · P(S|W) · P(L|W,S). The Factor Graph

imply there exist two functions ϕ1and ϕ2as well

as a normalization constant C such that we have

the factorization: P(W,L,S) = C · ϕ2(W,S) ·

ϕ1(L,W,S).If we set C = 1, ϕ2(W,S) =

P(W) · P(S|W) and ϕ1(L,W,S) = P(L|W,S),

the Factor Graph express exactly the same factor-

ization as the Bayesian Network.

A reason to use Graphical Models is that we can

use with them an algorithm called Belief Propa-

gation (abbreviated as BP from here on) (Pearl,

1988). The BP algorithm comes in two flavors:

sum-product BP and max-product BP. Each one

respectively solve two problems that arise often

(and are often intractable) in the use of a proba-

bilistic model: “what are the marginal probabili-

ties of each individual variable?” and “what is the

set of values with the highest probability?”. More

precisely, the BP algorithm will give the correct

answer to these questions if the graph represent-

ing the distribution is a forest. If it is not the case,

the BP algorithm is not even guaranteed to con-

verge. It has been shown, however, that the BP al-

gorithm do converge in many practical cases, and

that the results it produces are often surprisingly

good approximations (see, for example, (Murphy

et al., 1999) or (Weiss and Freeman, 2001) ).

(Yedidia et al., 2003) gives a very good presen-

tation of the sum-product BP algorithm, as well as

some theoretical justifications for its success. We

will just give an outline of the algorithm. The BP

algorithm is a message-passing algorithm. Mes-

sages are sent during several iterations until con-

vergence. At each iteration, each V-Node sends

to its neighboring F-Nodes a message represent-

ing an estimation of its own marginal values. The

message sent by the V-Node Vito the F-Node Fj

estimating the marginal probability of Vito take

the value x is :

?

(N(Vi) represent the set of the neighbours of Vi)

Also, every F-Node send a message to its neigh-

boring V-Nodes that represent its estimates of the

marginal values of the V-Node:

?

·

V k∈N(Fj)\V i

At any point, the belief of a V-Node V i is given

by

bi(x) =

Fk∈N(V i)

, bibeing normalized so that?

probability (or an approximation of it) of Vitaking

the value x .

An interesting point to note is that each message

can be “scaled” (that is, multiplied by a constant)

by any factor at any point without changing the re-

sult of the algorithm. This is very useful both for

preventing overflow and underflow during compu-

tation, and also sometimes for simplifying the al-

gorithm (we will use this in section 3.2). Also,

damping schemes such as the ones proposed in

(Murphy et al., 1999) or (Heskes, 2003) are use-

ful for decreasing the cases of non-convergence.

As for the max-product BP, it is best explained

as “sum-product BP where each sum is replaced

by a maximization”.

mV i→Fj(x) =

Fk∈N(V i)\Fj

mFk→V i(x)

mFj→V i(x) =

v1,...,vn

ϕj(v1,..,x,..,vn)·

?

mV k→Fj(vk)

?

mFk→V i(x)

xbi(x) = 1. The

beliefbi(x)isexpectedtoconvergetothemarginal

3The monolink model

We are now going to present a simple alignment

model that will serve both to illustrate the effi-

ciency of the BP algorithm and as basis for fur-

ther improvement. As previously mentioned, this

model is mostly identical to one already proposed

in (Melamed, 2000). The training and decoding

procedures we propose are however different.

3.1

Following the usual convention, we will designate

the two sides of a sentence pair as French and En-

glish. A sentence pair will be noted (e,f). eirep-

resents the word at position i in e.

Description

167

Page 3

In this first simple model, we will pay little at-

tention to the structure of the sentence pair we

want to align. Actually, each sentence will be re-

duced to a bag of words.

Intuitively, the two sides of a sentence pair ex-

press the same set of meanings. What we want to

do in the alignment process is find the parts of the

sentences that originate from the same meaning.

We will suppose here that each meaning generate

at most one word on each side, and we will name

concept the pair of words generated by a mean-

ing. It is possible for a meaning to be expressed

in only one side of the sentence pair. In that case,

we will have a “one-sided” concept consisting of

only one word. In this view, a sentence pair ap-

pears “superficially” as a pair of bag of words, but

the bag of words are themselves the visible part of

an underlying bag of concepts.

We propose a simple generative model to de-

scribe the generation of a sentence pair (or rather,

its underlying bag of concepts):

• First, an integer n, representing the number

of concepts of the sentence is drawn from a

distribution Psize

• Then, n concepts are drawn independently

from a distribution Pconcept

The probability of a bag of concepts C is then:

P(C) = Psize(|C|)

?

(w1,w2)∈C

Pconcept((w1,w2))

We can alternatively represent a bag of concepts

as a pair of sentence (e,f), plus an alignment a.

a is a set of links, a link being represented as a

pair of positions in each side of the sentence pair

(the special position -1 indicating the empty side

of a one-sided concept). This alternative represen-

tation has the advantage of better separating what

is observed (the sentence pair) and what is hidden

(the alignment). It is not a strictly equivalent rep-

resentation (it also contains information about the

word positions) but this will not be relevant here.

The joint distribution of e,f and a is then:

P(e,f,a) = Psize(|a|)

?

(i,j)∈a

Pconcept(ei,fj)

(1)

This model only take into consideration one-

to-one alignments. Therefore, from now on, we

will call this model “monolink”.Considering

only one-to-one alignments can be seen as a lim-

itation compared to others models that can of-

ten produce at least one-to-many alignments, but

on the good side, this allow the monolink model

to be nicely symmetric. Additionally, as already

argued in (Melamed, 2000), there are ways to

determine the boundaries of some multi-words

phrases (Melamed, 2002), allowing to treat sev-

eral words as a single token. Alternatively, a pro-

cedure similar to the one described in (Cromieres,

2006), where substrings instead of single words

are aligned (thus considering every segmentation

possible) could be used.

With the monolink model, we want to do two

things: first, we want to find out good values for

the distributions Psizeand Pconcept. Then we want

to be able to find the most likely alignment a given

the sentence pair (e,f).

We will consider Psizeto be a uniform distribu-

tion over the integers up to a sufficiently big value

(since it is not possible to have a uniform distri-

bution over an infinite discrete set). We will not

need to determine the exact value of Psize. The

assumption that it is uniform is actually enough to

“remove” it of the computations that follow.

In order to determine the Pconceptdistribution,

we can use an EM procedure.

show that, at every iteration, the EM procedure

will require to set Pconcept(we,wf) proportional

to the sum of the expected counts of the concept

(we,wf) over the training corpus. This, in turn,

mean we have to compute the conditional expec-

tation:

It is easy to

E((i,j) ∈ a|e,f) =

?

a|(i,j)∈a

P(a|e,f)

for every sentence pair (e,f). This computation

require a sum over all the possible alignments,

whose numbers grow exponentially with the size

of the sentences. As noted in (Melamed, 2000),

it does not seem possible to compute this expecta-

tion efficiently with dynamic programming tricks

like the one used in the IBM models 1 and 2 (as a

passing remark, these “tricks” can actually be seen

as instances of the BP algorithm).

We propose to solve this problem by applying

the BP algorithm to a Factor Graph representing

the conditional distribution P(a|e,f).

sentence pair (e,f), we build this graph as fol-

lows.

We create a V-node Ve

the English sentence. This V-Node can take for

Given a

ifor every position i in

168

Page 4

Figure 2: A Factor Graph for the monolink model

in the case of a 2-words English sentence and a 3-

words french sentence (Frec

ij

nodes are noted Fri-j)

value any position in the french sentence, or the

special position −1 (meaning this position is not

aligned, corresponding to a one-sided concept).

We create symmetrically a V-node Vf

position in the french sentence.

We have to enforce a “reciprocal love” condi-

tion: if a V-Node at position i choose a position j

on the opposite side, the opposite V-Node at po-

sition j must choose the position i. This is done

by adding a F-Node Frec

node Ve

We then connect a “translation probability” F-

Node Ftp.e

i

to every V-Node Ve

the function:

??Pconcept(ei,fj)

We add symmetrically on the French side F-Nodes

Ftp.f

j

It should be fairly easy to see that such a Factor

Graph represents P(a|e,f). See figure 2 for an

example.

Using the sum-product BP, the beliefs of ev-

ery V-Node Ve

node Vf

marginal expectation E((i,j) ∈ a|e,f) (or rather,

a hopefully good approximation of it).

We can also use max-product BP on the same

graph to decode the most likely alignment. In the

monolink case, decoding is actually an instance of

the “assignment problem”, for which efficient al-

gorithms are known. However this will not be the

jfor every

i,jbetween every opposite

j, associated with the function:

iand Vf

ϕrec

i,j(k,l) =

1

if (i = l and j = k)

or (i ?= l and j ?= k)

else

0

iassociated with

ϕtp.e

i

(j) =

if j ?= −1

if j = −1

Pconcept(ei,∅)

to the V-Nodes Vf

j.

ito take the value j and of every

jto take the value i should converge to the

case for the more complex model of the next sec-

tion. Actually, (Bayati et al., 2005) has recently

proved that max-product BP always give the opti-

mal solution to the assignment problem.

3.2Efficient BP iterations

Applying naively the BP algorithm would lead us

to a complexity of O(|e|2· |f|2) per BP iteration.

While this is not intractable, it could turn out to be

a bit slow. Fortunately, we found it is possible to

reduce this complexity to O(|e| · |f|) by making

two useful observations.

Let us note me

to Vf

ter it received its own message from Ve

has the same value for every x different from i:

me

k?=j

mf

messages me

except if x = i; and the same can be done for the

messages coming from the French side mf

lows that me

if the be

ery step, we only need to compute me

me

Hence the following algorithm (me

here abbreviated to me

of the message we need to compute). We describe

the process for computing the English-side mes-

sages and beliefs (me

must also be done symmetrically for the French-

side messages and beliefs (mf

iteration.

0- Initialize all messages and beliefs with:

me(0)

iji

Until convergence (or for a set number of itera-

tion):

1- Compute the messages me

be(t)

ii

2- Compute the beliefs be

ϕtp.e

i

ji

3- And then normalize the bi(j)e(t+1)so that

?

product BP.

ijthe resulting message from Ve

j(that is the message sent by Frec

i

i,jto Vf

i). me

jaf-

ij(x)

ij(x ?= i) =?

be

i(k)

ji(k). We can divide all the

ij(x ?= i), so that me

ijby me

ij(x) = 1

ij. It fol-

ij(x ?= i) =?

k?=jbe

i(k) = 1 − be

i(j)

iare kept normalized. Therefore, at ev-

ij(j), not

ij(x ?= j).

ij(j) will be

ijsince it is the only value

ijand be

i) , but the process

ijand bf

i) at every

= 1 and be(0)

(j) = ϕtp.e

i

(j)

ij: me(t+1)

ij

=

(j)/((1 − be(t)

(j)) · mf(t)

ji)

i(j):bi(j)e(t+1)

=

(j) · mf(t+1)

jbi(j)e(t+1)= 1.

A similar algorithm can be found for the max-

3.3Experimental Results

We evaluated the monolink algorithm with two

languages pairs: French-English and Japanese-

English.

169

Page 5

For the English-French Pair, we used 200,000

sentence pairs extracted from the Hansard cor-

pus (Germann, 2001). Evaluation was done with

the scripts and gold standard provided during

the workshop HLT-NAACL 20031(Mihalcea and

Pedersen, 2003). Null links are not considered for

the evaluation.

For the English-Japanese evaluation, we used

100,000 sentence pairs extracted from a corpus of

English/Japanese news. We used 1000 sentence

pairs extracted from pre-aligned data(Utiyama and

Isahara, 2003) as a gold standard. We segmented

all the Japanese data with the automatic segmenter

Juman (Kurohashi and Nagao, 1994). There is

a caveat to this evaluation, though. The reason

is that the segmentation and alignment scheme

used in our gold standard is not very fine-grained:

mostly, big chunks of the Japanese sentence cover-

ing several words are aligned to big chunks of the

English sentence. For the evaluation, we had to

consider that when two chunks are aligned, there

is a link between every pair of words belonging to

each chunk. A consequence is that our gold stan-

dard will contain a lot more links than it should,

some of them not relevants. This means that the

recall will be largely underestimated and the pre-

cision will be overestimated.

For the BP/EM training, we used 10 BP iter-

ations for each sentences, and 5 global EM iter-

ations. By using a damping scheme for the BP

algorithm, we never observed a problem of non-

convergence (such problems do commonly ap-

pears without damping). With our python/C im-

plementation, training time approximated 1 hour.

But with a better implementation, it should be pos-

sible to reduce this time to something comparable

to the model 1 training time with Giza++.

For the decoding, although the max-product BP

should be the algorithm of choice, we found we

could obtain slightly better results (by between 1

and 2 AER points) by using the sum-product BP,

choosing links with high beliefs, and cutting-off

links with very small beliefs (the cut-off was cho-

sen roughly by manually looking at a few aligned

sentences not used in the evaluation, so as not to

create too much bias).

Due to space constraints, all of the results of this

section and the next one are summarized in two

tables (tables 1 and 2) at the end of this paper.

In order to compare the efficiency of the BP

1http://www.cs.unt.edu/ rada/wpt/

training procedure to a more simple one, we reim-

plemented the Competitive Link Algorithm (ab-

breviated as CLA from here on) that is used in

(Melamed, 2000) to train an identical model. This

algorithm starts with some relatively good esti-

mates found by computing correlation score (we

used the G-test score) between words based on

their number of co-occurrences. A greedy Viterbi

training is then applied to improve this initial

guess. Incontrast, ourBP/EMtrainingdonotneed

to compute correlation scores and start the training

with uniform parameters.

Weonlyevaluated

French/English pair.

CLA did improve alignment quality, but subse-

quent ones decreased it. The reported score for

CLA is therefore the one obtained during the best

iteration. The BP/EM training demonstrate a clear

superiority over the CLA here, since it produce

almost 7 points of AER improvement over CLA.

In order to have a comparison with a well-

known and state-of-the-art system, we also used

the GIZA++ program (Och and Ney, 1999) to

align the same data. We tried alignments in both

direction and provide the results for the direction

that gave the best results. The settings used were

the ones used by the training scripts of the Moses

system2, which we assumed to be fairly optimal.

We tried alignment with the default Moses settings

(5 iterations of model 1, 5 of Hmm, 3 of model 3,

3 of model 4) and also tried with increased number

of iterations for each model (up to 10 per model).

We are aware that the score we obtained for

model 4 in English-French is slightly worse than

what is usually reported for a similar size of train-

ing data. At the time of this paper, we did not

have the time to investigate if it is a problem of

non-optimal settings in GIZA++, or if the train-

ing data we used was “difficult to learn from” (it

is common to extract sentences of moderate length

for the training data but we didn’t, and some sen-

tences of our training corpus do have more than

200 words; also, we did not use any kind of pre-

processing). In any case, Giza++ is compared here

with an algorithm trained on the same data and

with no possibilities for fine-tuning; therefore the

comparison should be fair.

The comparison show that performance-wise,

the monolink algorithm is between the model 2

and the model 3 for English/French. Considering

theCLAonthe

The first iteration of

2http://www.statmt.org/moses/

170

Page 6

our model has the same number of parameters as

the model 1 (namely, the word translation prob-

abilities, or concept probabilities in our model),

these are pretty good results. Overall, the mono-

link model tend to give better precision and worse

recall than the Giza++ models, which was to be

expected given the different type of alignments

produced (1-to-1 and 1-to-many).

For English/Japanese, monolink is at just about

the level of model 1, but model 1,2 and 3 have very

close performances for this language pair (inter-

estingly, this is different from the English/French

pair). Incidentally, these performances are very

poor. Recall was expected to be low, due to the

previously mentioned problem with the gold stan-

dard. But precision was expected to be better. It

could be the algorithms are confused by the very

fine-grained segmentation produced by Juman.

4 Adding distortion through structure

4.1

While the simple monolink model gives interest-

ing results, it is somehow limited in that it do not

use any model of distortion. We will now try to

add a distortion model; however, rather than di-

rectly modeling the movement of the positions of

the words, as is the case in the IBM models, we

will try to design a distortion model based on the

structures of the sentences. In particular, we are

interested in using the trees produced by syntactic

parsers.

The intuition we want to use is that, much like

there is a kind of “lexical conservation” in the

translation process, meaning that a word on one

side has usually an equivalent on the other side,

there should also be a kind of “structure conserva-

tion”, with most structures on one side having an

equivalent on the other.

Before going further, we should precise the idea

of “structure” we are going to use. As we said, our

prime (but not only) interest will be to make use of

the syntactic trees of the sentences to be aligned.

However these kind of trees come in very different

shapes depending on the language and the type of

parser used (dependency, constituents,...). This is

why we decided the only information we would

keep from a syntactic tree is the set of its sub-

nodes. More specifically, for every sub-node, we

will only consider the set of positions it cover in

the underlying sentence. We will call such a set

of positions a P-set. This simplification will allow

Description

Figure 3: A small syntactic tree and the 3 P-Sets it

generates

us to process dependency trees, constituents trees

and other structures in a uniformized way. Fig-

ure 3 gives an example of a constituents tree and

the P-sets it generates.

According to our intuition about the “conserva-

tion of structure”, some (not all) of the P-sets on

one side should have an equivalent on the other

side. We can model this in a way similar to how

we represented equivalence between words with

concepts. We postulate that, in addition to a bag of

concepts, sentence pairs are underlaid by a set of

P-concepts. P-concepts being actually pairs of P-

sets (a P-set for each side of the sentence pair). We

also allow the existence of one-sided P-concepts.

In the previous model, sentence pairs where

just bag of words underlaid by a or bag of con-

cepts, and there was no modeling of the position

of the words. P-concepts bring a notion of word

position to the model. Intuitively, there should

be coherency between P-concepts and concepts.

This coherence will come from a compatibility

constraint: if a sentence contains a two-sided P-

concept (PSe,PSf), and if a word we covered

by PSecome from a two-sided concept (we,wf),

then wfmust be covered by PSf.

Let us describe the model more formally. In

the view of this model, a sentence pair is fully de-

scribed by: e and f (the sentences themselves), a

(the word alignment giving us the underlying bag

of concept), seand sf(the sets of P-sets on each

side of the sentence) and as(the P-set alignment

that give us the underlying set of P-concepts).

e,f,se,sfare considered to be observed (even if

we will need parsing tools to observe seand sf);

a and asare hidden. The probability of a sentence

pair is given by the joint probability of these vari-

ables :P(e,f,se,sf,a,as). By making some sim-

ple independence assumptions, we can write:

P(a,as,e,f,se,sf) = Pml(a,e,f)·

· P(se,sf|e,f) · P(as|a,se,sf)

171

Page 7

Pml(a,e,f) is taken to be identical to the mono-

link model (see equation (1)). We are not inter-

ested in P(se,sf|e,f) (parsers will deal with it for

us). In our model, P(as|a,se,sf) will be equal to:

P(as|a,se,sf) = C ·

?

(i,j)∈as

· comp(a,as,se,sf)

Ppc(se

i,sf

j)·

where comp(a,as,se,sf) is equal to 1 if the com-

patibility constraint is verified, and 0 else. C is a

normalizing constant. Ppcdescribe the probability

of each P-concept.

Although it would be possible to learn parame-

ters for the distribution Ppcdepending on the char-

acteristics of each P-concepts, we want to keep

our model simple. Therefore, Ppcwill have only

two different values. One for the one-sided P-

concepts, and one for the two-sided ones. Con-

sidering the constraint of normalization, we then

have actually one parameter: α =

Although it would be possible to learn the param-

eter α during the EM-training, we choose to set

it at a preset value. Intuitively, we should have

0 < α < 1, because if α is greater than 1, then

the one-sided P-concepts will be favored by the

model, which is not what we want. Some empiri-

cal experiments showed that all values of α in the

range [0.5,0.9] were giving good results, which

lead to think that α can be set mostly indepen-

dently from the training corpus.

We still need to train the concepts probabilities

(used in Pml(a,e,f)), and to be able to decode

the most probable alignments. This is why we are

again going to represent P(a,as|e,f,se,sf) as a

Factor Graph.

This Factor Graph will contain two instances of

the monolink Factor Graph as subgraph: one for

a, the other for as(see figure 4). More precisely,

we create again a V-Node for every position on

each side of the sentence pair. We will call these

V-Nodes “Word V-Nodes”, to differentiate them

from the new “P-set V-Nodes”. We will create a

“P-set V-Node” Vps.e

i

for every P-set in se, and a

“P-set V-Node” Vps.f

j

for every P-set in sj. We

inter-connect all of the Word V-Nodes so that we

have a subgraph identical to the Factor Graph used

in the monolink case. We also create a “monolink

subgraph” for the P-set V-Nodes.

We now have 2 disconnected subgraphs. How-

ever, we need to add F-Nodes between them to en-

force the compatibility constraint between asand

Ppc(1−sided)

Ppc(2−sided).

Figure 4: A part of a Factor Graph showing the

connections between P-set V-Nodes and Word V-

Nodes on the English side.The V-Nodes are con-

nected to the French side through the 2 monolink

subgraphs

a. On the English side, for every P-set V-Node

Vpse

k

, and for every position i that the correspond-

ing P-set cover, we add a F-Node Fcomp.e

Vpse

k

and Ve

We proceed symmetrically on the French side.

Messages inside each monolink subgraph can

still be computed with the efficient procedure de-

scribed in section 3.2. We do not have the space to

describeindetailsthemessagessentbetweenP-set

V-Nodes and Word V-Nodes, but they are easily

computed from the principles of the BP algorithm.

Let NE =?

O(NG· ND+ |e| · |f|).

An interesting aspect of this model is that it

is flexible towards enforcing the respect of the

structures by the alignment, since not every P-set

need to have an equivalent in the opposite sen-

tence. (Gildea, 2003) has shown that too strict an

enforcement can easily degrade alignment quality

and that good balance was difficult to find.

Another interesting aspect is the fact that

we have a somehow “parameterless” distortion

model. There is only one real-valued parameter to

control the distortion: α. And even this parameter

is actually pre-set before any training on real data.

The distortion is therefore totally controlled by the

two sets of P-sets on each side of the sentence.

Finally, although we introduced the P-sets as

being generated from a syntactic tree, they do

not need to.In particular, we found interest-

ing to use P-sets consisting of every pair of adja-

k,i

between

i, associated with the function:

ϕcomp.e

k,i

(l,j) =

1

if j ∈ sf

j = −1 or l = −1

else

lor

0

ps∈se|ps| and NF =?

ps∈sf |ps|.

Then the complexity of one BP iteration will be

172

Page 8

cent positions in a sentence. For example, with

a sentence of length 5, we generate the P-sets

{1,2},{2,3},{3,4} and {4,5}. The underlying in-

tuition is that “adjacency” is often preserved in

translation (we can see this as another case of

“conservation of structure”). Practically, using P-

sets of adjacent positions create a distortion model

where permutation of words are not penalized, but

gaps are penalized.

4.2Experimental Results

The evaluation setting is the same as in the previ-

ous section. We created syntactic trees for every

sentences. For English,we used the Dan Bikel im-

plementation of the Collins parser (Collins, 2003).

For French, the SYGMART parser (Chauch´ e,

1984) and for Japanese, the KNP parser (Kuro-

hashi and Nagao, 1994).

The line SDM:Parsing (SDM standing for

“Structure-based Distortion Monolink”) shows the

results obtained by using P-sets from the trees pro-

duced by these parsers. The line SDM:Adjacency

shows results obtained by using adjacent positions

P-sets ,as described at the end of the previous sec-

tion (therefore, SDM:Adjacency do not use any

parser).

Several interesting observations can be made

from the results. First, our structure-based distor-

tion model did improve the results of the mono-

link model. There are however some surprising

results. In particular, SDM:Adjacency produced

surprisingly good results. It comes close to the

results of the IBM model 4 in both language pairs,

while it actually uses exactly the same parameters

as model 1. The fact that an assumption as simple

as “allow permutations, penalize gaps” can pro-

duce results almost on par with the complicated

distortion model of model 4 might be an indica-

tion that this model is unnecessarily complex for

languages with similar structure.Another surpris-

ing result is the fact that SDM:Adjacency gives

better results for the English-French language pair

than SDM:Parsing, while we expected that infor-

mation provided by parsers would have been more

relevant for the distortion model. It might be an

indication that the structure of English and French

is so close that knowing it provide only moder-

ate information for word reordering.

trast with the English-Japanese pair is, in this re-

spect, very interesting.

SDM:Adjacency did provide a strong improve-

The con-

For this language pair,

Algorithm

Monolink

SDM:Parsing

SDM:Adjacency

CLA

GIZA++ /Model 1

GIZA++ /Model 2

GIZA++ /Model 3

GIZA++ /Model 4

AER

0.197

0.166

0.135

0.26

0.281

0.205

0.162

0.121

PR

0.881

0.882

0.887

0.819

0.667

0.754

0.806

0.849

0.731

0.813

0.851

0.665

0.805

0.863

0.890

0.927

Table 1: Results for English/French

Algorithm

Monolink

SDM:Parsing

SDM:Adjacency

GIZA++ /Model 1

GIZA++ /Model 2

GIZA++ /Model 3

GIZA++ /Model 4

FPR

0.263

0.291

0.279

0.263

0.268

0.267

0.299

0.594

0.662

0.636

0.555

0.566

0.589

0.658

0.169

0.186

0.179

0.172

0.176

0.173

0.193

Table 2: Results for Japanese/English.

ment, but significantly less so than SDM:Parsing.

This tend to show that for language pairs that have

very different structures, the information provided

by syntactic tree is much more relevant.

5 Conclusion and Future Work

We will summarize what we think are the 4 more

interesti ng contributions of this paper. BP al-

gorithm has been shown to be useful and flexi-

ble for training and decoding complex alignment

models. An original mostly non-parametrical dis-

tortion model based on a simplified structure of

the sentences has been described. Adjacence con-

straints have been shown to produce very efficient

distortion model. Empirical performances differ-

ences in the task of aligning Japanese and English

toFrenchhintthatconsideringdifferentparadigms

depending on language pairs could be an improve-

ment on the “one-size-fits-all” approach generally

used in Statistical alignment and translation.

Several interesting improvement could also be

made on the model we presented.

a more elaborated Ppc, that would take into ac-

count the nature of the nodes (NP, VP, head,..) to

parametrize the P-set alignment probability, and

would use the EM-algorithm to learn those param-

eters.

Especially,

173

Page 9

References

M. Bayati, D. Shah, and M. Sharma. 2005. Maxi-

mum weight matching via max-product belief prop-

agation. Information Theory, 2005. ISIT 2005. Pro-

ceedings. International Symposium on, pages 1763–

1767.

Peter E Brown, Vincent J. Della Pietra, Stephen

A. Della Pietra, and Robert L. Mercer, 1993. The

mathematics of statistical machine translation: pa-

rameter estimation, volume 19, pages 263–311.

J. Chauch´ e.

lanalyse du discours. Coling84. Stanford Univer-

sity, California.

1984. Un outil multidimensionnel de

M. Collins. 2003. Head-driven statistical models for

natural language parsing. Computational Linguis-

tics.

Fabien Cromieres. 2006. Sub-sentential alignment us-

ing substring co-occurrence counts. In Proceedings

of ACL. The Association for Computer Linguistics.

U. Germann.

the

http://www.isi.edu/naturallanguage/download/hansard/.

2001. Alignedhansards

canada.of 36th parliamentof

D. Gildea. 2003. Loosely tree-based alignment for

machine translation. Proceedings of ACL, 3.

T. Heskes.

lief propagation are minima of the bethe free energy.

Advances in Neural Information Processing Systems

15: Proceedings of the 2002 Conference.

2003. Stable fixed points of loopy be-

S. Kurohashi and M. Nagao. 1994. A syntactic analy-

sis method of long japanese sentences based on the

detection of conjunctive structures. Computational

Linguistics, 20(4):507–534.

I. D. Melamed. 2000. Models of translational equiv-

alence among words.Computational Linguistics,

26(2):221–249.

I. Melamed. 2002. Empirical Methods for Exploiting

Parallel Texts. The MIT Press.

Rada Mihalcea and Ted Pedersen. 2003. An evaluation

exercise for word alignment. In Rada Mihalcea and

Ted Pedersen, editors, HLT-NAACL 2003 Workshop:

Building and Using Parallel Texts: Data Driven Ma-

chine Translation and Beyond, pages 1–10, Edmon-

ton, Alberta, Canada, May 31. Association for Com-

putational Linguistics.

Kevin P Murphy, Yair Weiss, and Michael I Jordan.

1999. Loopy belief propagation for approximate in-

ference: An empirical study. In Proceedings of Un-

certainty in AI, pages 467—475.

Franz Josef Och and Hermann Ney. 1999. Improved

alignment models for statistical machine translation.

University of Maryland, College Park, MD, pages

20—28.

J. Pearl. 1988. Probabilistic Reasoning in Intelligent

Systems: Networks of Plausible Inference. Morgan

Kaufmann Publishers.

M. Utiyama and H. Isahara. 2003. Reliable measures

for aligning japanese-english news articles and sen-

tences. Proceedings of the 41st Annual Meeting on

Association for Computational Linguistics-Volume

1, pages 72–79.

Y. Weiss and W. T. Freeman. 2001. On the optimality

of solutions of the max-product belief propagation

algorithm in arbitrary graphs. IEEE Trans. on Infor-

mation Theory, 47(2):736–744.

K. Yamada and K. Knight. 2001. A syntax-based sta-

tistical translation model. Proceedings of ACL.

Jonathan S. Yedidia, William T. Freeman, and Yair

Weiss, 2003. Understanding belief propagation and

its generalizations, pages 239–269. Morgan Kauf-

mann Publishers Inc., San Francisco, CA, USA.

174