Page 1

Proceedings of the 12th Conference of the European Chapter of the ACL, pages 166–174,

Athens, Greece, 30 March – 3 April 2009. c ?2009 Association for Computational Linguistics

166

An Alignment Algorithm using Belief Propagation and a Structure-Based

Distortion Model

Fabien Cromi` eres

Graduate school of informatics

Kyoto University

Kyoto, Japan

fabien@nlp.kuee.kyoto-u.ac.jp

Sadao Kurohashi

Graduate school of informatics

Kyoto University

Kyoto, Japan

kuro@i.kyoto-u.ac.jp

Abstract

In this paper, we first demonstrate the in-

terest of the Loopy Belief Propagation al-

gorithm to train and use a simple align-

ment model where the expected marginal

values needed for an efficient EM-training

are not easily computable. We then im-

prove this model with a distortion model

based on structure conservation.

Introduction and Related Work1

Automatic word alignment of parallel corpora is

an important step for data-oriented Machine trans-

lation (whether Statistical or Example-Based) as

well as for automatic lexicon acquisition. Many

algorithms have been proposed in the last twenty

years to tackle this problem. One of the most suc-

cessfull alignment procedure so far seems to be

the so-called “IBM model 4” described in (Brown

et al., 1993). It involves a very complex distor-

tion model (here and in subsequent usages “dis-

tortion” will be a generic term for the reordering

of the words occurring in the translation process)

with many parameters that make it very complex

to train.

By contrast, the first alignment model we are

going to propose is fairly simple. But this sim-

plicity will allow us to try and experiment differ-

ent ideas for making a better use of the sentence

structures in the alignment process. This model

(and even more so its subsequents variations), al-

though simple, do not have a computationally ef-

ficient procedure for an exact EM-based training.

However, wewillgivesometheoreticalandempir-

ical evidences that Loopy Belief Propagation can

give us a good approximation procedure.

Although we do not have the space to review the

many alignment systems that have already been

proposed, we will shortly refer to works that share

some similarities with our approach. In particu-

lar, the first alignment model we will present has

already been described in (Melamed, 2000). We

differ however in the training and decoding pro-

cedure we propose. The problem of making use

of syntactic trees for alignment (and translation),

which is the object of our second alignment model

has already received some attention, notably by

(Yamada and Knight, 2001) and (Gildea, 2003) .

2Factor Graphs and Belief Propagation

In this paper, we will make several use of Fac-

tor Graphs. A Factor Graph is a graphical

model, much like a Bayesian Network. The three

most common types of graphical models (Factor

Graphs, Bayesian Network and Markov Network)

share the same purpose: intuitively, they allow to

represent the dependencies among random vari-

ables; mathematically, they represent a factoriza-

tion of the joint probability of these variables.

Formally, a factor graph is a bipartite graph with

2 kinds of nodes. On one side, the Variable Nodes

(abbreviated as V-Node from here on), and on the

other side, the Factor Nodes (abbreviated as F-

Node). If a Factor Graph represents a given joint

distribution, there will be one V-Node for every

random variable in this joint distribution. Each F-

Node is associated with a function of the V-Nodes

to which it is connected (more precisely, a func-

tion of the values of the random variables associ-

ated with the V-Nodes, but for brevity, we will fre-

quently mix the notions of V-Node, Random Vari-

ables and their values). The joint distribution is

then the product of these functions (and of a nor-

malizing constant). Therefore, each F-Node actu-

ally represent a factor in the factorization of the

joint distribution.

As a short example, let us consider a prob-

lem classically used to introduce Bayesian Net-

work. We want to model the joint probability of

the Weather(W) being sunny or rainy, the Sprin-

kle(S) being on or off, and the Lawn(L) being

wet or dry. Figure 1 show the dependencies of

Page 2

Figure 1: A classical example

the variables represented with a Factor Graph and

with a Bayesian Network. Mathematically, the

Bayesian Network imply that the joint probabil-

ity has the following factorization: P(W,L,S) =

P(W) · P(S|W) · P(L|W,S). The Factor Graph

imply there exist two functions ϕ1and ϕ2as well

as a normalization constant C such that we have

the factorization: P(W,L,S) = C · ϕ2(W,S) ·

ϕ1(L,W,S).If we set C = 1, ϕ2(W,S) =

P(W) · P(S|W) and ϕ1(L,W,S) = P(L|W,S),

the Factor Graph express exactly the same factor-

ization as the Bayesian Network.

A reason to use Graphical Models is that we can

use with them an algorithm called Belief Propa-

gation (abbreviated as BP from here on) (Pearl,

1988). The BP algorithm comes in two flavors:

sum-product BP and max-product BP. Each one

respectively solve two problems that arise often

(and are often intractable) in the use of a proba-

bilistic model: “what are the marginal probabili-

ties of each individual variable?” and “what is the

set of values with the highest probability?”. More

precisely, the BP algorithm will give the correct

answer to these questions if the graph represent-

ing the distribution is a forest. If it is not the case,

the BP algorithm is not even guaranteed to con-

verge. It has been shown, however, that the BP al-

gorithm do converge in many practical cases, and

that the results it produces are often surprisingly

good approximations (see, for example, (Murphy

et al., 1999) or (Weiss and Freeman, 2001) ).

(Yedidia et al., 2003) gives a very good presen-

tation of the sum-product BP algorithm, as well as

some theoretical justifications for its success. We

will just give an outline of the algorithm. The BP

algorithm is a message-passing algorithm. Mes-

sages are sent during several iterations until con-

vergence. At each iteration, each V-Node sends

to its neighboring F-Nodes a message represent-

ing an estimation of its own marginal values. The

message sent by the V-Node Vito the F-Node Fj

estimating the marginal probability of Vito take

the value x is :

?

(N(Vi) represent the set of the neighbours of Vi)

Also, every F-Node send a message to its neigh-

boring V-Nodes that represent its estimates of the

marginal values of the V-Node:

?

·

V k∈N(Fj)\V i

At any point, the belief of a V-Node V i is given

by

bi(x) =

Fk∈N(V i)

, bibeing normalized so that?

probability (or an approximation of it) of Vitaking

the value x .

An interesting point to note is that each message

can be “scaled” (that is, multiplied by a constant)

by any factor at any point without changing the re-

sult of the algorithm. This is very useful both for

preventing overflow and underflow during compu-

tation, and also sometimes for simplifying the al-

gorithm (we will use this in section 3.2). Also,

damping schemes such as the ones proposed in

(Murphy et al., 1999) or (Heskes, 2003) are use-

ful for decreasing the cases of non-convergence.

As for the max-product BP, it is best explained

as “sum-product BP where each sum is replaced

by a maximization”.

mV i→Fj(x) =

Fk∈N(V i)\Fj

mFk→V i(x)

mFj→V i(x) =

v1,...,vn

ϕj(v1,..,x,..,vn)·

?

mV k→Fj(vk)

?

mFk→V i(x)

xbi(x) = 1. The

beliefbi(x)isexpectedtoconvergetothemarginal

3The monolink model

We are now going to present a simple alignment

model that will serve both to illustrate the effi-

ciency of the BP algorithm and as basis for fur-

ther improvement. As previously mentioned, this

model is mostly identical to one already proposed

in (Melamed, 2000). The training and decoding

procedures we propose are however different.

3.1

Following the usual convention, we will designate

the two sides of a sentence pair as French and En-

glish. A sentence pair will be noted (e,f). eirep-

resents the word at position i in e.

Description

167

Page 3

In this first simple model, we will pay little at-

tention to the structure of the sentence pair we

want to align. Actually, each sentence will be re-

duced to a bag of words.

Intuitively, the two sides of a sentence pair ex-

press the same set of meanings. What we want to

do in the alignment process is find the parts of the

sentences that originate from the same meaning.

We will suppose here that each meaning generate

at most one word on each side, and we will name

concept the pair of words generated by a mean-

ing. It is possible for a meaning to be expressed

in only one side of the sentence pair. In that case,

we will have a “one-sided” concept consisting of

only one word. In this view, a sentence pair ap-

pears “superficially” as a pair of bag of words, but

the bag of words are themselves the visible part of

an underlying bag of concepts.

We propose a simple generative model to de-

scribe the generation of a sentence pair (or rather,

its underlying bag of concepts):

• First, an integer n, representing the number

of concepts of the sentence is drawn from a

distribution Psize

• Then, n concepts are drawn independently

from a distribution Pconcept

The probability of a bag of concepts C is then:

P(C) = Psize(|C|)

?

(w1,w2)∈C

Pconcept((w1,w2))

We can alternatively represent a bag of concepts

as a pair of sentence (e,f), plus an alignment a.

a is a set of links, a link being represented as a

pair of positions in each side of the sentence pair

(the special position -1 indicating the empty side

of a one-sided concept). This alternative represen-

tation has the advantage of better separating what

is observed (the sentence pair) and what is hidden

(the alignment). It is not a strictly equivalent rep-

resentation (it also contains information about the

word positions) but this will not be relevant here.

The joint distribution of e,f and a is then:

P(e,f,a) = Psize(|a|)

?

(i,j)∈a

Pconcept(ei,fj)

(1)

This model only take into consideration one-

to-one alignments. Therefore, from now on, we

will call this model “monolink”.Considering

only one-to-one alignments can be seen as a lim-

itation compared to others models that can of-

ten produce at least one-to-many alignments, but

on the good side, this allow the monolink model

to be nicely symmetric. Additionally, as already

argued in (Melamed, 2000), there are ways to

determine the boundaries of some multi-words

phrases (Melamed, 2002), allowing to treat sev-

eral words as a single token. Alternatively, a pro-

cedure similar to the one described in (Cromieres,

2006), where substrings instead of single words

are aligned (thus considering every segmentation

possible) could be used.

With the monolink model, we want to do two

things: first, we want to find out good values for

the distributions Psizeand Pconcept. Then we want

to be able to find the most likely alignment a given

the sentence pair (e,f).

We will consider Psizeto be a uniform distribu-

tion over the integers up to a sufficiently big value

(since it is not possible to have a uniform distri-

bution over an infinite discrete set). We will not

need to determine the exact value of Psize. The

assumption that it is uniform is actually enough to

“remove” it of the computations that follow.

In order to determine the Pconceptdistribution,

we can use an EM procedure.

show that, at every iteration, the EM procedure

will require to set Pconcept(we,wf) proportional

to the sum of the expected counts of the concept

(we,wf) over the training corpus. This, in turn,

mean we have to compute the conditional expec-

tation:

It is easy to

E((i,j) ∈ a|e,f) =

?

a|(i,j)∈a

P(a|e,f)

for every sentence pair (e,f). This computation

require a sum over all the possible alignments,

whose numbers grow exponentially with the size

of the sentences. As noted in (Melamed, 2000),

it does not seem possible to compute this expecta-

tion efficiently with dynamic programming tricks

like the one used in the IBM models 1 and 2 (as a

passing remark, these “tricks” can actually be seen

as instances of the BP algorithm).

We propose to solve this problem by applying

the BP algorithm to a Factor Graph representing

the conditional distribution P(a|e,f).

sentence pair (e,f), we build this graph as fol-

lows.

We create a V-node Ve

the English sentence. This V-Node can take for

Given a

ifor every position i in

168

Page 4

Figure 2: A Factor Graph for the monolink model

in the case of a 2-words English sentence and a 3-

words french sentence (Frec

ij

nodes are noted Fri-j)

value any position in the french sentence, or the

special position −1 (meaning this position is not

aligned, corresponding to a one-sided concept).

We create symmetrically a V-node Vf

position in the french sentence.

We have to enforce a “reciprocal love” condi-

tion: if a V-Node at position i choose a position j

on the opposite side, the opposite V-Node at po-

sition j must choose the position i. This is done

by adding a F-Node Frec

node Ve

We then connect a “translation probability” F-

Node Ftp.e

i

to every V-Node Ve

the function:

??Pconcept(ei,fj)

We add symmetrically on the French side F-Nodes

Ftp.f

j

It should be fairly easy to see that such a Factor

Graph represents P(a|e,f). See figure 2 for an

example.

Using the sum-product BP, the beliefs of ev-

ery V-Node Ve

node Vf

marginal expectation E((i,j) ∈ a|e,f) (or rather,

a hopefully good approximation of it).

We can also use max-product BP on the same

graph to decode the most likely alignment. In the

monolink case, decoding is actually an instance of

the “assignment problem”, for which efficient al-

gorithms are known. However this will not be the

jfor every

i,jbetween every opposite

j, associated with the function:

iand Vf

ϕrec

i,j(k,l) =

1

if (i = l and j = k)

or (i ?= l and j ?= k)

else

0

iassociated with

ϕtp.e

i

(j) =

if j ?= −1

if j = −1

Pconcept(ei,∅)

to the V-Nodes Vf

j.

ito take the value j and of every

jto take the value i should converge to the

case for the more complex model of the next sec-

tion. Actually, (Bayati et al., 2005) has recently

proved that max-product BP always give the opti-

mal solution to the assignment problem.

3.2Efficient BP iterations

Applying naively the BP algorithm would lead us

to a complexity of O(|e|2· |f|2) per BP iteration.

While this is not intractable, it could turn out to be

a bit slow. Fortunately, we found it is possible to

reduce this complexity to O(|e| · |f|) by making

two useful observations.

Let us note me

to Vf

ter it received its own message from Ve

has the same value for every x different from i:

me

k?=j

mf

messages me

except if x = i; and the same can be done for the

messages coming from the French side mf

lows that me

if the be

ery step, we only need to compute me

me

Hence the following algorithm (me

here abbreviated to me

of the message we need to compute). We describe

the process for computing the English-side mes-

sages and beliefs (me

must also be done symmetrically for the French-

side messages and beliefs (mf

iteration.

0- Initialize all messages and beliefs with:

me(0)

iji

Until convergence (or for a set number of itera-

tion):

1- Compute the messages me

be(t)

ii

2- Compute the beliefs be

ϕtp.e

i

ji

3- And then normalize the bi(j)e(t+1)so that

?

product BP.

ijthe resulting message from Ve

j(that is the message sent by Frec

i

i,jto Vf

i). me

jaf-

ij(x)

ij(x ?= i) =?

be

i(k)

ji(k). We can divide all the

ij(x ?= i), so that me

ijby me

ij(x) = 1

ij. It fol-

ij(x ?= i) =?

k?=jbe

i(k) = 1 − be

i(j)

iare kept normalized. Therefore, at ev-

ij(j), not

ij(x ?= j).

ij(j) will be

ijsince it is the only value

ijand be

i) , but the process

ijand bf

i) at every

= 1 and be(0)

(j) = ϕtp.e

i

(j)

ij: me(t+1)

ij

=

(j)/((1 − be(t)

(j)) · mf(t)

ji)

i(j):bi(j)e(t+1)

=

(j) · mf(t+1)

jbi(j)e(t+1)= 1.

A similar algorithm can be found for the max-

3.3Experimental Results

We evaluated the monolink algorithm with two

languages pairs: French-English and Japanese-

English.

169

Page 5

For the English-French Pair, we used 200,000

sentence pairs extracted from the Hansard cor-

pus (Germann, 2001). Evaluation was done with

the scripts and gold standard provided during

the workshop HLT-NAACL 20031(Mihalcea and

Pedersen, 2003). Null links are not considered for

the evaluation.

For the English-Japanese evaluation, we used

100,000 sentence pairs extracted from a corpus of

English/Japanese news. We used 1000 sentence

pairs extracted from pre-aligned data(Utiyama and

Isahara, 2003) as a gold standard. We segmented

all the Japanese data with the automatic segmenter

Juman (Kurohashi and Nagao, 1994). There is

a caveat to this evaluation, though. The reason

is that the segmentation and alignment scheme

used in our gold standard is not very fine-grained:

mostly, big chunks of the Japanese sentence cover-

ing several words are aligned to big chunks of the

English sentence. For the evaluation, we had to

consider that when two chunks are aligned, there

is a link between every pair of words belonging to

each chunk. A consequence is that our gold stan-

dard will contain a lot more links than it should,

some of them not relevants. This means that the

recall will be largely underestimated and the pre-

cision will be overestimated.

For the BP/EM training, we used 10 BP iter-

ations for each sentences, and 5 global EM iter-

ations. By using a damping scheme for the BP

algorithm, we never observed a problem of non-

convergence (such problems do commonly ap-

pears without damping). With our python/C im-

plementation, training time approximated 1 hour.

But with a better implementation, it should be pos-

sible to reduce this time to something comparable

to the model 1 training time with Giza++.

For the decoding, although the max-product BP

should be the algorithm of choice, we found we

could obtain slightly better results (by between 1

and 2 AER points) by using the sum-product BP,

choosing links with high beliefs, and cutting-off

links with very small beliefs (the cut-off was cho-

sen roughly by manually looking at a few aligned

sentences not used in the evaluation, so as not to

create too much bias).

Due to space constraints, all of the results of this

section and the next one are summarized in two

tables (tables 1 and 2) at the end of this paper.

In order to compare the efficiency of the BP

1http://www.cs.unt.edu/ rada/wpt/

training procedure to a more simple one, we reim-

plemented the Competitive Link Algorithm (ab-

breviated as CLA from here on) that is used in

(Melamed, 2000) to train an identical model. This

algorithm starts with some relatively good esti-

mates found by computing correlation score (we

used the G-test score) between words based on

their number of co-occurrences. A greedy Viterbi

training is then applied to improve this initial

guess. Incontrast, ourBP/EMtrainingdonotneed

to compute correlation scores and start the training

with uniform parameters.

Weonlyevaluated

French/English pair.

CLA did improve alignment quality, but subse-

quent ones decreased it. The reported score for

CLA is therefore the one obtained during the best

iteration. The BP/EM training demonstrate a clear

superiority over the CLA here, since it produce

almost 7 points of AER improvement over CLA.

In order to have a comparison with a well-

known and state-of-the-art system, we also used

the GIZA++ program (Och and Ney, 1999) to

align the same data. We tried alignments in both

direction and provide the results for the direction

that gave the best results. The settings used were

the ones used by the training scripts of the Moses

system2, which we assumed to be fairly optimal.

We tried alignment with the default Moses settings

(5 iterations of model 1, 5 of Hmm, 3 of model 3,

3 of model 4) and also tried with increased number

of iterations for each model (up to 10 per model).

We are aware that the score we obtained for

model 4 in English-French is slightly worse than

what is usually reported for a similar size of train-

ing data. At the time of this paper, we did not

have the time to investigate if it is a problem of

non-optimal settings in GIZA++, or if the train-

ing data we used was “difficult to learn from” (it

is common to extract sentences of moderate length

for the training data but we didn’t, and some sen-

tences of our training corpus do have more than

200 words; also, we did not use any kind of pre-

processing). In any case, Giza++ is compared here

with an algorithm trained on the same data and

with no possibilities for fine-tuning; therefore the

comparison should be fair.

The comparison show that performance-wise,

the monolink algorithm is between the model 2

and the model 3 for English/French. Considering

theCLAonthe

The first iteration of

2http://www.statmt.org/moses/

170