Proceedings of the 12th Conference of the European Chapter of the ACL, pages 166–174,
Athens, Greece, 30 March – 3 April 2009. c ?2009 Association for Computational Linguistics
An Alignment Algorithm using Belief Propagation and a Structure-Based
Fabien Cromi` eres
Graduate school of informatics
Graduate school of informatics
In this paper, we first demonstrate the in-
terest of the Loopy Belief Propagation al-
gorithm to train and use a simple align-
ment model where the expected marginal
values needed for an efficient EM-training
are not easily computable. We then im-
prove this model with a distortion model
based on structure conservation.
Introduction and Related Work1
Automatic word alignment of parallel corpora is
an important step for data-oriented Machine trans-
lation (whether Statistical or Example-Based) as
well as for automatic lexicon acquisition. Many
algorithms have been proposed in the last twenty
years to tackle this problem. One of the most suc-
cessfull alignment procedure so far seems to be
the so-called “IBM model 4” described in (Brown
et al., 1993). It involves a very complex distor-
tion model (here and in subsequent usages “dis-
tortion” will be a generic term for the reordering
of the words occurring in the translation process)
with many parameters that make it very complex
By contrast, the first alignment model we are
going to propose is fairly simple. But this sim-
plicity will allow us to try and experiment differ-
ent ideas for making a better use of the sentence
structures in the alignment process. This model
(and even more so its subsequents variations), al-
though simple, do not have a computationally ef-
ficient procedure for an exact EM-based training.
ical evidences that Loopy Belief Propagation can
give us a good approximation procedure.
Although we do not have the space to review the
many alignment systems that have already been
proposed, we will shortly refer to works that share
some similarities with our approach. In particu-
lar, the first alignment model we will present has
already been described in (Melamed, 2000). We
differ however in the training and decoding pro-
cedure we propose. The problem of making use
of syntactic trees for alignment (and translation),
which is the object of our second alignment model
has already received some attention, notably by
(Yamada and Knight, 2001) and (Gildea, 2003) .
2Factor Graphs and Belief Propagation
In this paper, we will make several use of Fac-
tor Graphs. A Factor Graph is a graphical
model, much like a Bayesian Network. The three
most common types of graphical models (Factor
Graphs, Bayesian Network and Markov Network)
share the same purpose: intuitively, they allow to
represent the dependencies among random vari-
ables; mathematically, they represent a factoriza-
tion of the joint probability of these variables.
Formally, a factor graph is a bipartite graph with
2 kinds of nodes. On one side, the Variable Nodes
(abbreviated as V-Node from here on), and on the
other side, the Factor Nodes (abbreviated as F-
Node). If a Factor Graph represents a given joint
distribution, there will be one V-Node for every
random variable in this joint distribution. Each F-
Node is associated with a function of the V-Nodes
to which it is connected (more precisely, a func-
tion of the values of the random variables associ-
ated with the V-Nodes, but for brevity, we will fre-
quently mix the notions of V-Node, Random Vari-
ables and their values). The joint distribution is
then the product of these functions (and of a nor-
malizing constant). Therefore, each F-Node actu-
ally represent a factor in the factorization of the
As a short example, let us consider a prob-
lem classically used to introduce Bayesian Net-
work. We want to model the joint probability of
the Weather(W) being sunny or rainy, the Sprin-
kle(S) being on or off, and the Lawn(L) being
wet or dry. Figure 1 show the dependencies of
Figure 1: A classical example
the variables represented with a Factor Graph and
with a Bayesian Network. Mathematically, the
Bayesian Network imply that the joint probabil-
ity has the following factorization: P(W,L,S) =
P(W) · P(S|W) · P(L|W,S). The Factor Graph
imply there exist two functions ϕ1and ϕ2as well
as a normalization constant C such that we have
the factorization: P(W,L,S) = C · ϕ2(W,S) ·
ϕ1(L,W,S).If we set C = 1, ϕ2(W,S) =
P(W) · P(S|W) and ϕ1(L,W,S) = P(L|W,S),
the Factor Graph express exactly the same factor-
ization as the Bayesian Network.
A reason to use Graphical Models is that we can
use with them an algorithm called Belief Propa-
gation (abbreviated as BP from here on) (Pearl,
1988). The BP algorithm comes in two flavors:
sum-product BP and max-product BP. Each one
respectively solve two problems that arise often
(and are often intractable) in the use of a proba-
bilistic model: “what are the marginal probabili-
ties of each individual variable?” and “what is the
set of values with the highest probability?”. More
precisely, the BP algorithm will give the correct
answer to these questions if the graph represent-
ing the distribution is a forest. If it is not the case,
the BP algorithm is not even guaranteed to con-
verge. It has been shown, however, that the BP al-
gorithm do converge in many practical cases, and
that the results it produces are often surprisingly
good approximations (see, for example, (Murphy
et al., 1999) or (Weiss and Freeman, 2001) ).
(Yedidia et al., 2003) gives a very good presen-
tation of the sum-product BP algorithm, as well as
some theoretical justifications for its success. We
will just give an outline of the algorithm. The BP
algorithm is a message-passing algorithm. Mes-
sages are sent during several iterations until con-
vergence. At each iteration, each V-Node sends
to its neighboring F-Nodes a message represent-
ing an estimation of its own marginal values. The
message sent by the V-Node Vito the F-Node Fj
estimating the marginal probability of Vito take
the value x is :
(N(Vi) represent the set of the neighbours of Vi)
Also, every F-Node send a message to its neigh-
boring V-Nodes that represent its estimates of the
marginal values of the V-Node:
V k∈N(Fj)\V i
At any point, the belief of a V-Node V i is given
, bibeing normalized so that?
probability (or an approximation of it) of Vitaking
the value x .
An interesting point to note is that each message
can be “scaled” (that is, multiplied by a constant)
by any factor at any point without changing the re-
sult of the algorithm. This is very useful both for
preventing overflow and underflow during compu-
tation, and also sometimes for simplifying the al-
gorithm (we will use this in section 3.2). Also,
damping schemes such as the ones proposed in
(Murphy et al., 1999) or (Heskes, 2003) are use-
ful for decreasing the cases of non-convergence.
As for the max-product BP, it is best explained
as “sum-product BP where each sum is replaced
by a maximization”.
mV i→Fj(x) =
mFj→V i(x) =
xbi(x) = 1. The
3The monolink model
We are now going to present a simple alignment
model that will serve both to illustrate the effi-
ciency of the BP algorithm and as basis for fur-
ther improvement. As previously mentioned, this
model is mostly identical to one already proposed
in (Melamed, 2000). The training and decoding
procedures we propose are however different.
Following the usual convention, we will designate
the two sides of a sentence pair as French and En-
glish. A sentence pair will be noted (e,f). eirep-
resents the word at position i in e.
M. Bayati, D. Shah, and M. Sharma. 2005. Maxi-
mum weight matching via max-product belief prop-
agation. Information Theory, 2005. ISIT 2005. Pro-
ceedings. International Symposium on, pages 1763–
Peter E Brown, Vincent J. Della Pietra, Stephen
A. Della Pietra, and Robert L. Mercer, 1993. The
mathematics of statistical machine translation: pa-
rameter estimation, volume 19, pages 263–311.
J. Chauch´ e.
lanalyse du discours. Coling84. Stanford Univer-
1984. Un outil multidimensionnel de
M. Collins. 2003. Head-driven statistical models for
natural language parsing. Computational Linguis-
Fabien Cromieres. 2006. Sub-sentential alignment us-
ing substring co-occurrence counts. In Proceedings
of ACL. The Association for Computer Linguistics.
canada.of 36th parliamentof
D. Gildea. 2003. Loosely tree-based alignment for
machine translation. Proceedings of ACL, 3.
lief propagation are minima of the bethe free energy.
Advances in Neural Information Processing Systems
15: Proceedings of the 2002 Conference.
2003. Stable fixed points of loopy be-
S. Kurohashi and M. Nagao. 1994. A syntactic analy-
sis method of long japanese sentences based on the
detection of conjunctive structures. Computational
I. D. Melamed. 2000. Models of translational equiv-
alence among words.Computational Linguistics,
I. Melamed. 2002. Empirical Methods for Exploiting
Parallel Texts. The MIT Press.
Rada Mihalcea and Ted Pedersen. 2003. An evaluation
exercise for word alignment. In Rada Mihalcea and
Ted Pedersen, editors, HLT-NAACL 2003 Workshop:
Building and Using Parallel Texts: Data Driven Ma-
chine Translation and Beyond, pages 1–10, Edmon-
ton, Alberta, Canada, May 31. Association for Com-
Kevin P Murphy, Yair Weiss, and Michael I Jordan.
1999. Loopy belief propagation for approximate in-
ference: An empirical study. In Proceedings of Un-
certainty in AI, pages 467—475.
Franz Josef Och and Hermann Ney. 1999. Improved
alignment models for statistical machine translation.
University of Maryland, College Park, MD, pages
J. Pearl. 1988. Probabilistic Reasoning in Intelligent
Systems: Networks of Plausible Inference. Morgan
M. Utiyama and H. Isahara. 2003. Reliable measures
for aligning japanese-english news articles and sen-
tences. Proceedings of the 41st Annual Meeting on
Association for Computational Linguistics-Volume
1, pages 72–79.
Y. Weiss and W. T. Freeman. 2001. On the optimality
of solutions of the max-product belief propagation
algorithm in arbitrary graphs. IEEE Trans. on Infor-
mation Theory, 47(2):736–744.
K. Yamada and K. Knight. 2001. A syntax-based sta-
tistical translation model. Proceedings of ACL.
Jonathan S. Yedidia, William T. Freeman, and Yair
Weiss, 2003. Understanding belief propagation and
its generalizations, pages 239–269. Morgan Kauf-
mann Publishers Inc., San Francisco, CA, USA.