A Systematic Comparison of Phrase-Based, Hierarchical and Syntax-Augmented Statistical MT.
-
Citations (0)
-
Cited In (0)
Page 1
A Systematic Comparison of Phrase-Based, Hierarchical and
Syntax-Augmented Statistical MT
Andreas Zollmann∗and Ashish Venugopal∗and Franz Och and Jay Ponte
Google Inc.
1600 Amphitheatre Parkway
Mountain View, CA 94303, USA
{zollmann,ashishv}@cs.cmu.edu
{och,ponte}@google.com
Abstract
Probabilistic
grammar (PSCFG) translation models
define weighted transduction rules that
represent translation and reordering oper-
ations via nonterminal symbols.
work, we investigate the source of the im-
provements in translation quality reported
when using two PSCFG translation mod-
els (hierarchical and syntax-augmented),
when extending a state-of-the-art phrase-
based baseline that serves as the lexical
support for both PSCFG models.
isolate the impact on translation quality
for several important design decisions in
each model. We perform this comparison
on three NIST language translation tasks;
Chinese-to-English,
and Urdu-to-English, each representing
unique challenges.
synchronouscontext-free
In this
We
Arabic-to-English
1Introduction
Probabilistic synchronous context-free grammar
(PSCFG) models define weighted transduction
rules that are automatically learned from parallel
training data. As in monolingual parsing, such
rules make use of nonterminal categories to gener-
alize beyond the lexical level. In the example be-
low, the French (source language) words “ne” and
“pas” are translated into the English (target lan-
guage) word “not”, performing reordering in the
context of a nonterminal of type “VB” (verb).
VP → ne VB pas,do not VB : w1
∗Work done during internships at Google Inc.
∗c ?2008.
Attribution-Noncommercial-Share Alike 3.0 Unported li-
cense(http://creativecommons.org/licenses/by-nc-sa/3.0/).
Some rights reserved.
Licensed under the Creative Commons
VB → veux,want : w2.
As with probabilistic context-free grammars, each
rule has a left-hand-side nonterminal (VP and VB
in the two rules above), which constrains the rule’s
usage in further composition, and is assigned a
weight w, estimating the quality of the rule based
on some underlying statistical model.
tion with a PSCFG is thus a process of compos-
ing such rules to parse the source language while
synchronously generating target language output.
PSCFG approaches such as Chiang (2005) and
Zollmann and Venugopal (2006) typically begin
with a phrase-based model as the foundation for
the PSCFG rules described above. Starting with
bilingualphrasepairsextractedfromautomatically
aligned parallel text (Och and Ney, 2004; Koehn et
al., 2003), these PSCFG approaches augment each
contiguous (in source and target words) phrase
pair with a left-hand-side symbol (like the VP in
the example above), and perform a generalization
procedure to form rules that include nonterminal
symbols. We can thus view PSCFG methods as
an attempt to generalize beyond the purely lexi-
cal knowledge represented in phrase based mod-
els, allowing reordering decisions to be explicitly
encoded in each rule. It is important to note that
whilephrase-basedmodelscannotexplicitlyrepre-
sent context sensitive reordering effects like those
in the example above, in practice, phrase based
models often have the potential to generate the
same target translation output by translating source
phrases out of order, and allowing empty trans-
lations for some source words. Apart from one
or more language models scoring these reorder-
ing alternatives, state-of-the-art phrase-based sys-
temsarealsoequippedwithalexicalizeddistortion
model accounting for reordering behavior more di-
rectly. While previous work demonstrates impres-
Transla-
Page 2
sive improvements of PSCFG over phrase-based
approaches for large Chinese-to-English data sce-
narios (Chiang, 2005; Chiang, 2007; Marcu et al.,
2006; DeNeefe et al., 2007), these phrase-based
baseline systems were constrained to distortion
limits of four (Chiang, 2005) and seven (Chiang,
2007; Marcu et al., 2006; DeNeefe et al., 2007),
respectively, while the PSCFG systems were able
to operate within an implicit reordering window of
10 and higher.
In this work, we evaluate the impact of the ex-
tensions suggested by the PSCFG methods above,
looking to answer the following questions. Do the
relative improvements of PSCFG methods persist
when the phrase- based approach is allowed com-
parable long-distance reordering, and when the n-
gram language model is strong enough to effec-
tively select from these reordered alternatives? Do
these improvements persist across language pairs
that exhibit significantly different reodering effects
and how does resource availablility effect relative
performance? In order to answer these questions
we extend our PSCFG decoder to efficiently han-
dle the high order LMs typically applied in state-
of-the-art phrase based translation systems. We
evaluate the phrase-based system for a range of re-
ordering limits, up to those matching the PSCFG
approaches, isolating the impact of the nontermi-
nal based approach to reordering. Results are pre-
sented in multiple language pairs and data size
scenarios, highlighting the limited impact of the
PSCFG model in certain conditions.
2Summary of approaches
Given a source language sentence f, statistical ma-
chine translation defines the translation task as se-
lecting the most likely target translation e under a
model P(e|f), i.e.:
ˆ e(f) = argmax
e
P(e|f) = argmax
e
m
?
i=1
hi(e,f)λi
where the argmax operation denotes a search
through a structured space of translation ouputs
in the target language, hi(e,f) are bilingual fea-
tures of e and f and monolingual features of e,
and weights λiare trained discriminitively to max-
imize translation quality (based on automatic met-
rics) on held out data (Och, 2003).
Both phrase-based and PSCFG approaches
make independence assumptions to structure this
search space and thus most features hi(e,f) are
designed to be local to each phrase pair or rule.
A notable exception is the n-gram language model
(LM), which evaluates the likelihood of the se-
quential target words output. Phrase-based sys-
tems also typically allow source segments to be
translated out of order, and include distortion mod-
els to evaluate such operations.
suggest the efficient dynamic programming al-
gorithms for phrase-based systems described in
Koehn et al. (2004).
We now discuss the translation models com-
pared in this work.
These features
2.1
Phrase-based methods identify contiguous bilin-
gual phrase pairs based on automatically gener-
ated word alignments (Och et al., 1999). Phrase
pairs are extracted up to a fixed maximum length,
since very long phrases rarely have a tangible im-
pact during translation (Koehn et al., 2003). Dur-
ing decoding, extracted phrase pairs are reordered
to generate fluent target output. Reordered trans-
lation output is evaluated under a distortion model
and corroborated by one or more n-gram language
models. These models do not have an explicit rep-
resentation of how to reorder phrases. To avoid
search space explosion, most systems place a limit
on the distance that source segments can be moved
within the source sentence. This limit, along with
the phrase length limit (where local reorderings
are implicit in the phrase), determine the scope of
reordering represented in a phrase-based system.
All experiments in this work limit phrase pairs to
have source and target length of at most 12, and
either source length or target length of at most 6
(higher limits did not result in additional improve-
ments). In our experiments phrases are extracted
by the method described in Och and Ney (2004)
and reordering during decoding with the lexical-
ized distortion model from Zens and Ney (2006).
The reordering limit for the phrase based system
(for each language pair) is increased until no addi-
tional improvements result.
Phrase Based MT
2.2
Building upon the success of phrase-based meth-
ods, Chiang (2005) presents a PSCFG model of
translation that uses the bilingual phrase pairs of
phrase-based MT as starting point to learn hierar-
chicalrules. Foreachtrainingsentencepair’ssetof
extracted phrase pairs, the set of induced PSCFG
rules can be generated as follows: First, each
Hierarchical MT
Page 3
phrase pair is assigned a generic X-nonterminal as
left-hand-side, making it an initial rule. We can
now recursively generalize each already obtained
rule (initial or including nonterminals)
N → f1...fm/e1...en
for which there is an initial rule
M → fi...fu/ej...ev
where 1 ≤ i < u ≤ m and 1 ≤ j < v ≤ n, to
obtain a new rule
N → fi−1
1
Xkfm
u+1/ej−1
1
Xken
v+1
where e.g. fi−1
where k is an index for the nonterminal X that
indicates the one-to-one correspondence between
the new X tokens on the two sides (it is not in
the space of word indices like i,j,u,v,m,n). The
recursive form of this generalization operation al-
lows the generation of rules with multiple nonter-
minal symbols.
Performing translation with PSCFG grammars
amounts to straight-forward generalizations of
chart parsing algorithms for PCFG grammars.
Adaptations to the algorithms in the presence of n-
gram LMs are discussed in (Chiang, 2007; Venu-
gopal et al., 2007; Huang and Chiang, 2007).
Extracting hierarchical rules in this fashion can
generate a large number of rules and could in-
troduce significant challenges for search. Chiang
(2005) places restrictions on the extracted rules
which we adhere to as well. We disallow rules
with more than two nonterminal pairs, rules with
adjacent source-side nonterminals, and limit each
rule’s source side length (i.e., number of source
terminals and nonterminals) to 6. We extract rules
from initial phrases of maximal length 12 (exactly
matchingthephrasebasedsystem).1Higherlength
limits or allowing more than two nonterminals per
rule do not yield further improvements for systems
presented here.
During decoding, we allow application of all
rules of the grammar for chart items spanning up
to 15 source words (for sentences up to length 20),
or 12 source words (for longer sentences), respec-
tively. When that limit is reached, only a special
glue rule allowing monotonic concatenation of hy-
potheses is allowed. (The same holds for the Syn-
tax Augmented system.)
1
is short-hand for f1...fi−1, and
1Chiang (2005) uses source length limit 5 and initial
phrase length limit 10.
2.3
Syntax Augmented MT (SAMT) (Zollmann and
Venugopal, 2006) extends Chiang (2005) to in-
clude nonterminal symbols from target language
phrase structure parse trees. Each target sentence
in the training corpus is parsed with a stochas-
tic parser—we use Charniak (2000))—to produce
constituent labels for target spans. Phrases (ex-
tracted from a particular sentence pair) are as-
signed left-hand-side nonterminal symbols based
on the target side parse tree constituent spans.
Phrases whose target side corresponds to a con-
stituentspanareassignedthatconstituent’slabelas
their left-hand-side nonterminal. If the target span
of the phrase does not match a constituent in the
parse tree, heuristics are used to assign categories
that correspond to partial rewriting of the tree.
These heuristics first consider concatenation oper-
ations, forming categories such as “NP+V”, and
then resort to CCG (Steedman, 1999) style “slash”
categories such as “NP/NN.” or “DT\NP”. In the
spirit of isolating the additional benefit of syntactic
categories, the SAMT system used here also gen-
erates a purely hierarchical (single generic nonter-
minal symbol) variant for each syntax-augmented
rule. This allows the decoder to choose between
translation derivations that use syntactic labels and
those that do not. Additional features introduced
in SAMT rules are: a relative frequency estimated
probability of the rule given its left-hand-side non-
terminal, and a binary feature for the the purely
hierachial variants.
Syntax Augmented MT
3 Large N-Gram LMs for PSCFG
decoding
Brants et al. (2007) demonstrate the value of large
high-order LMs within a phrase-based system. Re-
cent results with PSCFG based methods have typ-
ically relied on significantly smaller LMs, as a
result of runtime complexity within the decoder.
In this work, we started with the publicly avail-
able PSCFG decoder described in Venugopal et al.
(2007) and extended it to efficiently use distributed
higher-order LMs under the Cube-Pruning decod-
ing method from Chiang (2007). These extensions
allow us to verify that the benefits of PSCFG mod-
els persist in the presence of large, powerful n-
gram LMs.
3.1
As described in Brants et al. (2007), using large
distributed LMs requires the decoder to perform
Asynchronous N-Gram LMs
Page 4
asynchronous LM requests. Scoring n-grams un-
der this distributed LM involves queuing a set
of n-gram probability requests, then distributing
these requests in batches to dedicated LM servers,
and waiting for the resulting probabilities, before
accessing them to score chart items.
to reduce the number of such roundtrip requests
in the chart parsing decoding algorithm used for
PSCFGs, we batch all n-gram requests for each
cell.
This single batched request per cell paradigm
requires some adaptation of the Cube-Pruning al-
gorithm. Cube-Pruning is an early pruning tech-
nique used to limit the generation of low quality
chart items during decoding. The algorithm calls
for the generation of N-Best chart items at each
cell (across all rules spanning that cell). The n-
gram LM is used to score each generated item,
driving the N-Best search algorithm of Huang and
Chiang (2005) toward items that score well from
a translation model and language model perspec-
tive. In order to accomodate batched asynchronous
LM requests, we queue n-gram requests for the top
N*K chart items without the n-gram LM where
K=100. We then generate the top N chart items
with the n-gram LM once these probabilties are
available. Chart items attempted to be generated
during Cube-Pruning that would require LM prob-
abilities of n-grams not in the queued set are dis-
carded. While discarding these items could lead
to search errors, in practice they tend to be poorly
performing items that do not affect final translation
quality.
In order
3.2
To effectively compare PSCFG approaches to
state-of-the-art phrase-based systems, we must be
able to use high order n-gram LMs during PSCFG
decoding, but as shown in Chiang (2007), the
number of chart items generated during decoding
grows exponentially in the the order of the n-gram
LM. Maintaining full n−1 word left and right his-
tories for each chart item (required to correctly se-
lect the argmax derivation when considering a n-
gram LM features) is prohibitive for n > 3.
We note however, that the full n − 1 left and
right word histories are unneccesary to safely com-
pare two competing chart items.
the sparsity of high order n-gram LMs, we only
need to consider those histories that can actually
be found in the n-gram LM. This allows signifi-
cantly more chart items to be recombined during
PSCFG Minimal-State Recombination
Rather, given
decoding, without additional search error. The n-
gram LM implementation described in Brants et
al. (2007) indicates when a particular n-gram is
not found in the model, and returns a shortened
n-gram or (“state”) that represents this shortened
condition. We use this state to identify the left and
rightchartitemhistories, thusreducingthenumber
of equivalence classes per cell.
Following Venugopal et al. (2007), we also cal-
culate an estimate for the quality of each chart
item’s left state based on the words represented
within the state (since we cannot know the tar-
get words that might precede this item in the fi-
nal translation). This estimate is only used during
Cube-Pruning to limit the number of chart items
generated.
The extensions above allows us to experiment
with the same order of n-gram LMs used in state-
of-the-art phrase based systems.
ments in this work include up to 5-gram mod-
els, we have succesfully run these PSCFG systems
with higher order n-gram LM models as well.
While experi-
4Experiments
4.1 Chinese-English and Arabic-English
We report experiments on three data configura-
tions. The first configuration (Full) uses all the
data (both bilingual and monolingual) data avail-
able for the NIST 2008 large track translation
task. The parallel training data comprises of 9.1M
sentence pairs (223M Arabic words, 236M En-
glish words) for Arabic-English and 15.4M sen-
tence pairs (295M Chinese Words, 336M English
words) for Chinese-English. This configuration
(for both Chinese-English and Arabic-English) in-
cludes three 5-gram LMs trained on the target side
of the parallel data (549M tokens, 448M 1..5-
grams), the LDC Gigaword corpus (3.7B tokens,
2.9B 1..5-grams) and the Web 1T 5-Gram Cor-
pus (1T tokens, 3.8B 1..5-grams).
configuration (TargetLM) uses a single language
model trained only on the target side of the paral-
lel training text to compare approaches with a rela-
tively weaker n-gram LM. The third configuration
is a simulation of a low data scenario (10%TM),
where only 10% of the bilingual training data is
used, with the language model from the TargetLM
configuration. Translation quality is automatically
evaluated by the IBM-BLEU metric (Papineni et
al., 2002) (case-sensitive, using length of the clos-
est reference translation) on the following publicly
The second
Page 5
Ch.-En. System \ %BLEU
Phraseb. reo=4
Phraseb. reo=7
Phraseb. reo=12
Dev (MT04)MT02MT03MT05MT06MT08 TstAvg
FULL
37.5
40.2
41.3*
41.6*
41.9*
38.0
40.3
41.0
40.9
41.0
38.9
41.1
41.8
42.5
43.0
36.5
38.5
39.4
40.3
40.6
32.2
34.6
35.2
36.5
36.5
26.2
27.7
27.9
28.7
29.2
34.4
36.5
37.0
37.8
38.1
Hier.
SAMT
TARGET-LM
Phraseb. reo=4
Phraseb. reo=7
Phraseb. reo=12
35.9*
38.3*
39.0*
38.1*
39.9*
36.0
38.3
38.7
37.8
39.8
36.0
38.6
38.9
38.3
40.1
33.5
35.8
36.4
36.0
36.6
30.2
31.8
33.1
33.5
34.0
24.6
25.8
25.9
26.5
26.9
32.1
34.1
34.6
34.4
35.5
Hier.
SAMT
TARGET-LM, 10%TM
Phraseb. reo=12 36.4*
36.4*
36.5*
35.8
36.5
36.1
MT02
35.3
36.3
35.8
MT03
33.5
33.8
33.7
MT05
29.9
31.5
31.2
MT06
22.9
23.9
23.8
MT08
31.5
32.4
32.1
Hier.
SAMT
Ar.-En. System \ %BLEU
Phraseb. reo=4
Phraseb. reo=7
Phraseb. reo=9
Dev (MT04)TstAvg
FULL
51.7
51.7*
51.7
52.0*
52.5*
64.3
64.5
64.3
64.4
63.9
54.5
54.3
54.4
53.5
54.2
57.8
58.2
58.3
57.5
57.5
45.9
45.9
45.9
45.5
45.5
44.2
44.0
44.0
44.1
44.9
53.3
53.4
53.4
53.0
53.2
Hier.
SAMT
TARGET-LM
Phraseb. reo=4
Phraseb. reo=7
Phraseb. reo=9
49.3
49.6*
49.6
49.1*
48.3*
61.3
61.5
61.5
60.5
59.5
51.4
51.9
52.0
51.0
50.0
53.0
53.2
53.4
53.5
51.9
42.6
42.8
42.8
42.0
41.0
40.2
40.1
40.1
40.0
39.1
49.7
49.9
50.0
49.4
48.3
Hier.
SAMT
TARGET-LM, 10%TM
Phraseb. reo=747.7*
46.7*
45.9*
59.4
58.2
57.6
50.1
48.8
48.7
51.5
50.6
50.7
40.5
39.5
40.0
37.6
37.4
37.3
47.8
46.9
46.9
Hier.
SAMT
Table 1: Results (% case-sensitive IBM-BLEU) for Ch-En and Ar-En NIST-large. Dev. scores with * indicate that the param-
eters of the decoder were MER-tuned for this configuration and also used in the corresponding non-marked configurations.
available NIST test corpora: MT02, MT03, MT05,
MT06, MT08. We used the NIST MT04 corpus
as development set to train the model parameters
λ. All of the systems were evaluated based on the
argmax decision rule. For the purposes of stable
comparison across multiple test sets, we addition-
ally report a TstAvg score which is the average of
all test set scores.2
Table 1 shows results comparing phrase-based,
hierarchical and SAMT systems on the Chinese-
EnglishandArabic-Englishlarge-trackNIST2008
tasks. Our primary goal in Table 1 is to evaluate
the relative impact of the PSCFG methods above
the phrase-based approach, and to verify that these
improvements persist with the use of of large n-
gram LMs. We also show the impact of larger
reordering capability under the phrase-based ap-
proach, providing a fair comparison to the PSCFG
approaches.
2We prefer this over taking the average over the aggregate
test data to avoid artificially generous BLEU scores due to
length penalty effects resulting from e.g. being too brief in a
hard test set but compensating this by over-generating in an
easy test set.
Chinese-to-English configurations:
consistent improvements moving from phrase-
based models to PSCFG models.
holds in both LM configurations (Full and Tar-
getLM) as well as the 10%TM case, with the ex-
ception of the hierarchical system for TargetLM,
which performs slightly worse than the maximum-
reordering phrase-based system.
We vary the reordering limit “reo” for the
phrase-based Full and TargetLM configurations
and see that Chinese-to-English translation re-
quires significant reordering to generate fluent
translations, as shownbytheTstAvg differencebe-
tween phrase-based reordering limited to 4 words
(34.4) and 12 words (37.0). Increasing the reorder-
ing limit beyond 12 did not yield further improve-
ment. Relative improvements over the most capa-
ble phrase-based model demonstrate that PSCFG
models are able to model reordering effects more
effectively than our phrase-based approach, even
in the presence of strong n-gram LMs (to aid the
distortion models) and comparable reordering con-
straints.
We see
This trend