Jonathan H. Clark’s research while affiliated with Microsoft and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (13)


Figure 1: Top: A traditional learning procedure, assigning a set of weights to a fixed feature set. Bottom: Discretization, our feature induction technique, expands the feature set as part of learning, while still producing a linear model for inference, albeit with more features.
Figure 2: Left: A real-valued feature. Bold dots represent points where we could imagine bins being placed. However, since we may only adjust w 0 , these "bins" will be rigidly fixed along the feature function's value. Right: After discretizing the feature into 4 bins, we may now adjust 4 weights independently, to achieve a non-linear re-shaping of the function.
Figure 3: We perform discretization locally on each grammar rule or phrase pair, operating on the local feature vectors h. In this example, the original real-valued features are crossed out with a solid gray line and their discretized indicator features are written above. When forming a complete hypothesis from partial hypotheses, we sum the counts of these indicator features to obtain the complete feature vector H. In this example, H = {H TM 0.1 : 2, H TM 0.2 : 1, H Count 2 : 1}
Figure 7: Plots of weights learned for the discretized p coherent (e|f ) (top) and c(f ) (bottom) for the Ar→En system with 4 bits and monotone neighbor regularization. p(e|f ) > 0.11 is omitted for exposition as values were constant after this point. The gray line fits a log curve to the weights. The system learns a shape that deviates from the log in several regions. Each non-monotonic segment represents the learner choosing to better fit the data while paying a strong regularization penalty.
Locally Non-Linear Learning for Statistical Machine Translation via Discretization and Structured Regularization
  • Article
  • Full-text available

December 2014

·

18 Reads

·

3 Citations

Transactions of the Association for Computational Linguistics

Jonathan H. Clark

·

Chris Dyer

·

Linear models, which support efficient learning and inference, are the workhorses of statistical machine translation; however, linear decision rules are less attractive from a modeling perspective. In this work, we introduce a technique for learning arbitrary, rule-local, non-linear feature transforms that improve model expressivity, but do not sacrifice the efficient inference and learning associated with linear models. To demonstrate the value of our technique, we discard the customary log transform of lexical probabilities and drop the phrasal translation probability in favor of raw counts. We observe that our algorithm learns a variation of a log transform that leads to better translation quality compared to the explicit log transform. We conclude that non-linear responses play an important role in SMT, an observation that we hope will inform the efforts of feature engineers.

Download

Scalable Modified Kneser-Ney Language Model Estimation

August 2013

·

549 Reads

·

535 Citations

We present an efficient algorithm to estimate large modified Kneser-Ney models including interpolation. Streaming and sorting enables the algorithm to scale to much larger models by using a fixed amount of RAM and variable amount of disk. Using one machine with 140 GB RAM for 2.8 days, we built an unpruned model on 126 billion tokens. Machine translation experiments with this model show improvement of 0.8 BLEU point over constrained systems for the 2013 Workshop on Machine Translation task in three language pairs. Our algorithm is also faster for small models: we estimated a model on 302 million tokens using 7.7% of the RAM and 14.0% of the wall time taken by SRILM. The code is open source as part of KenLM.


The CMU-ARK German-English translation system

July 2011

·

21 Reads

·

10 Citations

This paper describes the German-English translation system developed by the ARK research group at Carnegie Mellon University for the Sixth Workshop on Machine Translation (WMT11). We present the results of several modeling and training improvements to our core hierarchical phrase-based translation system, including: feature engineering to improve modeling of the derivation structure of translations; better handing of OOVs; and using development set translations into other languages to create additional pseudo-references for training.


Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability.

January 2011

·

102 Reads

·

412 Citations

In statistical machine translation, a researcher seeks to determine whether some innovation (e.g., a new feature, model, or inference algorithm) improves translation quality in comparison to a baseline system. To answer this question, he runs an experiment to evaluate the behavior of the two systems on held-out data. In this paper, we consider how to make such experiments more statistically reliable. We provide a systematic analysis of the effects of optimizer instability—an extraneous variable that is seldom controlled for—on experimental outcomes, and make recommendations for reporting results more accurately. 1


Unsupervised Word Alignment with Arbitrary Features

January 2011

·

39 Reads

·

38 Citations

We introduce a discriminatively trained, globally normalized, log-linear variant of the lexical translation models proposed by Brown et al. (1993). In our model, arbitrary, non-independent features may be freely incorporated, thereby overcoming the inherent limitation of generative models, which require that features be sensitive to the conditional independencies of the generative process. However, unlike previous work on discriminative modeling of word alignment (which also permits the use of arbitrary features), the parameters in our models are learned from unannotated parallel sentences, rather than from supervised word alignments. Using a variety of intrinsic and extrinsic measures, including translation performance, we show our model yields better alignments than generative baselines in a number of language pairs.


Improved Features and Grammar Selection for Syntax-Based MT

August 2010

·

60 Reads

We present the Carnegie Mellon Univer-sity Stat-XFER group submission to the WMT 2010 shared translation task. Up-dates to our syntax-based SMT system mainly fell in the areas of new feature for-mulations in the translation model and im-proved filtering of SCFG rules. Compared to our WMT 2009 submission, we report a gain of 1.73 BLEU by using the new features and decoding environment, and a gain of up to 0.52 BLEU from improved grammar selection.


The Machine Translation Toolpack for LoonyBin: Automated Management of Experimental Machine Translation HyperWorkflows

February 2010

·

27 Reads

·

4 Citations

Prague Bulletin of Mathematical Linguistics

Jonathan H. Clark

·

Jonathan Weese

·

Byung Gyu Ahn

·

[...]

·

The Machine Translation Toolpack for LoonyBin: Automated Management of Experimental Machine Translation HyperWorkflows Construction of machine translation systems has evolved into a multi-stage workflow involving many complicated dependencies. Many decoder distributions have addressed this by including monolithic training scripts - train-factored-model.pl for Moses and mr_runmer.pl for SAMT. However, such scripts can be tricky to modify for novel experiments and typically have limited support for the variety of job schedulers found on academic and commercial computer clusters. Further complicating these systems are hyperparameters, which often cannot be directly optimized by conventional methods requiring users to determine which combination of values is best via trial and error. The recently-released LoonyBin open-source workflow management tool addresses these issues by providing: 1) a visual interface for the user to create and modify workflows; 2) a well-defined logging mechanism; 3) a script generator that compiles visual workflows into shell scripts, and 4) the concept of Hyperworkflows, which intuitively and succinctly encodes small experimental variations within a larger workflow. In this paper, we describe the Machine Translation Toolpack for LoonyBin, which exposes state-of-the-art machine translation tools as drag-and-drop components within LoonyBin.



An Improved Statistical Transfer System for French-English Machine Translation

March 2009

·

228 Reads

·

7 Citations

This paper presents the Carnegie Mellon University statistical transfer MT system submitted to the 2009 WMT shared task in French-to-English translation. We de- scribe a syntax-based approach that incor- porates both syntactic and non-syntactic phrase pairs in addition to a syntactic grammar. After reporting development test results, we conduct a preliminary anal- ysis of the coverage and effectiveness of the system's components.


Inductive detection of language features via clustering minimal pairs

January 2008

·

18 Reads

Syntax-based Machine Translation systems have recently become a focus of research with much hope that they will outperform traditional Phrase-Based Statistical Machine Translation (PBSMT). Toward this goal, we present a method for analyzing the morphosyntactic content of language from an Elicitation Corpus such as the one included in the LDC's upcoming LCTL language packs. The presented method discovers a mapping between morphemes and linguistically relevant features. By providing this tool that can augment structure-based MT models with these rich features, we believe the discriminative power of current models can be improved. We conclude by outlining how the resulting output can then be used in inducing a morphosyntactically feature-rich grammar for AVENUE, a modern syntax-based MT system.


Citations (10)


... This is potentially problematic, as interactions between domain-specific features can be complex. It may be necessary to perform preprocessing steps over the feature space to produce a feature set that is less prone to non-linearities Clark et al. 2014). However, methods tailored to such a special treatment are quite sophisticated and not widely deployed in practice. ...

Reference:

A survey of domain adaptation for statistical machine translation
Locally Non-Linear Learning for Statistical Machine Translation via Discretization and Structured Regularization

Transactions of the Association for Computational Linguistics

... The baseline regression models contain as predictors word length in characters, index of word position within the sentence, unigram surprisal (all datasets), and whether the previous word was fixated (ET datasets only). Unigram surprisal was calculated using the KenLM toolkit (Heafield et al., 2013) with parameters estimated on the OpenWeb-Text Corpus (Gokaslan and Cohen, 2019). On top of these baseline regression models, surprisal of the current word and the preceding word was included to capture spillover effects (Rayner et al., 1983). ...

Scalable Modified Kneser-Ney Language Model Estimation
  • Citing Conference Paper
  • August 2013

... While software evaluation initially requires human validation to ensure its correlation with human standards, it simplifies the analysis of a large number of translations. Papineni (2002) and Lavie (2011) acknowledge its limitations, particularly in error detection and consistency in single-phrase analysis, yet its utility in managing large-scale translation evaluations is undeniable. ...

Statistical MT with Syntax and Morphology: Challenges and Some Solutions

... Obtained two sets of parsed trees, English and Vietnamese, were further used as training data to extract transfer rules. In contrast, Hanneman et al. (2009) extracted unique grammar rules from English-French parallel parsed corpus and selected high-frequency rules to reorder position of constituents. Hwa et al. (2005) describe an approach that focuses on syntax projection per se, but their approach also relies on word alignment in a parallel corpus. ...

An Improved Statistical Transfer System for French-English Machine Translation

... Experiment management tools (Koehn, 2010;Clark et al., 2010) abstract the internals of the decoder from the user to provide a uniform interface to the main training steps of the system. While these facilitate the coordination of large experimental setups, they must be configured using either a domain-specific language or a graphical interface that the user has to learn to manipulate the system. ...

The Machine Translation Toolpack for LoonyBin: Automated Management of Experimental Machine Translation HyperWorkflows
  • Citing Article
  • February 2010

Prague Bulletin of Mathematical Linguistics

... By processing the dataset using our methodology, we reduce the training time by 65% while raising the BLEU score by 1.6. A statistical significance test performed by using MultEval (Clark et al., 2011) to do bootstrap resampling shows that our improved system, trained on less data, is significantly better than the baseline, with p < 0.01. It is also noteworthy that our system is only 0.6 BLEU below that of IndicTrans, reported in Section 4.1, which is almost seven times larger in terms of parameters and trained on the whole Samanantar dataset. ...

Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability.
  • Citing Conference Paper
  • January 2011

... Following e.g. (Dyer et al., 2011;Lavergne et al., 2011), our implementation of the CRF model 4 heavily relies on weighted finite-state models and operations (Allauzen et al., 2007), which we use to represent the spaces of all possible and reference labellings on a per sentence basis, and to efficiently compute the expectations involved in the gradient, as well as to search for the optimal labellings and compute alignment and label posteriors. ...

Unsupervised Word Alignment with Arbitrary Features
  • Citing Conference Paper
  • January 2011

... They can cover a large variety of use-cases (Liew et al., 2016), but their use is hindered by strict requirements on the users computing nodes. The toolkit that seems to be most similar to our approach is Ducttape 2 , the successor of LonnyBin (Clark and Lavie, 2010). It is well designed and covers many useful points. ...

LoonyBin: Keeping Language Technologists Sane through Automated Management of Experimental (Hyper)Workflows.
  • Citing Conference Paper
  • January 2010

... In the former, datasets are developed for structured elicitation where linguists ask native speakers to judge grammaticality of presented sentences or to answer questions about their language. The elicitation results are key to deriving properties of the language (Clark et al., 2008;Probst and Levin, 2002). In the latter, datasets are constructed for system evaluation to establish benchmarks to gauge progress on shared tasks. ...

Toward Active Learning in Data Selection: Automatic Discovery of Language Features During Elicitation.