
Christof Monz- PhD in Computer Science
- University of Amsterdam
Christof Monz
- PhD in Computer Science
- University of Amsterdam
About
149
Publications
16,231
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,272
Citations
Introduction
Current institution
Additional affiliations
October 2005 - January 2009
January 2004 - September 2005
January 2004 - September 2005
Publications
Publications (149)
Large Language Models (LLMs) demonstrate strong reasoning capabilities for many tasks, often by explicitly decomposing the task via Chain-of-Thought (CoT) reasoning. Recent work on LLM-based translation designs hand-crafted prompts to decompose translation, or trains models to incorporate intermediate steps.~\textit{Translating Step-by-step}~\citep...
Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but...
Prior research diverges on language diversity in LLM fine-tuning: Some studies report benefits while others find no advantages. Through controlled fine-tuning experiments across 132 translation directions, we systematically resolve these disparities. We find that expanding language diversity during fine-tuning improves translation quality for both...
In this paper, we introduce S-MedQA, an English medical question-answering (QA) dataset for benchmarking large language models in fine-grained clinical specialties. We use S-MedQA to check the applicability of a popular hypothesis related to knowledge injection in the knowledge-intense scenario of medical QA, and show that: 1) training on data from...
This paper introduces Unilogit, a novel self-distillation method for machine unlearning in Large Language Models. Unilogit addresses the challenge of selectively forgetting specific information while maintaining overall model utility, a critical task in compliance with data privacy regulations like GDPR. Unlike prior methods that rely on static hyp...
Neural machine translation (NMT) systems typically employ maximum a posteriori (MAP) decoding to select the highest-scoring translation from the distribution mass. However, recent evidence highlights the inadequacy of MAP decoding, often resulting in low-quality or even pathological hypotheses -- the decoding objective is not aligned with real-worl...
A key challenge in MT evaluation is the inherent noise and inconsistency of human ratings. Regression-based neural metrics struggle with this noise, while prompting LLMs shows promise at system-level evaluation but performs poorly at segment level. In this work, we propose ReMedy, a novel MT metric framework that reformulates translation evaluation...
As large language models (LLMs) scale, model compression is crucial for edge deployment and accessibility. Weight-only quantization reduces model size but suffers from performance degradation at lower bit widths. Moreover, standard finetuning is incompatible with quantized models, and alternative methods often fall short of full finetuning. In this...
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative deco...
Extremely low-resource (XLR) languages lack substantial corpora for training NLP models, motivating the use of all available resources such as dictionaries and grammar books. Machine Translation from One Book (Tanzer et al., 2024) suggests prompting long-context LLMs with one grammar book enables English-Kalamang translation, an unseen XLR language...
Parameter-efficient finetuning (PEFT) methods effectively adapt large language models (LLMs) to diverse downstream tasks, reducing storage and GPU memory demands. Despite these advantages, several applications pose new challenges to PEFT beyond mere parameter efficiency. One notable challenge involves the efficient deployment of LLMs equipped with...
This paper introduces two multilingual systems, IKUN and IKUN-C, developed for the general machine translation task in WMT24. IKUN and IKUN-C represent an open system and a constrained system, respectively, built on Llama-3-8b and Mistral-7B-v0.3. Both systems are designed to handle all 11 language directions using a single model. According to auto...
Following multiple instructions is a crucial ability for large language models (LLMs). Evaluating this ability comes with significant challenges: (i) limited coherence between multiple instructions, (ii) positional bias where the order of instructions affects model performance, and (iii) a lack of objectively verifiable tasks. To address these issu...
Fine-tuning large language models (LLMs) for machine translation has shown improvements in overall translation quality. However, it is unclear what is the impact of fine-tuning on desirable LLM behaviors that are not present in neural machine translation models, such as steerability, inherent document-level translation abilities, and the ability to...
Despite the tremendous success of Neural Machine Translation (NMT), its performance on low-resource language pairs still remains subpar, partly due to the limited ability to handle previously unseen inputs, i.e., generalization. In this paper, we propose a method called Joint Dropout, that addresses the challenge of low-resource neural machine tran...
Parameter-efficient fine-tuning (PEFT) of pre-trained language models (PLMs) has emerged as a highly successful approach, with training only a small number of parameters without sacrificing performance and becoming the de-facto learning paradigm with the increasing size of PLMs. However, existing PEFT methods are not memory-efficient, because they...
Parameter-efficient fine-tuning (PEFT) of pre-trained language models has recently demonstrated remarkable achievements, effectively matching the performance of full fine-tuning while utilizing significantly fewer trainable parameters, and consequently addressing the storage and communication constraints. Nonetheless, various PEFT methods are limit...
Using a shared vocabulary is common practice in Multilingual Neural Machine Translation (MNMT). In addition to its simple design, shared tokens play an important role in positive knowledge transfer, which manifests naturally when the shared tokens refer to similar meanings across languages. However, natural flaws exist in such a design as well: 1)...
We argue that translation quality alone is not a sufficient metric for measuring knowledge transfer in multilingual neural machine translation. To support this claim, we introduce Representational Transfer Potential (RTP), which measures representational similarities between languages. We show that RTP can measure both positive and negative transfe...
The pre-training of masked language models (MLMs) consumes massive computation to achieve good results on downstream NLP tasks, resulting in a large carbon footprint. In the vanilla MLM, the virtual tokens, [MASK]s, act as placeholders and gather the contextualized information from unmasked tokens to restore the corrupted information. It raises the...
Neural Machine Translation (NMT) is an open vocabulary problem. As a result, dealing with the words not occurring during training (a.k.a. out-of-vocabulary (OOV) words) have long been a fundamental challenge for NMT systems. The predominant method to tackle this problem is Byte Pair Encoding (BPE) which splits words, including OOV words, into sub-w...
Language pairs with limited amounts of parallel data, also known as low-resource languages, remain a challenge for neural machine translation. While the Transformer model has achieved significant improvements for many language pairs and has become the de facto mainstream architecture, its capability under low-resource conditions has not been fully...
This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task wa...
This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2018. Participants were asked to build machine translation systems for any of 7 language pairs in both directions, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation qu...
The amount of digitally available information that is authored in languages other than English has been rapidly increasing over the last decade. This results in a distribution of information not only across different sources but also different languages. While various areas within natural language processing, such as machine translation, informatio...
CoSyne is a content synchronization system for assisting users and organizations involved in the maintenance of multilingual wikis. The system allows users to explore the diversity of multilingual content using a monolingual view. It provides suggestions for content modification based on additional or more specific information found in other langua...
This paper presents the results of the WMT12 shared tasks, which included a translation task, a task for machine translation evaluation metrics, and a task for run-time estimation of machine translation quality. We conducted a large-scale manual evaluation of 103 machine translation systems submitted by 34 teams. We used the ranking of these system...
This work proposes to adapt an existing general SMT model for the task of translating queries that are subsequently going to be used to retrieve information from a target language collection. In the scenario that we focus on access to the document collection itself is not available and changes to the IR model are not possible. We propose two ways t...
Document revision histories are a useful and abundant source of data for natural language processing, but selecting relevant data for the task at hand is not trivial. In this paper we introduce a scalable approach for automatically distinguishing between factual and fluency edits in document revision histories. The approach is based on supervised m...
This article describes a method that successfully exploits syntactic features for n-best translation candidate reranking using perceptrons. We motivate the utility of syntax by demonstrating the superior performance of parsers over n-gram language models in differentiating between Statistical Machine Translation output and human translations. Our a...
Wikis allow a large base of contributors easy access to shared content, and freedom in editing it. One of the side-effects of this freedom was the emergence of parallel and independently evolving versions in a variety of languages, reflecting the multilingual background of the pool of contributors. For the Wiki to properly represent the user-added...
Research on question answering dates back to the 1960s but has more recently been revisited as part of TREC's evaluation campaigns, where question answering is addressed as a subarea of information retrieval that focuses on specific answers to a user's information need. Whereas document retrieval systems aim to return the documents that are most re...
Part-of-speech language modeling is commonly used as a component in statistical machine translation systems, but there is mixed evidence that its usage leads to significant improvements. We argue that its limited effectiveness is due to the lack of lexicalization. We introduce a new approach that builds a separate local language model for each word...
This paper presents the results of the WMT11 shared tasks, which included a translation task, a system combination task, and a task for machine translation evaluation metrics. We conducted a large-scale manual evaluation of 148 machine translation systems and 41 system combination entries. We used the ranking of these systems to measure how strongl...
This paper presents the results of the WMT10 and MetricsMATR10 shared tasks, 1 which included a translation task, a system combination task, and an eval-uation task. We conducted a large-scale manual evaluation of 104 machine trans-lation systems and 41 system combina-tion entries. We used the ranking of these systems to measure how strongly auto-m...
This paper describes a method that successfully exploits simple syntactic features for n-best translation candidate reranking using perceptrons. Our approach uses discriminative language modelling to rerank the n-best translations generated by a statistical machine translation system. The performance is evaluated for Arabic-to-English translation u...
ABSTRACT This paper describes a simple clustering approach to person name disambiguation of retrieved documents. The methods are based on standard IR concepts and do not require any task-specic,features. We compare dierent,term-weighting and indexing methods and evaluate their performance against the Web People Search task (WePS). Despite their sim...
The last few years have witnessed an increasing interest in hybridizing surface-based statistical approaches and rule-based symbolic approaches to machine translation (MT). Much of that work is focused on extending statistical MT systems with symbolic knowledge and components. In the brand of hybridization discussed here, we go in the opposite dire...
This paper addresses the problem of ex- tracting the most important facts from a news article. Our approach uses syntac- tic, semantic, and general statistical fea- tures to identify the most important sen- tences in a document. The importance of the individual features is estimated us- ing generalized iterative scaling methods trained on an annota...
In this paper we present an extension of a phrase-based decoder that dynamically chunks, reorders, and applies phrase transla-tions in tandem. A maximum entropy classi-fier is trained based on the word alignments to find the best positions to chunk the source sentence. No language specific or syntactic in-formation is used to build the chunking cla...
In this paper we describe our participation in the Second Web People Search workshop (WePS2) and detail our ap-proaches. For the clustering task, our focus was on repli-cating the lessons learned at WEPS1 on the data set made available as part of WEPS2 and on experimenting with a voting-based combination of clustering methods. We found that cluster...
This paper presents the results of the WMT09 shared tasks, which included a translation task, a system combination task, and an evaluation task. We conducted a large-scale manual evaluation of 87 machine translation systems and 22 system combination entries. We used the ranking of these systems to measure how strongly automatic metrics correlate wi...
Despite increasing research into the use of syntax during statistical machine translation, the incorporation of syntax into language models has seen limited success. We present a study of the discriminative abilities of generative syntax-based language models, over and above standard n-gram models, with a focus on potential applications for Statist...
This paper analyzes the translation qual- ity of machine translation systems for 10 language pairs translating between Czech, English, French, German, Hungarian, and Spanish. We report the translation quality of over 30 diverse translation systems based on a large-scale manual evaluation involv- ing hundreds of hours of effort. We use the human jud...
It is becoming increasingly common in information retrieval to combine evidence from multiple resources to compute the retrieval status value of documents. Although this has led to considerable im- provements in several retrieval tasks, one of the outstanding issues is estimation of the respective weights that should be associated with the dierent...
This article introduces a new task-based evaluation measure called Relevance Prediction that is a more intuitive measure of an individual’s performance on a real-world task than interannotator agreement. Relevance Prediction parallels what a user does in the real world task of browsing a set of documents using standard search tools, i.e., the user...
This paper evaluates the translation quality of machine translation systems for 8 language pairs: translating French, German, Spanish, and Czech to English and back. We carried out an extensive human evaluation which allowed us not only to rank the different MT systems, but also to perform higher-level analysis of the evaluation process. We measure...
Question answering systems rely on retrieval components to identify documents that contain an answer to a user's question. The formulation of queries that are used for retrieving those documents has a strong impact on the e ective- ness of the retrieval component. Here, we focus on predicting the importance of terms from the original question. We u...
We evaluated machine translation performance for six European language pairs that participated in a shared task: translating French, German, Spanish texts to English and back. Evaluation was done automatically using the Bleu score and manually on fluency and adequacy.
The research context of this paper is de-veloping hybrid machine translation (MT) systems that exploit the advantages of linguistic rule-based and statistical MT systems. Arabic, as a morphologically rich language, is especially challenging even without addressing the hybridiza-tion question. In this paper, we describe the challenges in building an...
In this year's CLEF submissions we focus on using a state-of-the-art statistical machine trans-lation approach for ad-hoc cross-language retrieval. Our machine translation approach is phrase-based as opposed to statistical word-based approaches that have been previously used for query translation in cross-language IR. The phrase translation probabi...
Finding a proper distribution of translation probabilities is one of the most important factors impacting the effectiveness of a cross-language information retrieval system. In this paper we present a new approach that computes translation probabilities for a given query by using only a bilingual dictionary and a monolingual corpus in the target la...
The ACL-2005 Workshop on Parallel Texts hosted a shared task on building statistical machine translation systems for four European language pairs: French-English, German-English, Spanish-English, and Finnish-English. Eleven groups participated in the event. This paper describes the goals, the task definition and resources, as well as results and so...
Hierarchical organization is a well known prop- erty of language, and yet the notion of hierarchi- cal structure has been largely absent from the best performing machine translation systems in recent community-wide evaluations. In this paper, we dis- cuss a new hierarchical phrase-based statistical ma- chine translation system (Chiang, 2005), prese...
This paper presents a novel approach to combining different word alignments. We view word alignment as a pattern classifi- cation problem, where alignment combi- nation is treated as a classifier ensemble, and alignment links are adorned with lin- guistic features. A neural network model is used to learn word alignments from the individual alignmen...
We present a new word-alignment ap- proach that learns errors made by ex- isting word alignment systems and cor- rects them. By adapting transformation- based learning to the problem of word alignment, we project new alignment links from already existing links, using features such as POS tags. We show that our align- ment link projection approach y...
We implemented an initial application of a sentence-trimming approach (Trim-mer) to the problem of multi-document summarization in the MSE2005 and DUC2005 tasks. Sentence trimming was incorporated into a feature-based summarization system, called Multi-Document Trimmer (MDT), by us-ing sentence trimming as both a pre-processing stage and a feature...
This paper demonstrates the usefulness of sum- maries in an extrinsic task of relevance judgment based on a new method for measuring agree- ment, Relevance-Prediction, which compares sub- jects' judgments on summaries with their own judg- ments on full text documents. We demonstrate that, because this measure is more reliable than previ- ous gold-s...
In this short note we discuss several perspectives on the notion of Guarded Fragments (GFs) of first-order logic first introduced by Andreka, van Benthem and Nemeti. We focus on computational aspects, discussing some applications of GFs together with issues like the design of e#ective decision methods for specific reasoning tasks and the role of GF...
In this paper, we consider some of the problems that arise if automated reasoning methods are applied to natural language semantics.
This reports provides an overview of the findings and software that have evolved from the "Use of Minimal Lexical Conceptual Structures for Single-Document Summarization" project over the last six months. We present the major goals that have been achieved and discuss some of the open issues that we intend to address in the near future. This report...
This reports provides an overview of the findings and software that have evolved from the "Symbolic MT with Statistical NLP Components" project over the last year. We present the major goals that have been achieved and discuss some of the open issues that we intend to address in the near future. This report also contains some details on the usage o...
We describe our participation in the TREC 2003 Robust and Web tracks. For the Robust track, we experimented with the impact of stemming and feedback on the worst scoring topics. Our main finding is the effectiveness of stemming on poorly performing topics, which sheds new light on the role of morphological normalization in information retrieval. Fo...
We describe our participation in the TREC 2003 Question Answering track. We explain the ideas underlying our approaches to the task, report on our results, provide an error analysis, and give a summary of our findings so far.
The vast majority of research in information retrieval is done using English collections and topics. This raises questions about the e#ectiveness of retrieval strategies for other languages. To examine this issue, we focus on document retrieval in nine European languages. In particular, we investigate the e#ectiveness of language-dependent approach...
Recent years have witnessed considerable advances in information retrieval for European languages other than English. We give an overview of commonly used techniques and we analyze them with respect to their impact on retrieval effectiveness. The techniques considered range from linguistically motivated techniques, such as morphological normalizati...
We investigates the e#ectiveness of language-dependent approaches to document retrieval, such as stemming and decompounding, and constrast them with language-independent approaches, such as character n-gramming. In order to reap the benefits of more than one type of approach, we also consider the e#ectiveness of the combination of both types of app...
We investigates the e#ectiveness of language-dependent approaches to document retrieval, such as stemming and decompounding, and constrast them with language-independent approaches, such as character n-gramming. In order to reap the benefits of more than one type of approach, we also consider the e#ectiveness of the combination of both types of app...
Current question answering systems rely on document retrieval as a means of providing documents which are likely to contain an answer to a user's question. A question answering system heavily depends on the effectiveness of a retrieval system: If a retrieval system fails to find any relevant documents for a question, further processing steps to ext...