Nicola Bertoldi

Nicola Bertoldi
Fondazione Bruno Kessler | FBK · Human Language Technologies (HLT)

PhD in Mathematics

About

79
Publications
9,257
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,346
Citations

Publications

Publications (79)
Article
Full-text available
Training models for the automatic correction of machine-translated text usually relies on data consisting of (source, MT, human post- edit) triplets providing, for each source sentence, examples of translation errors with the corresponding corrections made by a human post-editor. Ideally, a large amount of data of this kind should allow the model t...
Conference Paper
Full-text available
In this paper we report on FBK's participation to the English-to-German news translation task of the Second Conference on Machine Translation (WMT'17). The submitted system is based on Neural Machine Translation using byte-pair encoding segmentation on both source and target languages for open-vocabulary translations. Back-translations of news mono...
Conference Paper
Full-text available
Machine translation systems are conventionally trained on textual resources that do not model phenomena that occur in spoken language. While the evaluation of neural machine translation systems on textual inputs is actively researched in the literature , little has been discovered about the complexities of translating spoken language data with neur...
Article
We investigate adaptive machine translation (MT) as a way to reduce human workload and enhance user experience when professional translators operate in real-life conditions. A crucial aspect in our analysis is how to ensure a reliable assessment of MT technologies aimed to support human post-editing. We pay particular attention to two evaluation as...
Conference Paper
Full-text available
EU-BRIDGE 1 is a European research project which is aimed at developing innovative speech translation technology. One of the collaborative efforts within EU-BRIDGE is to pro-duce joint submissions of up to four different partners to the evaluation campaign at the 2014 International Workshop on Spoken Language Translation (IWSLT). We submitted com-b...
Article
Recent research has shown that accuracy and speed of human translators can benefit from post-editing output of machine translation systems, with larger benefits for higher quality output. We present an efficient online learning framework for adapting all modules of a phrase-based statistical machine translation system to post-edited translations. W...
Article
Full-text available
The effective integration of MT technology into computer-assisted translation tools is a challenging topic both for academic research and the translation industry. In particular, professional translators consider the ability of MT systems to adapt to the feedback provided by them to be crucial. In this paper, we propose an adaptation scheme to tune...
Article
Full-text available
A very hot issue for research and industry is how to effectively integrate machine translation (MT) within computer assisted translation (CAT) software. This paper focuses on this issue, and more generally how to dynamically adapt phrase-based statistical machine translation (SMT) by exploiting external knowledge, like the post-editions from profes...
Conference Paper
Full-text available
EU-BRIDGE1 is a European research project which is aimed at developing innovative speech translation technology. This paper describes one of the collaborative efforts within EU- BRIDGE to further advance the state of the art in ma- chine translation between two European language pairs, English→French and German→English. Four research insti- tutions...
Conference Paper
The new frontier of computer assisted translation technology is the effective integration of statistical MT within the translation workflow. In this respect, the SMT ability of incrementally learning from the translations produced by users plays a central role. A still open problem is the evaluation of SMT systems that evolve over time. In this pap...
Article
Full-text available
This paper describes efforts towards the development of an Arabic to Italian SMT system for the news domain. Since only very little parallel data are available for this language pair, we investigated both the exploitation of comparable corpora and pivot translation. Experimental evaluation was conducted on a new benchmark devel-oped by extending tw...
Conference Paper
Full-text available
This paper investigates the impact of misspelled words in statistical machine translation and proposes an extension of the translation engine for handling misspellings. The enhanced system decodes a word-based confusion network representing spelling variations of the input text. We present extensive experimental results on two translation tasks of...
Article
Full-text available
This paper proposes a novel method for exploiting com-parable documents to generate parallel data for machine translation. First, each source document is paired to each sentence of the corresponding target document; second, par-tial phrase alignments are computed within the paired texts; finally, fragment pairs across linked phrase-pairs are ex-tra...
Article
Domain adaptation has recently gained interest in statistical machine translation to cope with the performance drop ob-served when testing conditions deviate from training conditions. The basic idea is that in-domain training data can be ex-ploited to adapt all components of an al-ready developed system. Previous work showed small performance gains...
Article
We describe an open-source soware for minimum error rate training (MERT) for statistical machine translation (SMT). is was implemented within the Moses toolkit, although it is essentially standsalone, with the aim of replacing the existing implementation with a cleaner, more flexible design, in order to facilitate further research in weight optim...
Article
This paper describes advances in the use of confusion networks as interface between automatic speech recognition and machine translation. In particular, it presents a decoding algorithm for confusion networks which results as an extension of a state-of-the-art phrase-based text translation decoder. The confusion network decoder significantly improv...
Article
Full-text available
This paper describes an approach for computing a consensus translation from the outputs of multiple machine translation (MT) systems. The consensus translation is computed by weighted majority voting on a confusion network, similarly to the well-established ROVER approach of Fiscus for combining speech recognition hypotheses. To create the confusio...
Article
Full-text available
Translation with pivot languages has recently gained atten-tion as a means to circumvent the data bottleneck of statis-tical machine translation (SMT). This paper tries to give a mathematically sound formulation of the various approaches presented in the literature and introduces new methods for training alignment models through pivot languages. We...
Article
Full-text available
This work extends phrase-based statistical MT (SMT) with shallow syntax dependencies. Two string-to-chunks translation models are proposed: a factored model, which augments phrase-based SMT with layered dependen- cies, and a joint model, that extends the phrase translation table with microtags, i.e. per- word projections of chunk labels. Both rely...
Conference Paper
Full-text available
This paper presents a full-fledged spoken language translation system developed at IRST during the TC-STAR project. The system integrates automatic speech recognition with machine translation through the use of confusion networks, which permit to represent a huge number of transcription hypotheses generated by the speech recognizer. Confusion netwo...
Conference Paper
Full-text available
We describe an open-source toolkit for sta- tistical machine translation whose novel contributions are (a) support for linguisti- cally motivated factors, (b) confusion net- work decoding, and (c) efficient data for- mats for translation models and language models. In addition to the SMT decoder, the toolkit also includes a wide variety of tools fo...
Conference Paper
This paper describes advances in the use of confusion networks as interface between automatic speech recognition and machine translation. In particular, it presents an implementation of a confusion network decoder which significantly improves both in efficiency and performance previous work along this direction. The confusion network decoder result...
Article
State of the art in statistical machine trans-lation is currently represented by phrase-based models, which typically incorpo-rate a large number of probabilities of phrase-pairs and word n-grams. In this work, we investigate data compression methods for efficiently encoding n-gram and phrase-pair probabilities, that are usu-ally encoded in 32-bit...
Conference Paper
Full-text available
This paper describes a multi-lingual phrase-based Statistical Machine Transla- tion system accessible by means of a Web page. The user can issue translation re- quests from Arabic, Chinese or Spanish into English. The same phrase-based sta- tistical technology is employed to realize the three supported language-pairs. New language-pairs can be easi...
Article
Full-text available
This paper reports on the participation of ITC-irst in the 2006 Spoken Language Translation Evaluation Campaign organized by the TC-STAR project. ITC-irst submitted runs for all translation directions, namely Spanish-to-English, English-to-Spanish and Chinese-to- English, and types of input, that is final text edition, human verbatim transcriptions...
Article
This article addresses the development of statistical models for phrase-based machine translation (MT) which extend a popular word-alignment model proposed by IBM in the early 90s. A novel decoding algorithm is directly derived from the optimization criterion which defines the statistical MT approach. Efficiency in decoding is achieved by applying...
Conference Paper
A novel approach to spoken language translation is proposed, which more tightly integrates automatic speech recognition (ASR) and statistical machine translation (SMT). SMT is directly applied on an approximation of the word graph produced by the ASR system, namely a confusion network. The decoding algorithm extends a conventional phrase-based deco...
Article
This paper investigates the problem of updating over time the statistical language model (LM) of an Italian broadcast news transcription system. Statistical adaptation methods are proposed which try to cope with the complex dynamics of news by exploiting newswire texts daily available on the Internet. In particular, contemporary news reports are us...
Conference Paper
Full-text available
This paper summarizes the Cross-Language Spoken Document Retrieval (CL-SDR) track held at CLEF 2004. The CL-SDR task at CLEF 2004 was again based on the TREC-8 and TREC-9 SDR tasks. This year the CL-SDR task was extended to explore the unknown story boundaries condition introduced at TREC. The paper reports results from the participants showing tha...
Article
This paper reports on the participation of ITC-irst in the Italian monolingual retrieval track and in the bilingual English-Italian track of the Cross Language Evaluation Forum (CLEF) 2002. A cross-language information retrieval systems is proposed which integrates re-trieval and translation scores over the set of N-best translations of the source...
Article
This paper presents preliminary experiments on crosslanguage spoken document retrieval (SDR) carried out on a benchmark assembled at ITC-irst. The benchmark is based on resources used in the last two spoken document retrieval tracks at the TREC conference, which are available on the Internet. They include automatic transcripts of American English b...
Article
This work reviews information retrieval systems developed at ITC-irst which were evaluated through several tracks of CLEF, during the last three years. The presentation tries to follow the progress made over time in developing new statistical models first for monolingual information retrieval, then for cross-language information retrieval. Besides...
Conference Paper
This paper summarises the results of ITC-irst in the Cross-Language Spoken Document Retrieval track at the Cross Language Evaluation Forum 2003. The target collection consisted of automatic transcriptions of American Broadcast News manually segmented into stories. Topics consisted of 50 short queries in English, for which human-made translations in...
Conference Paper
This paper reports on the participation of ITC-irst in the Cross Language Evaluation Forum 2003; in particular, in the monolingual, bilingual, small multilingual, and spoken document retrieval tracks. Con- sidered languages were English, French, German, Italian, and Spanish. With respect to our CLEF 2002 system, the statistical models for bilingual...
Conference Paper
Full-text available
The rapid growth of the Information Society is increasing the demand for technologies enabling access of multimedia data by content. An interesting example is given by the broadcasting companies and the content providers, which require today effective technologies to support the management and access of their huge audiovisual archives, both for int...
Article
This paper presents the development of a Named Entity (NE) recognition sys- tem for the Italian broadcast news do- main. A statistical model is introduced based on a trigram language model de- fined on words and NE classes. The estimation of the NE model is carried out with a very little list of 2,360 manually tagged NEs and a large untagged newspa...
Article
This paper investigates the problem of dynamically updating the language model (LM) of a broadcast news speech recognition system, in order to cope with language and topic changes, typical of the news domain. Statistical adaptation methods are proposed that exploit written news sources which are daily available on the Internet, i.e. newswires and n...
Conference Paper
This paper reports on the participation of ITC-irst in the Cross Language Evaluation Forum (CLEF) of 2001. ITC-irst has taken part to two tracks: the monolingual retrieval task, and the bilingual retrieval task. In both cases, Italian was chosen as the query language, while English was chosen as the document language of the bilingual task. The empl...
Article
This paper reports on experiments of porting the ITC-irst Italian broadcast news recognition system to two spontaneous dialogue domains. Porting was investigated by applying state-of-the-art adaptation methods on acoustic and language models, and by evaluating the trade-off between performance and required amount of task specific annotated data. Th...
Conference Paper
This paper presents preliminary experiments on cross- language spoken document retrieval (SDR) carried out on a benchmark assembled at ITC-irst. The benchmark is based on resources used in the last two spoken document retrieval tracks at the TREC conference, which are available on the Internet. They include automatic transcripts of American English...
Conference Paper
This paper reports on the participation of ITC-irst in the Italian monolingual retrieval track and in the bilingual English-Italian track of the Cross Language Evaluation Forum (CLEF) 2002. A cross- language information retrieval systems is proposed which integrates re- trieval and translation scores over the set of N-best translations of the sourc...
Conference Paper
This paper presents a novel statistical model for cross-language information retrieval. Given a written query in the source language, documents in the target language are ranked by integrating probabilities computed by two statistical models: a query-translation model, which generates most probable term-by-term translations of the query, and a quer...
Article
Full-text available
This paper reports on experiments of porting the ITC-irst Italian broadcast news recognition system to two spontaneous dialogue domains. The trade-off between performance and the required amount of task specific data was investigated. Porting was experimented by applying supervised adaptation methods on acoustic and language models. By using two ho...
Article
This paper presents work on document retrieval for Italian carried out at ITC-irst. Two different approaches to information retrieval were investigated, one based on the Okapi weighting formula and one based on a statistical model. Development experiments were carried out using the Italian sample of the TREC-8 CLIR track. Performance evaluation was...
Conference Paper
This paper presents work on document retrieval for Italian carried out at ITC-irst. Two dierent approaches to information retrieval were investigated, one based on the Okapi weighting formula and one based on a statistical model. Development experiments were carried out using the Italian sample of the TREC-8 CLIR track. Performance evalu- ation was...
Article
Full-text available
This paper focuses on the problem of language model adapta-tion in the context of Chinese-English cross-lingual dialogs, as set-up by the challenge task of the IWSLT 2009 Evalu-ation Campaign. Mixtures of n-gram language models are investigated, which are obtained by clustering bilingual train-ing data according to different available human annotat...
Article
This paper describes the SMT we built during the 2006 JHU Summer Workshop for the IWSLT 2006 evaluation. Our ef-fort focuses on two parts of the speech translation problem: 1) efficient decoding of word lattices and 2) novel applica-tions of factored translation models to IWSLT-specific prob-lems. In this paper, we present results from the open-tra...
Article
Full-text available
This paper reports on the participation of FBK at the IWSLT 2009 Evaluation. This year we worked on the Arabic-English and Turkish-English BTEC tasks with a special effort on linguistic preprocessing techniques involving morphologi-cal segmentation. In addition, we investigated the adapta-tion problem in the development of systems for the Chinese-E...
Article
Full-text available
This paper presents a look inside the ITC-irst large-vocabulary SMT system developed for the NIST 2005 Chinese-to-English evaluation campaign. Experiments on official NIST test sets provide a thorough overview of the performance of the sys-tem, supplying information on how single compo-nents contribute to the global performance. The presented syste...
Article
Full-text available
Focus of this paper is the system for statistical machine trans-lation developed at ITC-irst. It has been employed in the evaluation campaign of the International Workshop on Spo-ken Language Translation 2004 in all the three data set condi-tions of the Chinese-English track. Both the statistical model underlying the system and the system architect...
Article
Full-text available
This paper reports on the participation of ITC-irst to the evaluation campaign of the International Workshop on Spo- ken Language Translation 2006. Our two-pass system is the evolution of the one we employed for the 2005 campaign: in the first pass, an N-best list of translations is generated for each source sentence by means of a beam-search decod...
Article
1. SOMMARIO La traduzione automatica è da sempre considerata una delle sfide più affascinanti dell'informatica. Dopo decenni di risultati altalenanti, un sensibile miglioramento delle prestazioni è stato ottenuto grazie all'utilizzo di metodi statistici. Questo successo ha aperto nuove prospettive di applicazione col conseguente aumento degli inves...
Article
Full-text available
This paper describes the statistical machine translation system developed at ITC-irst for the evaluation campaign of the International Workshop on Spoken Language Trans- lation 2005. The system exploits two search passes: the first pass is performed by a beam-search decoder which gen- erates an n-best list of translations, the second by a sim- ple...
Article
Full-text available
In SMT, the instability of MERT, the com-monly used optimizer, is an acknowledged problem. This paper presents two methods for smoothing the MERT instability. Both exploit a set of different realizations of the same system obtained by running the opti-mization stage multiple times. One method averages the sets of different optimal weights; the othe...