Cyril Goutte

Cyril Goutte
National Research Council Canada | NRC · Digital Technologies

PhD

About

123
Publications
31,625
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,611
Citations
Introduction
Senior research officer at National Research Council Canada, working on Multilingual Text Processing. Agent de recherche senior au Conseil national de recherches Canada, équipe traitement multilingue de textes.
Additional affiliations
May 2006 - present
National Research Council Canada
Position
  • Research Officer
October 2001 - March 2006
Xerox Research Centre Europe
Position
  • Senior Researcher
October 2001 - March 2006
Xerox Research Centre Europe
Position
  • Senior Researcher

Publications

Publications (123)
Article
Research and development activities are regarded as one of the most influencing factors of the future of a country. Large investments in research can yield a tremendous outcome in terms of a country’s overall wealth and strength. However, public financial resources of countries are often limited which calls for a wise and targeted investment. Scien...
Book
This book constitutes the refereed proceedings of the 33rd Canadian Conference on Artificial Intelligence, Canadian AI 2020, which was planned to take place in Ottawa, ON, Canada. Due to the COVID-19 pandemic, however, it was held virtually during May 13–15, 2020. The 31 regular papers and 24 short papers presented together with 4 Graduate Student...
Chapter
Full-text available
Parallel corpora are the basic resource for many multilingual natural language processing models. Recent advances in, e.g. neural machine translation have shown that the quality of the alignment in the corpus has a crucial impact on the quality of the resulting model, renewing interest in filtering automatically aligned corpora to increase their qu...
Chapter
Learning curves are a crucial tool to accurately measure learners skills and give meaningful feedback in intelligent tutoring systems. Here we discuss various ways of building learning curves from empirical data for the Additive Factor model (AFM) and highlight their limitations. We focus on the impact of student attrition, a.k.a. attrition bias. W...
Conference Paper
Full-text available
Competency based education (CBE) is seen by many as a way to optimize learning on cost, efficiency and flexibility. However, defining the required competencies, assigning them to specific courses and building the assessments evaluating student's proficiency can be tedious. More precisely, making sure that the assessments evaluate what they are supp...
Technical Report
Full-text available
Roughly a decade ago appeared the Additive Factor Model (AFM), a cognitive diagnostic model that was subsequently implemented by PSLC-Datashop and successfully used by researchers since then. While powerful, this model is not always simple to apprehend for a novice user. This paper aims at addressing this concern, by sharing our understanding of th...
Article
Full-text available
We present an analysis of the performance of machine learning classifiers on discriminating between similar languages and language varieties. We carried out a number of experiments using the results of the two editions of the Discriminating between Similar Languages (DSL) shared task. We investigate the progress made between the two tasks, estimate...
Conference Paper
Full-text available
We present an analysis of the performance of machine learning classifiers on discriminating between similar languages and language varieties. We carried out a number of experiments using the results of the two editions of the Discriminating between Similar Languages (DSL) shared task. We investigate the progress made between the two tasks, estimate...
Conference Paper
Full-text available
In this paper, we present, discuss and summarize different research works we carried out toward the exploitation of the Web of data for learning and training purpose (Web of learning data). For several years now, we have conducted efforts to explore this main objective through two complementary directions. The first direction is the scalability and...
Conference Paper
Full-text available
Matrix factorization techniques are widely used to build collaborative filtering recommender systems. These recommenders aim at discovering latent variables or attributes that are supposed to explain and ultimately predict the interest of users. In cognitive modeling, skills and competencies are considered as key latent attributes to understand and...
Conference Paper
Full-text available
Recent years have seen significant advances in automatic identification of the Q-matrix necessary for cognitive diagnostic assessment. As data-driven approaches are introduced to identify latent knowledge components (KC) based on observed student performance, it becomes crucial to describe and interpret these latent KCs. We address the problem of n...
Conference Paper
Full-text available
A key aspect of cognitive diagnostic models is the specification of the Q-matrix associating the items and some underlying student attributes. In many data-driven approaches, test items are mapped to the underlying, latent knowledge components (KC) based on observed student performance, and with little or no input from human experts. As a result, t...
Article
In many applications, observations are available with different views. This is, for example, the case with image-text classification, multilingual document classification or document classification on the web. In addition, unlabeled multiview examples can be easily acquired, but assigning labels to these examples is usually a time consuming task. W...
Conference Paper
Full-text available
We describe the system built by the National Research Council Canada for the " Discriminating between similar languages " (DSL) shared task. Our system uses various statistical classifiers and makes predictions based on a two-stage process: we first predict the language group, then discriminate between languages or variants within the group. Langua...
Conference Paper
Full-text available
We describe the system entered by the National Research Council Canada in the SemEval-2014 L2 writing assistant task. Our system relies on a standard Phrase-Based Statistical Machine Translation trained on generic, publicly available data. Translations are produced by taking the already translated part of the sentence as fixed context. We show that...
Conference Paper
Full-text available
As larger and more diverse parallel texts become available, how can we leverage heterogeneous data to train robust machine translation systems that achieve good translation quality on various test domains? This challenge has been addressed so far by repurposing techniques developed for domain adaptation, such as linear mixture models which combine...
Patent
Full-text available
This application is related to a means and a method for facilitating the use of translation memories by aligning words of an input source language sentence with the correspondent translated words in target language sentence. More specifically, this invention relates to such a means and method where there is an enhanced translation memory comprising...
Conference Paper
Full-text available
We decribe the submissions made by the National Research Council Canada to the Native Language Identification (NLI) shared task. Our submissions rely on a Support Vector Machine classifier, various feature spaces using a variety of lexical, spelling, and syntactic features, and on a simple model combination strategy relying on a majority vote betwe...
Patent
The present document describes a method and a system for generating classifiers from multilingual corpora including subsets of content-equivalent documents written in different languages. When the documents are translations of each other, their classifications must be substantially the same. Embodiments of the invention utilize this similarity in o...
Article
The patent describes a method and a system for generating classifiers from multilingual corpora including subsets of content-equivalent documents written in different languages. When the documents are translations of each other, their classifications must be substantially the same. Embodiments of the invention utilize this similarity in order to en...
Conference Paper
Full-text available
When parallel or comparable corpora are harvested from the web, there is typically a trade-off between the size and quality of the data. In order to improve quality, corpus collection efforts often attempt to fix or remove misaligned sentence pairs. But, at the same time, Statistical Machine Translation (SMT) systems are widely assumed to be relati...
Article
Full-text available
Multiview learning has been shown to be a natural and efficient framework for supervised or semi-supervised learning of multilingual document categorizers. The state-of-the-art co-regularization approach relies on alternate minimizations of a combination of language-specific categorization errors and a disagreement between the outputs of the monoli...
Conference Paper
Translation is a key capability to access relevant information expressed in various languages on social media. Unfortunately, systematically translating all content far exceeds the capacity of most organizations. Computer-aided translation (CAT) tools can significantly increase the productivity of translators, but can not ultimately cope with the o...
Conference Paper
Full-text available
We address the problem of learning to rank documents in a multilingual context, when reference ranking information is only partially available. We propose a multiview learning approach to this semi-supervised ranking task, where the translation of a document in a given language is considered as a view of the document. Although both multiview and se...
Article
In this paper, we address the problem of learning aspect models with partially labeled data for the task of document categorization. The motivation of this work is to take advantage of the amount of available unlabeled data together with the set of labeled examples to learn latent models whose structure and underlying hypotheses take more accuratel...
Patent
Full-text available
A probabilistic clustering system is defined at least in part by probabilistic model parameters indicative of word counts, ratios, or frequencies characterizing classes of the clustering system. An association of one or more documents in the probabilistic clustering system is changed from one or more source classes to one or more destination classe...
Conference Paper
Full-text available
In this paper, we address the problem of learning aspect models with partially labeled examples. We propose a method which benefits from both semi-supervised and active learning frameworks. In particular, we combine a semi-supervised extension of the PLSA algorithm [11] with two active learning techniques. We perform experiments over four different...
Article
Full-text available
We address the problem of learning text categorization from a corpus of multilingual documents. We propose a multiview learning, co-regularization approach, in which we consider each language as a separate source, and minimize a joint loss that combines monolingual classification losses in each language while ensuring consistency of the categorizat...
Conference Paper
Full-text available
We propose a new multi-view clustering method which uses clustering results obtained on each view as a voting pattern in order to construct a new set of multi-view clusters. Our experiments on a multilingual corpus of documents show that performance increases significantly over simple concatenation and another multi-view clustering technique.
Conference Paper
Full-text available
We investigate the problem of learning document classifiers in a multilingual setting, from collections where labels are only partially available. We address this problem in the framework of multiview learning, where different languages correspond to different views of the same document, combined with semi-supervised learning in order to benefit fr...
Article
Full-text available
The well-known formal equivalence between non-negative matrix factorization and multinomial mixture models extends in a fairly straightforward manner to tensors. Among interesting practical implications of this equivalence, this suggests some principled ways to choose the number of factors in the decomposition. We discuss and illustrate two methods...
Article
Full-text available
We describe a new approach to SMT adaptation that weights out-of-domain phrase pairs according to their relevance to the target domain, determined by both how similar to it they appear to be, and whether they belong to general language or not. This extends previous work on discriminative weighting by using a finer granularity, focusing on the prope...
Patent
Full-text available
In categorizing an object respective to at least two categorization dimensions each defined by a plurality of categories, a probability value indicative of the object is determined for each category of each categorization dimension. A categorization label for the object is selected respective to each categorization dimension based on (i) the determ...
Conference Paper
Full-text available
Detecting and tracking of temporal data is an important task in multiple applications. In this paper we study temporal text mining methods for music information retrieval. We compare two ways of detecting the temporal latent semantics of a corpus extracted from Wikipedia, using a stepwise probabilistic latent semantic analysis (PLSA) approach and a...
Conference Paper
Full-text available
We address the problem of learning classifiers when observat ions have multiple views, some of which may not be observed for all examples. We assume the existence of view generating functions which may complete the missing views in an approximate way. This situation corresponds for example to learning text classifiers from multilingual collections...
Article
Full-text available
We address the problem of learning classifiers when observations have multiple views, some of which may not be observed for all examples. We assume the existence of view generating functions which may complete the missing views in an approximate way. This situation corresponds for example to learning text classifiers from multilingual collections w...
Article
Full-text available
We investigate the possibility of automatically detecting whether a piece of text is an original or a translation. On a large parallel English-French corpus where reference information is available, we find that this is possible with around 90% accuracy. We further study the implication this has on Machine Translation performance. After separating...
Conference Paper
Full-text available
This paper investigates a new extension of the Probabilistic Latent Semantic Analysis (PLSA) model [6] for text classification where the training set is partially labeled. The proposed approach iteratively labels the unlabeled documents and estimates the probabilities of its labeling errors. These probabilities are then taken into account in the es...
Conference Paper
Full-text available
This paper presents a boosting based algorithm for learning a bipartite ranking function (BRF) with partially labeled data. Until now different attempts had been made to build aB RF in atransductive setting, in which the test points are given to the methods in advance as unlabeled data. The proposed approach is a semi-supervised inductive ranking a...
Conference Paper
Full-text available
We present some on-going research on phrase-based Statistical Machine Transla-tion using flexible phrases that may contain gaps of variable lengths. This allows us to naturally handle various linguistic phenomena such as negations or separable particles. We integrate this within the standard Maximum Entropy model using some dedicated feature functi...
Article
Full-text available
Databases and data warehouses contain an overwhelming volume of information that users must wade through in order to extract valuable and actionable knowledge to support the decision-making process. This contribution addresses the problem of automatically analyzing large multidimensional tables to get a concise representation of data, identify patt...
Article
Full-text available
We describe NRC's submission to the Anomaly Detection/Text Mining competition organised at the Text Mining Workshop 2007. This submission relies on a straightforward implementation of the probabilistic categoriser described in (Gaussier et al., ECIR'02). This categoriser is adapted to handle multiple labelling and a piecewise-linear confidence esti...
Article
Full-text available
We propose to use a statistical phrase-based machine translation system in a post-editing task: the system takes as input raw machine translation output (from a commercial rule-based MT system), and produces post-edited target-language text. We report on experiments that were performed on data collected in precisely such a setting: pairs of raw MT...
Article
Full-text available
We describe the National Research Council's entry in the Anomaly Detection/Text Mining competition organised at the Text Mining Workshop 2007. This entry relies on a straightforward implementation of a probabilistic categorizer described earlier [4]. This categoriser is adapted to handle multiple labelling and a piecewise-linear confidence estimati...
Article
Full-text available
It is generally acknowledged that the performance of rule-based machine translation (RMBT) systems can be greatly improved through domain-specific system adaptation. To that end, RBMT users often choose to invest significant resources into the development of ad hoc MT dictionaries. In this paper, we demonstrate that comparable customization effects...
Conference Paper
Full-text available
Textual Entailment has recently been proposed as an ap- plication independent task of recognising whether the meaning of one text may be inferred from another. This is potentially a key task in many NLP applications. In this contribution, we investigate the use of various lexical entailment models in Information Retrieval, using the lan- guage mode...
Conference Paper
Full-text available
We explore the situation in which documents have to be categorized into more than one category system, a situation we refer to as multiple-view categorization. More partic- ularly, we address the case where two dif- ferent categorizers have already been built based on non-necessarily identical training sets, each one labeled using one category sys-...
Article
Full-text available
Any scientific endeavour must be evaluated in order to assess its correctness. In many applied sciences it is necessary to check that the theory adequately matches actual observations. In Machine Translation (MT), evaluation serves two purposes: relative evaluation allows us to check whether one MT technique is better than another, while absolute e...
Conference Paper
Full-text available
Music genre classification has been investigated using many different methods, but most of them build on probabilistic models of feature vectors x<sub>r</sub> which only represent the short time segment with index r of the song. Here, three different co-occurrence models are proposed which instead consider the whole song as an integrated part of th...
Conference Paper
Full-text available
Non-negative Matrix Factorization (NMF, [5]) and Probabilistic Latent Semantic Analysis (PLSA, [4]) have been successfully applied to a number of text analysis tasks such as document clustering. Despite their different inspirations, both methods are instances of multinomial PCA [1]. We further explore this relationship and first show that PLSA solv...
Conference Paper
Full-text available
We address the problems of 1/ assessing the confidence of the standard point estimates, precision, recall and F-score, and 2/ comparing the results, in terms of precision, recall and F-score, obtained using two different methods. To do so, we use a probabilistic setting which allows us to obtain posterior distributions on these performance indicato...
Article
Bio-medical knowledge bases are valuable resources for the research community. Original scientific publications are the main source used to annotate them. Medical annotation in Swiss-Prot is specifically targeted at finding and extracting data about human genetic diseases and polymorphisms. Curators have to scan through hundreds of publications to...
Conference Paper
Full-text available
This paper presents a phrase-based statis- tical machine translation method, based on non-contiguous phrases, i.e. phrases with gaps. A method for producing such phrases from a word-aligned corpora is proposed. A statistical translation model is also presented that deals such phrases, as well as a training method based on the maximization of transl...
Article
Full-text available
This paper presents a phrase-based statistical machine translation method, based on non-contiguous phrases, i.e. phrases with gaps. A method for producing such phrases from a word-aligned corpora is proposed. A statistical translation model is also presented that deals with such phrases, as well as a training method based on the maximization of tra...
Article
Full-text available
Annotating biomedical text for Named Entity Recognition (NER) is usually a tedious and expensive process, while unannotated data is freely available in large quantities. It therefore seems relevant to address biomedical NER using Machine Learning techniques that learn from a combination of labelled and unlabelled data. We consider two approaches: o...
Article
Full-text available
This paper presents the experimental results of our attemps to reduce the size of the parameter space in word alignment algorithm. We use IBM Model 4 as a baseline. In order to reduce the parameter space, we pre-processed the training corpus using a word lemmatizer and a bilingual term extraction algorithm. Using these additional components, we obt...
Article
Motivation: Searching relevant publications for manual database annotation is a tedious task. In this paper, we apply a combination of Natural Language Processing (NLP) and probabilistic classification to rerank documents returned by PubMed according to their relevance to Swiss-Prot annotation, and to identify significant terms in the documents.
Article
Full-text available
We propose a new hierarchical generative model for textual data, where words may be generated by topic specific distributions at any level in the hierarchy. This model is naturally well-suited to clustering documents in preset or automatically generated hierarchies, as well as categorising new documents in an existing hierarchy. Furthermore, we pre...
Article
We propose a new hierarchical generative model for textual data, where words may be generated by topic specific distributions at any level in the hierarchy. This model is naturally well-suited to clustering documents in preset or automatically generated hierarchies, as well as categorising new documents in an existing hierarchy. Training algorithms...
Article
Full-text available
We address the problem of categorising documents using kernel-based methods such as Support Vector Machines.
Article
Full-text available
this report describing the feature