
Didier Schwab- Ph.D.
- Full Professor at Grenoble Alpes University
Didier Schwab
- Ph.D.
- Full Professor at Grenoble Alpes University
About
138
Publications
43,182
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,088
Citations
Introduction
Didier Schwab currently works at Université Grenoble Alpes (France), Laboratory of Informatics of Grenoble, Study Group for Machine Translation and Automated Processing of Languages and Speech.
Didier does research in Artificial Intelligence, natural language processing. Currently, his main researches focus on automatic and interactive clarification of texts, plagiarism detection machine translation and communication of disable people.
His work involves many techniques and tools including deep learning, Moses, python, pytorch, Java, JavaFx, R, ...
Current institution
Additional affiliations
April 2006 - August 2007
September 2001 - December 2005
September 2007 - present
Publications
Publications (138)
Natural language production requires both a grammar and a lexicon. In this article, we deal only with the latter, trying to enhance an existing electronic resource to allow for search via navigation in a huge associative network. Our primary focus is on the structure of the lexicon (i.e. its indexing scheme). This issue has often been overlooked, y...
In this article, we investigate the effects on the quality of the disambiguation of exploiting multilingual features with a similarity-based WSD system based on an Ant Colony Algorithm. We considered features from one, two, three or four languages in order to quantify the improvement brought by using features from additional languages. Us-ing Babel...
In this article, we present the notions of local and global algorithms, for the word sense disambiguation of texts. A local algorithm allows to calculate the semantic similarity between two lexical objects. Global algorithms propagate local measures at the upper level. We use this notion to compare an ant colony algorithm to other methods from the...
Objective
This study aims to investigate how cognitive impairment and social presence influence goal attainment in an ecological virtual environment. It also examines the role of interactive features in improving computer-assisted cognitive training for older adults, both with and without mild cognitive impairment (MCI).
Materials and Methods
A vir...
Background: The growing field of artificial intelligence has opened new opportunities in computerized cognitive-training design (CCT), closer to everyday situations, specifically in the prevention of neurocognitive disorders. The aim of our exploratory study was to characterize the importance of social and cognitive facilitators in avatar-mediated...
The gap between speech and text modalities is a major challenge in speech-to-text translation (ST). Different methods have been proposed for reducing this gap, but most of them require architectural changes in ST training. In this work, we propose to mitigate this issue at the pre-training stage, requiring no change in the ST model. First, we show...
L'apprentissage auto-supervisé a ouvert des perspectives prometteuses dans de nombreux domaines comme la vision par ordinateur, le traitement automatique de la langue ou celui de la parole. Les modèles pré-appris sur de grandes quantités de données non étiquetées peuvent être ajustés sur de petits ensembles de données transcrites manuellement. Ceux...
L'apprentissage autosupervisé a apporté des améliorations remarquables dans de nombreux domaines tels que la vision par ordinateur ou le traitement de la langue et de la parole, en exploitant de grandes quantités de données non étiquetées. Dans le contexte spécifique de la parole, cependant, et malgré des résultats prometteurs, il existe un manque...
Diabetes is characterized by an abnormally enhanced concentration of glucose in the blood serum. It has a damaging impact on several noble body systems. Today, the concept of unbalanced learning has developed considerably in the domain of medical diagnosis, which greatly reduces the generation of erroneous classification results. The paper takes a...
With the increase in computation power and the development of new state-of-the-art deep learning algorithms, appearance-based gaze estimation is becoming more and more popular. It is believed to work well with curated laboratory data sets, however it faces several challenges when deployed in real world scenario. One such challenge is to estimate th...
Adapter modules were recently introduced as an efficient alternative to fine-tuning in NLP. Adapter tuning consists in freezing pretrained parameters of a model and injecting lightweight modules between layers, resulting in the addition of only a small number of task-specific trainable parameters. While adapter tuning was investigated for multiling...
Recent studies on the analysis of the multilingual representations focus on identifying whether there is an emergence of language-independent representations, or whether a multilingual model partitions its weights among different languages. While most of such work has been conducted in a "black-box" manner, this paper aims to analyze individual com...
Human beatboxing is a vocal art making use of speech organs to produce vocal drum sounds and imitate musical instruments. Beatbox sound classification is a current challenge that can be used for automatic database annotation and music-information retrieval. In this study, a large-vocabulary human-beatbox sound recognition system was developed with...
Medical text categorization is a valuable area of text classification due to the massive growth in the amount of medical data, most of which is unstructured. Reading and understanding the information contained in millions of medical documents is a time-consuming process. Automatic text classification aims to automatically classify text documents in...
Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing. Recent works also investigated SSL from speech. They were notably successful to improve performance on downstream tasks such as automatic speech recognition (ASR). While these works suggest it is possible to reduce depe...
Some users try to post false reviews to promote or to devalue other’s products and services. This action is known as deceptive opinions spam, where spammers try to gain or to profit from posting untruthful reviews. Therefore, we conducted this work to develop and to implement new semantic features to improve the Arabic deception detection. These fe...
We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST). Our models are based on the original Transformer architecture (Vaswani et al., 2017) but consist of two decoders, each responsible for one task (ASR or ST). Our major contribution lies in...
Document indexing is a key component for efficient information retrieval (IR). After preprocessing steps such as stemming and stop-word removal, document indexes usually store term-frequencies (tf). Along with tf (that only reflects the importance of a term in a document), traditional IR models use term discrimination values (TDVs) such as inverse...
Language models have become a key step to achieve state-of-the-art results in many different Natural Language Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their contextualiz...
Human beatboxing is a vocal art making use of speech organs to produce percussive sounds and imitate musical instruments. Beatbox sounds classification is a current challenge. We propose a beatbox sounds recognition system with an adaptation of the Kaldi toolbox, widely used for automatic speech recognition (ASR). Our corpus is composed of isolated...
Over the past years, deep learning methods allowed for new state-of-the-art results in ad-hoc information retrieval. However such methods usually require large amounts of annotated data to be effective. Since most standard ad-hoc information retrieval datasets publicly available for academic research (e.g. Robust04, ClueWeb09) have at most 250 anno...
Abstract—Text mining is one of the main and typical tasks of
machine learning (ML). Authorship identification (AI) is a
standard research subject in text mining and natural language
processing (NLP) that has undergone a remarkable evolution these
last years. We need to identify/determine the actual author of
anonymous texts given on the basis...
In this paper, we present our submission for the English to Czech Text Translation Task of IWSLT 2019. Our system aims to study how pre-trained language models, used as input embeddings, can improve a specialized machine translation system trained on few data. Therefore, we implemented a Transformer-based encoder-decoder neural system which is able...
In this article, we tackle the issue of the limited quantity of manually sense annotated corpora for the task of word sense disambiguation, by exploiting the semantic relationships between senses such as synonymy, hypernymy and hyponymy, in order to compress the sense vocabulary of Princeton WordNet, and thus reduce the number of different sense ta...
In Word Sense Disambiguation (WSD), the predominant approach generally involves a supervised system trained on sense annotated corpora. The limited quantity of such corpora however restricts the coverage and the performance of these systems. In this article, we propose a new method that solves these issues by taking advantage of the knowledge prese...
The goal of our research is to develop an automatic pictogram generation tool from speech to help the social circle of users of Alternative and Augmentative Communication to communicate among themselves. We described here the issues of such a tool, we then detail our development methodology and finally we describe our evaluation protocol.
In order to develop and enhance an augmentative and alternative communication (AAC), gaze is often considered as one of the most natural way and one of the easiest to set up in order to support individuals with multiple disabilities to interact with their environment. For children who start naturally from scratch, who have in addition such difficul...
Measuring the amount of shared information between two documents is a key to address a number of Natural Language Processing (NLP) challenges such as Information Retrieval (IR), Semantic Textual Similarity (STS), Sentiment Analysis (SA) and Plagiarism Detection (PD). In this paper, we report a plagiarism detection system based on two layers of asse...
Machine translation (MT) is the process of translating text written in a source language into text in a target language. In this article, we present our English-Arabic statistical machine translation system. First, we present the general process for setting up a statistical machine translation system, then we describe the tools as well as the diffe...
In this paper, we explore the usage of Word Embedding semantic resources for Information Retrieval (IR) task. This embedding, produced by a shallow neural network, have been shown to catch semantic similarities between words (Mikolov et al., 2013). Hence, our goal is to enhance IR Language Models by addressing the term mismatch problem. To do so, w...
Semantic Textual Similarity (STS) is an important component in many Natural Language Processing (NLP) applications, and plays an important role in diverse areas such as information retrieval, machine translation, information extraction and plagiarism detection. In this paper we propose two word embedding-based approaches devoted to measuring the se...
In this paper, an Algerian multilingual recommender system based on sentiment analysis is proposed for the goal of helping Algerian users to decide on products, restaurants, movies and other needs. The customer's review plays an important role in deciding as a customer prefers to get the opinion of other customers by observing their opinion through...
We present our submitted systems for Semantic Textual Similarity (STS) Track 4 at SemEval-2017. Given a pair of Spanish-English sentences, each system must estimate their semantic similarity by a score between 0 and 5. In our submission, we use syntax-based, dictionary-based, context-based, and MT-based methods. We also combine these methods in uns...
This paper is a deep investigation of cross-language plagiarism detection methods on a new recently introduced open dataset, which contains parallel and comparable collections of documents with multiple characteristics (different genres, languages and sizes of texts). We investigate cross-language plagiarism detection methods for 6 language pairs o...
This article describes our proposed system named LIM-LIG. This system is designed for SemEval 2017 Task1: Semantic Textual Similarity (Track1). LIM-LIG proposes an innovative enhancement to word embedding-based model devoted to measure the semantic similarity in Ara-bic sentences. The main idea is to exploit the word representations as vectors in a...
We present our submitted systems for Se- mantic Textual Similarity (STS) Track 4 at SemEval-2017. Given a pair of Spanish-English sentences, each system must estimate their semantic similarity by a score between 0 and 5. In our submission, we use syntax-based, dictionary-based, context-based, and MT-based methods. We also combine these methods in u...
Mesurer la similarité sémantique est à la base de nombreuses applications. Elle joue un rôle important dans divers domaines tels que la recherche d'information, la traduction automatique, l'extraction d'information ou la détection de plagiat. Dans cet article, nous proposons un système fondé sur le plongement de mots (word embedding). Ce système es...
This paper is a deep investigation of cross-language plagiarism detection methods on a new recently introduced open dataset, which contains parallel and comparable collections of documents with multiple characteristics (different genres, languages and sizes of texts). We investigate cross-language plagiarism detection methods for 6 language pairs o...
This article compares four probabilistic algorithms (global algorithms) for Word Sense Disambiguation (WSD) in terms of the number of scorer calls (local algo- rithm) and the F1 score as determined by a gold-standard scorer. Two algorithms come from the state of the art, a Simulated Annealing Algorithm (SAA) and a Genetic Algorithm (GA) as well as...
This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity detection methods based on distributed representation of words; (b) we combine the different methods proposed to verif...
We present our submitted systems for Semantic Textual Similarity (STS) Track 4 at SemEval-2017. Given a pair of Spanish-English sentences, each system must estimate their semantic similarity by a score between 0 and 5. In our submission, we use syntax-based, dictionary-based, context-based, and MT-based methods. We also combine these methods in uns...
Semantic textual similarity is the basis of countless applications and plays an important role in diverse areas, such as information retrieval, plagiarism detection, information extraction and machine translation. This article proposes an innovative word embedding-based system devoted to calculate the semantic similarity in Arabic sentences. The ma...
RÉSUMÉ Dans cet article, nous présentons une méthode pour améliorer la traduction automatique d'un corpus annoté et porter ses annotations de l'anglais vers une langue cible. Il s'agit d'améliorer la méthode de (Nasiruddin et al., 2015) qui donnait de nombreux segments non traduits, des duplications et des désordres. Nous proposons un processus de...
In this paper we describe our effort to create a dataset for the evaluation of cross-language textual similarity detection. We present pre-existing corpora and their limits and we explain the various gathered resources to overcome these limits and build our enriched dataset. The proposed dataset is multilingual, includes cross-language alignment fo...
The ability to identify the intended meanings of words in context is a central research topic in natural language. Many solutions exist for Word Sense Disambiguation (WSD) in different languages, such as English or French, but research on Arabic WSD remains limited. The main bottleneck is the lack of resources. In this paper, we show that it is pos...
Nous présentons une méthode pour créer rapidement un système de désambiguïsation lexicale (DL) pour une langue L peu dotée pourvu que l'on dispose d'un système de traduction automatique statistique (TAS) d'une langue riche en corpus annotés en sens (ici l'anglais) vers L. Il est, en effet, plus facile de disposer des ressources nécessaires à la cré...
Michael Zock’s work has focussed these last years on finding the appropriate and most adequate word when writing or speaking. The semantic relatedness between words can play an important role in this context. Previous studies have pointed out three kinds of approaches for their evaluation: a theoretical examination of the desirability (or not) of c...
Résumé. En traitement automatique des langues, les ressources lexico-sémantiques ont été incluses dans un grand nombre d'applications. La création manuelle de telles ressources est consommatrice de temps humain et leur couverture limitée ne permet pas toujours de couvrir les besoins des applications. Ce problème est encore plus important pour les l...
Cet article présente une approche originale permettant de décoder des graphes issus d'un Système de Reconnaissance Automatique de la Parole (SRAP) à l'aide d'un algorithme constructif : les colonies de fourmis. L'application d'un modèle de langage d'ordre supérieur à un graphe nécessite son extension afin de construire des historiques correspondant...
The DBnary project aims at providing high quality Lexical Linked Data extracted from different Wiktionary language editions. Data from 10 different languages is currently extracted for a total of over 3.16M translation links that connect lexical entries from the 10 extracted languages, to entries in more than one thousand languages. In Wiktionary,...
Les ressources lexicales (dictionnaires, bases de données, thesaurus, etc.) rassemblent des connaissances sur les mots, leurs sens et leurs usages. Si pendant des siècles elles ont été tributaires de l'imprimerie et du format textuel, il existe de nos jours une grande variété d'outils et de ressources accessibles sous des formats électroniques dive...
This article presents the GETALP system for the participation to SemEval-2013 Task 12, based on an adaptation of the Lesk measure propagated through an Ant Colony Algorithm, that yielded good results on the corpus of Se-meval 2007 Task 7 (WordNet 2.1) as well as the trial data for Task 12 SemEval 2013 (Ba-belNet 1.0). We approach the parameter es-t...
This article presents the GETALP system for the participation to SemEval-2013 Task 12, based on an adaptation of the Lesk measure propagated through an Ant Colony Algorithm, that yielded good results on the corpus of Se-meval 2007 Task 7 (WordNet 2.1) as well as the trial data for Task 12 SemEval 2013 (Ba-belNet 1.0). We approach the parameter es-t...
Word Sense Disambiguation (WSD) is a difficult problem for NLP. Algorithm that aim to solve the problem focus on the quality of the disambiguation alone and require considerable computational time. In this article we focus on the study of three unsupervised stochastic algorithms for WSD: a Genetic Algorithm (GA) and a Simulated Annealing algorithm...
In this article, we present the notions of local and global algorithms, for the word sense disambiguation of texts. A local algorithm allows to calculate the semantic similarity between two lexical objects. Global algorithms propagate local measures at the upper level. We use this notion to compare an ant colony algorithm to other methods from the...
Brute-force word sense disambiguation (WSD) algorithms based on semantic relatedness are really time consuming. We study how to perform WSD faster and better on the span of a text. Several stochastic algorithms can be used to perform Global WSD. We focus here on an Ant Colony Algorithm and compare it to two other methods (Genetic and Simulated Anne...
In this article we propose a method based on simulated annealing for the parameter estimation of probabilistic algorithms, where the solution provided by the algorithm can vary from execution to execution. Such algorithms are often very interesting to solve complex combinatorial problems, yet they involve many parameters that can be difficult to es...
Since September 2007, a large scale lexical network for French is under
construction through methods based on some kind of popular consensus by
means of games (JeuxDeMots project). Human intervention can be
considered as marginal. It is limited to corrections, adjustments and
validation of the senses of terms, which amounts to less than 0,5 % of
th...
Since September 2007, a large scale lexical network for French is under construction through
methods based on some kind of popular consensus by means of games (JeuxDeMots project). Human
intervention can be considered as marginal. It is limited to corrections, adjustments and validation of the
senses of terms, which amounts to less than 0,5 % of th...
tous ces stéréotypes d'un autre âge. Ainsi a-t-on entendu parler de la difficulté, pour ne pas dire l'impossibilité, de monter un simple projet de thèse transdisciplinaire. Personne ne peut ignorer le manque de cohérence entre cette transdisciplinarité qu'on enseigne à la jeunesse, en théorie, sans lui donner la possibilité de la mettre en pratique...
Questions
Question (1)
Let's consider that for a given problem, a configuration is a vector of several hundred dimensions in the problem space. A score can be assigned to each configuration. An optimal solution to the problem is a configuration with a score of 1. The search space is manifestly too large to explore exhaustively in search of an optimal solution.
In this particular instance, I am not trying to find a solution, I have at my disposal various heuristics (simulated-annealing, ant colonies, etc) that serve me well on that end. However, I want to understand and characterize the search space better. So far, what I have attempted, is to generate some configurations randomly. How could I quantify how representative the sampling is? Do you think I am using the right approach? And finally, how would you suggest that I proceed further?