About
53
Publications
9,428
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
532
Citations
Introduction
Current institution
Publications
Publications (53)
The use of synthetic speech as data augmentation is gaining increasing popularity in fields such as automatic speech recognition and speech classification tasks. Despite novel text-to-speech systems with voice cloning capabilities, that allow the usage of a larger amount of voices based on short audio segments, it is known that these systems tend t...
Detey, S., Fontan, L. & Ferrané, I. (2022). From Verbo-Tonal Method teachers’ training to Computer-Assisted Pronunciation Training tools: insight from L3 pronunciation studies and automatic speech processing technology among Japanese learners of French. Speech Research2022. Zagreb: Zagreb University (Croatia), 8-10/12/2022.
Abstract:
In the field...
De Fino, V., Fontan, L., Detey, S., Ferrané, I. & Pinquier, J. (2022). Corpus de parole non-native et prédiction automatique du niveau de performance en expression orale : application à CLIJAF. (I)PFC2022. Paris : CUI (France), 1-2/12/2022.
Meetings are a common activity in professional contexts, and it remains challenging to endow vocal assistants with advanced functionalities to facilitate meeting management. In this context, a task like active speaker detection can provide useful insights to model interaction between meeting participants. Detection of the active speaker can be perf...
De Fino, V., Fontan, L., Pinquier, J., Ferrané, I. & Detey, S. (2022). Prediction of L2 speech proficiency based on multi-level linguistic features. Proceedings of Interspeech2022 (Incheon, Korea), 4043-4047 (21/09/2022).
This study investigates the possibility to use automatic, multi-level features for the prediction of L2 speech proficiency. Th...
De Fino, V., Fontan, L., Pinquier, J., Barcat, C., Ferrané, I. & Detey, S. (2022). Mesures automatiques de parole non-native : exploration pilote d’un corpus d’apprenants japonais de français et différentiation de niveaux. Actes des JEP2022 (Ile de Noirmoutier, France), 693-702
Cette étude s'intéresse à l'évaluation automatique de la parole de Jap...
Meetings are a common activity in professional contexts, and it remains challenging to endow vocal assistants with advanced functionalities to facilitate meeting management. In this context, a task like active speaker detection can provide useful insights to model interaction between meeting participants. Motivated by our application context relate...
Training robot manipulation policies is a challenging and open problem in robotics and artificial intelligence. In this paper we propose a novel and compact state representation based on the rewards predicted from an image-based task success classifier. Our experiments, using the Pepper robot in simulation with two deep reinforcement learning algor...
This paper presents the first results of the PIA "Grands D\'efis du Num\'erique" research project LinTO. The goal of this project is to develop a conversational assistant to help the company's employees, particularly during meetings. LinTO is an interactive device equipped with microphones, a screen and a 360$^\circ$ camera, which allows to control...
The context of this work is to characterize the content and the structure of audiovisual documents by analysing the temporal relationships between basic events resulted from different segmentations of the same document. For this objective, we need to represent and reason about time. We propose a parametric representation of temporal relation betwee...
Purpose
The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist audiologists/hearing-aid...
In intelligent environments, activity detection is a necessary pre-processing step for adaptive energy management and interaction with humans. To characterize the interactions between individuals or between an individual and the infrastructure of a building, a re-identification process is required and using multimodal models improves its robustness...
This article presents a new method for analyzing Automatic Speech Recognition (ASR) results at the phonological feature level. To this end the Levenshtein distance algorithm is refined in order to take into account the distinctive features opposing substituted phonemes. This method allows to survey features additions or deletions, providing microsc...
In this paper, we present a multi-modal perception based framework to realize a non-intrusive domestic assistive robotic system. It is non-intrusive in that it only starts interaction with a user when it detects the user’s intention to do so. All the robot’s actions are based on multi-modal perceptions which include user detection based on RGB-D da...
This research work forms the first part of a long-term project designed to provide a framework for facilitating hearing aids tuning. The present study focuses on the setting up of automatic measures of speech intelligibility for the recognition of isolated words and sentences. Both materials were degraded in order to simulate presbycusis effects on...
Visual tracking is dynamic optimization where time and object state simultaneously influence the problem. In this paper, we intend to show that we built a tracker from an evolutionary optimization approach, the PSO (Particle Swarm optimization) algorithm. We demonstrated that an extension of the original algorithm where system dynamics is explicitl...
RÉSUMÉ. Cet article présente une étude comparative entre mesures perceptives et mesures au-tomatiques de l'intelligibilité de la parole sur de la parole dégradée par une simulation de la presbyacousie. L'objectif est de répondre à la question : peut-on se rapprocher d'une mesure perceptive humaine en utilisant un système de reconnaissance automatiq...
This study aims at comparing perceptive and automatic measures of speech intelligibility in the case of speech signals simulating the effects of age-related hearing loss (presbycusis). A new corpus especially designed for studying speech intelligibility and perception and the comparison of human speech recognition scores with Automatic Speech Recog...
The increasing amount of digital multimedia content available
is inspiring potential new types of user interaction with
video data. Users want to easily �nd the content by searching
and browsing. For this reason, techniques are needed
that allow automatic categorisation, searching the content
and linking to related information. In this work, we pre...
Loosing objects is a cause of conflicts between frail elderlies and their caregivers. To our knowledge, the literature addressing delusion of theft doesn’t provide information on the objects that are involved. In the RIDDLE project, we are using a companion robot to help the elderly find the objects they are looking for. Hence, we initiated a study...
In the field of automatic audiovisual content-based indexing and structuring, finding events like interviews, debates, reports,
or live commentaries requires to bridge the gap between low-level feature extraction and such high-level event detection.
In our work, we consider that detecting speaker roles like Anchor, Journalist and Other is a first s...
Assistance is currently a pivotal research area in robotics, with huge societal potential. Since assistant robots directly interact with people, finding natural and easy-to-use user interfaces is of fundamental importance. This paper describes a flexible multimodal interface based on speech and gesture modalities in order to control our mobile robo...
We propose a novel approach for video classification that bases on the analysis of the temporal relationships between the basic events in audiovisual documents. Starting from basic segmentation results, we define a new representation method that is called Temporal Relation Matrix (TRM). Each document is then described by a set of TRMs, the analysis...
In the audio indexing context, we present our recent con-tributions to the field of speaker role recognition, especially applied to conversational speech. We assume that there ex-ist clues about roles like Anchor, Journalists or Others in temporal, acoustic and prosodic features extracted from the results of speaker segmentation and from audio file...
When listening to foreign radio or TV programs we are able to pick up some information from the way people are interacting with each others and easily identify the most dominant speaker or the person who is interviewed. Our work relies on the existence of clues about speaker roles in acoustic and prosodic low-level features extracted from audio fil...
In the field of automatic audiovisual content-based indexing and structuring, finding events like interviews, debates, reports, or live commentaries requires to bridge the gap between low-level feature extraction and such high-level event detection. In our work, we consider that detecting speaker role to enrich interaction sequences between speaker...
With personal robotics and assistance to dependant people, robots are in continuous interface with humans. To enable a more natural communication, based on speech and gesture, robots must be endowed with auditive and visual perception capacities. This paper describes a modular multimodal interface based on speech and gestures in order to control an...
We designed an easy-to-use user interface based on speech and gesture modalities for controling an interactive robot. This paper, after a brief description of this interface and the platform on which it is implemented, describes an embedded gesture recognition system which is part of this multimodal interface. We describe two methods, namely Hidden...
Among the cognitive abilities a robot companion must be endowed with, human perception and speech understanding are both fundamental in the context of multimodal human-robot interaction. First, we propose a multiple object visual tracker which is interactively distributed and dedicated to two-handed gestures and head location in 3D. An on-board spe...
Giving access to the semantically rich content of large amounts of digital audiovisual data using an automatic and generic method is still an important challenge. The aim of our work is to address this issue while focusing on temporal aspects. Our approach is based on a method previously developed for analyzing temporal relations from a data mining...
Among the cognitive abilities a robot companion must be endowed with, human perception and speech understanding are both fundamental
in the context of multimodal human-robot interaction. In order to provide a mobile robot with the visual perception of its
user and means to handle verbal and multimodal communication, we have developed and integrated...
Rackham is an interactive robot-guide that has been used in several places and exhibitions. This paper presents its design and reports on results that have been obtained after its deployment in a permanent exhibition. The project is conducted so as to incrementally enhance the robot functional and decisional capabilities based on the observation of...
The aim of our work is the automatic analysis of audiovisual documents to retrieve their structure by studying the temporal relations between the events occurring in each of them. Different elementary segmentations of a same document are necessary. Then, starting from a parametric representation of temporal relations, a temporal relation matrix (TR...
The aim of our work is to study temporal relations between events that occur in a same audiovisual document. Our underlying purpose is to propose a method to reveal more significant events about the document content, like conversation sequences. We start from elementary features such as those produced by basic segmentation tools. Based on the new p...
Relations among temporal intervals can be used to detect semantic events in audio visual documents. The aim of our work is
to study all the relations that can be observed between different segmentations of a same document. These segmentations are
automatically provided by a set of tools. Each tool determines temporal units according to specific low...
Résumé : Le but de notre travail est de caractériser des structures intentionnelles dans des documents multimédia et particulièrement dans les vidéos. Pour cela, nous devons d'abord obtenir différentes segmentations du document pour étudier ensuite toutes les relations qui peuvent être observées entre ces segmentations. En ce qui nous concerne, nou...
In this paper, we describe lexical needs for spoken and written French surface processing, like automatic text correction, speech recognition and synthesis.We present statistical observations made on a vocabulary compiled from real texts like articles. These texts have been used for building a recorded speech database called BREF. Developed by the...
An abstract is not available.
The aim of our work is the automatic analysis of audiovisual documents to characterize their structure. Our approach is based on the study of the temporal relations between events occurring in a same document. For this purpose, we have proposed a parametric representation of temporal relations, from which Temporal Relation Matrix can be computed an...
The MediaEval 2012 Genre Tagging Task is a follow-up task of the MediaEval 2011 Genre Tagging Task and the Media-Eval 2010 Wild Wild Web Tagging Task to test and eval-uate retrieval techniques for video content as it occurs on the Internet, i.e., for semi-professional user generated con-tent that is associated with annotations existing on the So-ci...