
Stéphane Dupont- PhD Electrical Engineering
- University of Mons
Stéphane Dupont
- PhD Electrical Engineering
- University of Mons
About
196
Publications
78,168
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,503
Citations
Introduction
COST IC1307 - http://www.cost.eu/COST_Actions/ict/Actions/IC1307
IMOTION - http://www.chistera.eu/projects/imotion
JOKER - http://www.chistera.eu/projects/joker
iTreasures - http://www.i-treasures.eu/
ILHAIRE - http://www.ilhaire.eu/
LinkedTV - http://www.linkedtv.eu/
Current institution
Publications
Publications (196)
Laughter is everywhere. So much so that we often do not even notice it. First, laughter has a strong connection with humour. Most of us seek out laughter and people who make us laugh, and it is what we do when we gather together as groups relaxing and having a good time. But laughter also plays an important role in making sure we interact with each...
In this paper, we focus on the modeling of coarticulation and pronunciation variation in Automatic Speech Recognition systems (ASR). Most ASR systems explicitly describe these production phenomena through context-dependent phoneme models and multiple pronunciation lexicons. Here, we explore the potential benefit of using feature spaces covering lon...
The paper proposes a solution that brings some advances to the genericity of the ASR technology towards tasks and languages. A non-linear discriminant model is built from multi-lingual, multi-task speech material in order to classify the acoustic signal into language independent phonetic units. Instead of considering this model for direct HMM state...
Comprehending communication is dependent on analyzing the different modalities of conversation, including audio, visual, and others. This is a natural process for humans, but in digital libraries, where preservation and dissemination of digital information are crucial, it is a complex task. A rich conversational model, encompassing all modalities a...
Sketch-Based Image Retrieval (SBIR) is a crucial task in multimedia retrieval, where the goal is to retrieve a set of images that match a given sketch query. Researchers have already proposed several well-performing solutions for this task, but most focus on enhancing embedding through different approaches such as triplet loss, quadruplet loss, add...
This paper proposes a neuro-symbolic approach to predict the power of marine cargo vessels. The neuro-symbolic approach combines two parts. The first is a neural networks part, and the second is a symbolic part that relies on physics-based formulae. The Shifts-power dataset was used for evaluation. The experimental results showed that a combination...
This work aims at generating captions for soccer videos using deep learning. The paper introduces a novel dataset, model, and triple-level evaluation. The dataset consists of 22k caption-clip pairs and three visual features (images, optical flow, inpainting) for 500 hours of SoccerNet videos. The model is divided into three parts: a transformer lea...
The prediction of future consumption is vital for building management in general and green buildings in particular. A certified green building should meet an evaluation criterion of resource efficiency throughout its life cycle. This paper presents a hybrid model that combines a profile averaging model and a machine learning model, namely random fo...
Nowadays, with the growing interest in green energy, further improvements in photovoltaic (PV) power systems are needed. In this regard, the main aim is to find an optimal method to predict the output power of PV systems to maintain a sustainable operation. Hence, this work proposes a hybrid Machine Learning (ML) method LASSO-RFR for an hourly PV p...
Sketch-based image retrieval (SBIR) is the task of retrieving natural images (photos) that match the semantics and the spatial configuration of hand-drawn sketch queries. The universality of sketches extends the scope of possible applications and increases the demand for efficient SBIR solutions. In this paper, we study classic triplet-based SBIR s...
Smiling differences between men and women have been studied in psychology. Women smile more than men although the expressiveness of women is not universally more across all facial actions. There are also body movement differences between women and men. For example, more open-body postures were reported for men, but are there any body-movement diffe...
The development of virtual agents has enabled human-avatar interactions to become increasingly rich and varied. Moreover, an expressive virtual agent i.e. that mimics the natural expression of emotions, enhances social interaction between a user (human) and an agent (intelligent machine). The set of non-verbal behaviors of a virtual character is, t...
Event classification is inherently sequential and multimodal. Therefore, deep neural models need to dynamically focus on the most relevant time window and/or modality of a video. In this study, we propose the Multimodal Attentive Fusion Network (MAFnet), an architecture that can dynamically fuse visual and audio information for event recognition. I...
This work aims at generating captions for soccer videos using deep learning. In this context, this paper introduces a dataset, model, and triple-level evaluation. The dataset consists of 22k caption-clip pairs and three visual features (images, optical flow, inpainting) for ~500 hours of \emph{SoccerNet} videos. The model is divided into three part...
In real-world datasets, specifically in TV recordings, videos are often multi-person and multi-angle, which poses significant challenges for gesture recognition and retrieval. In addition to being of interest to linguists, gesture retrieval is a novel and challenging application for multimedia retrieval. In this paper, we propose a novel method for...
Event classification is inherently sequential and multimodal. Therefore, deep neural models need to dynamically focus on the most relevant time window and/or modality of a video. In this study, we propose the Multi-level Attention Fusion network (MAFnet), an architecture that can dynamically fuse visual and audio information for event recognition....
In this paper, we propose a study on multi-modal (audio and video) action spotting and classification in soccer videos. Action spotting and classification are the tasks that consist in finding the temporal anchors of events in a video and determine which event they are. This is an important application of general activity understanding. Here, we pr...
This paper aims to bring a new lightweight yet powerful solution for the task of Emotion Recognition and Sentiment Analysis. Our motivation is to propose two architectures based on Transformers and modulation that combine the linguistic and acoustic inputs from a wide range of datasets to challenge, and sometimes surpass, the state-of-the-art in th...
We introduce the AVECL-UMons dataset for audio-visual event classification and localization in the context of office environments. The audio-visual dataset is composed of 11 event classes recorded at several realistic positions in two different rooms. Two types of sequences are recorded according to the number of events in the sequence. The dataset...
Understanding expressed sentiment and emotions are two crucial factors in human multimodal language. This paper describes a Transformer-based joint-encoding (TBJE) for the task of Emotion Recognition and Sentiment Analysis. In addition to use the Transformer architecture, our approach relies on a modular co-attention and a glimpse layer to jointly...
In order to properly train an automatic speech recognition system, speech with its annotated transcriptions is most often required. The amount of real annotated data recorded in noisy and reverberant conditions is extremely limited, especially compared to the amount of data than can be simulated by adding noise to clean annotated speech. Thus, usin...
Recently, generative adversarial networks (GAN) have gathered a lot of interest. Their efficiency in generating unseen samples of high quality, especially images, has improved over the years. In the field of Natural Language Generation (NLG), the use of the adversarial setting to generate meaningful sentences has shown to be difficult for two reaso...
As new data-sets for real-world visual reasoning and compositional question answering are emerging, it might be needed to use the visual feature extraction as a end-to-end process during training. This small contribution aims to suggest new ideas to improve the visual processing of traditional convolutional network for visual question answering (VQ...
Even with the growing interest in problems at the intersection of Computer Vision and Natural Language, grounding (i.e. identifying) the components of a structured description in an image still remains a challenging task. This contribution aims to propose a model which learns grounding by reconstructing the visual features for the Multi-modal trans...
When searching for an object humans navigate through a scene using semantic information and spatial relationships. We look for an object using our knowledge of its attributes and relationships with other objects to infer the probable location. In this paper, we propose to tackle the visual navigation problem using rich semantic representations of t...
This paper describes the UMONS solution for the Multimodal Machine Translation Task presented at the third conference on machine translation (WMT18). We explore a novel architecture, called deepGRU, based on recent findings in the related task of Neural Image Captioning (NIC). The models presented in the following sections lead to the best METEOR t...
Neural Image Captioning (NIC) or neural caption generation has attracted a lot of attention over the last few years. Describing an image with a natural language has been an emerging challenge in both fields of computer vision and language processing. Therefore a lot of research has focused on driving this task forward with new creative ideas. So fa...
In order to properly train an automatic speech recognition system, speech with its annotated transcriptions is required. The amount of real annotated data recorded in noisy and reverberant conditions is extremely limited, especially compared to the amount of data that can be simulated by adding noise to clean annotated speech. Thus, using both real...
The 11th Summer Workshop on Multimodal Interfaces eNTERFACE 2015 was hosted by the Numediart Institute of Creative Technologies of the University of Mons from August 10th to September 2015. During the four weeks, students and researchers from all over the world came together in the Numediart Institute of the University of Mons to work on eight sele...
Intangible Cultural Heritage (ICH) creations include, amongst other, music, dance, singing, theatre, human skills and craftsmanship. These cultural expressions are usually transmitted orally and/or using gestures and are modified over a period of time, through a process of collective recreation. As the world becomes more interconnected and many dif...
We propose a new and fully end-to-end approach for multimodal translation where the source text encoder modulates the entire visual input processing using conditional batch normalization, in order to compute the most informative image features for our task. Additionally, we propose a new attention mechanism derived from this original idea, where th...
This paper presents the IMOTION system in its third version. While still focusing on sketch-based retrieval, we improved upon the semantic retrieval capabilities introduced in the previous version by adding more detectors and improving the interface for semantic query specification. In addition to previous year’s system, we increase the role of fea...
Dealing with speech corrupted by noise and reverberation is still an issue for automatic speech recognition. To address this, a solution that can be combined with multi-style learning consists of using multi-task learning, where the acoustic model is trained to solve one main task and at least one auxiliary task simultaneously. In noisy and reverbe...
This paper presents a data collection carried out in the framework of the Joker Project. Interaction scenarios have been designed in order to study the e ects of a ect bursts in a human-robot interaction and to build a system capable of using multilevel a ect bursts in a human-robot interaction. We use two main audio expression cues: verbal (synthe...
Freehand sketches are a simple and powerful tool for communication. They are easily recognized across cultures and suitable for various applications. In this paper, we use deep convolutional neural networks (ConvNets), state-of-the-art in the field of sketch recognition, to address several applications of automatic sketch processing: complete and p...
Dealing with noise deteriorating the speech is still a major problem for automatic speech recognition. An interesting approach to tackle this problem consists of using multi-task learning. In this case, an efficient auxiliary task is clean-speech generation. This auxiliary task is trained in addition to the main speech recognition task and its goal...
This paper presents the results achieved during our participation
at the MediaEval 2017 Retrieving Diverse Social Images Task. The
proposed unsupervised multimodal approach exploits visual and
textual information in a fashion that prioritizes both relevance
and diversification. As features, we used a modified version of the
RMAC (Regional Maximum A...
In this paper we present the AmuS database of about three hours worth of data related to amused speech recorded from two males and one female subjects and contains data in two languages French and English. We review previous work on smiled speech and speech-laughs. We describe acoustic analysis on part of our database, and a perception test compari...
In state-of-the-art Neural Machine Translation (NMT), an attention mechanism is used during decoding to enhance the translation. At every step, the decoder uses this mechanism to focus on different parts of the source sentence to gather the most useful information before outputting its target word. Recently, the effectiveness of the attention mecha...
In Multimodal Neural Machine Translation (MNMT), a neural model generates a translated sentence that describes an image, given the image itself and one source descriptions in English. This is considered as the multimodal image caption translation task. The images are processed with Convolutional Neural Network (CNN) to extract visual features explo...
In this paper, we present our work on analysis and classification of smiled vowels, chuckling (or shaking) vowels and laughter syllables. This work is part of a larger framework that aims at assessing the level of amusement in speech using the audio modality only. Indeed all of these three categories occur in amused speech and are considered to con...
Freehand sketches are a simple and powerful tool for communication. They are easily recognized across cultures and suitable for various applications. In this paper, we use deep convolutional neural networks (ConvNets) to address sketch-based image retrieval (SBIR). We first train our ConvNets on sketch and image object recognition in a large scale...
Freehand sketches are an intuitive tool for communication and suitable for various applications. In this paper, we present an effective approach that combines triplet networks and an attention mechanism for sketch-based image retrieval (SBIR). The study conducted in this work is based on features extracted using deep convolutional neural networks (...
Intangible cultural heritage (ICH) is a relatively recent term coined to represent living cultural expressions and practices, which are recognised by communities as distinct aspects of identity. The safeguarding of ICH has become a topic of international concern primarily through the work of United Nations Educational, Scientific and Cultural Organ...
In state-of-the-art Neural Machine Translation, an attention mechanism is used during decoding to enhance the translation. At every step, the decoder uses this mechanism to focus on different parts of the source sentence to gather the most useful information before outputting its target word. Recently, the effectiveness of the attention mechanism h...
We present a blind source separation algorithm named GCC-NMF that combines unsupervised dictionary learning via non-negative matrix factorization (NMF) with spatial localization via the generalized cross correlation (GCC) method. Dictionary learning is performed on the mixture signal, with separation subsequently achieved by grouping dictionary ato...
I-Vectors have been successfully applied in the speaker identification community in order to characterize the speaker and its acoustic environment. Recently, i-vectors have also shown their usefulness in automatic speech recognition, when con-catenated to standard acoustic features. Instead of directly feeding the acoustic model with i-vectors, we...
Overfitting is a commonly met issue in automatic speech recognition and is especially impacting when the amount of training data is limited. In order to address this problem, this article investigates acoustic modeling through Multi-Task Learning, with two speaker-related auxiliary tasks. Multi-Task Learning is a regularization method which aims at...
In this work, we experiment with the use of smiling and laughter in order to help create more natural and efficient listening agents. We present preliminary results on a system which predicts smile and laughter sequences in one dialogue participant based on observations of the other participant's behavior. This system also predicts the level of int...
Freehand sketches are an interesting universal form of visual representation. Sketching has become easily accessible with many of the devices that we use on a daily basis. In this paper, we propose a system for real-time sketch recognition and similarity search. Our system is able to recognize partial sketches from 250 object categories. It is then...
In order to address the commonly met issue of overfitting in speech recognition, this article investigates Multi-Task Learning, when the auxiliary task focuses on speaker classification. Overfitting occurs when the amount of training data is limited, leading to an over-sensible acoustic model. Multi-Task Learning is a method, among many other regul...
Affect bursts are short, isolated and non-verbal expressions of affect expressed vocally or facially. In this paper we present an attempt at synthesizing audio affect bursts on several levels of arousal. This work concerns 3 different types of affect bursts: disgust, startle and surprised expressions. Data are first gathered for each of these affec...
In this paper, we introduce the i-Treasures Intangible Cultural Heritage (ICH) dataset, a freely available collection of multimodal data captured from different forms of rare ICH. More specifically, the dataset contains video, audio, depth, motion capture data and other modalities, such as EEG or ultrasound data. It also includes (manual) annotatio...
It has been shown that adding expressivity and emotional expressions to an agent's communication systems would improve the interaction quality between this agent and a human user. In this paper we present a multimodal database of affect bursts, which are very short non-verbal expressions with facial, vocal, and gestural components that are highly s...
This paper provides a short summary of the importance of taking into account laughter and smile expressions in Human-Computer Interaction systems. Based on the literature, we mention some important characteristics of these expressions in our daily social interactions. We describe some of our own contributions and ongoing work to this field.
Generalization is a common issue for automatic speech recognition. A successful method used to improve recognition results consists of training a single system to solve multiple related tasks in parallel. This overview investigates which auxiliary tasks are helpful for speech recognition when multi-task learning is applied on a deep learning based...
The IMOTION system is a content-based video search engine that provides fast and intuitive known item search in large video collections. User interaction consists mainly of sketching, which the system recognizes in real-time and makes suggestions based on both visual appearance of the sketch (what does the sketch look like in terms of colors, edge...
This paper presents the second version of the IMOTION system, a sketch-based video retrieval engine supporting multiple query paradigms. Ever since, IMOTION has supported the search for video sequences on the basis of still images, user-provided sketches, or the specification of motion via flow fields. For the second version, the functionality and...
This paper introduces iAutoMotion, an autonomous video retrieval system that requires only minimal user input. It is based on the video retrieval engine IMOTION. iAutoMotion uses a camera to capture the input for both visual and textual queries and performs query composition, retrieval, and result submission autonomously. For the visual tasks, it u...
In this paper, we expose our work on classification of smiled vowels, shaking vowels and laughter syllables. This work is part of a larger framework that aims at assessing the level of amusement in speech using only audio cues. Indeed all of these three categories occur in amused speech and are considered to express a different level of amusement....
Smile is not only a visual expression. When it occurs together with speech, it also alters its acoustic realization. Being able to synthesize speech altered by the expression of smile can hence be an important contributor for adding naturalness and expressiveness in interactive systems. In this work, we present a first attempt to develop a Hidden M...
In this paper, we present our work on speech-smile/shaking vowels classification. An efficient classification system would be a first step towards the estimation (from speech signals only) of amusement levels beyond smile, as indeed shaking vowels represent a transition from smile to laughter superimposed to speech. A database containing examples o...
In this work, we present a study dedicated to improve the speech-laugh synthesis quality. The impact of two factors is evaluated. The first factor is the addition of breath intake sounds after laughter bursts in speech. The second is the repetition of the word interrupted by laughs in the speech-laugh sentences. Several configurations are evaluated...
The deliverable contains the updated and revised requirements of the i-Treasures platform.
In this paper, we present a system for sketch classification and similarity search. We used deep convolution neural networks (ConvNets), state of the art in the field of image recognition. They enable both classification and medium/highlevel features extraction. We make use of ConvNets features as a basis for similarity search using k-Nearest Neigh...
This paper presents an HMM-based speech-smile synthesis system. In order to do that, databases of three speech styles were recorded. This system was used to study to what extent synthesized speech-smiles (defined as Duchenne smiles in our work) and spread-lips (speech modulated by spreading the lips) communicate amusement. Our evaluation results sh...
This paper presents an HMM-based synthesis approach for speech-laughs. The building stone of this project was the idea of the co-occurrence of smile and laughter bursts in varying proportions within amused speech utterances. A corpus with three complementary speaking styles was used to train the underlying HMM models: neutral speech, speech-smile,...
The main objective of the EU FP7 ICT i-Treasures project is to build a public and expandable platform to enable learning and transmission of rare know-how of intangible cultural heritage. A core part of this platform consists of game-like applications able to support teaching and learning processes in the ICH field. We have designed and developed f...
The paper presents an interactive game-like application to learn, perform and evaluate modern contemporary singing. The Human Beat Box (HBB) is being used as a case study. The game consists of two main modules. A sensor module that consists of a portable helmet based system containing an ultrasonic (US) transducer to capture tongue movements, a vid...
The main objective of the EU FP7 ICT i-Treasures project is to build a public and expandable platform to enable learning and transmission of rare know-how of intangible cultural heritage. A core part of this platform consists of game-like applications able to support teaching and learning processes in the ICH field. We have designed and developed f...
This paper introduces the IMOTION system, a sketch-based video retrieval engine supporting multiple query paradigms. For vector space retrieval, the IMOTION system exploits a large variety of low-level image and video features, as well as high-level spatial and temporal features that can all be jointly used in any combination. In addition, it suppo...
This study investigated which features of AVATAR laughter are perceived threatening for individuals with a fear of being laughed at (gelotophobia), and individuals with no gelotophobia. Laughter samples were systematically varied (e.g., intensity, laughter pitch, and energy for the voice, intensity of facial actions of the face) in three modalities...