About
28
Publications
10,858
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
360
Citations
Introduction
Puneet is a Postdoctoral Researcher at the University of Oulu, Finland, working on Multimodal Emotion Understanding. He did his Ph.D., M.E. and B.E. from Indian Inst. of Technology Roorkee, Thapar Institute of Engg. & Tech. and Manipal Institute of Tech., respectively. He worked at Osaka Prefecture University Japan, Samsung R&D India and Oracle India and co-founded PaiByTwo Pvt. Ltd. He is passionate about relating humans & computers and using that understanding to optimize people's well-being.
Additional affiliations
January 2014 - May 2016
Education
July 2018 - June 2022
July 2016 - June 2018
July 2010 - May 2014
Publications
Publications (28)
In this paper, a multimodal speech emotion recognition system has
been developed, and a novel technique to explain its predictions
has been proposed. The audio and textual features are extracted
separately using attention-based Gated Recurrent Unit (GRU) and
pre-trained Bidirectional Encoder Representations from Transformers (BERT), respectively. T...
In this paper, a deep learning based fusion approach has been proposed to classify the emotions portrayed by image and corresponding text into discrete emotion classes. The proposed method first implements intermediate fusion on image and text inputs and then applies late fusion on image, text, and intermediate fusion's output. We have also come up...
Engagement analysis finds various applications in healthcare, education, advertisement, services. Deep Neu-ral Networks, used for analysis, possess complex architecture and need large amounts of input data, computational power, inference time. These constraints challenge embedding systems into devices for real-time use. To address these limitations...
Analysis of non-typical emotions, such as stress, depression and engagement is less common and more complex compared to that of frequently discussed emotions like happiness, sadness, fear, and anger. The importance of these non-typical emotions has been increasingly recognized due to their implications on mental health and well-being. Stress and de...
This paper proposes a multimodal emotion recognition system based on hybrid fusion that classifies the emotions depicted by speech utterances and corresponding images into discrete classes. A new interpretability technique has been developed to identify the important speech and image features leading to the prediction of particular emotion classes....
This paper aims to demonstrate the importance and feasibility of fusing multimodal information for emotion recognition. It introduces a multimodal framework for emotion understanding by fusing the information from visual facial features and rPPG signals extracted from the input videos. An interpretability technique based on permutation feature impo...
In this paper, we have defined a novel task of affective feedback synthesis that deals with generating feedback for input text and corresponding image in a similar way as humans respond towards the multimodal data. A feedback synthesis system has been proposed and trained using ground-truth human comments along with image-text input. We have also c...
Received the 'Best Ph.D. Thesis Award' in the 9th IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON'22)
This paper proposes a multimodal emotion recognition system based on hybrid fusion that classifies the emotions depicted by speech utterances and corresponding images into discrete classes. A new interpretability technique has been developed to identify the important speech & image features leading to the prediction of particular emotion classes. T...
This paper proposes a multimodal emotion recognition system, VIsual Spoken Textual Additive Net (VISTA Net), to classify the emotions reflected by a multimodal input containing image, speech, and text into discrete classes. A new interpretability technique, K-Average Additive exPlanation (KAAP), has also been developed to identify the important vis...
In this paper, a novel method for analyzing the sentiments portrayed by Sanskrit text has been proposed. Sanskrit is one of the world’s most ancient languages; however, natural language processing tasks such as machine translation and sentiment analysis have not been explored for it to the full potential because of the unavailability of sufficient...
In this paper, we have defined a novel task of affective feedback synthesis that deals with generating feedback for input text & corresponding image in a similar way as humans respond towards the multimodal data. A feedback synthesis system has been proposed and trained using ground-truth human comments along with image-text input. We have also con...
In this paper, a novel dual-channel system for multi-class text emotion recognition has been proposed, and a novel technique to explain its training & predictions has been developed. The architecture of the proposed system contains the embedding module, dual-channel module, emotion classification module, and explainability module. The embedding mod...
This paper has proposed a novel approach to classify the subjects’ smoking behavior by extracting relevant regions from a given image using deep learning. After the classification, we have proposed a conditionally active detection module based on Yolo-v3, which improves the model’s performance and reduces its complexity. To the best of our knowledg...
In this paper, an interpretable deep-learning-based system has been proposed for facial emotion recognition. A novel approach to interpret the proposed system’s results, Divide & Conquer based Shapley additive explanations (DnCShap), has also been developed. The proposed approach computes ‘Shapley values’ that denote the contribution of each image...
The need to develop computational systems to recognize the emotions portrayed in various modalities such as image, text, and speech is increasing rapidly. This doctoral thesis aims to recognize intangibly expressed emotions through behavior observation. The proposed works intend to develop end-to-end systems that can recognize emotions portrayed th...
Images are powerful tools for affective content analysis. Image emotion recognition is useful for graphics, gaming, animation, entertainment, and cinematography. In this paper, a technique for recognizing the emotions in images containing facial, non-facial, and non-human components has been proposed. The emotion-labeled images are mapped to their...
This paper has proposed a novel approach to classify the subjects' smoking behavior by extracting relevant regions from a given image using deep learning. After the classification, we have proposed a conditional detection module based on Yolo-v3, which improves model's performance and reduces its complexity. As per the best of our knowledge, we are...
The performance of text-to-speech (TTS) systems heavily depends on spectrogram to waveform generation, also known as the speech reconstruction phase. The time required for the same is known as synthesis delay. In this paper, an approach to reduce speech synthesis delay has been proposed. It aims to enhance the TTS systems for real-time applications...
In this paper, an end-to-end neural embedding system based on triplet loss and residual learning has been proposed for speech emotion recognition. The proposed system learns the embeddings from the emotional information of the speech utterances. The learned embeddings are used to recognize the emotions portrayed by given speech samples of various l...
In this paper, we propose a method to automatically compute a speech evaluation metric, Virtual Mean Opinion Score (vMOS) for the speech generated by Text-to-Speech (TTS) models to analyse its human-ness. In contrast to the currently used manual speech evaluation techniques, the proposed method uses an end-to-end neural network to calculate vMOS wh...
Questions
Questions (4)
I want to learn and understand the spectrograms associated with emotional speech. I'm looking for the sample code for this purpose. Something like this: https://github.com/AzamRabiee/Emotional-TTS (code is not provided here). Any leads or pointers to the relevant study material/code will be appreciated a lot. Thank you.
If we use cloud GPU services like Amazon AWS, FloydHub, Crestle, Vast.ai, etc. to train our deep learning networks, the code will be visible to the service providers, right? Is it okay to carry on the current unpublished research work with the same?
Hi, I've just started my PhD in Computer Science. My area of interest is Emotion AI (Affective Computing), which aims to understand emotions in Image, Video, Speech, Body-language and Brain-wave data. I've been trying to explore all of these data-types, with the aim to frame my PhD research proposal with one of them.
To get started with EEG data, I'm planning to buy an EEG device to gather the same. My budget is $100-200. I'd immensely appreciate any recommendations about the kind of device I could go for, or available EEG data-sets I could explore. Thank you.
Conventional Gradient Descent is very slow for Deep Learning training. While investigating about alternative methods to train Deep Neural Networks faster, I came across a few algorithms like Stochastic Gradient Descent, Contrastive Divergence, Optimization Heuristics etc. I am looking for the resources to explore all such important methods to fasten up Deep Learning training and parameter optimization time.
I'd appreciate any lead about such resources and some clarity about Contrastive Divergence algorithm? Is it an approximation to Gradient Descent or is it a different algorithm altogether? Thanks.