
Puneet KumarUniversity of Oulu
Puneet Kumar
PhD
About
23
Publications
6,679
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
158
Citations
Citations since 2017
Introduction
Passionate to understand how the human mind works and relate it on computational basis; researching in Emotion AI, Deep Learning, Cognitive Science, and Emotional AI; currently a Postdoctoral Researcher at the Center of Machine Vision and Signal Processing (CMVS), University of Oulu, Finland.
My life goal is to derive simplified outputs through computational research to help people optimize their lives. Interests and hobbies include Travel, Photography, Audio-books, Meditation, Cosmology, Behav
Additional affiliations
July 2018 - present
IIT Roorkee
Position
- PhD Student
Education
July 2018 - June 2022
July 2016 - June 2018
July 2010 - May 2014
Publications
Publications (23)
In this paper, a multimodal speech emotion recognition system has
been developed, and a novel technique to explain its predictions
has been proposed. The audio and textual features are extracted
separately using attention-based Gated Recurrent Unit (GRU) and
pre-trained Bidirectional Encoder Representations from Transformers (BERT), respectively. T...
In this paper, a deep learning based fusion approach has been proposed to classify the emotions portrayed by image and corresponding text into discrete emotion classes. The proposed method first implements intermediate fusion on image and text inputs and then applies late fusion on image, text, and intermediate fusion's output. We have also come up...
Received the 'Best Ph.D. Thesis Award' in the 9th IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON'22)
This paper proposes a multimodal emotion recognition system based on hybrid fusion that classifies the emotions depicted by speech utterances and corresponding images into discrete classes. A new interpretability technique has been developed to identify the important speech & image features leading to the prediction of particular emotion classes. T...
This paper proposes a multimodal emotion recognition system, VIsual Spoken Textual Additive Net (VISTA Net), to classify the emotions reflected by a multimodal input containing image, speech, and text into discrete classes. A new interpretability technique, K-Average Additive exPlanation (KAAP), has also been developed to identify the important vis...
In this paper, a novel method for analyzing the sentiments portrayed by Sanskrit text has been proposed. Sanskrit is one of the world’s most ancient languages; however, natural language processing tasks such as machine translation and sentiment analysis have not been explored for it to the full potential because of the unavailability of sufficient...
In this paper, we have defined a novel task of affective feedback synthesis that deals with generating feedback for input text & corresponding image in a similar way as humans respond towards the multimodal data. A feedback synthesis system has been proposed and trained using ground-truth human comments along with image-text input. We have also con...
In this paper, a novel dual-channel system for multi-class text emotion recognition has been proposed, and a novel technique to explain its training & predictions has been developed. The architecture of the proposed system contains the embedding module, dual-channel module, emotion classification module, and explainability module. The embedding mod...
This paper has proposed a novel approach to classify the subjects’ smoking behavior by extracting relevant regions from a given image using deep learning. After the classification, we have proposed a conditionally active detection module based on Yolo-v3, which improves the model’s performance and reduces its complexity. To the best of our knowledg...
In this paper, an interpretable deep-learning-based system has been proposed for facial emotion recognition. A novel approach to interpret the proposed system’s results, Divide & Conquer based Shapley additive explanations (DnCShap), has also been developed. The proposed approach computes ‘Shapley values’ that denote the contribution of each image...
The need to develop computational systems to recognize the emotions portrayed in various modalities such as image, text, and speech is increasing rapidly. This doctoral thesis aims to recognize intangibly expressed emotions through behavior observation. The proposed works intend to develop end-to-end systems that can recognize emotions portrayed th...
Images are powerful tools for affective content analysis. Image emotion recognition is useful for graphics, gaming, animation, entertainment, and cinematography. In this paper, a technique for recognizing the emotions in images containing facial, non-facial, and non-human components has been proposed. The emotion-labeled images are mapped to their...
This paper has proposed a novel approach to classify the subjects' smoking behavior by extracting relevant regions from a given image using deep learning. After the classification, we have proposed a conditional detection module based on Yolo-v3, which improves model's performance and reduces its complexity. As per the best of our knowledge, we are...
The performance of text-to-speech (TTS) systems heavily depends on spectrogram to waveform generation, also known as the speech reconstruction phase. The time required for the same is known as synthesis delay. In this paper, an approach to reduce speech synthesis delay has been proposed. It aims to enhance the TTS systems for real-time applications...
In this paper, an end-to-end neural embedding system based on triplet loss and residual learning has been proposed for speech emotion recognition. The proposed system learns the embeddings from the emotional information of the speech utterances. The learned embeddings are used to recognize the emotions portrayed by given speech samples of various l...
In this paper, we propose a method to automatically compute a speech evaluation metric, Virtual Mean Opinion Score (vMOS) for the speech generated by Text-to-Speech (TTS) models to analyse its human-ness. In contrast to the currently used manual speech evaluation techniques, the proposed method uses an end-to-end neural network to calculate vMOS wh...
Questions
Questions (4)
I want to learn and understand the spectrograms associated with emotional speech. I'm looking for the sample code for this purpose. Something like this: https://github.com/AzamRabiee/Emotional-TTS (code is not provided here). Any leads or pointers to the relevant study material/code will be appreciated a lot. Thank you.
If we use cloud GPU services like Amazon AWS, FloydHub, Crestle, Vast.ai, etc. to train our deep learning networks, the code will be visible to the service providers, right? Is it okay to carry on the current unpublished research work with the same?
Hi, I've just started my PhD in Computer Science. My area of interest is Emotion AI (Affective Computing), which aims to understand emotions in Image, Video, Speech, Body-language and Brain-wave data. I've been trying to explore all of these data-types, with the aim to frame my PhD research proposal with one of them.
To get started with EEG data, I'm planning to buy an EEG device to gather the same. My budget is $100-200. I'd immensely appreciate any recommendations about the kind of device I could go for, or available EEG data-sets I could explore. Thank you.
Conventional Gradient Descent is very slow for Deep Learning training. While investigating about alternative methods to train Deep Neural Networks faster, I came across a few algorithms like Stochastic Gradient Descent, Contrastive Divergence, Optimization Heuristics etc. I am looking for the resources to explore all such important methods to fasten up Deep Learning training and parameter optimization time.
I'd appreciate any lead about such resources and some clarity about Contrastive Divergence algorithm? Is it an approximation to Gradient Descent or is it a different algorithm altogether? Thanks.
Projects
Projects (2)
To build a neural text-to-speech system which is capable to generate multi-speaker speech directly from the raw text, including emotional affects.
This project is a part of the research, funded by Samsung R&D Delhi, India in the invigilation of Dr. R. Balasubramanian, Associate Professor, IIT Roorkee.
This research aimed to use meta-heuristic search methods such as Genetic Algorithm (GA) to find the optimal hyper-parameters of Deep Neural networks (DNNs). It was a part of M.E. CSE thesis work.
Results: Compared to the traditional grid search based methods it provided an average speed-up of 8 times for Convolution Neural Network (CNN) and 6.5 times for Recurrent Neural Network (RNN).