Puneet Kumar

Puneet Kumar
University of Oulu · Center for Machine Vision and Signal Analysis (CMVS)

PhD

About

28
Publications
10,858
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
360
Citations
Introduction
Puneet is a Postdoctoral Researcher at the University of Oulu, Finland, working on Multimodal Emotion Understanding. He did his Ph.D., M.E. and B.E. from Indian Inst. of Technology Roorkee, Thapar Institute of Engg. & Tech. and Manipal Institute of Tech., respectively. He worked at Osaka Prefecture University Japan, Samsung R&D India and Oracle India and co-founded PaiByTwo Pvt. Ltd. He is passionate about relating humans & computers and using that understanding to optimize people's well-being.
Additional affiliations
December 2019 - December 2019
Osaka Prefecture University
Position
  • Researcher
Description
  • Visiting Researcher at the Department of Computer Science and Intelligent Systems, OPU, Osaka, Japan, under Japan Science and Technology Sakura Science Plan.
July 2018 - June 2019
Samsung
Position
  • Researcher
Description
  • Project 'End to End Emotional Speech Synthesis', a part of the research, funded by Samsung R&D Delhi, India in the invigilation of Dr. R. Balasubramanian, Associate Professor, IIT Roorkee.
January 2014 - May 2016
Oracle Corporation
Position
  • Software Engineer
Education
July 2018 - June 2022
Indian Institute of Technology Roorkee
Field of study
  • Computer Science
July 2016 - June 2018
Thapar University
Field of study
  • Computer Science
July 2010 - May 2014
Manipal Academy of Higher Education
Field of study
  • Computer Science

Publications

Publications (28)
Conference Paper
Full-text available
In this paper, a multimodal speech emotion recognition system has been developed, and a novel technique to explain its predictions has been proposed. The audio and textual features are extracted separately using attention-based Gated Recurrent Unit (GRU) and pre-trained Bidirectional Encoder Representations from Transformers (BERT), respectively. T...
Conference Paper
Full-text available
In this paper, a deep learning based fusion approach has been proposed to classify the emotions portrayed by image and corresponding text into discrete emotion classes. The proposed method first implements intermediate fusion on image and text inputs and then applies late fusion on image, text, and intermediate fusion's output. We have also come up...
Conference Paper
Full-text available
Engagement analysis finds various applications in healthcare, education, advertisement, services. Deep Neu-ral Networks, used for analysis, possess complex architecture and need large amounts of input data, computational power, inference time. These constraints challenge embedding systems into devices for real-time use. To address these limitations...
Article
Full-text available
Analysis of non-typical emotions, such as stress, depression and engagement is less common and more complex compared to that of frequently discussed emotions like happiness, sadness, fear, and anger. The importance of these non-typical emotions has been increasingly recognized due to their implications on mental health and well-being. Stress and de...
Article
Full-text available
This paper proposes a multimodal emotion recognition system based on hybrid fusion that classifies the emotions depicted by speech utterances and corresponding images into discrete classes. A new interpretability technique has been developed to identify the important speech and image features leading to the prediction of particular emotion classes....
Preprint
Full-text available
This paper aims to demonstrate the importance and feasibility of fusing multimodal information for emotion recognition. It introduces a multimodal framework for emotion understanding by fusing the information from visual facial features and rPPG signals extracted from the input videos. An interpretability technique based on permutation feature impo...
Article
In this paper, we have defined a novel task of affective feedback synthesis that deals with generating feedback for input text and corresponding image in a similar way as humans respond towards the multimodal data. A feedback synthesis system has been proposed and trained using ground-truth human comments along with image-text input. We have also c...
Preprint
Full-text available
Received the 'Best Ph.D. Thesis Award' in the 9th IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON'22)
Preprint
Full-text available
This paper proposes a multimodal emotion recognition system based on hybrid fusion that classifies the emotions depicted by speech utterances and corresponding images into discrete classes. A new interpretability technique has been developed to identify the important speech & image features leading to the prediction of particular emotion classes. T...
Preprint
Full-text available
This paper proposes a multimodal emotion recognition system, VIsual Spoken Textual Additive Net (VISTA Net), to classify the emotions reflected by a multimodal input containing image, speech, and text into discrete classes. A new interpretability technique, K-Average Additive exPlanation (KAAP), has also been developed to identify the important vis...
Article
Full-text available
In this paper, a novel method for analyzing the sentiments portrayed by Sanskrit text has been proposed. Sanskrit is one of the world’s most ancient languages; however, natural language processing tasks such as machine translation and sentiment analysis have not been explored for it to the full potential because of the unavailability of sufficient...
Preprint
Full-text available
In this paper, we have defined a novel task of affective feedback synthesis that deals with generating feedback for input text & corresponding image in a similar way as humans respond towards the multimodal data. A feedback synthesis system has been proposed and trained using ground-truth human comments along with image-text input. We have also con...
Article
In this paper, a novel dual-channel system for multi-class text emotion recognition has been proposed, and a novel technique to explain its training & predictions has been developed. The architecture of the proposed system contains the embedding module, dual-channel module, emotion classification module, and explainability module. The embedding mod...
Chapter
This paper has proposed a novel approach to classify the subjects’ smoking behavior by extracting relevant regions from a given image using deep learning. After the classification, we have proposed a conditionally active detection module based on Yolo-v3, which improves the model’s performance and reduces its complexity. To the best of our knowledg...
Conference Paper
Full-text available
In this paper, an interpretable deep-learning-based system has been proposed for facial emotion recognition. A novel approach to interpret the proposed system’s results, Divide & Conquer based Shapley additive explanations (DnCShap), has also been developed. The proposed approach computes ‘Shapley values’ that denote the contribution of each image...
Preprint
Full-text available
The need to develop computational systems to recognize the emotions portrayed in various modalities such as image, text, and speech is increasing rapidly. This doctoral thesis aims to recognize intangibly expressed emotions through behavior observation. The proposed works intend to develop end-to-end systems that can recognize emotions portrayed th...
Chapter
Full-text available
Images are powerful tools for affective content analysis. Image emotion recognition is useful for graphics, gaming, animation, entertainment, and cinematography. In this paper, a technique for recognizing the emotions in images containing facial, non-facial, and non-human components has been proposed. The emotion-labeled images are mapped to their...
Preprint
Full-text available
This paper has proposed a novel approach to classify the subjects' smoking behavior by extracting relevant regions from a given image using deep learning. After the classification, we have proposed a conditional detection module based on Yolo-v3, which improves model's performance and reduces its complexity. As per the best of our knowledge, we are...
Article
Full-text available
The performance of text-to-speech (TTS) systems heavily depends on spectrogram to waveform generation, also known as the speech reconstruction phase. The time required for the same is known as synthesis delay. In this paper, an approach to reduce speech synthesis delay has been proposed. It aims to enhance the TTS systems for real-time applications...
Preprint
Full-text available
In this paper, an end-to-end neural embedding system based on triplet loss and residual learning has been proposed for speech emotion recognition. The proposed system learns the embeddings from the emotional information of the speech utterances. The learned embeddings are used to recognize the emotions portrayed by given speech samples of various l...
Conference Paper
Full-text available
In this paper, we propose a method to automatically compute a speech evaluation metric, Virtual Mean Opinion Score (vMOS) for the speech generated by Text-to-Speech (TTS) models to analyse its human-ness. In contrast to the currently used manual speech evaluation techniques, the proposed method uses an end-to-end neural network to calculate vMOS wh...

Questions

Questions (4)
Question
I want to learn and understand the spectrograms associated with emotional speech. I'm looking for the sample code for this purpose. Something like this: https://github.com/AzamRabiee/Emotional-TTS (code is not provided here). Any leads or pointers to the relevant study material/code will be appreciated a lot. Thank you.
Question
If we use cloud GPU services like Amazon AWS, FloydHub, Crestle, Vast.ai, etc. to train our deep learning networks, the code will be visible to the service providers, right? Is it okay to carry on the current unpublished research work with the same?
Question
Hi, I've just started my PhD in Computer Science. My area of interest is Emotion AI (Affective Computing), which aims to understand emotions in Image, Video, Speech, Body-language and Brain-wave data. I've been trying to explore all of these data-types, with the aim to frame my PhD research proposal with one of them.
To get started with EEG data, I'm planning to buy an EEG device to gather the same. My budget is $100-200. I'd immensely appreciate any recommendations about the kind of device I could go for, or available EEG data-sets I could explore. Thank you.
Question
Conventional Gradient Descent is very slow for Deep Learning training. While investigating about alternative methods to train Deep Neural Networks faster, I came across a few algorithms like Stochastic Gradient Descent, Contrastive Divergence, Optimization Heuristics etc. I am looking for the resources to explore all such important methods to fasten up Deep Learning training and parameter optimization time.
I'd appreciate any lead about such resources and some clarity about Contrastive Divergence algorithm? Is it an approximation to Gradient Descent or is it a different algorithm altogether?  Thanks.

Network

Cited By