
Neeraj KumarIndian Institute of Technology Delhi | IIT Delhi · Bharti School of Telecommunication Technology and Management
Neeraj Kumar
Bachelor of Technology
PHD researcher , IIT Delhi, Btech at IIT Kharagpur
About
13
Publications
2,260
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
51
Citations
Introduction
Education
September 2020 - August 2024
Publications
Publications (13)
Large pretrained models, such as Bert, GPT, and Wav2Vec, have demonstrated great potential for learning representations that are transferable to a wide variety of downstream tasks. It is difficult to obtain a large quantity of supervised data due to the limite d availability of resources and time. In light of this, a significant amount of research...
Large pretrained models like Bert, GPT, and Wav2Vec have demonstrated their ability to learn transferable representations for various downstream tasks. However, obtaining a substantial amount of supervised data remains a challenge due to resource and time limitations. As a solution, researchers have turned their attention to using large pretrained...
Large pre-trained models, such as Bert, GPT, and Wav2Vec, have demonstrated great potential for learning representations that are transferable to a wide variety of downstream tasks . It is difficult to obtain a large quantity of supervised data due to the limited availability of resources and time. In light of this, a significant amount of research...
Text-to-speech (TTS) systems are designed to synthesize natural and expressive speech, adapt to an unseen voice, and capture the speaking style of an unseen speaker by converting text into speech. The introduction of an unseen speaker’s speaking style into a TTS system offers a wide range of application scenarios, including personal assistant, news...
In this paper, we propose a novel normalization framework, multi-modal normalization(MultiNorm) that learns the multiple modalities through affine transformations involved in the normalization architecture. We have shown its effectiveness in speech-driven facial video generation and video emotion detection which are complex problems due to its mult...
There has been a remarkable progress in learning a model which could recognise novel classes with only a few labeled examples in the last few years. Few-shot learning (FSL) for action recognition is a challenging task of recognising novel action categories which are represented by few instances in the training data. We propose a novel variational i...
COVID-19 has made the immersive experiences such as video conferencing, virtual reality/augmented reality, the most important modes of exchanging information. Despite much advancement in the network bandwidth and codec techniques, the current system still suffers from glitches, lags and poor video quality, especially under unreliable network condit...
We consider the challenging problem of audio to animated video generation. We propose a novel method OneShotAu2AV to generate an animated video of arbitrary length using an audio clip and a single unseen image of a person as an input. The proposed method consists of two stages. In the first stage, OneShotAu2AV generates the talking-head video in th...
Speech-driven facial video generation has been a complex problem due to its multi-modal aspects namely audio and video domain. The audio comprises lots of underlying features such as expression, pitch, loudness, prosody(speaking style) and facial video has lots of variability in terms of head movement, eye blinks, lip synchronization and movements...
Audio to Video generation is an interesting problem that has numerous applications across industry verticals including film making, multi-media, marketing, education and others. High-quality video generation with expressive facial movements is a challenging problem that involves complex learning steps for generative adversarial networks. Further, e...
The style of the speech varies from person to person and every person exhibits his or her own style of speaking that is determined by the language, geography, culture and other factors. Style is best captured by prosody of a signal. High quality multi-speaker speech synthesis while considering prosody and in a few shot manner is an area of active r...