Neeraj Kumar

Neeraj Kumar
Indian Institute of Technology Delhi | IIT Delhi · Bharti School of Telecommunication Technology and Management

Bachelor of Technology
PHD researcher , IIT Delhi, Btech at IIT Kharagpur

About

13
Publications
2,260
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
51
Citations
Education
September 2020 - August 2024
Indian Institute of Technology Delhi
Field of study
  • Machine Learning

Publications

Publications (13)
Chapter
Large pretrained models, such as Bert, GPT, and Wav2Vec, have demonstrated great potential for learning representations that are transferable to a wide variety of downstream tasks. It is difficult to obtain a large quantity of supervised data due to the limite d availability of resources and time. In light of this, a significant amount of research...
Article
Large pretrained models like Bert, GPT, and Wav2Vec have demonstrated their ability to learn transferable representations for various downstream tasks. However, obtaining a substantial amount of supervised data remains a challenge due to resource and time limitations. As a solution, researchers have turned their attention to using large pretrained...
Preprint
Full-text available
Large pre-trained models, such as Bert, GPT, and Wav2Vec, have demonstrated great potential for learning representations that are transferable to a wide variety of downstream tasks . It is difficult to obtain a large quantity of supervised data due to the limited availability of resources and time. In light of this, a significant amount of research...
Article
Text-to-speech (TTS) systems are designed to synthesize natural and expressive speech, adapt to an unseen voice, and capture the speaking style of an unseen speaker by converting text into speech. The introduction of an unseen speaker’s speaking style into a TTS system offers a wide range of application scenarios, including personal assistant, news...
Chapter
In this paper, we propose a novel normalization framework, multi-modal normalization(MultiNorm) that learns the multiple modalities through affine transformations involved in the normalization architecture. We have shown its effectiveness in speech-driven facial video generation and video emotion detection which are complex problems due to its mult...
Preprint
Full-text available
There has been a remarkable progress in learning a model which could recognise novel classes with only a few labeled examples in the last few years. Few-shot learning (FSL) for action recognition is a challenging task of recognising novel action categories which are represented by few instances in the training data. We propose a novel variational i...
Article
Full-text available
COVID-19 has made the immersive experiences such as video conferencing, virtual reality/augmented reality, the most important modes of exchanging information. Despite much advancement in the network bandwidth and codec techniques, the current system still suffers from glitches, lags and poor video quality, especially under unreliable network condit...
Preprint
Full-text available
We consider the challenging problem of audio to animated video generation. We propose a novel method OneShotAu2AV to generate an animated video of arbitrary length using an audio clip and a single unseen image of a person as an input. The proposed method consists of two stages. In the first stage, OneShotAu2AV generates the talking-head video in th...
Preprint
Full-text available
Speech-driven facial video generation has been a complex problem due to its multi-modal aspects namely audio and video domain. The audio comprises lots of underlying features such as expression, pitch, loudness, prosody(speaking style) and facial video has lots of variability in terms of head movement, eye blinks, lip synchronization and movements...
Preprint
Full-text available
Audio to Video generation is an interesting problem that has numerous applications across industry verticals including film making, multi-media, marketing, education and others. High-quality video generation with expressive facial movements is a challenging problem that involves complex learning steps for generative adversarial networks. Further, e...
Preprint
Full-text available
The style of the speech varies from person to person and every person exhibits his or her own style of speaking that is determined by the language, geography, culture and other factors. Style is best captured by prosody of a signal. High quality multi-speaker speech synthesis while considering prosody and in a few shot manner is an area of active r...

Network

Cited By