Rita Singh

Rita Singh
  • Sharda University

About

261
Publications
40,367
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,491
Citations
Introduction
Skills and Expertise
Current institution
Sharda University

Publications

Publications (261)
Preprint
Full-text available
Multimodal Audio-Language Models (ALMs) can understand and reason over both audio and text. Typically, reasoning performance correlates with model size, with the best results achieved by models exceeding 8 billion parameters. However, no prior work has explored enabling small audio-language models to perform reasoning tasks, despite the potential a...
Preprint
Full-text available
Speech foundation models trained at a massive scale, both in terms of model and data size, result in robust systems capable of performing multiple speech tasks, including automatic speech recognition (ASR). These models transcend language and domain barriers, yet effectively measuring their performance remains a challenge. Traditional metrics like...
Preprint
Full-text available
Recent advances in speech foundation models are largely driven by scaling both model size and data, enabling them to perform a wide range of tasks, including speech recognition. Traditionally, ASR models are evaluated using metrics like Word Error Rate (WER) and Character Error Rate (CER), which depend on ground truth labels. As a result of limited...
Preprint
Full-text available
Understanding and explaining differences between audio recordings is crucial for fields like audio forensics, quality assessment, and audio generation. This involves identifying and describing audio events, acoustic scenes, signal characteristics, and their emotional impact on listeners. This paper stands out as the first work to comprehensively st...
Article
Full-text available
Teachers' content-related humor is highly relevant for student outcomes in higher education (HE). Yet, teachers' use of different types of humor and frequency and other factors make generalizations about the effective use of humor on students' learning hard to establish. Specifically, little research attention has been paid to the impact of the use...
Preprint
Full-text available
Voice biometric tasks, such as age estimation require modeling the often complex relationship between voice features and the biometric variable. While deep learning models can handle such complexity, they typically require large amounts of accurately labeled data to perform well. Such data are often scarce for biometric tasks such as voice-based ag...
Preprint
Full-text available
Understanding how speech foundation models capture non-verbal cues is crucial for improving their interpretability and adaptability across diverse tasks. In our work, we analyze several prominent models such as Whisper, Seamless, Wav2Vec, HuBERT, and Qwen2-Audio focusing on their learned representations in both paralinguistic and non-paralinguistic...
Preprint
The quality of human voice plays an important role across various fields like music, speech therapy, and communication, yet it lacks a universally accepted, objective definition. Instead, voice quality is referred to using subjective descriptors like "rough", "breathy" etc. Despite this subjectivity, extensive research across disciplines has linked...
Preprint
Full-text available
Speaker verification systems have seen significant advancements with the introduction of Multi-scale Feature Aggregation (MFA) architectures, such as MFA-Conformer and ECAPA-TDNN. These models leverage information from various network depths by concatenating intermediate feature maps before the pooling and projection layers, demonstrating that even...
Preprint
Full-text available
We introduce a novel, general-purpose audio generation framework specifically designed for anomaly detection and localization. Unlike existing datasets that predominantly focus on industrial and machine-related sounds, our framework focuses a broader range of environments, particularly useful in real-world scenarios where only audio data are availa...
Preprint
Full-text available
Speaker verification systems are crucial for authenticating identity through voice. Traditionally, these systems focus on comparing feature vectors, overlooking the speech's content. However, this paper challenges this by highlighting the importance of phonetic dominance, a measure of the frequency or duration of phonemes, as a crucial cue in speak...
Article
A 40-year-old woman diagnosed with Mayer-Rokitansky-Küster-Hauser (MRKH) syndrome at 16 years of age presented with a large abdominal mass protruding to the right subcostal margin, equivalent to 30 weeks gestation. She didn’t have comorbidities of hypertension or diabetes. The vitals were normal with BMI 30. She was asymptomatic except for occasion...
Preprint
Full-text available
Reference summaries for abstractive speech summarization require human annotation, which can be performed by listening to an audio recording or by reading textual transcripts of the recording. In this paper, we examine whether summaries based on annotators listening to the recordings differ from those based on annotators reading transcripts. Using...
Article
Full-text available
The emergence of ChatGPT, a Generative AI program, has sparked discussions about its teaching and learning value, and concerns about academic integrity in higher education (HE). An extant review of the literature indicates that a scarcity of research exists on GenAI, specifically a synthesis of the official views, guidelines and articles of top-ran...
Preprint
Full-text available
Recent literature uses language to build foundation models for audio. These Audio-Language Models (ALMs) are trained on a vast number of audio-text pairs and show remarkable performance in tasks including Text-to-Audio Retrieval, Captioning, and Question Answering. However, their ability to engage in more complex open-ended tasks, like Interactive...
Preprint
Full-text available
Speech Emotion Recognition (SER) has been traditionally formulated as a classification task. However, emotions are generally a spectrum whose distribution varies from situation to situation leading to poor Out-of-Domain (OOD) performance. We take inspiration from statistical formulation of Automatic Speech Recognition (ASR) and formulate the SER ta...
Preprint
Conditional visual generation has witnessed remarkable progress with the advent of diffusion models (DMs), especially in tasks like control-to-image generation. However, challenges such as expensive computational cost, high inference latency, and difficulties of integration with large language models (LLMs) have necessitated exploring alternatives...
Preprint
Full-text available
Find the cate Intelligent System Human Meow Environment "Find the yellow and black cat" Textual Visual Acoustic Buzzing Referring Guidance Perturbations ⋯ "M eow ⋯" Confidence Map (a) Referring Perception in Practice (b) Perturbation in Referring Guidance ⋯ ⋯ 1 0 "cate" (typo) Buzzing Fig. 1: Motivation illustration. Referring perception models (RP...
Article
Adequate nutrition status is imperative for overall health and well-being, although numerous challenges impede its attainment. Various sociodemographic determinants significantly influence an individual's nutrition status, particularly those of women. In this pilot study, the dietary patterns of 105 women were collected using the modified version o...
Article
Full-text available
Background: Aim of study was to evaluate the impact of myoinositol and D-chiro inositol plus vitamin D supplementation on the prevention of gestational diabetes mellitus (GDM) in pregnant women. Methods: In the multi-centric, prospective, randomised, double-blind clinical trial, either vitamin D alone (group I) or myoinositol and D-chiro inositol p...
Preprint
This work unveils the enigmatic link between phonemes and facial features. Traditional studies on voice-face correlations typically involve using a long period of voice input, including generating face images from voices and reconstructing 3D face meshes from voices. However, in situations like voice-based crimes, the available voice evidence may b...
Preprint
Previous works on voice-face matching and voice-guided face synthesis demonstrate strong correlations between voice and face, but mainly rely on coarse semantic cues such as gender, age, and emotion. In this paper, we aim to investigate the capability of reconstructing the 3D facial shape from voice from a geometry perspective without any semantic...
Preprint
Full-text available
End-to-end speech summarization has been shown to improve performance over cascade baselines. However, such models are difficult to train on very large inputs (dozens of minutes or hours) owing to compute restrictions and are hence trained with truncated model inputs. Truncation leads to poorer models, and a solution to this problem rests in block-...
Preprint
Full-text available
In this paper, we introduce the imprecise label learning (ILL) framework, a unified approach to handle various imprecise label configurations, which are commonplace challenges in machine learning tasks. ILL leverages an expectation-maximization (EM) algorithm for the maximum likelihood estimation (MLE) of the imprecise label information, treating t...
Preprint
Full-text available
In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity t...
Article
Background: The triple burden of malnutrition (TBM) presents a significant threat to the health of mothers and their future generations, particularly in low-and middle-income countries. Though having adequate macro-and micronutrients and maintaining a healthy weight are essential for all, pregnant women require special attention. However, their nut...
Preprint
Full-text available
Emotions lie on a broad continuum and treating emotions as a discrete number of classes limits the ability of a model to capture the nuances in the continuum. The challenge is how to describe the nuances of emotions and how to enable a model to learn the descriptions. In this work, we devise a method to automatically create a description (or prompt...
Preprint
Full-text available
Traditionally, in paralinguistic analysis for emotion detection from speech, emotions have been identified with discrete or dimensional (continuous-valued) labels. Accordingly, models that have been proposed for emotion detection use one or the other of these label types. However, psychologists like Russell and Plutchik have proposed theories and m...
Article
BACKGROUND Postpartum haemorrhage (PPH) and pre-eclampsia and eclampsia (PE/E) are the leading causes of pregnancy and childbirth-related complications and deaths, particularly in developing countries. FOGSI-Manyata skill transfer training is being implemented in private healthcare facilities in India, enabling ‘task-shifting’ to staff nurses by im...
Preprint
This work presents a multitask approach to the simultaneous estimation of age, country of origin, and emotion given vocal burst audio for the 2022 ICML Expressive Vocalizations Challenge ExVo-MultiTask track. The method of choice utilized a combination of spectro-temporal modulation and self-supervised features, followed by an encoder-decoder netwo...
Article
This study compares the coverage of coping strategies and emotions portrayed in news regarding COVID-19 by The New York Times in the U.S. and People’s Daily of China via social media. By employing corpus assisted discourse analysis to scrutinize the text corpora, our study uncovered prominent keywords and themes. Findings indicate that a comprehens...
Article
Full-text available
Processed and radiation sterilized allograft tissues that can be banked for use on demand are a precious therapeutic resource for the repair or reconstruction of damaged or injured tissues. Skin dressings or skin substitutes like allograft skin, amniotic membrane and bioengineered skin can be used for the treatment of thermal burns and radiation in...
Article
Communication is critical in a new health emergency because it motivates the public to take preventive actions. Prior research has shown that strategies including source credibility, information transparency and uncertainty reduction actions could enhance trust in health communication on social media. Yet research on how the government in China use...
Preprint
Full-text available
Lately, there has been a global effort by multiple research groups to detect COVID-19 from voice. Different researchers use different kinds of information from the voice signal to achieve this. Various types of phonated sounds and the sound of cough and breath have all been used with varying degrees of success in automated voice-based COVID-19 dete...
Article
This paper addresses the deep face recognition problem under an open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal inter-class distance under a suitably chosen metric space. To this end, hyperspherical face recognition, as a promising line of research, has attracted increasing attenti...
Article
Full-text available
Background: Vaccines serve an integral role in containing pandemics, yet vaccine hesitancy is prevalent globally. One key reason for this is the pervasiveness of misinformation on social media. While considerable research attention has been drawn to how exposure to misinformation is closely associated with vaccine hesitancy, little scholarly atten...
Preprint
BACKGROUND Vaccines serve an integral role in containing pandemics, yet vaccine hesitancy is prevalent globally. One key reason for this hesitancy is the pervasiveness of misinformation on social media. Although considerable research attention has been drawn to how exposure to misinformation is closely associated with vaccine hesitancy, little scho...
Article
Public relations professionals from global corporations have increasingly communicated corporate social responsibility (CSR) practices on social media to engage publics. Yet the link between CSR communication of global corporations, particularly with regard to the dimensions of genuineness exhibited in their communication and public engagement on s...
Preprint
We present a conditional estimation (CEST) framework to learn 3D facial parameters from 2D single-view images by self-supervised training from videos. CEST is based on the process of analysis by synthesis, where the 3D facial parameters (shape, reflectance, viewpoint, and illumination) are estimated from the face image, and then recombined to recon...
Preprint
Full-text available
This paper reflects on the effect of several categories of medical conditions on human voice, focusing on those that may be hypothesized to have effects on voice, but for which the changes themselves may be subtle enough to have eluded observation in standard analytical examinations of the voice signal. It presents three categories of techniques th...
Preprint
Full-text available
This paper addresses the deep face recognition problem under an open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal inter-class distance under a suitably chosen metric space. To this end, hyperspherical face recognition, as a promising line of research, has attracted increasing attenti...
Preprint
Full-text available
State-of-the-art deep face recognition methods are mostly trained with a softmax-based multi-class classification framework. Despite being popular and effective, these methods still have a few shortcomings that limit empirical performance. In this paper, we first identify the discrepancy between training and evaluation in the existing multi-class c...
Preprint
Full-text available
Multiple studies in the past have shown that there is a strong correlation between human vocal characteristics and facial features. However, existing approaches generate faces simply from voice, without exploring the set of features that contribute to these observed correlations. A computational methodology to explore this can be devised by rephras...
Preprint
Full-text available
While multitask and transfer learning has shown to improve the performance of neural networks in limited data settings, they require pretraining of the model on large datasets beforehand. In this paper, we focus on improving the performance of weakly supervised sound event detection in low data and noisy settings simultaneously without requiring an...
Article
Silver nanoparticles as antimicrobial agent have vast potential in agriculture for the protection of plants from diseases and increasing crop productivity. In this study, the bulbils of aerial yam (Dioscorea bulbifera L.) were investigated for their potential for biosynthesis of silver nanoparticles. Variation in the absorption spectra of silver na...
Conference Paper
Full-text available
Open-set speaker recognition can be regarded as a metric learning problem, which is to maximize inter-class variance and minimize intra-class variance. Supervised metric learning can be categorized into pair-based learning and proxy-based learning [1]. Most of the existing metric learning objectives belong to the former division, the performance of...
Article
Full-text available
When corporations are confronted with a crisis, well-crafted CEO apologies can serve to repair, restore, and rebuild a damaged corporate image. In prior research, the use of linguistic resources exhibited in CEO corporate apology discourse for different crisis response strategies has not been sufficiently examined. Drawing on the appraisal framewor...
Preprint
Full-text available
State-of-the-art methods for audio generation suffer from fingerprint artifacts and repeated inconsistencies across temporal and spectral domains. Such artifacts could be well captured by the frequency domain analysis over the spectrogram. Thus, we propose a novel use of long-range spectro-temporal modulation feature -- 2D DCT over log-Mel spectrog...
Chapter
Multiple studies in the past have shown that there is a strong correlation between human vocal characteristics and facial features. However, existing approaches generate faces simply from voice, without exploring the set of features that contribute to these observed correlations. A computational methodology to explore this can be devised by rephras...
Preprint
Full-text available
Automatic speaker verification (ASV) systems utilize the biometric information in human speech to verify the speaker's identity. The techniques used for performing speaker verification are often vulnerable to malicious attacks that attempt to induce the ASV system to return wrong results, allowing an impostor to bypass the system and gain access. A...
Preprint
In the pathogenesis of COVID-19, impairment of respiratory functions is often one of the key symptoms. Studies show that in these cases, voice production is also adversely affected -- vocal fold oscillations are asynchronous, asymmetrical and more restricted during phonation. This paper proposes a method that analyzes the differential dynamics of t...
Preprint
Phonation, or the vibration of the vocal folds, is the primary source of vocalization in the production of voiced sounds by humans. It is a complex bio-mechanical process that is highly sensitive to changes in the speaker's respiratory parameters. Since most symptomatic cases of COVID-19 present with moderate to severe impairment of respiratory fun...
Article
Full-text available
Social networking sites offer an important means for increasing the accessibility and enabling new forms of health communication between the public and medical social influencers (MSIs). MSIs have a social presence and are perceived as a credible source of health-related information. A research gap, however, exists in understanding the communicatio...
Conference Paper
Full-text available
Not all knowledge is created equal. A hierarchical architecture is a method to classify knowledge for use in the field of human cognition and computational creativity. This paper introduces an Insight-Knowledge Object (IKO) model as a framework for Artificial Creative Intelligence (ACI), a step forward in the pursuit of replicating general human in...
Poster
Full-text available
Based on the paper. "Artificial Creative Intelligence: Breaking the Imitation Barrier" with a focus on a music use case.
Article
Full-text available
The COVID-19 global pandemic has placed undue stress on health systems all around the world. However, little is known about the impact of exposure to the virus on growing foetuses or the course of COVID-19 in pregnant women, who are often asymptomatic. To develop effective policies and recommendations, robust data for both asymptomatic and symptoma...
Preprint
Weakly Labelled learning has garnered lot of attention in recent years due to its potential to scale Sound Event Detection (SED). The paper proposes a Multi-Task Learning (MTL) framework for learning from Weakly Labelled Audio data which encompasses the traditional Multiple Instance Learning (MIL) setup. The MTL framework uses two-step attention me...
Article
Corporate leader messages posted by senior management play a pivotal role in building relationships with stakeholders in the professional corporate communication context and such messages often explicitly or implicitly draw on prior texts to establish credibility. This mixed methods study seeks to analyse how intertextuality is manifested linguisti...
Article
BACKGROUND In this study, we determined the safety, tolerability, efficacy and compliance of ‘Positeve’ melt-in-mouth tablets, an oral iron preparation along with micronutrients crucial for healthy blood cells and women’s health. The tablets were administered to young girls and non-pregnant women as part of daily dietary supplementation in maintain...
Preprint
Full-text available
BACKGROUND COVID-19 has posed an unprecedented challenge to governments worldwide. Effective government communication of COVID-19 information with the public is of crucial importance. OBJECTIVE We investigated how the most-read government-owned newspaper in China, People’s Daily, utilized a social networking site, Sina Weibo, to communicate about...
Article
Full-text available
Background: COVID-19 has posed an unprecedented challenge to governments worldwide. Effective government communication of COVID-19 information with the public is of crucial importance. Objective: We investigated how the most-read state-owned newspaper in China, People's Daily, utilized an online social networking site, Sina Weibo, to communicate...
Article
Full-text available
Research article abstracts often convince readers that the article is worth reading. Therefore, they rely not only on the quality of arguments or novelty of findings to persuade readers but also linguistic markers in the form of metadiscourse to assert a position on an issue, increase readability of a text, engage readers, and avoid objection to th...
Article
BACKGROUND Meagre data exists as to whether health care professionals consider and advocate menstrual cups as a safe and feasible alternative to the generally used methods of menstruation management. The attitudes and practice among girls and women with respect to usage of menstrual cups is also not known. Therefore, we conducted a cross-sectional...
Article
Full-text available
Food irradiation provides an effective means for controlling the physiological processes causing spoilage and for eradication of microbes, insect pests and parasites. Irradiation has multipurpose role in food processing and is applicable for a variety of food commodities such as fruits, vegetables, cereals, pulses, spices, meat, poultry and seafood...

Network

Cited By