Sreyan Ghosh's scientific contributions

Publications (19)

Preprint
Full-text available
The expression of emotions is a crucial part of daily human communication. Modeling the conversational and sequential context has seen much success and plays a vital role in Emotion Recognition in Conversations (ERC). However, existing approaches either model only one of the two or employ naive late-fusion methodologies to obtain final utterance re...
Preprint
Full-text available
Self-supervised learning (SSL) to learn high-level speech representations has been a popular approach to building Automatic Speech Recognition (ASR) systems in low-resource settings. However, the common assumption made in literature is that a considerable amount of unlabeled data is available for the same domain or language that can be leveraged fo...
Preprint
Full-text available
Emotion Recognition (ER) aims to classify human utterances into different emotion categories. Based on early-fusion and self-attention-based multimodal interaction between text and acoustic modalities, in this paper, we propose a multimodal multitask learning approach for ER from individual utterances in isolation. Experiments on the IEMOCAP benchm...
Preprint
Full-text available
While self-supervised speech representation learning (SSL) models serve a variety of downstream tasks, these models have been observed to overfit to the domain from which the unlabelled data originates. To alleviate this issue, we propose PADA (Pruning Assisted Domain Adaptation) and zero out redundant weights from models pre-trained on large amoun...
Preprint
Full-text available
Existing approaches in disfluency detection focus on solving a token-level classification task for identifying and removing disfluencies in text. Moreover, most works focus on leveraging only contextual information captured by the linear sequences in text, thus ignoring the structured information in text which is efficiently captured by dependency...
Preprint
Full-text available
Inspired by the recent progress in self-supervised learning for computer vision, in this paper, through the DeLoRes learning framework, we introduce two new general-purpose audio representation learning approaches, the DeLoRes-S and DeLoRes-M. Our main objective is to make our network learn representations in a resource-constrained setting (both da...
Preprint
Full-text available
In the current era of the internet, where social media platforms are easily accessible for everyone, people often have to deal with threats, identity attacks, hate, and bullying due to their association with a cast, creed, gender, religion, or even acceptance or rejection of a notion. Existing works in hate speech detection primarily focus on indiv...
Preprint
Full-text available
We introduce DECAR, a self-supervised pre-training approach for learning general-purpose audio representations. Our system is based on clustering: it utilizes an offline clustering step to provide target labels that act as pseudo-labels for solving a prediction task. We develop on top of recent advances in self-supervised learning for computer visi...
Preprint
Full-text available
Toxic speech, also known as hate speech, is regarded as one of the crucial issues plaguing online social media today. Most recent work on toxic speech detection is constrained to the modality of text with no existing work on toxicity detection from spoken utterances. In this paper, we propose a new Spoken Language Processing task of detecting toxic...
Preprint
Full-text available
Social network platforms are generally used to share positive, constructive, and insightful content. However, in recent times, people often get exposed to objectionable content like threat, identity attacks, hate speech, insults, obscene texts, offensive remarks or bullying. Existing work on toxic speech detection focuses on binary classification o...
Preprint
Full-text available
This paper describes our proposed system for the AAAI-CAD21 shared task: Predicting Emphasis in Presentation Slides. In this specific task, given the contents of a slide we are asked to predict the degree of emphasis to be laid on each word in the slide. We propose 2 approaches to this problem including a BiLSTM-ELMo approach and a transformers bas...
Chapter
The users of the Internet increase every moment with increasing population and accessibility of the Internet. With the increase in the number of users of the Internet, the number of controversies, arguments and abuses of all kinds increases. It becomes necessary for social media and other sites to identify toxic content amongst a large number of co...
Preprint
Named entity recognition (NER) from text has been a widely studied problem and usually extracts semantic information from text. Until now, NER from speech is mostly studied in a two-step pipeline process that includes first applying an automatic speech recognition (ASR) system on an audio sample and then passing the predicted transcript to a NER ta...
Article
Full-text available
With the advancement in technology, we are offered new opportunities for long term monitoring of health conditions. There are a tremendous amount of opportunities in psychiatry where the diagnosis rely on the historical data of patient as well as the states of mood that increase the complexity of distinguishing between bipolar disorder or borderlin...
Chapter
With increasing population levels and poverty rate it has become a major problem for Non-profit organizations and agencies to ensure that the right kind of people receive alleviation. The world’s poorest typically cannot provide the necessary income and expense records to prove that they qualify for aid. In Latin America, one popular method to veri...

Citations

... Spoken named entity recognition has recentely received significant attention [18,53,65]. It is often addressed in one of two ways: end-to-end (E2E) or pipeline (i.e., a combination of an ASR model and an NLP NER model). ...
... Hate speech detection has branched into several sub-tasks like toxic span extraction [30,31], rationale identification [32] and hate target identification [20]. Though recent advancement in the field of NLP has pushed the limits of hate speech identification, like transformers [25] and graph neural networks [33,25,34] with people attempting to induce external knowledge leveraging author profiling [25] or ideology [35] but using context of the conversation is still a challenge with very little work exploring this problem. ...
... Two RNN variants, long short-term memory (LSTM) [17], and gated recurrent unit (GRU) [18] are also popular choices due to their ability to overcome the gradient explosion and vanishing problem that exist in the vanilla RNN model. Bi-LSTM and Bi-GRU [42] are known for their potential to capture backward and forward contextual features. BERT [43], built on the Transformer model [37], adopts the multi-headed attention mechanism which allows the model to learn how each word in a sentence is attended by every other word to enrich contextual understanding. ...
... e model is validated using the Alan Turing Institute synthesized signature database within pattern psychiatric repository. e end outcome has an AUC of 0.95, which is higher than the previous result of 0.90 [18]. ...