
Kripabandhu Ghosh- Indian Statistical Institute
Kripabandhu Ghosh
- Indian Statistical Institute
About
94
Publications
22,266
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,564
Citations
Introduction
Skills and Expertise
Current institution
Additional affiliations
August 2009 - September 2016
Publications
Publications (94)
Mental manipulation is a subtle yet pervasive form of abuse in interpersonal communication, making its detection critical for safeguarding potential victims. However, due to manipulation's nuanced and context-specific nature, identifying manipulative language in complex, multi-turn, and multi-person conversations remains a significant challenge for...
We propose the task of legal question generation (QG) as an application in Legal NLP. Specifically, the task is to generate a question, given a context and an optional keyword. We create the first dataset for the QG task in the legal domain, called LegalQ, consisting of 2023 <context, question> pairs spanning the legal systems of multiple countries...
In the landscape of Fact-based Judgment Prediction and Explanation (FJPE), reliance on factual data is essential for developing robust and realistic AI-driven decision-making tools. This paper introduces TathyaNyaya, the largest annotated dataset for FJPE tailored to the Indian legal context, encompassing judgments from the Supreme Court of India a...
Identification of rhetorical roles like facts, arguments, and final judgments is central to understanding a legal case document and can lend power to other downstream tasks like legal case summarization and judgment prediction. However, there are several challenges to this task. Legal documents are often unstructured and contain a specialized vocab...
Identification of rhetorical roles like facts, arguments, and final judgments is central to understanding a legal case document and can lend power to other downstream tasks like legal case summarization and judgment prediction. However, there are several challenges to this task. Legal documents are often unstructured and contain a specialized vocab...
In this paper, we address the task of semantic segmentation of legal documents through rhetorical role classification, with a focus on Indian legal judgments. We introduce LegalSeg, the largest annotated dataset for this task, comprising over 7,000 documents and 1.4 million sentences, labeled with 7 rhetorical roles. To benchmark performance, we ev...
Large Language Models (LLMs) have significantly impacted nearly every domain of human knowledge. However, the explainability of these models esp. to laypersons, which are crucial for instilling trust, have been examined through various skeptical lenses. In this paper, we introduce a novel notion of LLM explainability to laypersons, termed ReQuestin...
The integration of artificial intelligence (AI) in legal judgment prediction (LJP) has the potential to transform the legal landscape, particularly in jurisdictions like India, where a significant backlog of cases burdens the legal system. This paper introduces NyayaAnumana, the largest and most diverse corpus of Indian legal cases compiled for LJP...
The escalating number of pending cases is a growing concern worldwide. Recent advancements in digitization have opened up possibilities for leveraging artificial intelligence (AI) tools in the processing of legal documents. Adopting a structured representation for legal documents, as opposed to a mere bag-of-words flat text representation, can sign...
Automatic summarization of legal case documents is an important and challenging problem, where algorithms attempt to generate summaries that match well with expert-generated summaries. This work takes the first step in analyzing expert-generated summaries and algorithmic summaries of legal case documents. We try to uncover how law experts write sum...
This paper tackles the challenge of building robust and generalizable bias mitigation models for language. Recognizing the limitations of existing datasets, we introduce ANUBIS, a novel dataset with 1507 carefully curated sentence pairs encompassing nine social bias categories. We evaluate state-of-the-art models like T5, utilizing Supervised Fine-...
Automatic summarization of legal case judgements, which are known to be long and complex, has traditionally been tried via extractive summarization models. In recent years, generative models including abstractive summarization models and Large language models (LLMs) have gained huge popularity. In this paper, we explore the applicability of such mo...
Automatic summarization of legal case judgements, which are known to be long and complex, has traditionally been tried via extractive summarization models. In recent years, generative models including abstractive summarization models and Large language models (LLMs) have gained huge popularity. In this paper, we explore the applicability of such mo...
Automatic summarization of legal case judgements, which are known to be long and complex, has traditionally been tried via extractive summarization models. In recent years, generative models including abstractive summarization models and Large language models (LLMs) have gained huge popularity. In this paper, we explore the applicability of such mo...
Large Language Models (LLMs) have demonstrated impressive performance across a wide range of NLP tasks, including summarization. Inherently LLMs produce abstractive summaries, and the task of achieving extractive summaries through LLMs still remains largely unexplored. To bridge this gap, in this work, we propose a novel framework LaMSUM to generat...
In the era of Large Language Models (LLMs), predicting judicial outcomes poses significant challenges due to the complexity of legal proceedings and the scarcity of expert-annotated datasets. Addressing this, we introduce \textbf{Pred}iction with \textbf{Ex}planation (\texttt{PredEx}), the largest expert-annotated dataset for legal judgment predict...
Despite the availability of vast amounts of data, legal data is often unstructured, making it difficult even for law practitioners to ingest and comprehend the same. It is important to organise the legal information in a way that is useful for practitioners and downstream automation tasks. The word ontology was used by Greek philosophers to discuss...
Artificial intelligence is growing up fast, as are robots whose facial expressions can elicit empathy and make your mirror neurons quiver."-Diane Ackerman, American poet Summary:-Despite the availability of vast amounts of data, legal data is often unstructured, making it difficult even for law practitioners to ingest and comprehend the same.-It is...
Continual Learning (CL) involves training a machine learning model in a sequential manner to learn new information while retaining previously learned tasks without the presence of previous training data. Although there has been significant interest in CL, most recent CL approaches in computer vision have focused on convolutional architectures only....
Automatic summarization of legal case judgements has traditionally been attempted by using extractive summarization methods. However, in recent years, abstractive summarization models are gaining popularity since they can generate more natural and coherent summaries. Legal domain-specific pre-trained abstractive summarization models are now availab...
Artificial Intelligence (AI), Machine Learning (ML), Information Retrieval (IR) and Natural Language Processing (NLP) are transforming the way legal professionals and law firms approach their work. The significant potential for the application of AI to Law, for instance, by creating computational solutions for legal tasks, has intrigued researchers...
Summarization of legal case judgement documents is a practical and challenging problem, for which many summarization algorithms of different varieties have been tried. In this work, rather than developing yet another summarization algorithm, we investigate if intelligently ensembling (combining) the outputs of multiple (base) summarization algorith...
This report describes the 2 nd edition of the Symposium on Artificial Intelligence and Law (SAIL) organized as a virtual event during June 6--9, 2022. The aim of SAIL is to bring together experts from the industry and the academia to discuss the scope and future of AI as applied to the legal domain. The symposium is also meant to foster collaborati...
Microblogging sites such as Twitter play an important role in dealing with various mass emergencies including natural disasters and
pandemics. The FIRE 2022 track on Information Retrieval from Microblogs during Disasters (IRMiDis) focused on two important tasks -- (i) to detect the vaccine-related stance of tweets related to COVID-19 vaccines, and...
Estimating the similarity between two legal case documents is an important and challenging problem, having various downstream applications such as prior-case retrieval and citation recommendation. There are two broad approaches for the task — citation network-based and text-based. Prior citation network-based approaches consider citations only to p...
Summarization of legal case judgement documents is a challenging problem in Legal NLP. However, not much analyses exist on how different families of summarization models (e.g., extractive vs. abstractive) perform when applied to legal case documents. This question is particularly important since many recent transformer-based abstractive summarizati...
Estimating the similarity between two legal case documents is an important and challenging problem, having various downstream applications such as prior-case retrieval and citation recommendation. There are two broad approaches for the task -- citation network-based and text-based. Prior citation network-based approaches consider citations only to...
Automatic summarization of legal case documents is an important and challenging problem, where algorithms attempt to generate summaries that match well with expert-generated summaries. This work takes the first step in analyzing expert-generated summaries and algorithmic summaries of legal case documents. We try to uncover how law experts write sum...
The task of rhetorical role labeling is to assign labels (such as Fact, Argument, Final Judgement, etc.) to sentences of a court case document. Rhetorical role labeling is an important problem in the field of Legal Analytics, since it can aid in various downstream tasks as well as enhances the readability of lengthy case documents. The task is chal...
In the domain of legal information retrieval, an important challenge is to compute similarity between two legal documents. Precedents (statements from prior cases) play an important role in The Common Law system, where lawyers need to frequently refer to relevant prior cases. Measuring document similarity is one of the most crucial aspects of any d...
In a Common Law system, legal practitioners need frequent access to prior case documents that discuss relevant legal issues. Case documents are generally very lengthy, containing complex sentence structures, and reading them fully is a strenuous task even for legal practitioners. Having a concise overview of these documents can relieve legal practi...
Automatic summarization of legal case documents is an important and practical challenge. Apart from many domain-independent text summarization algorithms that can be used for this purpose, several algorithms have been developed specifically for summarizing legal case documents. However, most of the existing algorithms do not systematically incorpor...
Automatic summarization of legal case documents is an important and practical challenge. Apart from many domain-independent text summarization algorithms that can be used for this purpose, several algorithms have been developed specifically for summarizing legal case documents. However, most of the existing algorithms do not systematically incorpor...
A large fraction of textual data available today contains various types of “noise,” such as OCR noise in digitized documents, noise due to informal writing style of users on microblogging sites, and so on. To enable tasks such as search/retrieval and classification over all the available data, we need robust algorithms for text normalization, i.e.,...
With the surge in user-generated textual information, there has been a recent increase in the use of summarization algorithms for providing an overview of the extensive content. Traditional metrics for evaluation of these algorithms (e.g. ROUGE scores) rely on matching algorithmic summaries to human-generated ones. However, it has been shown that w...
A large fraction of textual data available today contains various types of 'noise', such as OCR noise in digitized documents, noise due to informal writing style of users on microblogging sites, and so on. To enable tasks such as search/retrieval and classification over all the available data, we need robust algorithms for text normalization, i.e.,...
Most existing algorithms for cross-modal Information Retrieval are based on a supervised train-test setup, where a model learns to align the mode of the query (e.g., text) to the mode of the documents (e.g., images) from a given training set. Such a setup assumes that the training set contains an exhaustive representation of all possible classes of...
During a disaster event, two types of information that are especially useful for coordinating relief operations are needs and availabilities of resources (e.g., food, water, medicines) in the affected region. Information posted on microblogging sites is increasingly being used for assisting post-disaster relief operations. In this context, two prac...
Computing similarity between two legal case documents is an important and challenging task in Legal IR, for which text-based and network-based measures have been proposed in literature. All prior network-based similarity methods considered a precedent citation network among case documents only (PCNet). However, this approach misses an important sou...
Although a lot of research has been done on utilising Online Social Media during disasters, there exists no system for a specific task that is critical in a post-disaster scenario -- identifying resource-needs and resource-availabilities in the disaster-affected region, coupled with their subsequent matching. To this end, we present NARMADA, a semi...
Computing similarity between two legal documents is an important and challenging task in the domain of Legal Information Retrieval. Finding similar legal documents has many applications in downstream tasks, including prior-case retrieval, recommendation of legal articles, and so on. Prior works have proposed two broad ways of measuring similarity b...
Twitter is an active communication channel for the spreading of updated information in emergency situations. Retrieving specific information related to infrastructure damage offers the situational views to the concerned authorities, who can take necessary action to disburse help. However, such usages of Twitter demand significant accuracy of the re...
In last few years ,microblogging sites like Twitter have been evolved as a repository of critical situational information during various mass emergencies. However, messages posted on microblogging sites often contain non-actionable information such as sympathy and prayer for victims. Moreover, messages sometimes contain rumors and overstated facts....
Twitter is an active communication channel for the spreading of updated information in emergency situations. Retrieving specific information related to infrastructure damage offers the situational views to the concerned authorities, who can take necessary action to disburse help. However such usages of Twitter demand significant accuracy of the ret...
Twitter provides important information for emergency responders in the rescue process during disasters. However, tweets containing relevant information are sparse and are usually hidden in a vast set of noisy contents. This leads to inherent challenges in generating suitable training data that are required for neural network models. In this paper,...
Automatically understanding the rhetorical roles of sentences in a legal case judgement is an important problem to solve, since it can help in several downstream tasks like summarization of legal judgments, legal search, and so on. The task is challenging since legal case documents are usually not well-structured, and these rhetorical roles may be...
As the amount of user-generated textual content grows rapidly, text summarization algorithms are increasingly being used to provide users a quick overview of the information content. Traditionally, summarization algorithms have been evaluated only based on how well they match human-written summaries (e.g. as measured by ROUGE scores). In this work,...
Microblogging sites like Twitter are the important sources of real-time information during disaster/emergency events. During such events, the critical situational information posted is immersed in a lot of conversational content; hence, reliable methodologies are needed for extracting the meaningful information. In this paper, we focus on a particu...
Summarization of legal case judgments is an important problem because the huge length and complexity of such documents make them difficult to read as a whole. Many summarization algorithms have been proposed till date, both for general text documents and a few specifically targeted to summarizing legal documents of various countries. However, to ou...
The Second Workshop on Exploitation of Social Media for Emergency Relief and Preparedness (SMERP) was held in conjunction with The Web Conference (WWW) 2018 at Lyon, France. A primary aim of the workshop was to promote multi-modal and multi-view information retrieval from the social media content in disaster situations. The workshop programme inclu...
In countries like the US, European countries, Australia and Japan, user-generated content from microblogging sites is extensively used for crowdsourcing actionable information during disasters. However, there has been limited work in this direction in India. Moreover, there has been a limited attempt to verify the credibility of the information ext...
As the amount of textual information grows rapidly, text summarization algorithms are increasingly being used to provide users a quick overview of the information content. Traditionally, summarization algorithms have been evaluated only based on how well they match human-written summaries (as measured by ROUGE scores). In this work, we propose to e...
Online Social Media, such as Twitter, Facebook and WhatsApp, are important sources of real-time information related to emergency events, including both natural calamities, man-made disasters, epidemics, and so on. There has been lot of recent work on designing information systems that would be useful for aiding post-disaster relief operations, as w...
We propose to evaluate extractive summarization algorithms from a completely new perspective. Considering that an extractive summarization algorithm selects a subset of the textual units in the input data for inclusion in the summary, we investigate whether this selection is fair. We use several summarization algorithms over datasets that have a se...
User-generated content on online social media (OSM) platforms has become an important source of real-time information during emergency events. The SMERP workshop series aims to provide a forum for researchers working on utilizing OSM for emergency preparedness and aiding post-emergency relief operations. The workshop aims to bring together research...
During a disaster event, it is essential to know about needs and availabilities of different types of resources, for coordinating relief operations. Microblogging sites are frequently used for aiding post-disaster relief operations, and there have been prior attempts to identify tweets that inform about resource needs and availabilities (termed as...
The Web has several information sources on which an ongoing event is discussed. To get a complete picture of the event, it is important to retrieve information from multiple sources. We propose a novel neural network based model which integrates the embeddings from multiple sources, and thus retrieves information from them jointly, %all the sources...
Effective clustering of short documents, such as tweets, is difficult because of the lack of sufficient semantic context. Word embedding is a technique that is effective in addressing this lack of semantic context. However, the process of word vector embedding, in turn, relies on the availability of sufficient contexts to learn the word association...
Cross-lingual information retrieval is a challenging task in the absence of aligned parallel corpora. In this paper, we address this problem by considering topically aligned corpora designed for evaluating an IR setup. To emphasize, we neither use any sentence-aligned corpora or document-aligned corpora, nor do we use any language specific resource...
Computing the similarity between two legal documents is an important challenge in the Legal Information Retrieval domain. Efficient calculation of this similarity has useful applications in various tasks such as identifying relevant prior cases for a given case document. Prior works have proposed network-based and text-based methods for measuring s...
Automatically identifying catchphrases from legal court case documents is an important problem in Legal Information Retrieval, which has not been extensively studied. In this work, we propose an unsupervised approach for extraction and ranking of catchphrases from court case documents, by focusing on noun phrases. Using a dataset of gold standard c...
Stemming is a vital step employed to improve retrieval performance through efficient unification of morphological variants of a word. We propose an unsupervised, context-specific stemming algorithm for microblogs, based on both local and global word embeddings, which is capable of handling the informal, noisy vocabulary of microblogs. Experiments o...
The first international workshop on Exploitation of Social Media for Emergency Relief and Preparedness (SMERP) was held in conjunction with the 2017 European Conference on Information Retrieval (ECIR) in Aberdeen, Scotland, UK. The aim of the workshop was to explore various technologies for extracting useful information from social media content in...
Microblogging sites like Twitter are increasingly being used for aiding post-disaster relief operations. In such situations, identifying needs and availabilities of various types of resources is critical for effective coordination of the relief operations. We focus on the problem of automatically identifying tweets that inform about needs and avail...
Microblogging sites like Twitter and Weibo have emerged as important sourcesof real-time information on ongoing events, including socio-political events, emergency events, and so on. For instance, during emergency events (such as earthquakes, floods, terror attacks), microblogging sites are very useful for gathering situational information in real-...
IR methods are increasingly being applied over microblogs to extract real-time information, such as during disaster events. In such sites, most of the user-generated content is written informally – the same word is often spelled differently by different users, and words are shortened arbitrarily due to the length limitations on microblogs. Stemming...
In relevance feedback, first-round search results are used to boost second-round search results. Two forms have been traditionally considered: exhaustively labelled feedback, where all first-round results to depth k are annotated for relevance by the user; and blind feedback, where the top-k results are all assumed to be relevant. In this paper, we...
OCR errors hurt retrieval performance to a great extent. Research has been done on modelling and correction of OCR errors. However, most of the existing systems use language dependent resources or training texts for studying the nature of errors. Not much research has been reported on improving retrieval performance from erroneous text when no trai...
Researchers have shown that a weighted linear combination in data fusion can produce better results than an unweighted combination. Many techniques have been used to determine the linear combination weights. In this work, we have used the Genetic Algorithm (GA) for the same purpose. The GA is not new and it has been used earlier in several other ap...
Standard test collections form the very basis of Information Retrieval research and evaluation. Important datasets have been created to promote empirical research and experimentation. In this paper, we describe our endeavour in creating a test collection from old, archived writings of IR stalwarts. The documents are created in text format from the...
OCR errors hurt retrieval performance to a great extent. Research has been done on modelling and correction of OCR errors. However, most of the existing systems use language dependent resources or training texts for studying the nature of errors. Not much research has been reported on improving retrieval performance from erroneous text when no trai...
In this paper, we present our work in the RISOT track of FIRE 2011. Here, we describe an error modeling technique for OCR errors in an Indic script. Based on the error model, we apply a two-fold error correction method on the OCRed corpus. First, we correct the corpus by correction with full confidence and correction without full confidence approac...
The extreme brevity of Microblog posts (such as 'tweets') exacerbates the well-known vocabulary mismatch problem when retrieving tweets in response to user queries. In this study, we explore various query expansion approaches as a way to address this problem. We use the Web as a source of query expansion terms. We also tried a variation of a standa...
Information Retrieval performance is hurt to a great extent by OCR errors. Much research has been reported on modelling and correction of OCR errors. However, all the existing systems make use of language dependent resources or training texts to study the nature of errors. No research has been reported on improving retrieval performance from errone...