
Arnab Bhattacharya- PhD
- Professor at Indian Institute of Technology Kanpur
Arnab Bhattacharya
- PhD
- Professor at Indian Institute of Technology Kanpur
About
126
Publications
30,628
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,338
Citations
Introduction
I am very interested in working with Indian languages and, in particular, the computational aspects.
Current institution
Additional affiliations
December 2007 - June 2014
December 2007 - present
December 2007 - present
Education
September 2002 - July 2007
August 1997 - May 2001
Publications
Publications (126)
Large language models (LLMs) have achieved remarkable success across various natural language processing tasks. However, most LLM models use traditional tokenizers like BPE and SentencePiece, which fail to capture the finer nuances of a morphologically rich language like Bangla (Bengali). In this work, we introduce BanglaByT5, the first byte-level...
Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across diverse tasks and languages. In this study, we focus on natural language understanding in three classical languages -- Sanskrit, Ancient Greek and Latin -- to investigate the factors affecting cross-lingual zero-shot generalization. First, we explore named...
In the landscape of Fact-based Judgment Prediction and Explanation (FJPE), reliance on factual data is essential for developing robust and realistic AI-driven decision-making tools. This paper introduces TathyaNyaya, the largest annotated dataset for FJPE tailored to the Indian legal context, encompassing judgments from the Supreme Court of India a...
Automating legal document drafting can significantly enhance efficiency, reduce manual effort, and streamline legal workflows. While prior research has explored tasks such as judgment prediction and case summarization, the structured generation of private legal documents in the Indian legal domain remains largely unaddressed. To bridge this gap, we...
In this paper, we address the task of semantic segmentation of legal documents through rhetorical role classification, with a focus on Indian legal judgments. We introduce LegalSeg, the largest annotated dataset for this task, comprising over 7,000 documents and 1.4 million sentences, labeled with 7 rhetorical roles. To benchmark performance, we ev...
The integration of artificial intelligence (AI) in legal judgment prediction (LJP) has the potential to transform the legal landscape, particularly in jurisdictions like India, where a significant backlog of cases burdens the legal system. This paper introduces NyayaAnumana, the largest and most diverse corpus of Indian legal cases compiled for LJP...
This study investigates judgment prediction in a realistic scenario within the context of Indian judgments, utilizing a range of transformer-based models, including InLegalBERT, BERT, and XLNet, alongside LLMs such as Llama-2 and GPT-3.5 Turbo. In this realistic scenario, we simulate how judgments are predicted at the point when a case is presented...
Graph isomorphism, a classical algorithmic problem, determines whether two input graphs are structurally identical or not. Interestingly, it is one of the few problems that is not yet known to belong to either the P or NP-complete complexity classes. As such, intelligent search-space pruning based strategies were proposed for developing isomorphism...
In the era of Large Language Models (LLMs), predicting judicial outcomes poses significant challenges due to the complexity of legal proceedings and the scarcity of expert-annotated datasets. Addressing this, we introduce \textbf{Pred}iction with \textbf{Ex}planation (\texttt{PredEx}), the largest expert-annotated dataset for legal judgment predict...
Adversarial attacks on Graph Neural Networks (GNNs) reveal their security vulnerabilities, limiting their adoption in safety-critical applications. However, existing attack strategies rely on the knowledge of either the GNN model being used or the predictive task being attacked. Is this knowledge necessary? For example, a graph may be used for mult...
This report describes the 2 nd edition of the Symposium on Artificial Intelligence and Law (SAIL) organized as a virtual event during June 6--9, 2022. The aim of SAIL is to bring together experts from the industry and the academia to discuss the scope and future of AI as applied to the legal domain. The symposium is also meant to foster collaborati...
We present Chandoj\~n\=anam, a web-based Sanskrit meter (Chanda) identification and utilization system. In addition to the core functionality of identifying meters, it sports a friendly user interface to display the scansion, which is a graphical representation of the metrical pattern. The system supports identification of meters from uploaded imag...
Many populous countries including India are burdened with a considerable backlog of legal cases. Development of automated systems that could process legal documents and augment legal practitioners can mitigate this. However, there is a dearth of high-quality corpora that is needed to develop such data-driven systems. The problem gets even more pron...
Many populous countries including India are burdened with a considerable backlog of legal cases. Development of automated systems that could process legal documents and augment legal practitioners can mitigate this. However, there is a dearth of high-quality corpora that is needed to develop such data-driven systems. The problem gets even more pron...
In this paper, we develop a multivariate regression model and a neural network model to predict the Reynolds number (Re) and Nusselt number in turbulent thermal convection. We compare their predictions with those of earlier models of convection: Grossmann–Lohse [Phys. Rev. Lett. 86, 3316 (2001)], revised Grossmann–Lohse [Phys. Fluids 33, 015113 (20...
Knowledge bases (KB) are an important resource in a number of natural language processing (NLP) and information retrieval (IR) tasks, such as semantic search, automated question-answering etc. They are also useful for researchers trying to gain information from a text. Unfortunately, however, the state-of-the-art in Sanskrit NLP does not yet allow...
In this paper, we develop a multivariate regression model and a neural network model to predict the Reynolds number (Re) and Nusselt number in turbulent thermal convection. We compare their predictions with those of earlier models of convection: Grossmann-Lohse~[Phys. Rev. Lett. \textbf{86}, 3316 (2001)], revised Grossmann-Lohse~[Phys. Fluids \text...
Graph neural networks (GNNs) have witnessed significant adoption in the industry owing to impressive performance on various predictive tasks. Performance alone, however, is not enough. Any widely deployed machine learning algorithm must be robust to adversarial attacks. In this work, we investigate this aspect for GNNs, identify vulnerabilities, an...
Legal documents are unstructured, use legal jargon, and have considerable length, making it difficult to process automatically via conventional text processing techniques. A legal document processing system would benefit substantially if the documents could be semantically segmented into coherent units of information. This paper proposes a Rhetoric...
Approximate subgraph matching, which is an important primitive for many applications like question answering, community detection, and motif discovery, often involves large labeled graphs such as knowledge graphs, social networks, and protein sequences. Effective methods for extracting matching subgraphs, in terms of label and structural similariti...
Knowledge graphs (KG) that model the relationships between entities as labeled edges (or facts) in a graph are mostly constructed using a suite of automated extractors, thereby inherently leading to uncertainty in the extracted facts. Modeling the uncertainty as probabilistic confidence scores results in a probabilistic knowledge graph. Graph queri...
Majority of the existing graph neural networks(GNN) learn node embeddings that encode their local neighborhoods but not their positions. Consequently, two nodes that are vastly distant but located in similar local neighborhoods map to similar embeddings in those networks. This limitation prevents accurate performance in predictive tasks that rely o...
In this work, we present a web-based annotation and querying tool Sangrahaka. It annotates entities and relationships from text corpora and constructs a knowledge graph (KG). The KG is queried using templatized natural language queries. The application is language and corpus agnostic, but can be tuned for special needs of a specific language or a c...
An automated system that could assist a judge in predicting the outcome of a case would help expedite the judicial process. For such a system to be practically useful, predictions by the system should be explainable. To promote research in developing such a system, we introduce ILDC (Indian Legal Documents Corpus). ILDC is a large corpus of 35k Ind...
An automated system that could assist a judge in predicting the outcome of a case would help expedite the judicial process. For such a system to be practically useful, predictions by the system should be explainable. To promote research in developing such a system, we introduce ILDC (Indian Legal Documents Corpus). ILDC is a large corpus of 35k Ind...
Knowledge graphs (KGs), that have become the backbone of many critical knowledge-centric applications, are mostly automatically constructed based on an ensemble of extraction techniques applied over diverse data sources. It is, therefore, important to establish the provenance of results for a query to determine how these were computed. Provenance i...
Analyzing graphs by representing them in a low dimensional space using Graph Neural Networks (GNNs) is a promising research problem, with a lot of ongoing research. In this paper, we propose GraphReach, a position-aware GNN framework that captures the global positioning of nodes with respect to a set of fixed nodes, referred to as anchors. The mode...
Knowledge graphs (KGs) have increasingly become the backbone of many critical knowledge-centric applications. Most large-scale KGs used in practice are automatically constructed based on an ensemble of extraction techniques applied over diverse data sources. Therefore, it is important to establish the provenance of results for a query to determine...
Subgraph querying is one of the most important primitives in many applications. Although the field is well studied for deterministic graphs, in many situations, the graphs are probabilistic in nature. In this paper, we address the problem of subgraph querying in large probabilistic labeled graphs. We employ a novel algorithmic framework, called Chi...
Classification of text based datasets has many applications in the field of Computer Science. Some of the key application areas include scientific article recommendation, news article tagging, multimedia content search assistance, etc. We are interested in the problem of data placement of text based datasets in a distributed storage system. Distrib...
GRADES-NDA 2019 is the second joint meeting of the GRADES and NDA workshops, which were each independently organized at previous SIGMOD-PODS meetings, GRADES since 2013 and NDA since 2016. The focus of GRADES-NDA is the application areas, usage scenarios, and open challenges in managing large-scale graph-shaped data. To summarize, GRADES-NDA aims t...
The phenomenal growth of graph data from a wide variety of real-world applications has rendered graph querying to be a problem of paramount importance. Traditional techniques use structural as well as node similarities to find matches of a given query graph in a (large) target graph. However, almost all existing techniques have tacitly ignored the...
Optimal location (OL) queries traditionally leverage only static user information to identify the best locations for installing new facilities. However, many services such as fuel stations, ATMs, etc. mostly serve mobile users and, therefore, their placement must take into account the user mobility patterns or trajectories as well. Hence, to find t...
Searching is one of the fundamental tasks in Computer Science. An intuitive way to search is to do it linearly, that is, start at the beginning of the dataset and continue till the searched-for item is found or nothing is found. However, as the volume of data increases, the response time of linear search is no longer acceptable. Indexes are designe...
Nearest neighbor searching of large databases in high-dimensional spaces is inherently difficult due to the curse of dimensionality. A flavor of approximation is, therefore, necessary to practically solve the problem of nearest neighbor search. In this paper, we propose a novel yet simple indexing scheme, HD-Index, to solve the problem of approxima...
Nearest neighbor searching of large databases in high-dimensional spaces is inherently difficult due to the curse of dimensionality. A flavor of approximation is, therefore, necessary to practically solve the problem of nearest neighbor search. In this paper, we propose a novel yet simple indexing scheme, HD-Index, to solve the problem of approxima...
Cross-lingual information retrieval is a challenging task in the absence of aligned parallel corpora. In this paper, we address this problem by considering topically aligned corpora designed for evaluating an IR setup. To emphasize, we neither use any sentence-aligned corpora or document-aligned corpora, nor do we use any language specific resource...
The phenomenal growth of graph data from a wide-variety of real-world applications has rendered graph querying to be a problem of paramount importance. Traditional techniques use structural as well as node similarities to find matches of a given query graph in a (large) target graph. However, almost all previous research has tacitly ignored the pre...
Money laundering refers to activities pertaining to hiding the true income, evading taxes, or converting illegally earned money for normal use. These activities are often performed through shell companies that masquerade as real companies but where actual the purpose is to launder money. Shell companies are used in all the three phases of money lau...
Crowdsourcing, where the power of the human thinking is harnessed to answer queries that are otherwise difficult for computers to answer, has been successfully used in many applications. A particularly interesting application of crowdsourcing is crowd mining, where given a dataset, patterns are learned by asking questions to the crowd. Crowd mining...
Tracking the Impact of Fact Deletions on Knowledge Graph Queries using Provenance Polynomials
Stopword removal has traditionally been an integral step in information retrieval pre-processing. In this paper, we question the utility of this step in retrieving relevant documents for verbose queries on standard datasets. We show that stopword removal does not lead to noticeable difference in retrieval performance as opposed to not removing them...
A key research interest in the area of Constraint Satisfaction Problems (CSP) is to identify tractable classes of constraints and develop efficient algorithms for solving them. In this paper, we propose an optimal algorithm for solving r-ary min-closed and max-closed constraints. Assuming r = O(1), our algorithm has an optimal running time of O(ct)...
Critical business applications in domains ranging from technical support to healthcare increasingly rely on large-scale, automatically constructed knowledge graphs. These applications use the results of complex queries over knowledge graphs in order to help users in taking crucial decisions such as which drug to administer, or whether certain actio...
Facility location problems aim to identify the best locations to set up new services. Majority of the existing works typically assume that the users of the service are static. However, there exists a wide array of services such as fuel stations, ATMs, food joints, etc., that are widely accessed by mobile users, besides the static ones. Such traject...
Facility location problems aim to identify the best locations to set up new services. Majority of the existing works typically assume that the users are static. However, there exists a wide array of services such as fuel stations, ATMs, food joints, etc., that are widely accessed by mobile users besides the static ones. Such trajectory-aware servic...
Several services today are annotated with points of interest (PoIs) such as "coffee shop", "park", etc. A region of interest (RoI) is a neighborhood that contains PoIs relevant to the user. In this paper, we study the scenario where a user wants to identify the best RoI in a city. The user expresses relevance through a set of keywords denoting PoIs...
We present GradeIT, a system that combines the dual objectives of automated grading and program repairing for introductory programming courses (CS1). Syntax errors pose a significant challenge for testcase-based grading as it is difficult to differentiate between a submission that is almost correct and has some minor syntax errors and another submi...
A combinatorial algorithm to find a largest rectangle (LR) inside the inner isothetic cover which tightly inscribes a given digital object without holes is presented here which runs in time, where n, g, and k being the number of pixels on the contour of the digital object, grid size, and the number of convex regions, respectively. Certain combinato...
Labeled graphs provide a natural way of representing entities, relationships and structures within real datasets such as knowledge graphs and protein interactions. Applications such as question answering, semantic search, and motif discovery entail efficient approaches for subgraph matching involving both label and structural similarities. Given th...
Optimal location queries identify the best locations to set up new facilities for providing service to its users. For several businesses such as fuel stations, cellphone base-stations, etc., placement queries require taking into account the mobility patterns (or trajectories) of the users. In this work, we formulate the TOPS (Trajectory-Aware Optim...
Skyline queries enable multi-criteria optimization by filtering objects that are worse in all the attributes of interest than another object. To handle the large answer set of skyline queries in high-dimensional datasets, the concept of k-dominance was proposed where an object is said to dominate another object if it is better (or equal) in at leas...
Optimal location queries identify the best locations to set up new facilities for providing service to the users. For several businesses such as gas stations, cellphone base-stations, etc., placement queries require taking into account the mobility patterns (or trajectories) of the users. In this work, we formulate the TOPS (Trajectory-Aware Optima...
Unraveling "interesting" subgraphs corresponding to disease/crime hotspots or characterizing habitation shift patterns is an important graph mining task. With the availability and growth of large-scale real-world graphs, mining for such subgraphs has become the need of the hour for graph miners as well as non-technical end-users. In this demo, we p...
In this paper we show how skylines can be used to improve the stable matching algorithm with asymmetric preference sets for men and women. The skyline set of men (or women) in a dataset comprises of those who are not worse off in all the qualities in comparison to another man (or woman). We prove that if a man in the skyline set is matched with a w...
We present a combinatorial algorithm which runs in \(O(n \log n)\) time to find largest rectangle (LR) inside a given digital object without holes, n being the number of pixels on the contour of digital object. The object is imposed on background isothetic grid and inner isothetic cover is obtained for a particular grid size, g, which tightly inscr...
Skyline queries retrieve promising data objects that are not dominated in all the attributes of interest. However, in many cases, a user may not be interested in a skyline set computed over the entire dataset, but rather over a specified range of values for each attribute. For example, a user may look for hotels only within a specified budget and/o...
This work presents an algorithm to generate simple closed random triangular digital curves of finite length imposed on a background triangular grid. A novel timestamp-based combinatorial technique is incorporated to allow the curve to grow freely without intersecting itself. The algorithm runs in linear time as a fixed set of vertices are consulted...
The multi-criteria decision making, made possible by the advent of skyline queries, has been successfully applied in many areas. Though most of the earlier work is concerned with only a single relation, several real world applications require finding the skyline set over multiple relations. Consequently, the join operation over skylines where the p...
The infamous Wikileaks cables are a large-scale resource for analyzing international relationships. We use sentiment analysis on this dataset to extract opinion polarities in the international scenario. We use an unsupervised approach based on standard sentiment lexicon with modifiers to mine opinion polarities among the cables to and from embassie...
In this paper, we test the hypothesis that for a particular item recommendation, it matters more how a friend (i.e., another user who is socially connected) has rated than a random user. To test this, we propose a matrix factorization based collaborative filtering approach that utilizes the social connections as an additional term in the objective...
. A fast linear-time algorithm to generate non-intersecting closed random orthogonal
(4-connected) digital curves of finite length imposed on a background grid is proposed in
this paper. A novel timestampbased combinatorial technique is used so that the curve grows
freely without intersecting itself. The combintaorial constraints are further modifi...
We design and evaluate algorithms for efficient user-mobility driven
macro-cell planning in cellular networks. As cellular networks embrace
heterogeneous technologies (including long range 3G/4G and short range WiFi,
Femto-cells, etc.), most traffic generated by static users gets absorbed by the
short-range technologies, thereby increasingly leavin...
Fundamentals of Database Indexing and Searching presents well-known database searching and indexing techniques. It focuses on similarity search queries, showing how to use distance functions to measure the notion of dissimilarity. After defining database queries and similarity search queries, the book organizes the most common and representative in...
The steady growth of graph data in various applications has resulted in wide-spread research in finding significant sub-structures in a graph. In this paper, we address the problem of finding statistically significant connected subgraphs where the nodes of the graph are labeled. The labels may be either discrete where they assume values from a pre-...
Optimizing XML queries is an intensively studied
problem in the field of databases of late. The topic
has a host of applications, viz., web-scale XML
and keyword search. In this paper, we address
the problem of efficient execution of XML path
queries (commonly known as XPath queries), branch
queries and wild-card queries. Our index structure
assist...
Emotion recognition has been one of the cornerstones of human-computer interaction. Although decades of work has attacked the problem of automatic emotion recognition from either audio or video signals, the fusion of the two modalities is more recent. In this paper, we aim to tackle the problem when both audio and video data are available in a sync...
In this paper, we study the problem of effective route search in road networks. Given a pair of source and destination locations, the aim is to find a path from the source to the destination that visits k different types of sites in a particular order as prescribed by the user. The route planning problem has two objectives to optimize: minimize the...
In many applications of similarity searching in databases, a set of similar queries appear more frequently. Since it is rare that a query point with its associated parameters (range or number of nearest neighbors) will repeat exactly, intelligent caching mechanisms are required to efficiently answer such queries. In addition, the performance of non...
Active languages such as Bangla (or Bengali) evolve over time due to a
variety of social, cultural, economic, and political issues. In this paper, we
analyze the change in the written form of the modern phase of Bangla
quantitatively in terms of character-level, syllable-level, morpheme-level and
word-level features. We collect three different type...
Embodiments of the present disclosure set forth a method for selecting a preferred data set. The method includes generating a candidate data set based on a first data set having a first join attribute, and a first aggregate attribute and a second data set having a second join attribute compatible with the first join attribute, and a second aggregat...
One of the key research interests in the area of Constraint Satisfaction
Problem (CSP) is to identify tractable classes of constraints and develop
efficient solutions for them. In this paper, we introduce generalized staircase
(GS) constraints which is an important generalization of one such tractable
class found in the literature, namely, staircas...
This paper serves as a report for the participation of Special Interest Group In Data (SIGDATA), Indian Institute of Technology, Kanpur in the String Similarity Workshop, EDBT, 2013. We present a novel technique to efficiently process edit distance based string similarity queries. Our technique draws upon some previously conducted works in the fiel...
With the tremendous expansion of reservoirs of sequence data stored worldwide, efficient mining of large string databases in various domains including intrusion detection systems, player statistics, texts, and proteins, has emerged as a practical challenge. Searching for an unusual pattern within long strings of data is one of the foremost requirem...
Itemset mining has been an active area of research due to its successful
application in various data mining scenarios including finding association
rules. Though most of the past work has been on finding frequent itemsets,
infrequent itemset mining has demonstrated its utility in web mining,
bioinformatics and other fields. In this paper, we propos...
The tremendous expanse of search engines, dictionary and thesaurus storage,
and other text mining applications, combined with the popularity of readily
available scanning devices and optical character recognition tools, has
necessitated efficient storage, retrieval and management of massive text
databases for various modern applications. For such a...
The problem of identification of statistically significant patterns in a
sequence of data has been applied to many domains such as intrusion detection
systems, financial models, web-click records, automated monitoring systems,
computational biology, cryptology, and text analysis. An observed pattern of
events is deemed to be statistically significa...
The multi-criteria decision making, which is possible with the advent of
skyline queries, has been applied in many areas. Though most of the existing
research is concerned with only a single relation, several real world
applications require finding the skyline set of records over multiple
relations. Consequently, the join operation over skylines wh...
Multi-criteria decision making has been made possible with the advent of
skyline queries. However, processing such queries for high dimensional datasets
remains a time consuming task. Real-time applications are thus infeasible,
especially for non-indexed skyline techniques where the datasets arrive online.
In this paper, we propose a caching mechan...
In this paper, we address the problem of answering continuous route planning queries over a road network, in the presence of updates to the delay (cost) estimates of links. A simple approach to this problem would be to recompute the best path for all queries on arrival of every delay update. However, such a naive approach scales poorly when there a...
Many real-life graphs such as social networks and peer-to-peer networks capture the relationships among the nodes by using trust scores to label the edges. Important usage of such networks includes trust prediction, finding the most reliable or trusted node in a local subgraph, etc. For many of these applications, it is crucial to assess the presti...
Given the vast reservoirs of sequence data stored worldwide, efficient mining of string databases such as intrusion detection
systems, player statistics, texts, proteins, etc. has emerged as a great challenge. Searching for an unusual pattern within
long strings of data has emerged as a requirement for diverse applications. Given a string, the pro...
Spatial data are common in many scientific and commercial domains such as geographical information systems and gene/protein expression profiles. Querying for distribution patterns on such data can discover underlying spatial relationships and suggest avenues for further scientific exploration. Supporting such pattern retrieval requires not only the...
Given a spatio-temporal network (ST network) where edge properties vary with time, a time-sub-interval minimum spanning tree (TSMST) is a collection of minimum spanning trees of the ST network, where each tree is associated with a time interval. During this time interval, the total cost of tree is least among all the spanning trees. The TSMST probl...
Questions
Questions (2)
Crowd-sourcing is very important nowadays, but what are the research problems in it?
Far reinsertion seems to be more intuitive as it removes the entries that are farther, and therefore, have a greater chance of being inserted in some other node.