Arindam Banerjee

Arindam Banerjee
S P Jain School of Global Management · Department of Finance

About

135
Publications
89,163
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
18,214
Citations
Citations since 2016
25 Research Items
11785 Citations
201620172018201920202021202205001,0001,5002,000
201620172018201920202021202205001,0001,5002,000
201620172018201920202021202205001,0001,5002,000
201620172018201920202021202205001,0001,5002,000
Introduction
Skills and Expertise

Publications

Publications (135)
Article
Full-text available
Simulations of the land surface carbon cycle typically compress functional diversity into a small set of plant functional types (PFT), with parameters defined by the average value of measurements of functional traits. In most earth system models, all wild plant life is represented by between five and 14 PFTs and a typical grid cell (≈100 × 100 km)...
Article
Full-text available
Significance Tree diversity is fundamental for forest ecosystem stability and services. However, because of limited available data, estimates of tree diversity at large geographic domains still rely heavily on published lists of species descriptions that are geographically uneven in coverage. These limitations have precluded efforts to generate a g...
Article
Full-text available
We updated the routines used to estimate leaf maintenance respiration (MR) in the Energy Land Model (ELM) using a comprehensive global respiration data base. The updated algorithm includes a temperature acclimating base rate, an updated instantaneous temperature response, and new plant functional type specific parameters. The updated MR algorithm r...
Article
Full-text available
Assessing biodiversity status and trends in plant communities is critical for understanding, quantifying and predicting the effects of global change on ecosystems. Vegetation plots record the occurrence or abundance of all plant species co‐occurring within delimited local areas. This allows species absences to be inferred, information seldom provid...
Article
Full-text available
Small and Medium Enterprises (SME) are the backbone of any economy and play a vital role in the creation of employment and in providing the much-needed support to the large industries. All the ancillary and support services needed by large industries are provided by the SMEs. The business support for spare parts or components or logistics or servic...
Article
Full-text available
The sudden onset of the pandemic called COVID 19 took the entire world by shock and surprise. The reaction time was very less as most of the countries went into a lockdown to contain the spread of the virus. People were least prepared for the pandemic and its waterfall effects. The businesses were mostly reactive and internally started creating pan...
Article
Full-text available
Plant traits—the morphological, anatomical, physiological, biochemical and phenological characteristics of plants—determine how plants respond to environmental factors, affect other trophic levels, and influence ecosystem properties and their benefits and detriments to people. Plant trait data thus represent the basis for a vast area of research sp...
Article
Plant trait databases often contain traits that are correlated, but for whom direct (undirected statistical dependency) and indirect (mediated by other traits) connections may be confounded. The confounding of correlation and connection hinders our understanding of plant strategies, and how these vary among growth forms and climate zones. We identi...
Preprint
Full-text available
Stream deinterleaving is an important problem with various applications in the cybersecurity domain. In this paper, we consider the specific problem of deinterleaving DNS data streams using machine-learning techniques, with the objective of automating the extraction of malware domain sequences. We first develop a generative model for user request g...
Article
Full-text available
Significance Currently, Earth system models (ESMs) represent variation in plant life through the presence of a small set of plant functional types (PFTs), each of which accounts for hundreds or thousands of species across thousands of vegetated grid cells on land. By expanding plant traits from a single mean value per PFT to a full distribution per...
Article
Multivariate time-series modeling and forecasting is an important problem with numerous applications. Traditional approaches such as VAR (vector auto-regressive) models and more recent approaches such as RNNs (recurrent neural networks) are indispensable tools in modeling time-series data. In many multivariate time series modeling problems, there i...
Conference Paper
In this work we consider the problem of anomaly detection in heterogeneous, multivariate, variable-length time series datasets. Our focus is on the aviation safety domain, where data objects are flights and time series are sensor readings and pilot switches. In this context the goal is to detect anomalous flight segments, due to mechanical, environ...
Article
We consider learning high-dimensional multi-response linear models with structured parameters. By exploiting the noise correlations among responses, we propose an alternating estimation (AltEst) procedure to estimate the model parameters based on the generalized Dantzig selector. Under suitable sample size and resampling assumptions, we show that t...
Article
The stochastic linear bandit problem proceeds in rounds where at each round the algorithm selects a vector from a decision set after which it receives a noisy linear loss parameterized by an unknown vector. The goal in such a problem is to minimize the (pseudo) regret which is the difference between the total expected loss of the algorithm and the...
Article
We consider the problem of estimating change in the dependency structure between two $p$-dimensional Ising models, based on respectively $n_1$ and $n_2$ samples drawn from the models. The change is assumed to be structured, e.g., sparse, block sparse, node-perturbed sparse, etc., such that it can be characterized by a suitable (atomic) norm. We pre...
Article
While the Matrix Generalized Inverse Gaussian ($\mathcal{MGIG}$) distribution arises naturally in some settings as a distribution over symmetric positive semi-definite matrices, certain key properties of the distribution and effective ways of sampling from the distribution have not been carefully studied. In this paper, we show that the $\mathcal{M...
Article
In recent years, structured matrix recovery problems have gained considerable attention for its real world applications, such as recommender systems and computer vision. Much of the existing work has focused on matrices with low-rank structure, and limited progress has been made matrices with other types of structure. In this paper we present non-a...
Article
Full-text available
In this work we consider the problem of anomaly detection in heterogeneous, multivariate, variable-length time series datasets. Our focus is on the aviation safety domain, where data objects are flights and time series are sensor readings and pilot switches. In this context the goal is to detect anomalous flight segments, due to mechanical, environ...
Article
In this paper, we present a unified analysis of matrix completion under general low-dimensional structural constraints induced by {\em any} norm regularization. We consider two estimators for the general problem of structured matrix completion, and provide unified upper bounds on the sample complexity and the estimation error. Our analysis relies o...
Article
Full-text available
While considerable advances have been made in estimating high-dimensional structured models from independent data using Lasso-type models, limited progress has been made for settings when the samples are dependent. We consider estimating structured VAR (vector auto-regressive models), where the structure can be captured by any suitable norm, e.g.,...
Article
Anomalies correspond to the behavior of a system which does not conform to its expected or normal behavior. Identifying such anomalies from observed data, or the task of anomaly detection, is an important and often critical analysis task. This includes finding abnormalities in a medical image, fraudulent transactions in a credit card history, or st...
Article
We consider the problem of high-dimensional structured estimation with norm-regularized estimators, such as Lasso, when the design matrix and noise are drawn from sub-exponential distributions. Existing results only consider sub-Gaussian designs and noise, and both the sample complexity and non-asymptotic estimation error have been shown to depend...
Conference Paper
Data mining algorithms for computing solutions to online resource allocation (ORA) problems have focused on budgeting resources currently in possession, e.g., investing in the stock market with cash on hand or assigning current employees to projects. In several settings, one can leverage borrowed resources with which tasks can be accomplished more...
Conference Paper
Full-text available
While influence maximization in social networks has been studied extensively in computer science community for the last decade the focus has been on the progressive influence models, such as independent cascade (IC) and Linear threshold (LT) models, which cannot capture the reversibility of choices. In this paper, we present the Heat Conduction (HC...
Article
Full-text available
Functional traits of organisms are key to understanding and predicting biodiversity and ecological change, which motivates continuous collection of traits and their integration into global databases. Such trait matrices are inherently sparse, severely limiting their usefulness for further analyses. On the other hand, traits are characterized by the...
Chapter
Link prediction is an important problem in online social and collaboration networks, for recommending friends and future collaborators. Most of the existing approaches for link prediction are focused on building unsupervised or supervised classification models based on the availability of accepts and rejects of the past recommendations. Several of...
Article
Analysis of non-asymptotic estimation error and structured statistical recovery based on norm regularized regression, such as Lasso, needs to consider four aspects: the norm, the loss function, the design matrix, and the noise model. This paper presents generalizations of such estimation error analysis on all four aspects compared to the existing l...
Conference Paper
Full-text available
An important problem in discrete graphical models is the maximum a posterior (MAP) inference problem. Recent research has been focusing on the development of parallel MAP inference algorithm, which scales to graphical models of millions of nodes. In this paper, we introduce a parallel implementation of the recently proposed Bethe-ADMM algorithm usi...
Article
Low-rank matrix completion methods have been successful in a variety of settings such as recommendation systems. However, most of the existing matrix completion methods only provide a point estimate of missing entries, and do not characterize uncertainties of the predictions. In this paper, we propose a Bayesian hierarchical probabilistic matrix fa...
Patent
A method, system and computer program product for inferring topic evolution and emergence in a set of documents. In one embodiment, the method comprises forming a group of matrices using text in the documents, and analyzing these matrices to identify a first group of topics as evolving topics and a second group of topics as emerging topics. The mat...
Conference Paper
Background / Purpose: During the last years the TRY initiative has combined an unprecedented number of plant trait measurements and has made these data available for trait-based approaches in ecology and biodiversity science and for the improvement of vegetation models ( www.try-db.org ). Main conclusion: We presented recent progress with resp...
Patent
Full-text available
A system, method and computer program product provides for multiple imputation of missing data elements in retail data sets used for modeling and decision-support applications based on the multi-dimensional, tensor structure of the data sets, and a fast, scalable scheme is implemented that is suitable for large data sets. The method generates multi...
Conference Paper
Background/Question/Methods During the last years the TRY initiative has combined an unprecedented number of plant trait measurements and has made these data available for trait-based approaches in ecology and biodiversity science and for the improvement of vegetation models (www.try-db.org). Results/Conclusions Here we will present recent pro...
Article
Full-text available
Hidden semi-Markov models (HSMMs) are latent variable models which allow latent state persistence and can be viewed as a generalization of the popular hidden Markov models (HMMs). In this paper, we introduce a novel spectral algorithm to perform inference in HSMMs. Unlike expectation maximization (EM), our approach correctly estimates the probabili...
Article
Two types of low cost-per-iteration gradient descent methods have been extensively studied in parallel. One is online or stochastic gradient descent (OGD/SGD), and the other is randomzied coordinate descent (RBCD). In this paper, for the first time, we combine the two types of methods together and propose online randomized block coordinate descent...
Article
We propose a Generalized Dantzig Selector (GDS) for linear models, in which any norm encoding the parameter structure can be leveraged for estimation. We investigate both computational and statistical aspects of the GDS. Based on conjugate proximal operator, a flexible inexact ADMM framework is designed for solving GDS, and non-asymptotic high-prob...
Article
We consider the problem of minimizing block-separable convex functions subject to linear constraints. While the Alternating Direction Method of Multipliers (ADMM) for two-block linear constraints has been intensively studied both theoretically and empirically, in spite of some preliminary work, effective generalizations of ADMM to multiple blocks i...
Article
We consider the problem of estimating sparse precision matrix of Gaussian copula distributions using samples with missing values in high dimensions. Existing approaches, primarily designed for Gaussian distributions, suggest using plugin estimators by disregarding the missing values. In this paper, we propose double plugin Gaussian (DoPinG) copula...
Conference Paper
Full-text available
In this paper, we study the problem of anomaly detection with application to aviation systems. We proposed a framework for detecting precursors to aviation safety incidents due to human factors based on Hidden Semi-Markov Models (HSMM). We investigate HSMMs due to their inherent ability to model durations in addition to model latent state transitio...
Article
We consider the problem of maximum a posteriori (MAP) inference in discrete graphical models. We present a parallel MAP inference algorithm called Bethe-ADMM based on two ideas: tree-decomposition of the graph and the alternating direction method of multipliers (ADMM). However, unlike the standard ADMM, we use an inexact ADMM augmented with a Bethe...
Conference Paper
Background/Question/Methods Plant functional traits serve as tools to understand how plants as primary producers contribute – and react to changing environmental conditions. They also provide means to further our understanding of ecosystem functioning and the relationships between environmental conditions and vegetation patterns on different scal...
Article
Full-text available
Several important combinatorial optimization problems can be formulated as maximum a posteriori (MAP) inference in discrete graphical models. We adopt the recently proposed parallel MAP inference algorithm Bethe-ADMM and implement it using message passing interface (MPI) to fully utilize the computing power provided by the modern supercomputers wit...
Article
Online optimization has emerged as powerful tool in large scale optimization. In this pa- per, we introduce efficient online optimization algorithms based on the alternating direction method (ADM), which can solve online convex optimization under linear constraints where the objective could be non-smooth. We introduce new proof techniques for ADM i...
Article
The mirror descent algorithm (MDA) generalizes gradient descent by using a Bregman di- vergence to replace squared Euclidean distance as a proximal function. In this paper, we simi- larly generalize the alternating direction method of multipliers (ADMM) to Bregman ADMM (BADMM), which uses Bregman divergences as proximal functions in updates. BADMM...
Conference Paper
There are several Global Climate Models (GCM) reported by various countries to the Intergovernmental Panel on Climate Change (IPCC). Due to the varied nature of the GCM model assumptions, the future projections of the GCMs show high variability which makes it difficult to come up with confident projections into the future. Climate scientists combin...
Article
With the advent of remotely sensed data and coordinated efforts to create global databases, the ecological community has progressively become more data-intensive. However, in contrast to other disciplines, statistical ways of handling these large data sets, especially the gaps which are inherent to them, are lacking. Widely used theoretical approac...
Article
Full-text available
Covariance matrices have found success in several computer vision applications, including activity recognition, visual surveillance, and diffusion tensor imaging. This is because they provide an easy platform for fusing multiple features compactly. An important task in all of these applications is to compare two covariance matrices using a (dis)sim...
Conference Paper
Full-text available
Extracting sentiment from Twitter data is one of the fundamental problems in social media analytics. Twitter's length constraint renders determining the positive/negative sentiment of a tweet difficult, even for a human judge. In this work we present a general framework for per-tweet (in contrast with batches of tweets) sentiment analysis which con...
Article
Complex dynamical systems like precipitation extremes under climate variability or change are typically governed by multiple processes at multiple scales. The processes themselves may be manifested at multiple scales and would need to be captured through key indicator variables, which in turn may be better projected by physical models than the vari...
Article
Full-text available
Plant traits are a key to understanding and predicting the adaptation of ecosystems to environmental changes, which motivates the TRY project aiming at constructing a global database for plant traits and becoming a standard resource for the ecological community. Despite its unprecedented coverage, a large percentage of missing data substantially co...
Article
Full-text available
This survey attempts to provide a comprehensive and structured overview of the existing research for the problem of detecting anomalies in discrete/symbolic sequences. The objective is to provide a global understanding of the sequence anomaly detection problem and how existing techniques relate to each other. The key contribution of this survey is...
Article
Droughts are one of the most damaging climate-related haz-ards. The late 1960s Sahel drought in Africa and the North American Dust Bowl of the 1930s are two examples of severe droughts that have an impact on society and the environ-ment. Due to the importance of understanding droughts, we consider the problem of their detection based on gridded dat...
Article
The design of statistical predictive models for climate data gives rise to some unique challenges due to the high dimensionality and spatio-temporal nature of the datasets, which dictate that models should exhibit parsimony in variable selection. Recently, a class of methods which promote structured sparsity in the model have been developed, which...
Article
We propose a new matrix completion algorithm— Kernelized Probabilistic Matrix Factorization (KPMF), which effectively incorporates external side information into the matrix factorization process. Compared with Probabilistic Matrix Factorization (PMF) [1], which im-poses a Gaussian prior for each row of the data ma-trix, KPMF imposes a Gaussian Proc...
Article
We introduce Gaussian Process Topic Models (GPTMs), a new family of topic models which can leverage a kernel among documents while extracting correlated topics. GPTMs can be considered a systematic generalization of the Correlated Topic Models (CTMs) using ideas from Gaussian Process (GP) based embedding. Since GPTMs work with both a topic covarian...
Conference Paper
The large amount of reliable climate data available today has promoted the development of statistical predictive models for climate variables. In this paper we have applied Sparse Group Lasso to build a predictive model for land climate variables using ocean climate variables as covariates. We demonstrate that the sparse model provides better predi...
Conference Paper
The prominence and usage of probabilistic graphical models for data analysis have increased substantially over the past decade. Unlike traditional models in statistical machine learning, graphical models capture statistical dependencies between variables making them suitable for many problems. In this talk, I will discuss two applications of graphi...
Conference Paper
Full-text available
Covariance matrices provide compact, informative feature descriptors for use in several computer vision applications, such as people-appearance tracking, diffusion-tensor imaging, activity recognition, among others. A key task in many of these applications is to compare different covariance matrices using a (dis)similarity function. A natural choic...
Conference Paper
Streaming user-generated content in the form of blogs, microblogs, forums, and multimedia sharing sites, provides a rich source of data from which invaluable information and insights maybe gleaned. Given the vast volume of such social media data being continually generated, one of the challenges is to automatically tease apart the emerging topics o...
Conference Paper
We consider the problem of finding a suitable common low dimensional subspace for accurately representing a given set of covariance matrices. With one covariance matrix, this is principal component analysis (PCA). For multiple covariance matrices, we term the problem Common Component Analysis (CCA). While CCA can be posed as a tensor decomposition...
Conference Paper
Several data mining algorithms use iterative optimization methods for learning predictive models. It is not easy to determine upfront which optimization method will perform best or converge fast for such tasks. In this paper, we analyze Meta Algorithms (MAs) which work by adaptively combining iterates from a pool of base optimization algorithms. We...
Article
In recent years, mixture models have found widespread usage in discovering latent cluster structure from data. A popular special case of finite mixture models is the family of naive Bayes (NB) models, where the probability of a feature vector factorizes over the features for any given component of the mixture. Despite their popularity, naive Bayes...
Article
Cluster ensembles provide a framework for combining multiple base clusterings of a dataset to generate a stable and robust consensus clustering. There are important variants of the basic cluster ensemble problem, notably including cluster ensembles with missing values, row- or column-distributed cluster ensembles. Existing cluster ensemble algorith...
Conference Paper
We introduce Probabilistic Matrix Addition (PMA) for modeling real-valued data matrices by simultaneously capturing covariance structure among rows and among columns. PMA additively combines two latent matrices drawn from two Gaussian Processes respectively over rows and columns. The resulting joint distribution over the observed matrix does not fa...
Conference Paper
Full-text available
This paper presents a generalized version of the linear threshold model for simulating multiple cascades on a network while allowing nodes to switch between them. The proposed model is shown to be a rapidly mixing Markov chain and the corresponding steady state distribution is used to estimate highly likely states of the cascades' spread in the net...
Conference Paper
Probabilistic matrix factorization (PMF) methods have shown great promise in collaborative filtering. In this paper, we consider several variants and generalizations of PMF framework inspired by three broad questions: Are the prior distributions used in existing PMF models suitable, or can one get better predictive performance with different priors...
Article
Family home visiting is a widely accepted strategy used with disadvantaged families to mitigate the effects of poverty. However, gaps persist in knowledge of effective intervention approaches for home visiting relative to specific client risks such as parenting and psychosocial problems. The purpose of this study was to inductively create clusters...
Article
Full-text available
Topic detection (TD) is a fundamental research issue in the Topic Detection and Tracking (TDT) community with practical implications; TD helps analysts to separate the wheat from the chaff among the thousands of incoming news streams. In this paper, we propose a simple and effective topic detection model called the temporal Discriminative Probabili...
Article
Full-text available
We present a graph-based semi-supervised approach for learn- ing user-preferred travel schedules. Assuming a setting in which a user provides a small number of labeled travel sched- ules, we classify schedules into desirable and non-desirable. This task is non-trivial since only a small number of labeled points is available. It is further complicat...
Conference Paper
In recent years, matrix approximation for missing value prediction has emerged as an important problem in a variety of domains such as recommendation systems, e-commerce and online advertisement. While matrix factorization based algorithms typically have good approximation accuracy, such algorithms can be slow especially for large matrices. Further...
Conference Paper
The Aviation Safety Reporting System (ASRS) is used to collect voluntarily submitted aviation safety reports from pilots, controllers and others. As such it is particularly useful in researching aviation safety deficiencies. In this paper we address two challenges related to the analysis of ASRS data: (1) the unsupervised extraction of meaningful a...