Joydeep Ghosh

Joydeep Ghosh
University of Texas at Austin | UT · Department of Electrical & Computer Engineering

About

380
Publications
65,793
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
23,577
Citations

Publications

Publications (380)
Preprint
We propose a novel approach to the problem of clustering hierarchically aggregated time-series data, which has remained an understudied problem though it has several commercial applications. We first group time series at each aggregated level, while simultaneously leveraging local and global information. The proposed method can cluster hierarchical...
Preprint
Full-text available
Federated Learning has become an important learning paradigm due to its privacy and computational benefits. As the field advances, two key challenges that still remain to be addressed are: (1) system heterogeneity-variability in the compute and/or data resources present on each client, and (2) lack of labeled data in certain federated settings. Sev...
Chapter
Tensor factorization is a methodology that is applied in a variety of fields, ranging from climate modeling to medical informatics. A tensor is an n-way array that captures the relationship between n objects. These multiway arrays can be factored to study the underlying bases present in the data. Two challenges arising in tensor factorization are 1...
Article
Full-text available
Discriminative classification models often assume all classes are available at the training phase. As such models do not have a strategy to learn new concepts from available unlabeled instances, they usually work poorly when unknown classes emerge from future data to be classified. To address the appearance of new classes, some authors have develop...
Article
A computational phenotype is a set of clinically relevant and interesting characteristics that describe patients with a given condition. Various machine learning methods have been proposed to derive phenotypes in an automatic, high-throughput manner. Among these methods, computational phenotyping through tensor factorization has been shown to produ...
Preprint
Full-text available
It has been recently shown that sparse, nonnegative tensor factorization of multi-modal electronic health record data is a promising approach to high-throughput computational phenotyping. However, such approaches typically do not leverage available domain knowledge while extracting the phenotypes; hence, some of the suggested phenotypes may not map...
Article
Full-text available
Background Researchers are developing methods to automatically extract clinically relevant and useful patient characteristics from raw healthcare datasets. These characteristics, often capturing essential properties of patients with common medical conditions, are called computational phenotypes. Being generated by automated or semiautomated, data-d...
Article
Oftentimes businesses face the challenge of requiring costly information to improve the accuracy of prediction tasks. One notable example is obtaining informative customer feedback (e.g., customer-product ratings via costly incentives) to improve the effectiveness of recommender systems. In this paper, we develop a novel active learning approach, w...
Conference Paper
We propose gamAID, an exploratory, supervised nonnegative tensor factorization method that iteratively extracts phenotypes from tensors constructed from medical count data. Using data from diabetic patients who later on get diagnosed with chronic kidney disorder (CKD) as well as diabetic patients who do not receive a CKD diagnosis, we demonstrate t...
Article
We propose a novel and efficient algorithm for the collaborative preference completion problem, which involves jointly estimating individualized rankings for a set of entities over a shared set of items, based on a limited number of observed affinity values. Our approach exploits the observation that while preferences are often recorded as numerica...
Article
Full-text available
The increased availability of electronic health records (EHRs) have spearheaded the initiative for precision medicine using data driven approaches. Essential to this effort is the ability to identify patients with certain medical conditions of interest from simple queries on EHRs, or EHR-based phenotypes. Existing rule--based phenotyping approaches...
Article
Learning the true ordering between objects by aggregating a set of expert opinion rank order lists is an important and ubiquitous problem in many applications ranging from social choice theory to natural language processing and search aggregation. We study the problem of unsupervised rank aggregation where no ground truth ordering information in av...
Article
Databases in domains such as healthcare are routinely released to the public in aggregated form. Unfortunately, naive modeling with aggregated data may significantly diminish the accuracy of inferences at the individual level. This paper addresses the scenario where features are provided at the individual level, but the target variables are only av...
Conference Paper
Computational phenotyping is the process of converting heterogeneous electronic health records (EHRs) into meaningful clinical concepts. Unsupervised phenotyping methods have the potential to leverage a vast amount of labeled EHR data for phenotype discovery. However, existing unsupervised phenotyping methods do not incorporate current medical know...
Article
Full-text available
We investigate how to make a simpler version of an existing algorithm, named C^3E, from Consensus between Classification and Clustering Ensembles, more user-friendly by automatically tuning its main parameters with the use of metaheuristics. In particular, C^3E based on a Squared Loss function, C^3E-SL, assumes an optimization procedure that takes...
Article
The rapidly increasing availability of electronic health records (EHRs) from multiple heterogeneous sources has spearheaded the adoption of data-driven approaches for improved clinical research, decision making, prognosis, and patient management. Unfortunately, EHR data do not always directly and reliably map to medical concepts that clinical resea...
Article
Unsupervised models can provide supplementary soft constraints to help classify new "target" data because similar instances in the target set are more likely to share the same class label. Such models can also help detect possible differences between training and target distributions, which is useful in applications where concept drift may take pla...
Conference Paper
Electronic health records (EHRs) are becoming an increasingly important source of patient information. Unfortunately, EHR data do not always directly and reliably map to medical concepts that clinical researchers need or use. Some recent studies have focused on EHR-derived phenotyping, which aims at mapping the EHR data to specific medical concepts...
Article
Full-text available
Unsupervised models can provide supple-mentary soft constraints to help classify new data since similar instances are more likely to share the same class label. In this context, this paper reports on a study on how to make an existing algorithm, named C3E (from Consensus between Classification and Clustering Ensembles), more convenient by automatic...
Article
Sepsis and septic shock are common and potentially fatal conditions that often occur in intensive care unit (ICU) patients. Early prediction of patients at risk for septic shock is therefore crucial to minimizing the effects of these complications. Potential indications for septic shock risk span a wide range of measurements, including physiologica...
Article
Transposable data represents interactions among two sets of entities, and are typically represented as a matrix containing the known interaction values. Additional side information may consist of feature vectors specific to entities corresponding to the rows and/or columns of such a matrix. Further information may also be available in the form of i...
Article
Full-text available
In this paper we present EPIC, an efficient and effective predictor for IC manufacturing hotspots in deep sub-wavelength lithography. EPIC proposes a unified framework to combine different hotspot detection methods together, such as machine learning and pattern matching, using mathematical programming/optimization. EPIC algorithm has been tested on...
Article
We quantify the degradation in performance of a popular and effective face detector when human-perceived image quality is degraded by distortions due to additive white gaussian noise, gaussian blur or JPEG compression. It is observed that, within a certain range of perceived image quality, a modest increase in image quality can drastically improve...
Article
This paper introduces two kinds of decision tree ensembles for imbalanced classification problems, extensively utilizing properties of α-divergence. First, a novel splitting criterion based on α-divergence is shown to generalize several well-known splitting criteria such as those used in C4.5 and CART. When the α-divergence splitting criterion is a...
Article
We present a general framework for constructing prior distributions with structured variables. The prior is defined as the information projection of a base distribution onto distributions supported on the constraint set of interest. In cases where this projection is intractable, we propose a family of parameterized approximations indexed by subsets...
Article
Full-text available
We propose a categorical data synthesizer with a quantifiable disclosure risk. Our algorithm, named Perturbed Gibbs Sampler, can handle high-dimensional categorical data that are often intractable to represent as contingency tables. The algorithm extends a multiple imputation strategy for fully synthetic data by utilizing feature hashing and non-pa...
Conference Paper
We introduce a class of methods for Gaussian process regression with functional expectation constraints. We show that the solution can be found without the need for approximations when the constraint set satisfies a representation theorem. Further, the solution is unique when the constraint set is convex. Constrained Gaussian process regression is...
Conference Paper
This paper introduces retargeted matrix factorization (R-MF); a novel approach for learning the user-wise ranking of items in the context of collaborative filtering. R-MF learns to rank by "retargeting" the item ratings of each user, searching for a monotonic transformation of the ratings that results in a better fit while preserving the ranked ord...
Article
Full-text available
We present a novel approach for constrained Bayesian inference. Unlike current methods, our approach does not require convexity of the constraint set. We reduce the constrained variational inference to a parametric optimization over the feasible set of densities and propose a general recipe for such problems. We apply the proposed constrained Bayes...
Conference Paper
Full-text available
Unsupervised models can provide supplementary soft constraints to help classify new data since similar instances are more likely to share the same class label. In this context, we investigate how to make an existing algorithm, named C3E (from Combining Classifier and Cluster Ensembles), more user-friendly by automatically tunning its main parameter...
Article
In healthcare-related studies, individual patient or hospital data are not often publicly available due to privacy restrictions, legal issues, or reporting norms. However, such measures may be provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones, such as hospital referral regions (...
Article
We model the temporal symptomatic characteristics of 171 cardiac arrest patients in Intensive Care Units. The temporal and feature dependencies in the data are illustrated using a mixture of matrix normal distributions. We found that the cardiac arrest temporal signature is best summarized with six hours data prior to cardiac arrest events, and its...
Article
Computational prediction of genes that play roles in human diseases remains an important but challenging task. In this work, we formulate candidate gene prediction as a bipartite ranking problem combining a task-wise ordered observation model with a latent multitask regression function using the matrix-variate Gaussian process (MV-GP). We then use...
Conference Paper
We present an experimental study of topic models applied to the analysis of functional magnetic resonance images. This study is motivated by the hypothesis that experimental task contrast images share a common set of mental concepts. We represent the images as documents and the mental concepts as topics, and evaluate the effectiveness of unsupervis...
Article
Full-text available
Multiple sclerosis (MS) is a chronic autoimmune disease that affects the central nervous system. The progression and severity of MS varies by individual, but it is generally a disabling disease. Although medications have been developed to slow the disease progression and help manage symptoms, MS research has yet to result in a cure. Early diagnosis...
Article
We propose a family of image quality assessment (IQA) models based on natural scene statistics (NSS), that can predict the subjective quality of a distorted image without reference to a corresponding distortionless image, and without any training results on human opinion scores of distorted images. These `completely blind' models compete well with...
Article
We propose a novel hierarchical model for multitask bipartite ranking. The proposed approach combines a matrix-variate Gaussian process with a generative model for task-wise bipartite ranking. In addition, we employ a novel trace constrained variational inference approach to impose low rank structure on the posterior matrix-variate Gaussian process...
Article
Full-text available
Both supervised and semisupervised algorithms for hyperspectral data analysis typically assume that all unlabeled data belong to the same set of land-cover classes that is represented by labeled data. This is not true in general, however, since there may be new classes in the unexplored regions within an image or in areas that are geographically ne...
Article
Constrained clustering has been an active research topic since the last decade. Most studies focus on batch-mode algorithms. This brief introduces two algorithms for on-line constrained learning, named on-line linear constrained vector quantization error (O-LCVQE) and constrained rival penalized competitive learning (C-RPCL). The former is a varian...
Article
Full-text available
This paper tackles temporal resolution of documents, such as determining when a document is about or when it was written, based only on its text. We apply techniques from information retrieval that predict dates via language models over a discretized timeline. Unlike most previous works, we rely {\it solely} on temporal cues implicit in the text. W...
Article
Unsupervised models can provide supplementary soft constraints to help classify new target data under the assumption that similar objects in the target set are more likely to share the same class label. Such models can also help detect possible differences between training and target distributions, which is useful in applications where concept drif...
Article
Full-text available
This paper introduces a novel approach for learning to rank (LETOR) based on the notion of monotone retargeting. It involves minimizing a divergence between all monotonic increasing transformations of the training scores and a parameterized prediction function. The minimization is both over the transformations as well as over the parameters. It is...
Article
Probabilistic matrix factorization (PMF) and other popular approaches to collaborative filtering assume that the ratings given by users for products are genuine, and hence they give equal importance to all available ratings. However, this is not always true due to several reasons including the presence of opinion spam in product reviews. In this pa...
Article
Full-text available
This paper introduces a privacy-aware Bayesian approach that combines ensembles of classifiers and clusterers to perform semi-supervised and transductive learning. We consider scenarios where instances and their classification/clustering results are distributed across different data sites and have sharing restrictions. As a special case, the privac...
Article
Full-text available
Unsupervised models can provide supplementary soft constraints to help classify new, "target" data since similar instances in the target set are more likely to share the same class label. Such models can also help detect possible differences between training and target distributions, which is useful in applications where concept drift may take plac...
Article
Full-text available
We propose a highly unsupervised, training free, no reference image quality assessment (IQA) model that is based on the hypothesis that distorted images have certain latent characteristics that differ from those of “natural” or “pristine” images. These latent characteristics are uncovered by applying a “topic model” to visual words extracted from a...
Article
Full-text available
In healthcare-related studies, individual patient or hospital data are not often publicly available due to privacy restrictions, legal issues or reporting norms. However, such measures may be provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones (HRR1 or HSA2). Such levels constitut...
Conference Paper
ICU patients are vulnerable to in-ICU morbidities and mortality, making accurate systems for identifying at-risk patients a necessity for improving clinical care. Here, we present an improved model for predicting in-hospital mortality using data collected from the first 48 hours of a patient's ICU stay. We generated predictive features for each pat...
Article
Full-text available
The Laplacian Eigenmap is a popular method for non-linear dimension reduction and data representation. This graph based method uses a Graph Laplacian matrix that closely ap-proximates the Laplace-Beltrami operator which has proper-ties that help to learn the structure of data lying on Riema-niann manifolds. However, the Graph Laplacian used in this...
Conference Paper
Full-text available
Learning a model for data in a distributed source system has often been performed by collecting all data at a central location and performing the learning process on the global data set at the central location. Although a common global feature space is normally assumed, each local source may only sample a subset of features, producing a heterogeneo...
Article
Full-text available
Pairwise interaction networks capture inter-user dependencies (e.g. social networks) and inter-item dependencies (e.g item categories) that provide insight into user and item behavior. It is often assumed that such interaction information is informative for preference prediction. This may not be the case, as the some of the observed interactions ma...
Conference Paper
Full-text available
The combination of multiple classifiers to generate a single classifier has been shown to be very useful in practice. Similarly, several efforts have shown that cluster ensembles can improve the quality of results as compared to a single clustering solution. These observations suggest that ensembles containing both classifiers and clusterers are po...
Article
Full-text available
Many measures of healthcare delivery or quality are not publicly available at the individual patient or hospital level largely due to privacy restrictions, legal issues or reporting norms. Instead, such measures are provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones (HRR1s and HS...
Article
Full-text available
This paper proposes a novel framework called Gaussian process maximum likelihood for spatially adaptive classification of hyperspectral data. In hyperspectral images, spectral responses of land covers vary over space, and conventional classification algorithms that result in spatially invariant solutions are fundamentally limited. In the proposed f...
Article
This paper presents a semi-supervised learning algorithm called Gaussian process expectation-maximization (GP-EM), for classification of landcover based on hyperspectral data analysis. Model parameters for each land cover class are first estimated by a supervised algorithm using Gaussian process regressions to find spatially adaptive parameters, an...
Article
Cluster ensembles combine multiple clusterings of a set of objects into a single consolidated clustering, often referred to as the consensus solution. Consensus clustering can be used to generate more robust and stable clustering results compared to a single clustering approach, perform distributed computing under privacy or sharing constraints, or...
Conference Paper
Full-text available
We propose a no-reference algorithm to assess the comfort associated with viewing stereo images and videos. The proposed measure of 3D quality of experience is shown to correlate well with human perception of quality on a publicly available dataset of 3D images/videos and human subjective scores. The proposed measure extracts statistical features f...
Conference Paper
Full-text available
This paper introduces a novel splitting criterion parametrized by a scalar 'α' to build a class-imbalance resistant ensemble of decision trees. The proposed splitting criterion generalizes information gain in C4.5, and its extended form encompasses Gini(CART) and DKM splitting criteria as well. Each decision tree in the ensemble is based on a diffe...
Article
Full-text available
For several evaluation metrics for classification problems, correctly classifying an additional point from one class will have a different effect on the value of the evaluation metric compared to correctly classifying an additional point from another class. In this paper, we describe a method for quanti-fying these effects based on " metric skew"....
Conference Paper
Full-text available
Terascale astronomical datasets have the potential to provide unprecedented insights into the origins of our universe. However, automated techniques for determining regions of interest are a must if domain experts are to cope with the intractable amounts of simulation data. This paper addresses the important problem of locating and tracking high de...
Chapter
In this chapter, we examine the relationship between cost-sensitive learning and resampling. We first introduce these concepts, including a new resampling method called “generative oversampling,” which creates new data points by learning parameters for an assumed probability distribution. We then examine theoretically and empirically the effects of...
Conference Paper
Full-text available
Many data mining applications involve predictive modeling of very large, complex datasets. Such applications present a need for innovative algorithms and associated implementations that are not only effective in terms of prediction accuracy, but can also be efficiently run on distributed computational systems to yield results in reasonable time. Th...
Conference Paper
Manifold models for nonlinear dimensionality reduction provide useful low-dimensional representations of high-dimensional data. Most manifold models are unsupervised algorithms and map the entire data onto a single manifold. Heterogeneous data with multiple classes are often better modeled by multiple manifolds rather than by a single global manifo...
Conference Paper
Full-text available
Several data mining applications such as recommender systems and online advertising involve the analysis of large, heterogeneous dyadic data, where the data consists of measurements on pairs of elements, each from a different set of entities. Independent variables (covariates) are additionally associated with the entities along the two modes and th...
Article
Full-text available
Clustering is a useful technique that divides data points into groups, also known as clusters, such that the data points of the same cluster exhibit similar properties. Typical clustering algorithms assign each data point to at least one cluster. However, in practical datasets like microarray gene dataset, only a subset of the genes are highly corr...
Article
Full-text available
A key application of clustering data obtained from sources such as microarrays, protein mass spectroscopy, and phylogenetic profiles is the detection of functionally related genes. Typically, only a small number of functionally related genes cluster into one or more groups, and the rest need to be ignored. For such situations, we present Automated...
Article
Full-text available
For difficult classification or regression problems, practitioners often segment the data into relatively homogeneous groups and then build a predictive model for each group. This two-step procedure usually results in simpler, more interpretable and actionable models without any loss in accuracy. In this work, we consider problems such as predictin...
Conference Paper
Full-text available
A Gaussian process regression technique is proposed to predict ground-based aerosol optical depth measurements from satellite multispectral images, and to select the most informative ground-based sites by active learning. Satellite images provide spatial and temporal information in addition to the spectral features, and such heterogeneity of availa...
Conference Paper
Full-text available
Multispectral remote sensing images are widely used for automated land use and land cover classification tasks. Remotely sensed images usually cover large geographical areas, and spectral characteristics of each class often varies over time and space. We apply a spatially adaptive classification scheme that models spatial variation with Gaussian pr...
Conference Paper
Full-text available
Uncertainty sampling is an effective method for performing active learning that is computationally efficient compared to other active learning methods such as loss-reduction methods. However, unlike loss-reduction methods, uncertainty sampling cannot minimize total misclassification costs when errors incur different costs. This paper introduces a m...