[show abstract][hide abstract] ABSTRACT: W e consider the problem of dimensionality reduction and manifold learning when the domain of interest is a set of probability distributions instead of a set of Euclidean data vectors. In this problem, one seeks to discover a low dimensional representation, called embedding, that preserves certain properties such as distance between measured distributions or separation between classes of distributions. This article presents the methods that are specifically designed for low-dimensional embedding of information-geometric data, and we illustrate these methods for visualization in flow cytometry and demography analysis.
IEEE Signal Processing Magazine 04/2011; · 3.37 Impact Factor
[show abstract][hide abstract] ABSTRACT: In this paper, we present multiple novel applications for local intrinsic dimension estimation. There has been much work done on estimating the global dimension of a data set, typically for the purposes of dimensionality reduction. We show that by estimating dimension locally, we are able to extend the uses of dimension estimation to many applications, which are not possible with global dimension estimation. Additionally, we show that local dimension estimation can be used to obtain a better global dimension estimate, alleviating the negative bias that is common to all known dimension estimation algorithms. We illustrate local dimension estimation's uses towards additional applications, such as learning on statistical manifolds, network anomaly detection, clustering, and image segmentation.
IEEE Transactions on Signal Processing 03/2010; · 2.81 Impact Factor
[show abstract][hide abstract] ABSTRACT: Flow cytometry is often used to characterize the malignant cells in leukemia and lymphoma patients, traced to the level of the individual cell. Typically, flow-cytometric data analysis is performed through a series of 2-D projections onto the axes of the data set. Through the years, clinicians have determined combinations of different fluorescent markers which generate relatively known expression patterns for specific subtypes of leukemia and lymphoma - cancers of the hematopoietic system. By only viewing a series of 2-D projections, the high-dimensional nature of the data is rarely exploited. In this paper we present a means of determining a low-dimensional projection which maintains the high-dimensional relationships (i.e., information distance) between differing oncological data sets. By using machine learning techniques, we allow clinicians to visualize data in a low dimension defined by a linear combination of all of the available markers, rather than just two at a time. This provides an aid in diagnosing similar forms of cancer, as well as a means for variable selection in exploratory flow-cytometric research. We refer to our method as information preserving component analysis (IPCA).
IEEE Journal of Selected Topics in Signal Processing 03/2009; · 3.30 Impact Factor
[show abstract][hide abstract] ABSTRACT: Dimensionality reduction is required for "human in the loop" analysis of high dimensional data. We present a method for dimensionality reduction that is tailored to tasks of data set discrimination. As contrasted with Euclidean dimensionality reduction, which preserves Euclidean distance or Euler angles in the lower dimensional space, our method seeks to preserve information as measured by the Fisher information distance, or approximations thereof, on the data-associated probability density functions. We will illustrate the approach for multi-class object discrimination problems.
Digital Signal Processing Workshop and 5th IEEE Signal Processing Education Workshop, 2009. DSP/SPE 2009. IEEE 13th; 02/2009
[show abstract][hide abstract] ABSTRACT: Like many biomedical applications, flow cytometry is a field in which dimensionality reduction is important for analysis and diagnosis. Through expression patterns of various fluorescent biomarkers, flow cytometry is often used to characterize the malignant cells in cancer patients, traced to the level of the individual cell. Typically, diagnosticians analyze cytometric data through a series of 2-dimensional histograms of the expression of various marker combinations, which does not exploit the high-dimensional nature of the data. In this paper we utilize a form of dimensionality reduction - which we refer to as Information Preserving Component Analysis (IPCA) - that preserves the information distance between multi-dimensional data sets. As such, we offer a method for clinicians to visualize patient data in a low-dimensional projection space defined by a linear combination of all available markers. We illustrate these results on actual patient data.
Machine Learning for Signal Processing, 2008. MLSP 2008. IEEE Workshop on; 11/2008
[show abstract][hide abstract] ABSTRACT: The problem of document classification considers categorizing or grouping of various document types. Each document can be represented as a bag of words, which has no straightforward Euclidean representation. Relative word counts form the basis for similarity metrics among documents. Endowing the vector of term frequencies with a Euclidean metric has no obvious straightforward justification. A more appropriate assumption commonly used is that the data lies on a statistical manifold, or a manifold of probabilistic generative models. In this paper, we propose calculating a low-dimensional, information based embedding of documents into Euclidean space. One component of our approach motivated by information geometry is the Fisher information distance to define similarities between documents. The other component is the calculation of the Fisher metric over a lower dimensional statistical manifold estimated in a nonparametric fashion from the data. We demonstrate that in the classification task, this information driven embedding outperforms both a standard PCA embedding and other Euclidean embeddings of the term frequency vector.
Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on; 05/2008 · 4.63 Impact Factor