Article

Lower Dimensional Representation of Text Data Based on Centroids and Least Squares

University of Minnesota; Univ. of California, Santa Barbara; University of California
BIT (impact factor: 0.72). 05/2003; 43(2):427-448. DOI:10.1023/A:1026039313770 pp.427-448

ABSTRACT Dimension reduction in today's vector space based information retrieval system is essential for improving computational efficiency in handling massive amounts of data. A mathematical framework for lower dimensional representation of text data in vector space based information retrieval is proposed using minimization and a matrix rank reduction formula. We illustrate how the commonly used Latent Semantic Indexing based on the Singular Value Decomposition (LSI/SVD) can be derived as a method for dimension reduction from our mathematical framework. Then two new methods for dimension reduction based on the centroids of data clusters are proposed and shown to be more efficient and effective than LSI/SVD when we have a priori information on the cluster structure of the data. Several advantages of the new methods in terms of computational efficiency and data representation in the reduced space, as well as their mathematical properties are discussed.Experimental results are presented to illustrate the effectiveness of our methods on certain classification problems in a reduced dimensional space. The results indicate that for a successful lower dimensional representation of the data, it is important to incorporate a priori knowledge in the dimension reduction algorithms.

0 0
 · 
0 Bookmarks
 · 
28 Views
  • Source
    Article: Feature extraction and dimensionality reduction for mass spectrometry data.
    [show abstract] [hide abstract]
    ABSTRACT: Mass spectrometry is being used to generate protein profiles from human serum, and proteomic data obtained from mass spectrometry have attracted great interest for the detection of early stage cancer. However, high dimensional mass spectrometry data cause considerable challenges. In this paper we propose a feature extraction algorithm based on wavelet analysis for high dimensional mass spectrometry data. A set of wavelet detail coefficients at different scale is used to detect the transient changes of mass spectrometry data. The experiments are performed on 2 datasets. A highly competitive accuracy, compared with the best performance of other kinds of classification models, is achieved. Experimental results show that the wavelet detail coefficients are efficient way to characterize features of high dimensional mass spectra and reduce the dimensionality of high dimensional mass spectra.
    Computers in biology and medicine 10/2009; 39(9):818-23. · 1.27 Impact Factor
  • Source
    Article: Generalized linear discriminant analysis: a unified framework and efficient model selection.
    [show abstract] [hide abstract]
    ABSTRACT: High-dimensional data are common in many domains, and dimensionality reduction is the key to cope with the curse-of-dimensionality. Linear discriminant analysis (LDA) is a well-known method for supervised dimensionality reduction. When dealing with high-dimensional and low sample size data, classical LDA suffers from the singularity problem. Over the years, many algorithms have been developed to overcome this problem, and they have been applied successfully in various applications. However, there is a lack of a systematic study of the commonalities and differences of these algorithms, as well as their intrinsic relationships. In this paper, a unified framework for generalized LDA is proposed, which elucidates the properties of various algorithms and their relationships. Based on the proposed framework, we show that the matrix computations involved in LDA-based algorithms can be simplified so that the cross-validation procedure for model selection can be performed efficiently. We conduct extensive experiments using a collection of high-dimensional data sets, including text documents, face images, gene expression data, and gene expression pattern images, to evaluate the proposed theories and algorithms.
    IEEE Transactions on Neural Networks 11/2008; 19(10):1768-82. · 2.95 Impact Factor
  • Source
    Conference Proceeding: Identifying biomarkers for acupuncture treatment via an optimization model
    [show abstract] [hide abstract]
    ABSTRACT: Identifying biomarkers for acupuncture treatment is crucial to understand the mechanism of acupuncture effect at molecular level. In this study, we investigate the metabolic profiles of acupuncture treatment on several meridian points in human. To identify the subsets of metabolites that best characterize the acupuncture effect for each meridian point, a linear programming based model is proposed to identify biomarkers from the high-dimensional metabolic data. Specifically, we use nearest centroid as prototype to simultaneously minimize the number of selected features and leave-one-out cross validation error of the classifier. As a result, we reveal novel metabolite biomarkers for acupuncture treatment. Our result demonstrates that metabolic profiling might be a promising method to investigating the molecular mechanism of acupuncture. Comparison with other existing methods shows the efficiency and effectiveness of our new method. In addition, the method proposed in this paper is general and can be used in other high-dimensional applications, such as cancer genomics.
    Systems Biology (ISB), 2011 IEEE International Conference on; 10/2011

Full-text (2 Sources)

View
10 Downloads
Available from
7 Mar 2013

Keywords

certain classification problems
 
computational efficiency
 
data clusters
 
dimension reduction
 
dimension reduction algorithms
 
handling massive amounts
 
information retrieval system
 
lower dimensional representation
 
mathematical framework
 
matrix rank reduction formula
 
new methods
 
priori information
 
reduced dimensional space
 
reduced space
 
Singular Value Decomposition
 
successful lower dimensional representation
 
text data
 
today's vector space
 
used Latent Semantic Indexing
 
vector space