Semi-supervised Nonnegative Matrix Factorization for gene expression deconvolution: a case study.

Computational Biology Group, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, South Africa.
Infection, genetics and evolution: journal of molecular epidemiology and evolutionary genetics in infectious diseases (Impact Factor: 3.22). 09/2011; 12(5):913-21. DOI: 10.1016/j.meegid.2011.08.014
Source: PubMed

ABSTRACT Heterogeneity in sample composition is an inherent issue in many gene expression studies and, in many cases, should be taken into account in the downstream analysis to enable correct interpretation of the underlying biological processes. Typical examples are infectious diseases or immunology-related studies using blood samples, where, for example, the proportions of lymphocyte sub-populations are expected to vary between cases and controls. Nonnegative Matrix Factorization (NMF) is an unsupervised learning technique that has been applied successfully in several fields, notably in bioinformatics where its ability to extract meaningful information from high-dimensional data such as gene expression microarrays has been demonstrated. Very recently, it has been applied to biomarker discovery and gene expression deconvolution in heterogeneous tissue samples. Being essentially unsupervised, standard NMF methods are not guaranteed to find components corresponding to the cell types of interest in the sample, which may jeopardize the correct estimation of cell proportions. We have investigated the use of prior knowledge, in the form of a set of marker genes, to improve gene expression deconvolution with NMF algorithms. We found that this improves the consistency with which both cell type proportions and cell type gene expression signatures are estimated. The proposed method was tested on a microarray dataset consisting of pure cell types mixed in known proportions. Pearson correlation coefficients between true and estimated cell type proportions improved substantially (typically from about 0.5 to approximately 0.8) with the semi-supervised (marker-guided) versions of commonly used NMF algorithms. Furthermore known marker genes associated with each cell type were assigned to the correct cell type more frequently for the guided versions. We conclude that the use of marker genes improves the accuracy of gene expression deconvolution using NMF and suggest modifications to how the marker gene information is used that may lead to further improvements.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The altered composition of immune cells in peripheral blood has been reported to be associated with cancer patient survival. However, analysis of the composition of peripheral immune cells are often limited in retrospective survival studies employing banked blood specimens with long-term follow-up because the application of flow cytometry to such specimens is problematic. The aim of this study was to demonstrate the feasibility of deconvolving blood-based gene expression profiles (GEPs) to estimate the proportions of immune cells and determine their prognostic values for cancer patients.
    PLoS ONE 01/2014; 9(6):e100934. · 3.53 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Nonnegative Matrix Factorization (NMF) has been a popular representation method for pattern classification problem. It tries to decompose a nonnegative matrix of data samples as the product of a nonnegative basic matrix and a nonnegative coefficient matrix, and the coefficient matrix is used as the new representation. However, traditional NMF methods ignore the class labels of the data samples. In this paper, we proposed a supervised novel NMF algorithm to improve the discriminative ability of the new representation. Using the class labels, we separate all the data sample pairs into within-class pairs and between-class pairs. To improve the discriminate ability of the new NMF representations, we hope that the maximum distance of the within-class pairs in the new NMF space could be minimized, while the minimum distance of the between-class pairs pairs could be maximized. With this criterion, we construct an objective function and optimize it with regard to basic and coefficient matrices and slack variables alternatively, resulting in a iterative algorithm.
    Neural Networks. 12/2013;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Solid tumor samples typically contain multiple distinct clonal populations of cancer cells, and also stromal and immune cell contamination. A majority of the cancer genomics and transcriptomics studies do not explicitly consider genetic heterogeneity and impurity, and draw inferences based on mixed populations of cells. Deconvolution of genomic data from heterogeneous samples provides a powerful tool to address this limitation. We discuss several computational tools, which enable deconvolution of genomic and transcriptomic data from heterogeneous samples. We also performed a systematic comparative assessment of these tools. If properly used, these tools have potentials to complement single-cell genomics and immunoFISH analyses, and provide novel insights into tumor heterogeneity.
    Briefings in Bioinformatics 02/2014; · 5.30 Impact Factor