Article

Semi-supervised Nonnegative Matrix Factorization for gene expression deconvolution: a case study.

Computational Biology Group, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, South Africa.
Infection, genetics and evolution: journal of molecular epidemiology and evolutionary genetics in infectious diseases (Impact Factor: 3.22). 09/2011; 12(5):913-21. DOI: 10.1016/j.meegid.2011.08.014
Source: PubMed

ABSTRACT Heterogeneity in sample composition is an inherent issue in many gene expression studies and, in many cases, should be taken into account in the downstream analysis to enable correct interpretation of the underlying biological processes. Typical examples are infectious diseases or immunology-related studies using blood samples, where, for example, the proportions of lymphocyte sub-populations are expected to vary between cases and controls. Nonnegative Matrix Factorization (NMF) is an unsupervised learning technique that has been applied successfully in several fields, notably in bioinformatics where its ability to extract meaningful information from high-dimensional data such as gene expression microarrays has been demonstrated. Very recently, it has been applied to biomarker discovery and gene expression deconvolution in heterogeneous tissue samples. Being essentially unsupervised, standard NMF methods are not guaranteed to find components corresponding to the cell types of interest in the sample, which may jeopardize the correct estimation of cell proportions. We have investigated the use of prior knowledge, in the form of a set of marker genes, to improve gene expression deconvolution with NMF algorithms. We found that this improves the consistency with which both cell type proportions and cell type gene expression signatures are estimated. The proposed method was tested on a microarray dataset consisting of pure cell types mixed in known proportions. Pearson correlation coefficients between true and estimated cell type proportions improved substantially (typically from about 0.5 to approximately 0.8) with the semi-supervised (marker-guided) versions of commonly used NMF algorithms. Furthermore known marker genes associated with each cell type were assigned to the correct cell type more frequently for the guided versions. We conclude that the use of marker genes improves the accuracy of gene expression deconvolution using NMF and suggest modifications to how the marker gene information is used that may lead to further improvements.

0 Bookmarks
 · 
136 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The behavior of epigenetic mechanisms in the brain is obscured by tissue heterogeneity and disease-related histological changes. Not accounting for these confounders leads to biased results. We develop a statistical methodology that estimates and adjusts for celltype composition by decomposing neuronal and non-neuronal differential signal. This method provides a conceptual framework for deconvolving heterogeneous epigenetic data from postmortem brain studies. We apply it to find cell-specific differentially methylated regions between prefrontal cortex and hippocampus. We demonstrate the utility of the method on both Infinium 450k and CHARM data.
    Genome biology 08/2013; 14(8):R94. · 10.30 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: One of the significant obstacles in the development of clinically relevant microarray-derived biomarkers and classifiers is tissue heterogeneity. Physical cell separation techniques, such as cell-sorting and laser-capture micro-dissection, can enrich samples for cell types of interest, but are costly, labor-intensive and can limit investigation of important interactions between different cell types. We developed a new computational approach, called Microarray Micro-dissection with Analysis of Differences (MMAD), which performs micro-dissection in silico. Notably, MMAD (1) allows for simultaneous estimation of cell fractions and gene expression profiles of contributing cell types, (2) adjusts for microarray normalization bias, (3) utilizes the corrected Akaike Information Criterion (AICc) during model optimization to minimize overfitting, and (4) provides mechanisms for comparing gene expression and cell fractions between samples in different classes. Computational micro-dissection of simulated and experimental tissue mixture datasets showed tight correlations between predicted and measured gene expression of pure tissues as well as tight correlations between reported and estimated cell fraction for each of the individual cell types. In simulation studies, MMAD showed superior ability to detect differentially expressed genes in mixed tissue samples when compared to standard metrics, including both Significance Analysis of Microarrays (SAM) and cell-type specific significance analysis of microarrays (csSAM).Conclusions: We have developed a new computational tool called MMAD, which is capable of performing robust tissue micro-dissection in silico, and which can improve the detection of differentially expressed genes. MMAD software as implemented in MATLAB is publically available for download at http://sourceforge.net/projects/mmad/.
    Bioinformatics 10/2013; · 5.47 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Solid tumor samples typically contain multiple distinct clonal populations of cancer cells, and also stromal and immune cell contamination. A majority of the cancer genomics and transcriptomics studies do not explicitly consider genetic heterogeneity and impurity, and draw inferences based on mixed populations of cells. Deconvolution of genomic data from heterogeneous samples provides a powerful tool to address this limitation. We discuss several computational tools, which enable deconvolution of genomic and transcriptomic data from heterogeneous samples. We also performed a systematic comparative assessment of these tools. If properly used, these tools have potentials to complement single-cell genomics and immunoFISH analyses, and provide novel insights into tumor heterogeneity.
    Briefings in Bioinformatics 02/2014; · 5.30 Impact Factor