Semi-supervised Nonnegative Matrix Factorization for gene expression deconvolution: A case study

Computational Biology Group, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, South Africa.
Infection, genetics and evolution: journal of molecular epidemiology and evolutionary genetics in infectious diseases (Impact Factor: 3.02). 09/2011; 12(5):913-21. DOI: 10.1016/j.meegid.2011.08.014
Source: PubMed

ABSTRACT Heterogeneity in sample composition is an inherent issue in many gene expression studies and, in many cases, should be taken into account in the downstream analysis to enable correct interpretation of the underlying biological processes. Typical examples are infectious diseases or immunology-related studies using blood samples, where, for example, the proportions of lymphocyte sub-populations are expected to vary between cases and controls. Nonnegative Matrix Factorization (NMF) is an unsupervised learning technique that has been applied successfully in several fields, notably in bioinformatics where its ability to extract meaningful information from high-dimensional data such as gene expression microarrays has been demonstrated. Very recently, it has been applied to biomarker discovery and gene expression deconvolution in heterogeneous tissue samples. Being essentially unsupervised, standard NMF methods are not guaranteed to find components corresponding to the cell types of interest in the sample, which may jeopardize the correct estimation of cell proportions. We have investigated the use of prior knowledge, in the form of a set of marker genes, to improve gene expression deconvolution with NMF algorithms. We found that this improves the consistency with which both cell type proportions and cell type gene expression signatures are estimated. The proposed method was tested on a microarray dataset consisting of pure cell types mixed in known proportions. Pearson correlation coefficients between true and estimated cell type proportions improved substantially (typically from about 0.5 to approximately 0.8) with the semi-supervised (marker-guided) versions of commonly used NMF algorithms. Furthermore known marker genes associated with each cell type were assigned to the correct cell type more frequently for the guided versions. We conclude that the use of marker genes improves the accuracy of gene expression deconvolution using NMF and suggest modifications to how the marker gene information is used that may lead to further improvements.

29 Reads
  • Source
    • "Several supervised and semi-supervised computational deconvolution algorithms have also been proposed to tackle this problem [17-21]. However, they require prior knowledge of either the cell type frequencies within a given tissue [19,20], or the in vitro gene expression profiles of each component cell type [17,18]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Cellular heterogeneity is present in almost all gene expression profiles. However, transcriptome analysis of tissue specimens often ignores the cellular heterogeneity present in these samples. Standard deconvolution algorithms require prior knowledge of the cell type frequencies within a tissue or their in vitro expression profiles. Furthermore, these algorithms tend to report biased estimations. Results Here, we describe a Digital Sorting Algorithm (DSA) for extracting cell-type specific gene expression profiles from mixed tissue samples that is unbiased and does not require prior knowledge of cell type frequencies. Conclusions The results suggest that DSA is a specific and sensitivity algorithm in gene expression profile deconvolution and will be useful in studying individual cell types of complex tissues.
    BMC Bioinformatics 03/2013; 14(1):89. DOI:10.1186/1471-2105-14-89 · 2.58 Impact Factor
  • Source
    • "We note that since we began this work, a small number of authors have published similar deconvolution algorithms using gene expression data [13-15]. The techniques are similar to the quadratic programming method we describe below in Methods for deconvolving a single sample, but none comprehensively addresses statistical properties or employs data from DNA methylation. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background There has been a long-standing need in biomedical research for a method that quantifies the normally mixed composition of leukocytes beyond what is possible by simple histological or flow cytometric assessments. The latter is restricted by the labile nature of protein epitopes, requirements for cell processing, and timely cell analysis. In a diverse array of diseases and following numerous immune-toxic exposures, leukocyte composition will critically inform the underlying immuno-biology to most chronic medical conditions. Emerging research demonstrates that DNA methylation is responsible for cellular differentiation, and when measured in whole peripheral blood, serves to distinguish cancer cases from controls. Results Here we present a method, similar to regression calibration, for inferring changes in the distribution of white blood cells between different subpopulations (e.g. cases and controls) using DNA methylation signatures, in combination with a previously obtained external validation set consisting of signatures from purified leukocyte samples. We validate the fundamental idea in a cell mixture reconstruction experiment, then demonstrate our method on DNA methylation data sets from several studies, including data from a Head and Neck Squamous Cell Carcinoma (HNSCC) study and an ovarian cancer study. Our method produces results consistent with prior biological findings, thereby validating the approach. Conclusions Our method, in combination with an appropriate external validation set, promises new opportunities for large-scale immunological studies of both disease states and noxious exposures.
    BMC Bioinformatics 05/2012; 13(1):86. DOI:10.1186/1471-2105-13-86 · 2.58 Impact Factor
  • Source
    • "The weights are proportional to the relative contribution of these cell types in the mixture and are hence invariable among genes. Subsequent studies have demonstrated that the linearity assumption is valid under a wide variety of experimental conditions, especially when the cellular composition of the heterogeneous tissue was determined in the same object as where the RNA was obtained from [33] [34]. To deconvolve cell-specific gene expression, we applied a statistical methodology of csSAM which, given microarray data from two groups of biological samples and the relative cell-type frequencies of each sample, estimates the average gene expression for each cell-type at a group level, and uses these cellular gene expression levels to identify differentially expressed genes at a cell-type specific level between experimental conditions. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Atherosclerosis is intimately coupled to blood flow by the presence of predilection sites. The coupling is through mechanotransduction of endothelial cells and approximately 2000 gene are associated with this process. This paper describes a new platform to study and identify new signalling pathways in endothelial cells covering an atherosclerotic plaque. The identified networks are synthesized in primary cells to study their reaction to flow. This synthetic approach might lead to new insights and drug targets.
    FEBS letters 05/2012; 586(15):2164-70. DOI:10.1016/j.febslet.2012.04.031 · 3.17 Impact Factor
Show more