Semi-supervised Nonnegative Matrix Factorization for gene expression deconvolution: A case study

Computational Biology Group, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, South Africa.
Infection, genetics and evolution: journal of molecular epidemiology and evolutionary genetics in infectious diseases (Impact Factor: 3.26). 09/2011; 12(5):913-21. DOI: 10.1016/j.meegid.2011.08.014
Source: PubMed

ABSTRACT Heterogeneity in sample composition is an inherent issue in many gene expression studies and, in many cases, should be taken into account in the downstream analysis to enable correct interpretation of the underlying biological processes. Typical examples are infectious diseases or immunology-related studies using blood samples, where, for example, the proportions of lymphocyte sub-populations are expected to vary between cases and controls. Nonnegative Matrix Factorization (NMF) is an unsupervised learning technique that has been applied successfully in several fields, notably in bioinformatics where its ability to extract meaningful information from high-dimensional data such as gene expression microarrays has been demonstrated. Very recently, it has been applied to biomarker discovery and gene expression deconvolution in heterogeneous tissue samples. Being essentially unsupervised, standard NMF methods are not guaranteed to find components corresponding to the cell types of interest in the sample, which may jeopardize the correct estimation of cell proportions. We have investigated the use of prior knowledge, in the form of a set of marker genes, to improve gene expression deconvolution with NMF algorithms. We found that this improves the consistency with which both cell type proportions and cell type gene expression signatures are estimated. The proposed method was tested on a microarray dataset consisting of pure cell types mixed in known proportions. Pearson correlation coefficients between true and estimated cell type proportions improved substantially (typically from about 0.5 to approximately 0.8) with the semi-supervised (marker-guided) versions of commonly used NMF algorithms. Furthermore known marker genes associated with each cell type were assigned to the correct cell type more frequently for the guided versions. We conclude that the use of marker genes improves the accuracy of gene expression deconvolution using NMF and suggest modifications to how the marker gene information is used that may lead to further improvements.

  • Source
    • "The weights are proportional to the relative contribution of these cell types in the mixture and are hence invariable among genes. Subsequent studies have demonstrated that the linearity assumption is valid under a wide variety of experimental conditions, especially when the cellular composition of the heterogeneous tissue was determined in the same object as where the RNA was obtained from [33] [34]. To deconvolve cell-specific gene expression, we applied a statistical methodology of csSAM which, given microarray data from two groups of biological samples and the relative cell-type frequencies of each sample, estimates the average gene expression for each cell-type at a group level, and uses these cellular gene expression levels to identify differentially expressed genes at a cell-type specific level between experimental conditions. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Atherosclerosis is intimately coupled to blood flow by the presence of predilection sites. The coupling is through mechanotransduction of endothelial cells and approximately 2000 gene are associated with this process. This paper describes a new platform to study and identify new signalling pathways in endothelial cells covering an atherosclerotic plaque. The identified networks are synthesized in primary cells to study their reaction to flow. This synthetic approach might lead to new insights and drug targets.
    FEBS letters 05/2012; 586(15):2164-70. DOI:10.1016/j.febslet.2012.04.031 · 3.34 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Acute cardiac allograft rejection is a serious complication of heart transplantation. Investigating molecular processes in whole blood via microarrays is a promising avenue of research in transplantation, particularly due to the non-invasive nature of blood sampling. However, whole blood is a complex tissue and the consequent heterogeneity in composition amongst samples is ignored in traditional microarray analysis. This complicates the biological interpretation of microarray data. Here we have applied a statistical deconvolution approach, cell-specific significance analysis of microarrays (csSAM), to whole blood samples from subjects either undergoing acute heart allograft rejection (AR) or not (NR). We identified eight differentially expressed probe-sets significantly correlated to monocytes (mapping to 6 genes, all down-regulated in ARs versus NRs) at a false discovery rate (FDR) ≤ 15%. None of the genes identified are present in a biomarker panel of acute heart rejection previously published by our group and discovered in the same data***.
    Bioinformatics and biology insights 04/2012; 6:49-61. DOI:10.4137/BBI.S9197
  • [Show abstract] [Hide abstract]
    ABSTRACT: Inference of Transcriptional Regulatory Networks (TRNs) provides insight into the mechanisms driving biological systems, especially mammalian development and disease. Many techniques have been developed for TRN estimation from indirect biochemical measurements. Although successful when initially tested in model organisms, these regulatory models often fail when applied to data from multicellular organisms where multiple regulation and gene reuse increase dramatically. Non-negative matrix factorization techniques were initially introduced to find non-orthogonal patterns in data, making them ideal techniques for inference in cases of multiple regulation. We review these techniques and their application to TRN analysis.
Show more