[Show abstract][Hide abstract] ABSTRACT: False discovery rate (FDR) control is an important tool of statistical inference in feature selection. In mass spectrometry-based metabolomics data, features can be measured at different levels of reliability and false features are often detected in untargeted metabolite profiling as chemical and/or bioinformatics noise. The traditional false discovery rate methods treat all features equally, which can cause substantial loss of statistical power to detect differentially expressed features. We propose a reliability index for mass spectrometry-based metabolomics data with repeated measurements, which is quantified using a composite measure. We then present a new method to estimate the local false discovery rate (lfdr) that incorporates feature reliability. In simulations, our proposed method achieved better balance between sensitivity and controlling false discovery, as compared to traditional lfdr estimation. We applied our method to a real metabolomics dataset and were able to detect more differentially expressed metabolites that were biologically meaningful.
[Show abstract][Hide abstract] ABSTRACT: With modern technologies such as microarray, deep sequencing, and liquid chromatography-mass spectrometry (LC-MS), it is possible to measure the expression levels of thousands of genes/proteins simultaneously to unravel important biological processes. A very first step towards elucidating hidden patterns and understanding the massive data is the application of clustering techniques. Nonlinear relations, which were mostly unutilized in contrast to linear correlations, are prevalent in high-throughput data. In many cases, nonlinear relations can model the biological relationship more precisely and reflect critical patterns in the biological systems. Using the general dependency measure, Distance Based on Conditional Ordered List (DCOL) that we introduced before, we designed the nonlinear K-profiles clustering method, which can be seen as the nonlinear counterpart of the K-means clustering algorithm. The method has a built-in statistical testing procedure that ensures genes not belonging to any cluster do not impact the estimation of cluster profiles. Results from extensive simulation studies showed that K-profiles clustering not only outperformed traditional linear K-means algorithm, but also presented significantly better performance over our previous General Dependency Hierarchical Clustering (GDHC) algorithm. We further analyzed a gene expression dataset, on which K-profile clustering generated biologically meaningful results.
[Show abstract][Hide abstract] ABSTRACT: Probabilistic association discovery aims at identifying the association between random vectors, regardless of number of variables involved or linear/nonlinear functional forms. Recently, applications in high-dimensional data have generated rising interest in probabilistic association discovery. We developed a framework based on functions on the observation graph, named MeDiA (Mean Distance Association). We generalize its property to a group of functions on the observation graph. The group of functions encapsulates major existing methods in association discovery, e.g. mutual information and Brownian Covariance, and can be expanded to more complicated forms. We conducted numerical comparison of the statistical power of related methods under multiple scenarios. We further demonstrated the application of MeDiA as a method of gene set analysis that captures a broader range of responses than traditional gene set analysis methods.
[Show abstract][Hide abstract] ABSTRACT: HIV-1 infection is characterized by varying degrees of chronic immune activation and disruption of T-cell homeostasis, which impact the rate of disease progression. A deeper understanding of the factors that influence HIV-1-induced immunopathology and subsequent CD4(+) T-cell decline is critical to strategies aimed at controlling or eliminating the virus. In an analysis of 127 acutely infected Zambians, we demonstrate a dramatic and early impact of viral replicative capacity (vRC) on HIV-1 immunopathogenesis that is independent of viral load (VL). Individuals infected with high-RC viruses exhibit a distinct inflammatory cytokine profile as well as significantly elevated T-cell activation, proliferation, and CD8(+) T-cell exhaustion, during the earliest months of infection. Moreover, the vRC of the transmitted virus is positively correlated with the magnitude of viral burden in naive and central memory CD4(+) T-cell populations, raising the possibility that transmitted viral phenotypes may influence the size of the initial latent viral reservoir. Taken together, these findings support an unprecedented role for the replicative fitness of the founder virus, independent of host protective genes and VL, in influencing multiple facets of HIV-1-related immunopathology, and that a greater focus on this parameter could provide novel approaches to clinical interventions.
Full-text · Article · Feb 2015 · Proceedings of the National Academy of Sciences
[Show abstract][Hide abstract] ABSTRACT: High-throughput expression data, such as gene expression and metabolomics data, exhibit modular structures. Groups of features in each module follow a latent factor model, while between modules, the latent factors are quasi-independent. Recovering the latent factors can shed light on the hidden regulation patterns of the expression. The difficulty in detecting such modules and recovering the latent factors lies in the high dimensionality of the data, and the lack of knowledge in module membership.
Here we describe a method based on community detection in the co-expression network. It consists of inference-based network construction, module detection, and interacting latent factor detection from modules.
In simulations, the method outperformed projection-based modular latent factor discovery when the input signals were not Gaussian. We also demonstrate the method's value in real data analysis.
The new method nMLSA (network-based modular latent structure analysis) is effective in detecting latent structures, and is easy to extend to non-linear cases. The method is available as R code at http://web1.sph.emory.edu/users/tyu8/nMLSA/.
[Show abstract][Hide abstract] ABSTRACT: Background
Understanding the metabolites that are altered by donor red blood cell (RBC) storage and irradiation may provide insight into the metabolic pathways disrupted by the RBC storage lesion.Study Design and Methods
Patterns of metabolites, representing more than 11,000 distinct mass-to-charge ratio (m/z) features, were compared between gamma-irradiated and nonirradiated CPDA-1–split RBCs from six human donors over 35 days of storage using multilevel sparse partial least squares discriminant analysis (msPLSDA), hierarchical clustering, pathway enrichment analysis, and network analysis.ResultsIn msPLSDA analysis, RBC units stored 7 days or fewer (irradiated or nonirradiated) showed similar metabolomic profiles. By contrast, donor RBCs stored 10 days or more demonstrated distinct clustering as a function of storage time and irradiation. Irradiation shifted metabolic features to those seen in older units. Hierarchical clustering analysis identified at least two clusters of metabolites that differentiated between RBC units based on storage time and irradiation exposure, confirming results of the msPLSDA analysis. Pathway enrichment analysis, used to map the discriminatory biochemical features to specific metabolic pathways, identified four pathways significantly affected by irradiation and/or storage including arachidonic acid (p = 3.3 × 10−33) and linoleic acid (p = 1.61 × 10−11) metabolism.ConclusionRBC storage under blood bank conditions produces numerous metabolic alterations. Gamma irradiation accentuates these differences as the age of blood increases, indicating that at the biochemical level irradiation accelerates metabolic aging of stored RBCs. Metabolites involved in the cellular membrane are prominently affected and may be useful biomarkers of the RBC storage lesion.
[Show abstract][Hide abstract] ABSTRACT: Feature selection is a critical step in translational omics research. False discovery rate (FDR) is anintegral tool of statistical inference in feature selection from high-throughput data. It is commonly used to screen features (SNPs, genes, proteins, or metabolites) for their relevance to the specific clinical outcome under study. Traditionally, all features are treated equally in the calculation of false discovery rate. In many applications, different features are measured with different levels of reliability. In such situations, treating all features equally will cause substantial loss of statistical power to detect significant features. Feature reliability can often be quantified in the measurements. Here we present a new method to estimate the local false discovery rate that incorporates feature reliability. We also propose a composite reliability index for metabolomics data. Combined with the new local false discovery rate method, it helps to detect more differentially expressed metabolites that are biologically meaningful in a real metabolomics dataset.
[Show abstract][Hide abstract] ABSTRACT: It is very challenging to select informative features from tens of thousands
of measured features in high-throughput data analysis. Recently, several
parametric/regression models have been developed utilizing the gene network
information to select genes or pathways strongly associated with a
clinical/biological outcome. Alternatively, in this paper, we propose a
nonparametric Bayesian model for gene selection incorporating network
information. In addition to identifying genes that have a strong association
with a clinical outcome, our model can select genes with particular
expressional behavior, in which case the regression models are not directly
applicable. We show that our proposed model is equivalent to an infinity
mixture model for which we develop a posterior computation algorithm based on
Markov chain Monte Carlo (MCMC) methods. We also propose two fast computing
algorithms that approximate the posterior simulation with good accuracy but
relatively low computational cost. We illustrate our methods on simulation
studies and the analysis of Spellman yeast cell cycle microarray data.
Full-text · Article · Jul 2014 · The Annals of Applied Statistics
[Show abstract][Hide abstract] ABSTRACT: Motivation:
Peak detection is a key step in the preprocessing of untargeted metabolomics data generated from high-resolution liquid chromatography-mass spectrometry (LC/MS). The common practice is to use filters with predetermined parameters to select peaks in the LC/MS profile. This rigid approach can cause suboptimal performance when the choice of peak model and parameters do not suit the data characteristics.
Here we present a method that learns directly from various data features of the extracted ion chromatograms (EICs) to differentiate between true peak regions from noise regions in the LC/MS profile. It utilizes the knowledge of known metabolites, as well as robust machine learning approaches. Unlike currently available methods, this new approach does not assume a parametric peak shape model and allows maximum flexibility. We demonstrate the superiority of the new approach using real data. Because matching to known metabolites entails uncertainties and cannot be considered a gold standard, we also developed a probabilistic receiver-operating characteristic (pROC) approach that can incorporate uncertainties.
Availability and implementation:
The new peak detection approach is implemented as part of the apLCMS package available at http://web1.sph.emory.edu/apLCMS/ CONTACT: email@example.com
Supplementary data are available at Bioinformatics online.
[Show abstract][Hide abstract] ABSTRACT: Unlabelled:
It remains a challenge to develop a successful human immunodeficiency virus (HIV) vaccine that is capable of preventing infection. Here, we utilized the benefits of CD40L, a costimulatory molecule that can stimulate both dendritic cells (DCs) and B cells, as an adjuvant for our simian immunodeficiency virus (SIV) DNA vaccine in rhesus macaques. We coexpressed the CD40L with our DNA/SIV vaccine such that the CD40L is anchored on the membrane of SIV virus-like particle (VLP). These CD40L containing SIV VLPs showed enhanced activation of DCs in vitro. We then tested the potential of DNA/SIV-CD40L vaccine to adjuvant the DNA prime of a DNA/modified vaccinia virus Ankara (MVA) vaccine in rhesus macaques. Our results demonstrated that the CD40L adjuvant enhanced the functional quality of anti-Env antibody response and breadth of anti-SIV CD8 and CD4 T cell responses, significantly delayed the acquisition of heterologous mucosal SIV infection, and improved viral control. Notably, the CD40L adjuvant enhanced the control of viral replication in the gut at the site of challenge that was associated with lower mucosal CD8 immune activation, one of the strong predictors of disease progression. Collectively, our results highlight the benefits of CD40L adjuvant for enhancing antiviral humoral and cellular immunity, leading to enhanced protection against a pathogenic SIV. A single adjuvant that enhances both humoral and cellular immunity is rare and thus underlines the importance and practicality of CD40L as an adjuvant for vaccines against infectious diseases, including HIV-1.
Despite many advances in the field of AIDS research, an effective AIDS vaccine that can prevent infection remains elusive. CD40L is a key stimulator of dendritic cells and B cells and can therefore enhance T cell and antibody responses, but its overly potent nature can lead to adverse effects unless used in small doses. In order to modulate local expression of CD40L at relatively lower levels, we expressed CD40L in a membrane-bound form, along with SIV antigens, in a nucleic acid (DNA) vector. We tested the immunogenicity and efficacy of the CD40L-adjuvanted vaccine in macaques using a heterologous mucosal SIV infection. The CD40L-adjuvanted vaccine enhanced the functional quality of anti-Env antibody response and breadth of anti-SIV T cell responses and improved protection. These results demonstrate that VLP-membrane-bound CD40L serves as a novel adjuvant for an HIV vaccine.
Preview · Article · Jun 2014 · Journal of Virology
[Show abstract][Hide abstract] ABSTRACT: Mining novel biomarkers from gene expression profiles for accurate disease classification is challenging due to small sample size and high noise in gene expression measurements. Several studies have proposed integrated analyses of microarray data and protein-protein interaction (PPI) networks to find diagnostic subnetwork markers. However, the neighborhood relationship among network member genes has not been fully considered by those methods, leaving many potential gene markers unidentified. The main idea of this study is to take full advantage of the biological observation that genes associated with the same or similar diseases commonly reside in the same neighborhood of molecular networks.
We present EgoNet, a novel method based on egocentric network-analysis techniques, to exhaustively search and prioritize disease subnetworks and gene markers from a large-scale biological network. When applied to a triple-negative breast cancer (TNBC) microarray dataset, the top selected modules contain both known gene markers in TNBC and novel candidates, such as RAD51 and DOK1, which play a central role in their respective ego-networks by connecting many differentially expressed genes.
Our results suggest that EgoNet, which is based on the ego network concept, allows the identification of novel biomarkers and provides a deeper understanding of their roles in complex diseases.
[Show abstract][Hide abstract] ABSTRACT: Short telomere length, a marker of biological aging, has been associated with age-related metabolic disorders. Telomere attrition induces profound metabolic dysfunction in animal models, but no study has examined the metabolome of telomeric aging in human. Here we studied 423 apparently healthy American Indians participating in the Strong Family Heart Study. Leukocyte telomere length (LTL) was measured by qPCR. Metabolites in fasting plasma were detected by untargeted LC/MS. Associations of LTL with each metabolite and their combined effects were examined using generalized estimating equation adjusting for chronological age and other aging-related factors. Multiple testing was corrected using the q-value method (q<0.05). Of the 1,364 distinct m/z features detected, nineteen metabolites in the classes of glycerophosphoethanolamines, glycerophosphocholines, glycerolipids, bile acids, isoprenoids, fatty amides, or L-carnitine ester were significantly associated with LTL, independent of chronological age and other aging-related factors. Participants with longer (top tertile) and shorter (bottom tertile) LTL were clearly separated into distinct groups using a multi-marker score comprising of all these metabolites, suggesting that these newly detected metabolites could be novel metabolic markers of biological aging. This is the first study to interrogate the human metabolome of telomeric aging. Our results provide initial evidence for a metabolic control of LTL and may reveal previously undescribed new roles of various lipids in the aging process.
[Show abstract][Hide abstract] ABSTRACT: Metabolic profiling is the unbiased detection and quantification of low molecular-weight metabolites in a living system. It is rapidly developing in biological and translational research, contributing to disease mechanism elucidation, environmental chemical surveillance, biomarker detection, and health outcome prediction. Recent developments in experimental and computational technology allow more and more known metabolites to be detected and quantified from complex samples. As the coverage of the metabolic network improves, it has become feasible to examine metabolic profiling data from a systems perspective, i.e. interpreting the data and performing statistical inference in the context of pathways and genome-scale metabolic networks. Recently a number of methods have been developed in this area, and much improvement in algorithms and databases are still needed. In this review, we survey some methods for the analysis of metabolic profiling data based on metabolic networks.
[Show abstract][Hide abstract] ABSTRACT: High-throughput expression technologies, including gene expression array and liquid chromatography--mass spectrometry (LC-MS) and so on, measure thousands of features, i.e., genes or metabolites, on a continuous scale. In such data, both linear and nonlinear relations exist between features. Nonlinear relations can reflect critical regulation patterns in the biological system. However, they are not identified and utilized by traditional clustering methods based on linear associations. Clustering based on general dependences, i.e., both linear and nonlinear relations, is hampered by the high dimensionality and high noise level of the data. We developed a sensitive nonparametric measure of general dependence between (groups of) random variables in high dimensions. Based on this dependence measure, we developed a hierarchical clustering method. In simulation studies, the method outperformed correlation- and mutual information (MI)-based hierarchical clustering methods in clustering features with nonlinear dependences. We applied the method to a microarray data set measuring the gene expression in cell-cycle time series to show it generates biologically relevant results. The R code is available at http://userwww.service.emory.edu/~tyu8/GDHC.
No preview · Article · Aug 2013 · IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM
[Show abstract][Hide abstract] ABSTRACT: Joint analyses of high-throughput datasets generate the need to assess the association between two long lists of p-values. In such p-value lists, the vast majority of the features are insignificant. Ideally contributions of features that are null in both tests should be minimized. However, by random chance their p-values are uniformly distributed between zero and one, and weak correlations of the p-values may exist due to inherent biases in the high-throughput technology used to generate the multiple datasets. Rank-based agreement test may capture such unwanted effects. Testing contingency tables generated using hard cutoffs may be sensitive to arbitrary threshold choice. We develop a novel method based on feature-level concordance using local false discovery rate. The association score enjoys straight-forward interpretation. The method shows higher statistical power to detect association between p-value lists in simulation. We demonstrate its utility using real data analysis. The R implementation of the method is available at http://userwww.service.emory.edu/~tyu8/AAPL/.
No preview · Article · Apr 2013 · Statistical Analysis and Data Mining
[Show abstract][Hide abstract] ABSTRACT: Feature detection is a critical step in the preprocessing of Liquid Chromatography - Mass Spectrometry (LC-MS) metabolomics data. Currently, the predominant approach is to detect features using noise filters and peak shape models based on the data at hand alone. Databases of known metabolites and historical data contain information that could help boost the sensitivity of feature detection, especially for low-concentration metabolites. However, utilizing such information in targeted feature detection may cause large number of false-positives because of the high levels of noise in LC-MS data. With high-resolution mass spectrometry such as Liquid Chromatograph - Fourier Transform Liquid Chromatography (LC-FTMS), high-confidence matching of peaks to known features is feasible. Here we describe a computational approach that serves two purposes. First it boosts feature detection sensitivity by using a hybrid procedure of both untargeted and targeted peak detection. New algorithms are designed to reduce the chance of false-positives by non-parametric local peak detection and filtering. Second, it can accumulate information on the concentration variation of metabolites over large number of samples, which can help find rare features and/or features with uncommon concentration in future studies. Information can be accumulated on features that are consistently found in real data even before their identities are found. We demonstrate the value of the approach in a proof-of-concept study. The method is implemented as part of the R package apLCMS at http://www.sph.emory.edu/apLCMS/.
No preview · Article · Jan 2013 · Journal of Proteome Research
[Show abstract][Hide abstract] ABSTRACT: Venn Diagrams representing overlapping features between the default setting and variations in min.run at min.pres = 0.3.
Only unique features at
tolerance level of 10 ppm were used to generate Venn diagrams using BioVenn (