Hongzhe Li

University of Pennsylvania, Philadelphia, Pennsylvania, United States

Are you Hongzhe Li?

Claim your profile

Publications (77)294.69 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper studies the problem of detecting dependence between two mixture distributions, motivated by questions arising from statistical genomics. The fundamental limits of detecting weak positive dependence are derived and an oracle test statistic is proposed. It is shown that for mixture distributions whose components are stochastically ordered, the oracle test statistic is asymptotically optimal. Connections are drawn between dependency detection and signal detection, where the goal of the latter is to detect the presence of non-null components in a single mixture distribution. It is shown that the oracle test for dependency can also be used as a signal detection procedure in the two-sample setting, and there can achieve detection even when detection using each sample separately is provably impossible. A nonparametric data-adaptive test statistic is then proposed, and its closed-form asymptotic distribution under the null hypothesis of independence is established. Simulations show that the adaptive procedure performs as well as the oracle test statistic, and that both can be more powerful than existing methods. In an application to the analysis of the shared genetic basis of psychiatric disorders, the adaptive test is able to detect genetic relationships not detected by other procedures.
  • [Show abstract] [Hide abstract]
    ABSTRACT: The consumption of an agrarian diet is associated with a reduced risk for many diseases associated with a 'Westernised' lifestyle. Studies suggest that diet affects the gut microbiota, which subsequently influences the metabolome, thereby connecting diet, microbiota and health. However, the degree to which diet influences the composition of the gut microbiota is controversial. Murine models and studies comparing the gut microbiota in humans residing in agrarian versus Western societies suggest that the influence is large. To separate global environmental influences from dietary influences, we characterised the gut microbiota and the host metabolome of individuals consuming an agrarian diet in Western society. Using 16S rRNA-tagged sequencing as well as plasma and urinary metabolomic platforms, we compared measures of dietary intake, gut microbiota composition and the plasma metabolome between healthy human vegans and omnivores, sampled in an urban USA environment. Plasma metabolome of vegans differed markedly from omnivores but the gut microbiota was surprisingly similar. Unlike prior studies of individuals living in agrarian societies, higher consumption of fermentable substrate in vegans was not associated with higher levels of faecal short chain fatty acids, a finding confirmed in a 10-day controlled feeding experiment. Similarly, the proportion of vegans capable of producing equol, a soy-based gut microbiota metabolite, was less than that was reported in Asian societies despite the high consumption of soy-based products. Evidently, residence in globally distinct societies helps determine the composition of the gut microbiota that, in turn, influences the production of diet-dependent gut microbial metabolites. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
    Gut 11/2014; · 10.73 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In a genome-wide survey on somatic copy-number alterations (SCNAs) of long noncoding RNA (lncRNA) in 2,394 tumor specimens from 12 cancer types, we found that about 21.8% of lncRNA genes were located in regions with focal SCNAs. By integrating bioinformatics analyses of lncRNA SCNAs and expression with functional screening assays, we identified an oncogene, focally amplified lncRNA on chromosome 1 (FAL1), whose copy number and expression are correlated with outcomes in ovarian cancer. FAL1 associates with the epigenetic repressor BMI1 and regulates its stability in order to modulate the transcription of a number of genes including CDKN1A. The oncogenic activity of FAL1 is partially attributable to its repression of p21. FAL1-specific siRNAs significantly inhibit tumor growth in vivo.
    Cancer cell. 09/2014; 26(3):344-357.
  • Sihai D Zhao, T Tony Cai, Hongzhe Li
    [Show abstract] [Hide abstract]
    ABSTRACT: Integrative genomics offers a promising approach to more powerful genetic association studies. The hope is that combining outcome and genotype data with other types of genomic information can lead to more powerful SNP detection. We present a new association test based on a statistical model that explicitly assumes that genetic variations affect the outcome through perturbing gene expression levels. It is shown analytically that the proposed approach can have more power to detect SNPs that are associated with the outcome through transcriptional regulation, compared to tests using the outcome and genotype data alone, and simulations show that our method is relatively robust to misspecification. We also provide a strategy for applying our approach to high-dimensional genomic data. We use this strategy to identify a potentially new association between a SNP and a yeast cell's response to the natural product tomatidine, which standard association analysis did not detect.
    Biometrics 06/2014; · 1.41 Impact Factor
  • Wei Wang, Zhi Wei, Hongzhe Li
    [Show abstract] [Hide abstract]
    ABSTRACT: Next-generation RNA sequencing offers an opportunity to investigate transcriptome in an unprecedented scale. Recent studies have revealed widespread alternative polyadenalytion (APA) in eukaryotes, leading to various mRNA isoforms differing in their 3'UTR, through which, the stability, localization and translation of mRNA can be regulated. However, very few, if any, methods and tools are available for directly analyzing this special alternative RNA processing event. Conventional methods rely on annotation of polyadenylation sites; yet, such knowledge remains incomplete, and identification of polyA sites is still challenging. The goal of this article is to develop methods for detecting 3'UTR switching without any prior knowledge of polyA annotations. We propose a change-point model based on a likelihood ratio test for detecting 3'UTR switching. We develop a directional testing procedure for identifying dramatic shortening or lengthening events in 3'UTR, while controlling mixed directional FDR at a nominal level. To our knowledge, this is the first approach to analyze 3'UTR switching directly without relying on any polyA annotations. Simulation studies and applications to two real datasets reveal that our proposed method is powerful, accurate and feasible for the analysis of next-generation RNA sequencing data. The proposed method will fill a void among alternative RNA processing analysis tools for transciptome studies. It can help to obtain additional insights from RNA sequencing data by understanding gene regulation mechanisms through the analysis of 3'UTR switching. The software is implemented in Java and can be freely downloaded from http://utr.sourceforge.net/. zhiwei@njit.edu; hongzhe@mail.med.upenn.edu.
    Bioinformatics 04/2014; · 5.47 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Copy number variants (CNVs) constitute an important class of genetic variants in human genome and are shown to be associated with complex diseases. Whole-genome sequencing provides an unbiased way of identifying all the CNVs that an individual carries. In this paper, we consider parametric modeling of the read depth (RD) data from whole-genome sequencing with the aim of identifying the CNVs, including both Poisson and negative-binomial modeling of such count data. We propose a unified approach of using a mean-matching variance stabilizing transformation to turn the relatively complicated problem of sparse segment identification for count data into a sparse segment identification problem for a sequence of Gaussian data. We apply the optimal sparse segment identification procedure to the transformed data in order to identify the CNV segments. This provides a computationally efficient approach for RD-based CNV identification. Simulation results show that this approach often results in a small number of false identifications of the CNVs and has similar or better performances in identifying the true CNVs when compared with other RD-based approaches. We demonstrate the methods using the trio data from the 1000 Genomes Project.
    Biostatistics 01/2014; · 2.43 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: It is often of interest to understand how the structure of a genetic network differs between two conditions. In this paper, each condition-specific network is modelled using the precision matrix of a multivariate normal random vector, and a method is proposed to directly estimate the difference of the precision matrices. In contrast to other approaches, such as separate or joint estimation of the individual matrices, direct estimation does not require those matrices to be sparse, and thus can allow the individual networks to contain hub nodes. Under the assumption that the true differential network is sparse, the direct estimator is shown to be consistent in support recovery and estimation. It is also shown to outperform existing methods in simulations, and its properties are illustrated on gene expression data from late-stage ovarian cancer patients.
    Biometrika 01/2014; 2(2). · 1.65 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Correctly estimating isoform-specific gene expression is important for understanding complicated biological mechanisms and for mapping disease susceptibility genes. However, estimating isoform-specific gene expression is challenging because various biases present in RNA-Seq (RNA sequencing) data complicate the analysis, and if not appropriately corrected, can affect isoform expression estimation and downstream analysis. In this article, we present PennSeq, a statistical method that allows each isoform to have its own non-uniform read distribution. Instead of making parametric assumptions, we give adequate weight to the underlying data by the use of a non-parametric approach. Our rationale is that regardless what factors lead to non-uniformity, whether it is due to hexamer priming bias, local sequence bias, positional bias, RNA degradation, mapping bias or other unknown reasons, the probability that a fragment is sampled from a particular region will be reflected in the aligned data. This empirical approach thus maximally reflects the true underlying non-uniform read distribution. We evaluate the performance of PennSeq using both simulated data with known ground truth, and using two real Illumina RNA-Seq data sets including one with quantitative real time polymerase chain reaction measurements. Our results indicate superior performance of PennSeq over existing methods, particularly for isoforms demonstrating severe non-uniformity. PennSeq is freely available for download at http://sourceforge.net/projects/pennseq.
    Nucleic Acids Research 12/2013; · 8.81 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Changes in human microbiome are associated with many human diseases. Next generation sequencing technologies make it possible to quantify the microbial composition without the need for laboratory cultivation. One important problem of microbiome data analysis is to identify the environmental/biological covariates that are associated with different bacterial taxa. Taxa count data in microbiome studies are often over-dispersed and include many zeros. To account for such an over-dispersion, we propose to use an additive logistic normal multinomial regression model to associate the covariates to bacterial composition. The model can naturally account for sampling variabilities and zero observations and also allow for a flexible covariance structure among the bacterial taxa. In order to select the relevant covariates and to estimate the corresponding regression coefficients, we propose a group ℓ1 penalized likelihood estimation method for variable selection and estimation. We develop a Monte Carlo expectation-maximization algorithm to implement the penalized likelihood estimation. Our simulation results show that the proposed method outperforms the group ℓ1 penalized multinomial logistic regression and the Dirichlet multinomial regression models in variable selection. We demonstrate the methods using a data set that links human gut microbiome to micro-nutrients in order to identify the nutrients that are associated with the human gut microbiome enterotype.
    Biometrics 10/2013; · 1.41 Impact Factor
  • Hongzhe Li
    [Show abstract] [Hide abstract]
    ABSTRACT: Systems biology approaches to epidemiological studies of complex diseases include collection of genetic, genomic, epigenomic, and metagenomic data in large-scale epidemiological studies of complex phenotypes. Designs and analyses of such studies raise many statistical challenges. This article reviews some issues related to integrative analysis of such high dimensional and inter-related datasets and outline some possible solutions. I focus my review on integrative approaches for genome-wide genetic variants and gene expression data, methods for joint analysis of genetic and epigenetic variants, and methods for analysis of microbiome data. Statistical methods such as mediation analysis, high-dimensional instrumental variable regression, sparse signal recovery, and compositional data regression provide potential frameworks for integrative analysis of these high-dimensional genomic data. For further resources related to this article, please visit the WIREs website. Conflict of interest: The authors have declared no conflicts of interest for this article.
    Wiley Interdisciplinary Reviews Systems Biology and Medicine 09/2013; · 3.68 Impact Factor
  • Source
    Wanlu Deng, Zhi Geng, Hongzhe Li
    [Show abstract] [Hide abstract]
    ABSTRACT: Multivariate time series (MTS) data such as time course gene expression data in genomics are often collected to study the dynamic nature of the systems. These data provide important information about the causal dependency among a set of random variables. In this paper, we introduce a computationally efficient algorithm to learn directed acyclic graphs (DAGs) based on MTS data, focusing on learning the local structure of a given target variable. Our algorithm is based on learning all parents (P), all children (C) and some descendants (D) (PCD) iteratively, utilizing the time order of the variables to orient the edges. This time series PCD-PCD algorithm (tsPCD-PCD) extends the previous PCD-PCD algorithm to dependent observations and utilizes composite likelihood ratio tests (CLRTs) for testing the conditional independence. We present the asymptotic distribution of the CLRT statistic and show that the tsPCD-PCD is guaranteed to recover the true DAG structure when the faithfulness condition holds and the tests correctly reject the null hypotheses. Simulation studies show that the CLRTs are valid and perform well even when the sample sizes are small. In addition, the tsPCD-PCD algorithm outperforms the PCD-PCD algorithm in recovering the local graph structures. We illustrate the algorithm by analyzing a time course gene expression data related to mouse T-cell activation.
    The Annals of Applied Statistics 09/2013; 7(3). · 2.24 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: RNA-Seq has drastically changed our ways of studying transcrip-tomes in providing more precise estimates of gene expression, including isoform-specific expression. Most of the available methods for RNA-Seq data focus on one sample at a time. We present in this paper a Poisson-Gamma hierarchical model for multi-sample RNA-Seq data analysis in order to simultaneously estimate isoform-specific expression and to identify differentially expressed iso-forms. Our model has the advantage of borrowing information across all samples in estimating expression levels, which can improve the estimates drastically, particularly for low abundance isoforms. Furthermore, our hierarchical model has the ability to account for overdispersion in the data and also can incorporate sample-specific covariates in the underlying model, which facilitates the isoform-specific differential expression analysis. Simulation studies demonstrated that this Bayesian multi-sample approach can lead to more precise estimates of isoform-specific expression and higher power to detect differential expression by borrowing information across all samples than single sample analysis, especially for isoforms of low abundance. We further illustrated our methods using the RNA-Seq data of 10 Yoruban and 10 Caucasian individuals.
    Statistics in Biosciences 05/2013; 5(1):119-137.
  • Source
    Wei Lin, Rui Feng, Hongzhe Li
    [Show abstract] [Hide abstract]
    ABSTRACT: In genetical genomics studies, it is important to jointly analyze gene expression data and genetic variants in exploring their associations with complex traits, where the dimensionality of gene expressions and genetic variants can both be much larger than the sample size. Motivated by such modern applications, we consider the problem of variable selection and estimation in high-dimensional sparse instrumental variables models. To overcome the difficulty of high dimensionality and unknown optimal instruments, we propose a two-stage regularization framework for identifying and estimating important covariate effects while selecting and estimating optimal instruments. The methodology extends the classical two-stage least squares estimator to high dimensions by exploiting sparsity using sparsity-inducing penalty functions in both stages. The resulting procedure is efficiently implemented by coordinate descent optimization. In the representative case of $L_1$ regularization, we establish estimation, prediction, and model selection properties of the two-stage regularized estimators in the high-dimensional setting where the dimensions of covariates and instruments are both allowed to grow exponentially with the sample size. The practical performance of the proposed method is evaluated by simulation studies and its usefulness is illustrated by an analysis of mouse obesity data.
    Journal of the American Statistical Association 04/2013; · 1.83 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Intestinal microbiota metabolism of choline and phosphatidylcholine produces trimethylamine (TMA), which is further metabolized to a proatherogenic species, trimethylamine-N-oxide (TMAO). We demonstrate here that metabolism by intestinal microbiota of dietary l-carnitine, a trimethylamine abundant in red meat, also produces TMAO and accelerates atherosclerosis in mice. Omnivorous human subjects produced more TMAO than did vegans or vegetarians following ingestion of l-carnitine through a microbiota-dependent mechanism. The presence of specific bacterial taxa in human feces was associated with both plasma TMAO concentration and dietary status. Plasma l-carnitine levels in subjects undergoing cardiac evaluation (n = 2,595) predicted increased risks for both prevalent cardiovascular disease (CVD) and incident major adverse cardiac events (myocardial infarction, stroke or death), but only among subjects with concurrently high TMAO levels. Chronic dietary l-carnitine supplementation in mice altered cecal microbial composition, markedly enhanced synthesis of TMA and TMAO, and increased atherosclerosis, but this did not occur if intestinal microbiota was concurrently suppressed. In mice with an intact intestinal microbiota, dietary supplementation with TMAO or either carnitine or choline reduced in vivo reverse cholesterol transport. Intestinal microbiota may thus contribute to the well-established link between high levels of red meat consumption and CVD risk.
    Nature medicine 04/2013; · 27.14 Impact Factor
  • Source
    Jun Chen, Hongzhe Li
    [Show abstract] [Hide abstract]
    ABSTRACT: With the development of next generation sequencing technology, researchers have now been able to study the microbiome composition using direct sequencing, whose output are bacterial taxa counts for each microbiome sample. One goal of microbiome study is to associate the microbiome composition with environmental covariates. We propose to model the taxa counts using a Dirichlet-multinomial (DM) regression model in order to account for overdispersion of observed counts. The DM regression model can be used for testing the association between taxa composition and covariates using the likelihood ratio test. However, when the number of the covariates is large, multiple testing can lead to loss of power. To deal with the high dimensionality of the problem, we develop a penalized likelihood approach to estimate the regression parameters and to select the variables by imposing a sparse group [Formula: see text] penalty to encourage both group-level and within-group sparsity. Such a variable selection procedure can lead to selection of the relevant covariates and their associated bacterial taxa. An efficient block-coordinate descent algorithm is developed to solve the optimization problem. We present extensive simulations to demonstrate that the sparse DM regression can result in better identification of the microbiome-associated covariates than models that ignore overdispersion or only consider the proportions. We demonstrate the power of our method in an analysis of a data set evaluating the effects of nutrient intake on human gut microbiome composition. Our results have clearly shown that the nutrient intake is strongly associated with the human gut microbiome.
    The Annals of Applied Statistics 03/2013; 7(1). · 2.24 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Motivated by analysis of genetical genomics data, we introduce a sparse high-dimensional multivariate regression model for studying conditional independence relationships among a set of genes adjusting for possible genetic effects. The precision matrix in the model specifies a covariate-adjusted Gaussian graph, which presents the conditional dependence structure of gene expressions after the confounding genetic effects on gene expression are taken into account. We present a covariate-adjusted precision matrix estimation method using a constrained ℓ 1 minimization, which can be easily implemented by linear programming. Asymptotic convergence rates in various matrix norms and sign consistency are established for the estimators of the regression coefficients and the precision matrix, allowing both the number of genes and the number of the genetic variants to diverge. Simulation shows that the proposed method results in significant improvements in both precision matrix estimation and graphical structure selection when compared to the standard Gaussian graphical model assuming constant means. The proposed method is applied to yeast genetical genomics data for the identification of the gene network among a set of genes in the mitogen-activated protein kinase pathway.
    Biometrika 01/2013; 1(1). · 1.65 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Diet influences health as a source of nutrients and toxins, and by shaping the composition of resident microbial populations. Previous studies have begun to map out associations between diet and the bacteria and viruses of the human gut microbiome. Here we investigate associations of diet with fungal and archaeal populations, taking advantage of samples from 98 well-characterized individuals. Diet was quantified using inventories scoring both long-term and recent diet, and archaea and fungi were characterized by deep sequencing of marker genes in DNA purified from stool. For fungi, we found 66 genera, with generally mutually exclusive presence of either the phyla Ascomycota or Basiodiomycota. For archaea, Methanobrevibacter was the most prevalent genus, present in 30% of samples. Several other archaeal genera were detected in lower abundance and frequency. Myriad associations were detected for fungi and archaea with diet, with each other, and with bacterial lineages. Methanobrevibacter and Candida were positively associated with diets high in carbohydrates, but negatively with diets high in amino acids, protein, and fatty acids. A previous study emphasized that bacterial population structure was associated primarily with long-term diet, but high Candida abundance was most strongly associated with the recent consumption of carbohydrates. Methobrevibacter abundance was associated with both long term and recent consumption of carbohydrates. These results confirm earlier targeted studies and provide a host of new associations to consider in modeling the effects of diet on the gut microbiome and human health.
    PLoS ONE 01/2013; 8(6):e66019. · 3.53 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Copy number variations (CNVs) are associated with many complex diseases. Next generation sequencing data enable one to identify precise CNV breakpoints to better under the underlying molecular mechanisms and to design more efficient assays. Using the CIGAR strings of the reads, we develop a method that can identify the exact CNV breakpoints, and in cases when the breakpoints are in a repeated region, the method reports a range where the breakpoints can slide. Our method identifies the breakpoints of a CNV using both the positions and CIGAR strings of the reads that cover breakpoints of a CNV. A read with a long soft clipped part (denoted as S in CIGAR) at its 3'(right) end can be used to identify the 5'(left)-side of the breakpoints, and a read with a long S part at the 5' end can be used to identify the breakpoint at the 3'-side. To ensure both types of reads cover the same CNV, we require the overlapped common string to include both of the soft clipped parts. When a CNV starts and ends in the same repeated regions, its breakpoints are not unique, in which case our method reports the left most positions for the breakpoints and a range within which the breakpoints can be incremented without changing the variant sequence. We have implemented the methods in a C++ package intended for the current Illumina Miseq and Hiseq platforms for both whole genome and exon-sequencing. Our simulation studies have shown that our method compares favorably with other similar methods in terms of true discovery rate, false positive rate and breakpoint accuracy. Our results from a real application have shown that the detected CNVs are consistent with zygosity and read depth information. The software package is available at http://statgene.med.upenn.edu/softprog.html.
    Frontiers in Genetics 01/2013; 4:157.
  • X Jessie Jeng, T Tony Cai, Hongzhe Li
    [Show abstract] [Hide abstract]
    ABSTRACT: Copy number variant is an important type of genetic structural variation appearing in germline DNA, ranging from common to rare in a population. Both rare and common copy number variants have been reported to be associated with complex diseases, so it is therefore important to simultaneously identify both based on a large set of population samples. We develop a proportion adaptive segment selection procedure that automatically adjusts to the unknown proportions of the carriers of the segment variants. We characterize the detection boundary that separates the region where a segment variant is detectable by some method from the region where it cannot be detected. Although the detection boundaries are very different for the rare and common segment variants, it is shown that the proposed procedure can reliably identify both whenever they are detectable. Compared with methods for single sample analysis, this procedure gains power by pooling information from multiple samples. The method is applied to analyze neuroblastoma samples and identifies a large number of copy number variants that are missed by single-sample methods.
    Biometrika 01/2013; 100(1):157-172. · 1.65 Impact Factor
  • T Tony Cai, X Jessie Jeng, Hongzhe Li
    [Show abstract] [Hide abstract]
    ABSTRACT: Copy number variants (CNVs) are alternations of DNA of a genome that results in the cell having a less or more than two copies of segments of the DNA. CNVs correspond to relatively large regions of the genome, ranging from about one kilobase to several megabases, that are deleted or duplicated. Motivated by CNV analysis based on next generation sequencing data, we consider the problem of detecting and identifying sparse short segments hidden in a long linear sequence of data with an unspecified noise distribution. We propose a computationally efficient method that provides a robust and near-optimal solution for segment identification over a wide range of noise distributions. We theoretically quantify the conditions for detecting the segment signals and show that the method near-optimally estimates the signal segments whenever it is possible to detect their existence. Simulation studies are carried out to demonstrate the efficiency of the method under different noise distributions. We present results from a CNV analysis of a HapMap Yoruban sample to further illustrate the theory and the methods.
    Journal of the Royal Statistical Society Series B (Statistical Methodology) 11/2012; 74(5):773-797. · 4.81 Impact Factor

Publication Stats

2k Citations
294.69 Total Impact Points


  • 2010–2014
    • University of Pennsylvania
      • • Department of Biostatistics and Epidemiology
      • • Center for Neurobiology and Behavior
      Philadelphia, Pennsylvania, United States
  • 2013
    • North Carolina State University
      • Department of Statistics
      Raleigh, NC, United States
    • The University of Hong Kong
      • Department of Statistics & Actuarial Science
      Hong Kong, Hong Kong
    • Universidade Federal de Goiás
      Goianá, Goiás, Brazil
  • 2006–2013
    • Hospital of the University of Pennsylvania
      • Department of Biostatistics and Epidemiology
      Philadelphia, Pennsylvania, United States
  • 2011
    • Temple University
      • Department of Statistics
      Philadelphia, PA, United States
  • 2008
    • New Jersey Institute of Technology
      • Department of Computer Science
      Newark, NJ, United States
  • 2002–2005
    • University of California, Davis
      • • Department of Statistics
      • • School of Medicine
      Davis, CA, United States