Hongzhe Li

University of Pennsylvania, Filadelfia, Pennsylvania, United States

Are you Hongzhe Li?

Claim your profile

Publications (100)663.3 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Genome-wide association studies (GWASs) have identified hundreds of susceptibility genes, including shared associations across clinically distinct autoimmune diseases. We performed an inverse χ(2) meta-analysis across ten pediatric-age-of-onset autoimmune diseases (pAIDs) in a case-control study including more than 6,035 cases and 10,718 shared population-based controls. We identified 27 genome-wide significant loci associated with one or more pAIDs, mapping to in silico-replicated autoimmune-associated genes (including IL2RA) and new candidate loci with established immunoregulatory functions such as ADGRL2, TENM3, ANKRD30A, ADCY7 and CD40LG. The pAID-associated single-nucleotide polymorphisms (SNPs) were functionally enriched for deoxyribonuclease (DNase)-hypersensitivity sites, expression quantitative trait loci (eQTLs), microRNA (miRNA)-binding sites and coding variants. We also identified biologically correlated, pAID-associated candidate gene sets on the basis of immune cell expression profiling and found evidence of genetic sharing. Network and protein-interaction analyses demonstrated converging roles for the signaling pathways of type 1, 2 and 17 helper T cells (TH1, TH2 and TH17), JAK-STAT, interferon and interleukin in multiple autoimmune diseases.
    Nature medicine 08/2015; DOI:10.1038/nm.3933 · 28.05 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Therapeutic targets in pediatric Crohn's disease include symptoms, quality of life (QOL), and mucosal healing. Although partial enteral nutrition (PEN), exclusive enteral nutritional (EEN), and anti-tumor necrosis factor alpha (anti-TNF) therapy all improve symptoms, the comparative effectiveness of these approaches to improve QOL and achieve mucosal healing has not been assessed prospectively. In a prospective study of children initiating PEN, EEN, or anti-TNF therapy for Crohn's disease, we compared clinical outcomes using the Pediatric Crohn's Disease Activity Index (PCDAI), QOL (IMPACT score), and mucosal healing as estimated by fecal calprotectin (FCP). PCDAI, IMPACT, FCP, and diet (prompted 24-h recall) were measured at baseline and after 8 weeks of therapy. We enrolled 90 children with active Crohn's disease (PCDAI, 33.7 ± 13.7; and FCP, 976 ± 754), of whom 52 were treated with anti-TNF, 22 with EEN, and 16 with PEN plus ad lib diet. Clinical response (PCDAI reduction ≥15 or final PCDAI ≤10) was achieved by 64% on PEN, 88% EEN, and 84% anti-TNF (test for trend P = 0.08). FCP ≤250 μg/g was achieved with PEN in 14%, EEN 45%, and anti-TNF 62% (test for trend P = 0.001). Improvement in overall QOL was not statistically significantly different between the 3 groups (P = 0.86). However, QOL improvement was the greatest with EEN in the body image (P = 0.03) domain and with anti-TNF in the emotional domain (P = 0.04). Although PEN improved clinical symptoms, EEN and anti-TNF were more effective for decreasing mucosal inflammation and improving specific aspects of QOL.
    Inflammatory Bowel Diseases 05/2015; Publish Ahead of Print. DOI:10.1097/MIB.0000000000000426 · 5.48 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: HSV is a large double stranded DNA virus, capable of causing a variety of diseases from the common cold sore to devastating encephalitis. Although DNA within the HSV virion does not contain any histone protein, within 1 h of infecting a cell and entering its nucleus the viral genome acquires some histone protein (nucleosomes). During lytic infection, partial micrococcal nuclease (MNase) digestion does not give the classic ladder band pattern, seen on digestion of cell DNA or latent viral DNA. However, complete digestion does give a mono-nucleosome band, strongly suggesting that there are some nucleosomes present on the viral genome during the lytic infection, but that they are not evenly positioned, with a 200bp repeat pattern, like cell DNA. Where then are the nucleosomes positioned? Here we perform HSV-1 genome wide nucleosome mapping, at a time when viral replication is in full swing (6hr PI), using a microarray consisting of 50mer oligonucleotides, covering the whole viral genome (152kb). Arrays were probed with MNase-protected fragments of DNA from infected cells. Cells were not treated with crosslinking agents, thus we are only mapping tightly bound nucleosomes. The data show that nucleosome deposition is not random. The distribution of signal on the arrays suggest that nucleosomes are located at preferred positions on the genome, and that there are some positions that are not occupied (nucleosome free regions -NFR or Nucleosome depleted regions -NDR), or occupied at frequency below our limit of detection in the population of genomes. Occupancy of only a fraction of the possible sites may explain the lack of a typical MNase partial digestion band ladder pattern for HSV DNA during lytic infection. On average, DNA encoding Immediate Early (IE), Early (E) and Late (L) genes appear to have a similar density of nucleosomes.
    PLoS ONE 02/2015; 10(2):e0117471. DOI:10.1371/journal.pone.0117471 · 3.23 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Purpose: Chemokines are implicated in T cell trafficking. We mapped the chemokine landscape in advanced stage ovarian cancer and characterized the expression of cognate receptors in autologous DC-vaccine primed T cells in the context of cell-based immunotherapy. Experimental design: The expression of all known human chemokines in patients with primary ovarian cancer was analyzed on two independent microarray datasets and validated on tissue microarray. Peripheral blood T cells from five HLA-A2 patients with recurrent ovarian cancer, who previously received autologous tumor DC vaccine, underwent CD3/CD28 costimulation and expansion ex vivo. Tumor-specific T cells were identified by HER2/neu pentamer staining and were evaluated for the expression and functionality of chemokine receptors important for homing to ovarian cancer. Results: The chemokine landscape of ovarian cancer is heterogeneous with high expression of known lymphocyte-recruiting chemokines (CCL2, CCL4 and CCL5) in tumors with intraepithelial T cells, whereas CXCL10, CXCL12 and CXCL16 are expressed quasi-universally, including in tumors lacking tumor infiltrating T cells. DC-vaccine primed T cells were found to express the cognate receptors for the above chemokines. Ex vivo CD3/CD28 costimulation and expansion of vaccine-primed T cells upregulated CXCR3 and CXCR4, and enhanced their migration toward universally expressed chemokines in ovarian cancer. Conclusions: DC-primed tumor specific T cells are armed with the appropriate receptors to migrate towards universal ovarian cancer chemokines, and these receptors are further upregulated by ex vivo CD3/CD28 costimulation, which render T cells more fit for migrating towards these chemokines. Copyright © 2015, American Association for Cancer Research.
    Clinical Cancer Research 02/2015; 21(12). DOI:10.1158/1078-0432.CCR-14-2777 · 8.19 Impact Factor
  • Source
    Sihai Dave Zhao · T. Tony Cai · Hongzhe Li
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper studies the problem of detecting dependence between two mixture distributions, motivated by questions arising from statistical genomics. The fundamental limits of detecting weak positive dependence are derived and an oracle test statistic is proposed. It is shown that for mixture distributions whose components are stochastically ordered, the oracle test statistic is asymptotically optimal. Connections are drawn between dependency detection and signal detection, where the goal of the latter is to detect the presence of non-null components in a single mixture distribution. It is shown that the oracle test for dependency can also be used as a signal detection procedure in the two-sample setting, and there can achieve detection even when detection using each sample separately is provably impossible. A nonparametric data-adaptive test statistic is then proposed, and its closed-form asymptotic distribution under the null hypothesis of independence is established. Simulations show that the adaptive procedure performs as well as the oracle test statistic, and that both can be more powerful than existing methods. In an application to the analysis of the shared genetic basis of psychiatric disorders, the adaptive test is able to detect genetic relationships not detected by other procedures.
  • [Show abstract] [Hide abstract]
    ABSTRACT: The consumption of an agrarian diet is associated with a reduced risk for many diseases associated with a 'Westernised' lifestyle. Studies suggest that diet affects the gut microbiota, which subsequently influences the metabolome, thereby connecting diet, microbiota and health. However, the degree to which diet influences the composition of the gut microbiota is controversial. Murine models and studies comparing the gut microbiota in humans residing in agrarian versus Western societies suggest that the influence is large. To separate global environmental influences from dietary influences, we characterised the gut microbiota and the host metabolome of individuals consuming an agrarian diet in Western society. Using 16S rRNA-tagged sequencing as well as plasma and urinary metabolomic platforms, we compared measures of dietary intake, gut microbiota composition and the plasma metabolome between healthy human vegans and omnivores, sampled in an urban USA environment. Plasma metabolome of vegans differed markedly from omnivores but the gut microbiota was surprisingly similar. Unlike prior studies of individuals living in agrarian societies, higher consumption of fermentable substrate in vegans was not associated with higher levels of faecal short chain fatty acids, a finding confirmed in a 10-day controlled feeding experiment. Similarly, the proportion of vegans capable of producing equol, a soy-based gut microbiota metabolite, was less than that was reported in Asian societies despite the high consumption of soy-based products. Evidently, residence in globally distinct societies helps determine the composition of the gut microbiota that, in turn, influences the production of diet-dependent gut microbial metabolites. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
    Gut 11/2014; DOI:10.1136/gutjnl-2014-308209 · 13.32 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In a genome-wide survey on somatic copy-number alterations (SCNAs) of long noncoding RNA (lncRNA) in 2,394 tumor specimens from 12 cancer types, we found that about 21.8% of lncRNA genes were located in regions with focal SCNAs. By integrating bioinformatics analyses of lncRNA SCNAs and expression with functional screening assays, we identified an oncogene, focally amplified lncRNA on chromosome 1 (FAL1), whose copy number and expression are correlated with outcomes in ovarian cancer. FAL1 associates with the epigenetic repressor BMI1 and regulates its stability in order to modulate the transcription of a number of genes including CDKN1A. The oncogenic activity of FAL1 is partially attributable to its repression of p21. FAL1-specific siRNAs significantly inhibit tumor growth in vivo.
    Cancer Cell 09/2014; 26(3):344-357. DOI:10.1016/j.ccr.2014.07.009 · 23.89 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: BACKGROUND & AIMS: The gut microbiota is a complex and densely populated community in a dynamic environment determined by host physiology. We investigated how intestinal oxygen levels affect the composition of the fecal and mucosally adherent microbiota. METHODS: We used the phosphorescence quenching method and a specially designed intraluminal oxygen probe to dynamically quantify gut luminal oxygen levels in mice. 16S ribosomal RNA gene sequencing was used to characterize the microbiota in intestines of mice exposed to hyperbaric oxygen, human rectal biopsy and mucosal swab samples, and paired human stool samples. RESULTS: Average PO2 values in the lumen of the cecum were extremely low (< 1 mm Hg). In altering oxygenation of mouse intestines, we observed that oxygen diffused from intestinal tissue and established a radial gradient that extended from the tissue interface into the lumen. Increasing tissue oxygenation with hyperbaric oxygen altered the composition of the gut microbiota in mice. In human beings, 16S ribosomal RNA gene analyses showed an increased proportion of oxygen-tolerant organisms of the Proteobacteria and Actinobacteria phyla associated with rectal mucosa, compared with feces. A consortium of asaccharolytic bacteria of the Firmicute and Bacteroidetes phyla, which primarily metabolize peptones and amino acids, was associated primarily with mucus. This could be owing to the presence of proteinaceous substrates provided by mucus and the shedding of the intestinal epithelium. CONCLUSIONS: In an analysis of intestinal microbiota of mice and human beings, we observed a radial gradient of microbes linked to the distribution of oxygen and nutrients provided by host tissue.
    Gastroenterology 07/2014; 147(5). DOI:10.1053/j.gastro.2014.07.020 · 13.93 Impact Factor
  • Sihai D Zhao · T Tony Cai · Hongzhe Li
    [Show abstract] [Hide abstract]
    ABSTRACT: Integrative genomics offers a promising approach to more powerful genetic association studies. The hope is that combining outcome and genotype data with other types of genomic information can lead to more powerful SNP detection. We present a new association test based on a statistical model that explicitly assumes that genetic variations affect the outcome through perturbing gene expression levels. It is shown analytically that the proposed approach can have more power to detect SNPs that are associated with the outcome through transcriptional regulation, compared to tests using the outcome and genotype data alone, and simulations show that our method is relatively robust to misspecification. We also provide a strategy for applying our approach to high-dimensional genomic data. We use this strategy to identify a potentially new association between a SNP and a yeast cell's response to the natural product tomatidine, which standard association analysis did not detect.
    Biometrics 06/2014; 70(4). DOI:10.1111/biom.12206 · 1.52 Impact Factor
  • Sihai Dave Zhao · T.Tony Cai · Hongzhe Li
    [Show abstract] [Hide abstract]
    ABSTRACT: It is often of interest to understand how the structure of a genetic network differs between two conditions. In this paper, each condition-specific network is modelled using the precision matrix of a multivariate normal random vector, and a method is proposed to directly estimate the difference of the precision matrices. In contrast to other approaches, such as separate or joint estimation of the individual matrices, direct estimation does not require those matrices to be sparse, and thus can allow the individual networks to contain hub nodes. Under the assumption that the true differential network is sparse, the direct estimator is shown to be consistent in support recovery and estimation. It is also shown to outperform existing methods in simulations, and its properties are illustrated on gene expression data from late-stage ovarian cancer patients.
    Biometrika 06/2014; 2(2). DOI:10.1093/biomet/asu009 · 1.51 Impact Factor
  • Gastroenterology 05/2014; 146(5):S-347. DOI:10.1016/S0016-5085(14)61252-X · 13.93 Impact Factor
  • Wei Wang · Zhi Wei · Hongzhe Li
    [Show abstract] [Hide abstract]
    ABSTRACT: Next-generation RNA sequencing offers an opportunity to investigate transcriptome in an unprecedented scale. Recent studies have revealed widespread alternative polyadenalytion (APA) in eukaryotes, leading to various mRNA isoforms differing in their 3'UTR, through which, the stability, localization and translation of mRNA can be regulated. However, very few, if any, methods and tools are available for directly analyzing this special alternative RNA processing event. Conventional methods rely on annotation of polyadenylation sites; yet, such knowledge remains incomplete, and identification of polyA sites is still challenging. The goal of this article is to develop methods for detecting 3'UTR switching without any prior knowledge of polyA annotations. We propose a change-point model based on a likelihood ratio test for detecting 3'UTR switching. We develop a directional testing procedure for identifying dramatic shortening or lengthening events in 3'UTR, while controlling mixed directional FDR at a nominal level. To our knowledge, this is the first approach to analyze 3'UTR switching directly without relying on any polyA annotations. Simulation studies and applications to two real datasets reveal that our proposed method is powerful, accurate and feasible for the analysis of next-generation RNA sequencing data. The proposed method will fill a void among alternative RNA processing analysis tools for transciptome studies. It can help to obtain additional insights from RNA sequencing data by understanding gene regulation mechanisms through the analysis of 3'UTR switching. The software is implemented in Java and can be freely downloaded from http://utr.sourceforge.net/. zhiwei@njit.edu; hongzhe@mail.med.upenn.edu.
    Bioinformatics 04/2014; 30(15). DOI:10.1093/bioinformatics/btu189 · 4.62 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Copy number variants (CNVs) constitute an important class of genetic variants in human genome and are shown to be associated with complex diseases. Whole-genome sequencing provides an unbiased way of identifying all the CNVs that an individual carries. In this paper, we consider parametric modeling of the read depth (RD) data from whole-genome sequencing with the aim of identifying the CNVs, including both Poisson and negative-binomial modeling of such count data. We propose a unified approach of using a mean-matching variance stabilizing transformation to turn the relatively complicated problem of sparse segment identification for count data into a sparse segment identification problem for a sequence of Gaussian data. We apply the optimal sparse segment identification procedure to the transformed data in order to identify the CNV segments. This provides a computationally efficient approach for RD-based CNV identification. Simulation results show that this approach often results in a small number of false identifications of the CNVs and has similar or better performances in identifying the true CNVs when compared with other RD-based approaches. We demonstrate the methods using the trio data from the 1000 Genomes Project.
    Biostatistics 01/2014; 15(3). DOI:10.1093/biostatistics/kxt060 · 2.24 Impact Factor
  • Hongzhe Li
    [Show abstract] [Hide abstract]
    ABSTRACT: The development of novel high-throughput DNA sequencing methods has provided a powerful method for both mapping and quantifying transcriptomes. This method, termed RNA-seq (RNA sequencing), has advantages over microarray-based approaches in terms of wide dynamic range of expressions, less reliance on existing knowledge about genome sequence, and low background noise. After aligning the reads to the reference genomes, the first step of RNA-seq analysis is to infer relative transcript abundances. This can be done at the whole transcript level, at the isoform-specific relative abundance level assuming a known set of isoforms, and at the level where transcripts are identified and their abundances are quantified. We review these methods briefly and add some recent developments in dealing with non-uniform read distribution within a transcript. We focus on methods for simultaneous transcript discovery and quantification.
    Statistical Analysis of Next Generation Sequencing Data, 01/2014: pages 247-259; , ISBN: 978-3-319-07211-1
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Correctly estimating isoform-specific gene expression is important for understanding complicated biological mechanisms and for mapping disease susceptibility genes. However, estimating isoform-specific gene expression is challenging because various biases present in RNA-Seq (RNA sequencing) data complicate the analysis, and if not appropriately corrected, can affect isoform expression estimation and downstream analysis. In this article, we present PennSeq, a statistical method that allows each isoform to have its own non-uniform read distribution. Instead of making parametric assumptions, we give adequate weight to the underlying data by the use of a non-parametric approach. Our rationale is that regardless what factors lead to non-uniformity, whether it is due to hexamer priming bias, local sequence bias, positional bias, RNA degradation, mapping bias or other unknown reasons, the probability that a fragment is sampled from a particular region will be reflected in the aligned data. This empirical approach thus maximally reflects the true underlying non-uniform read distribution. We evaluate the performance of PennSeq using both simulated data with known ground truth, and using two real Illumina RNA-Seq data sets including one with quantitative real time polymerase chain reaction measurements. Our results indicate superior performance of PennSeq over existing methods, particularly for isoforms demonstrating severe non-uniformity. PennSeq is freely available for download at http://sourceforge.net/projects/pennseq.
    Nucleic Acids Research 12/2013; 42(3). DOI:10.1093/nar/gkt1304 · 9.11 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: BACKGROUND: The gut microbiota exists within a dynamic environment determined by host physiology. Examples include the effects of diet on bile acid secretion, the production of mucus by the intestinal epithelium, and the delivery of urea to the colon. The response of the gut microbiota to these factors helps to shape its structure and function, which may show spatial segregation. Here, we characterize factors that distinguish the fecal from the mucosally-adherent microbiota. Additionally, we provide support for the importance of an oxygen gradient in the gut lumen in the development of the dysbiosis seen in inflammatory bowel disease (IBD).METHODS: Partial pressure of oxygen (pO2) was measured using phosphorescence quenching. Excitation of a phosphorescent probe by light produces phosphorescence decay, whose rate is proportional to the concentration of oxygen in the environment. A water-soluble probe, Oxyphor G4, was administered IV for tissue pO2 measurements. To quantify luminal pO2 we used Oxyphor Micro, a water-insoluble probe. To characterize the microbial oxygen signature in human samples, we examined the oxygen preference of bacterial taxa found in biopsy specimens versus stool samples from a study previously published by our group1. We also examined the bacterial taxa found in paired stool and rectal swab samples from 7 pediatric subjects. Microbiota composition was determined by 454 pyrosequencing of 16S rRNA genes analyzed by UniFrac.RESULTS: IV injection of Oxyphor G4 led to its distribution throughout the vasculature. By directing excitation at the intestinal wall, pO2 in intestinal tissue was measured. In contrast, ingested Oxyphor Micro remained in the feces, providing selective measurement of luminal pO2. The pO2 was much lower in the feces than in the tissue. By characterizing the oxygen preference of the bacterial genera in the paired stool and biopsy samples in our previously published dataset1, we found that aerotolerant organisms were greatly enriched in biopsy samples relative to stool (P < 5.4e-05). An analysis, which included the results of the rectal swab study, revealed that the microbiota obtained by swab and biopsy, were more similar to each other than to stool (Fig. 1). Additionally, the mucosally-adherent microbiota could be distinguished from that in stool by the presence of aerotolerant taxa belonging to the Proteobacteria and Actinobacteria phyla as well as assacharolytic taxa from the Firmicutes and Bacteroidetes phyla.CONCLUSIONS: Using phosphorescence quenching, we confirm the oxygen-poor environment of the gut lumen. Using 16S rRNA gene sequencing, we reveal an oxygen gradient by demonstrating enrichment of aerotolerant organisms associated with the mucosa relative to the feces. Since the observed phyla contain taxa that are consistently found in the dysbiotic signature associated with IBD, this supports the hypothesis that dysbiosis is the result of the oxidative nature of the inflammatory response. By comparing rectal biopsies and swabs to feces, we also show that a consortium of asaccharolytic bacteria that primarily metabolize amino acids are mucosally-adherent, a possible consequence of the protein content of mucus and the shedding of the intestinal epithelium. These findings reveal a previously unappreciated complexity of the gut microbiota. 1. Wu GD, et al. Science 2011;334:105-8.(C) Crohn's & Colitis Foundation of America, Inc.
    Inflammatory Bowel Diseases 12/2013; 19:S12. DOI:10.1097/01.MIB.0000438567.10388.f5 · 5.48 Impact Factor
  • Hongzhe Li
    [Show abstract] [Hide abstract]
    ABSTRACT: Systems biology approaches to epidemiological studies of complex diseases include collection of genetic, genomic, epigenomic, and metagenomic data in large-scale epidemiological studies of complex phenotypes. Designs and analyses of such studies raise many statistical challenges. This article reviews some issues related to integrative analysis of such high dimensional and inter-related datasets and outline some possible solutions. I focus my review on integrative approaches for genome-wide genetic variants and gene expression data, methods for joint analysis of genetic and epigenetic variants, and methods for analysis of microbiome data. Statistical methods such as mediation analysis, high-dimensional instrumental variable regression, sparse signal recovery, and compositional data regression provide potential frameworks for integrative analysis of these high-dimensional genomic data. For further resources related to this article, please visit the WIREs website. Conflict of interest: The authors have declared no conflicts of interest for this article.
    Wiley Interdisciplinary Reviews Systems Biology and Medicine 11/2013; 5(6). DOI:10.1002/wsbm.1242 · 3.01 Impact Factor
  • Fan Xia · Jun Chen · Wing Kam Fung · Hongzhe Li
    [Show abstract] [Hide abstract]
    ABSTRACT: Changes in human microbiome are associated with many human diseases. Next generation sequencing technologies make it possible to quantify the microbial composition without the need for laboratory cultivation. One important problem of microbiome data analysis is to identify the environmental/biological covariates that are associated with different bacterial taxa. Taxa count data in microbiome studies are often over-dispersed and include many zeros. To account for such an over-dispersion, we propose to use an additive logistic normal multinomial regression model to associate the covariates to bacterial composition. The model can naturally account for sampling variabilities and zero observations and also allow for a flexible covariance structure among the bacterial taxa. In order to select the relevant covariates and to estimate the corresponding regression coefficients, we propose a group ℓ1 penalized likelihood estimation method for variable selection and estimation. We develop a Monte Carlo expectation-maximization algorithm to implement the penalized likelihood estimation. Our simulation results show that the proposed method outperforms the group ℓ1 penalized multinomial logistic regression and the Dirichlet multinomial regression models in variable selection. We demonstrate the methods using a data set that links human gut microbiome to micro-nutrients in order to identify the nutrients that are associated with the human gut microbiome enterotype.
    Biometrics 10/2013; 69(4). DOI:10.1111/biom.12079 · 1.52 Impact Factor
  • Source
    Wanlu Deng · Zhi Geng · Hongzhe Li
    [Show abstract] [Hide abstract]
    ABSTRACT: Multivariate time series (MTS) data such as time course gene expression data in genomics are often collected to study the dynamic nature of the systems. These data provide important information about the causal dependency among a set of random variables. In this paper, we introduce a computationally efficient algorithm to learn directed acyclic graphs (DAGs) based on MTS data, focusing on learning the local structure of a given target variable. Our algorithm is based on learning all parents (P), all children (C) and some descendants (D) (PCD) iteratively, utilizing the time order of the variables to orient the edges. This time series PCD-PCD algorithm (tsPCD-PCD) extends the previous PCD-PCD algorithm to dependent observations and utilizes composite likelihood ratio tests (CLRTs) for testing the conditional independence. We present the asymptotic distribution of the CLRT statistic and show that the tsPCD-PCD is guaranteed to recover the true DAG structure when the faithfulness condition holds and the tests correctly reject the null hypotheses. Simulation studies show that the CLRTs are valid and perform well even when the sample sizes are small. In addition, the tsPCD-PCD algorithm outperforms the PCD-PCD algorithm in recovering the local graph structures. We illustrate the algorithm by analyzing a time course gene expression data related to mouse T-cell activation.
    The Annals of Applied Statistics 09/2013; 7(3). DOI:10.1214/13-AOAS635 · 1.69 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Copy number variations (CNVs) are associated with many complex diseases. Next generation sequencing data enable one to identify precise CNV breakpoints to better under the underlying molecular mechanisms and to design more efficient assays. Using the CIGAR strings of the reads, we develop a method that can identify the exact CNV breakpoints, and in cases when the breakpoints are in a repeated region, the method reports a range where the breakpoints can slide. Our method identifies the breakpoints of a CNV using both the positions and CIGAR strings of the reads that cover breakpoints of a CNV. A read with a long soft clipped part (denoted as S in CIGAR) at its 3'(right) end can be used to identify the 5'(left)-side of the breakpoints, and a read with a long S part at the 5' end can be used to identify the breakpoint at the 3'-side. To ensure both types of reads cover the same CNV, we require the overlapped common string to include both of the soft clipped parts. When a CNV starts and ends in the same repeated regions, its breakpoints are not unique, in which case our method reports the left most positions for the breakpoints and a range within which the breakpoints can be incremented without changing the variant sequence. We have implemented the methods in a C++ package intended for the current Illumina Miseq and Hiseq platforms for both whole genome and exon-sequencing. Our simulation studies have shown that our method compares favorably with other similar methods in terms of true discovery rate, false positive rate and breakpoint accuracy. Our results from a real application have shown that the detected CNVs are consistent with zygosity and read depth information. The software package is available at http://statgene.med.upenn.edu/softprog.html.
    Frontiers in Genetics 08/2013; 4:157. DOI:10.3389/fgene.2013.00157

Publication Stats

4k Citations
663.30 Total Impact Points

Institutions

  • 2008–2015
    • University of Pennsylvania
      • Department of Biostatistics and Epidemiology
      Filadelfia, Pennsylvania, United States
  • 2006–2014
    • William Penn University
      Filadelfia, Pennsylvania, United States
  • 2012–2013
    • New Jersey Institute of Technology
      • Department of Computer Science
      Newark, New Jersey, United States
  • 2011
    • Temple University
      • Department of Statistics
      Philadelphia, PA, United States
    • University of Rochester
      • Department of Biostatistics and Computational Biology
      Rochester, New York, United States
  • 2009
    • The Children's Hospital of Philadelphia
      Filadelfia, Pennsylvania, United States
  • 2000–2005
    • University of California, Davis
      • School of Medicine
      Davis, California, United States