Simon Tavaré

Cancer Research UK Cambridge Institute, Cambridge, England, United Kingdom

Are you Simon Tavaré?

Claim your profile

Publications (91)442.29 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Cancer genome sequencing studies have identified numerous driver genes, but the relative timing of mutations in carcinogenesis remains unclear. The gradual progression from premalignant Barrett's esophagus to esophageal adenocarcinoma (EAC) provides an ideal model to study the ordering of somatic mutations. We identified recurrently mutated genes and assessed clonal structure using whole-genome sequencing and amplicon resequencing of 112 EACs. We next screened a cohort of 109 biopsies from 2 key transition points in the development of malignancy: benign metaplastic never-dysplastic Barrett's esophagus (NDBE; n=66) and high-grade dysplasia (HGD; n=43). Unexpectedly, the majority of recurrently mutated genes in EAC were also mutated in NDBE. Only TP53 and SMAD4 mutations occurred in a stage-specific manner, confined to HGD and EAC, respectively. Finally, we applied this knowledge to identify high-risk Barrett's esophagus in a new non-endoscopic test. In conclusion, mutations in EAC driver genes generally occur exceptionally early in disease development with profound implications for diagnostic and therapeutic strategies.
    Full-text · Article · Jun 2014 · Nature Genetics
  • [Show abstract] [Hide abstract]
    ABSTRACT: A series of clonal expansions are thought to underlie the progression of Barrett's oesophagus (BE) to oesophageal adenocarcinoma (OAC). Each expansion carries with it somatic driver mutation (s) fixing it within a larger population and therefore increasing the likelihood of acquiring a second mutation. However, the precise order in which somatic variants occur remains unknown.
    No preview · Article · Jun 2014 · Gut
  • [Show abstract] [Hide abstract]
    ABSTRACT: In metabolomics the goal is to identify and measure the concentrations of different metabolites (small molecules) in a cell or a biological system. The metabolites form an important layer in the complex metabolic network, and the interactions between different metabolites are often of interest. It is crucial to perform proper normalization of metabolomics data, but current methods may not be applicable when estimating interactions in the form of correlations between metabolites. We propose a normalization approach based on a mixed model, with simultaneous estimation of a correlation matrix. We also investigate how the common use of a calibration standard in NMR experiments affects the estimation of correlations. We show with both real and simulated data that our proposed normalization method is robust and has good performance when discovering true correlations between metabolites. The standardization of NMR data is shown in simulation studies to affect our ability to discover true correlations to a small extent. However, comparing standardized and non-standardized real data does not result in any large differences in correlation estimates. Source code is freely available at CONTACT: SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
    No preview · Article · Apr 2014 · Bioinformatics
  • Source
    Ernest Turro · William J Astle · Simon Tavaré
    [Show abstract] [Hide abstract]
    ABSTRACT: Most methods for estimating differential expression from RNA-seq are based on statistics that compare normalised read counts between treatment classes. Unfortunately, reads are in general too short to be mapped unambiguously to features of interest, such as genes, isoforms or haplotype-specific isoforms. There are methods for estimating expression levels that account for this source of ambiguity. However, the uncertainty is not generally accounted for in downstream analysis of gene expression experiments. Moreover, at the individual transcript level, it can sometimes be too large to allow useful comparisons between treatment groups. In this paper we make two proposals that improve the power, specificity and versatility of expression analysis using RNA-seq data. Firstly, we present a Bayesian method for model selection that accounts for read mapping ambiguities using random effects. This polytomous model selection approach can be used to identify many interesting patterns of gene expression and is not confined to detecting differential expression between two groups. For illustration, we use our method to detect imprinting, different types of regulatory divergence in cis and in trans, and differential isoform usage, but many other applications are possible. Secondly, we present a novel collapsing algorithm for grouping transcripts into inferential units that exploits the posterior correlation between transcript expression levels. The aggregate expression levels of these units can be estimated with useful levels of uncertainty. Our algorithm can improve the precision of expression estimates when uncertainty is large with only a small reduction in biological resolution. We have implemented our software in the mmdiff and mmcollapse multi-threaded C++ programs as part of the open-source MMSEQ package, available on
    Preview · Article · Nov 2013 · Bioinformatics
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Although histopathological diagnosis is essential in decision of therapeutic strategy for gliomas, sometimes the tumors diagnosed in one histological entity show thoroughly different clinical courses. This phenomenon is believed to be due primarily to the presence of the genetic subgroup. In fact, relationship between treatment response and certain genetic characteristics is indicated (e.g. better chemosensitivity in glioma with losses of 1p/19q (−1p/19q)). It is highly likely that genetic classification of glioma is useful to select the adjuvant treatment. Additionally, gain of 7q (+7q) and −1p/19q are early events in 2 distinct tumor lineages, astrocytic tumors and oligodendroglial tumors, respectively, and these tumors obtain additional genetic aberration (−9p, 10q) with tumor progression. On the other hand, concerning the tumors without +7q or −1p/19q, little is known about clinically important genetic aberration. Therefore the study on such tumors could provide useful information for the prognosis prediction and the determination of treatment strategy. METHODS: We selected 39 cases of gliomas without +7q or −1p/19q from 200 adult supratentorial glioma cases surgically treated and analyzed chromosomal DNA copy number aberrations (CNAs) by comparative genomic hybridization (CGH) from 2005 to 2012. We correlated clinical features of these tumors with histological characteristics, CNAs and IDH1 status. RESULTS: The clinical course of gliomas without +7q or −1p/19q was not correlated with additional genetic aberration of -9p or 10q, which have been known as genetic markers for poor prognosis, and absence of +7q or −1p/19q was maintained at the time of recurrence. The tumors without +7q or −1p/19q showed relatively favorable prognosis although mutation of IDH1 was infrequent in these tumors (35.8 %). CONCLUSION: The gliomas without +7q or −1p/19q have clinical features distinct from the +7q and −1p/19q gliomas. Prognostic markers for each subgroups could help establish therapeutic strategy against the tumor.
    Full-text · Article · Nov 2013 · Neuro-Oncology
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Lineage-tracing approaches, widely used to characterize stem cell populations, rely on the specificity and stability of individual markers for accurate results. We present a method in which genetic labeling in the intestinal epithelium is acquired as a mutation-induced clonal mark during DNA replication. By determining the rate of mutation in vivo and combining this data with the known neutral-drift dynamics that describe intestinal stem cell replacement, we quantify the number of functional stem cells in crypts and adenomas. Contrary to previous reports, we find that significantly lower numbers of "working" stem cells are present in the intestinal epithelium (five to seven per crypt) and in adenomas (nine per gland), and that those stem cells are also replaced at a significantly lower rate. These findings suggest that the bulk of tumor stem cell divisions serve only to replace stem cell loss, with rare clonal victors driving gland repopulation and tumor growth.
    Preview · Article · Sep 2013 · Cell stem cell
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Dynamic activity of signaling pathways, such as Notch, is vital to achieve correct development and homeostasis. However, most studies assess output many hours or days after initiation of signaling, once the outcome has been consolidated. Here we analyze genome-wide changes in transcript levels, binding of the Notch pathway transcription factor, CSL [Suppressor of Hairless, Su(H), in Drosophila], and RNA Polymerase II (Pol II) immediately following a short pulse of Notch stimulation. A total of 154 genes showed significant differential expression (DE) over time, and their expression profiles stratified into 14 clusters based on the timing, magnitude, and direction of DE. E(spl) genes were the most rapidly upregulated, with Su(H), Pol II, and transcript levels increasing within 5-10 minutes. Other genes had a more delayed response, the timing of which was largely unaffected by more prolonged Notch activation. Neither Su(H) binding nor poised Pol II could fully explain the differences between profiles. Instead, our data indicate that regulatory interactions, driven by the early-responding E(spl)bHLH genes, are required. Proposed cross-regulatory relationships were validated in vivo and in cell culture, supporting the view that feed-forward repression by E(spl)bHLH/Hes shapes the response of late-responding genes. Based on these data, we propose a model in which Hes genes are responsible for co-ordinating the Notch response of a wide spectrum of other targets, explaining the critical functions these key regulators play in many developmental and disease contexts.
    Full-text · Article · Jan 2013 · PLoS Genetics
  • Source

    Full-text · Dataset · Dec 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: To identify novel dynamic patterns of gene expression, we develop a statistical method to cluster noisy measurements of gene expression collected from multiple replicates at multiple time points, with an unknown number of clusters. We propose a random-effects mixture model coupled with a Dirichlet-process prior for clustering. The mixture model formulation allows for probabilistic cluster assignments. The random-effects formulation allows for attributing the total variability in the data to the sources that are consistent with the experimental design, particularly when the noise level is high and the temporal dependence is not strong. The Dirichlet-process prior induces a prior distribution on partitions and helps to estimate the number of clusters (or mixture components) from the data. We further tackle two challenges associated with Dirichlet-process prior-based methods. One is efficient sampling. We develop a novel Metropolis-Hastings Markov Chain Monte Carlo (MCMC) procedure to sample the partitions. The other is efficient use of the MCMC samples in forming clusters. We propose a two-step procedure for posterior inference, which involves resampling and relabeling, to estimate the posterior allocation probability matrix. This matrix can be directly used in cluster assignments, while describing the uncertainty in clustering. We demonstrate the effectiveness of our model and sampling procedure through simulated data. Applying our method to a real data set collected from Drosophila adult muscle cells after five-minute Notch activation, we identify 14 clusters of different transcriptional responses among 163 differentially expressed genes, which provides several novel insights into underlying transcriptional mechanisms in the Notch signaling pathway. The algorithm developed here is implemented in the R package DIRECT.
    Full-text · Article · Oct 2012 · The Annals of Applied Statistics
  • [Show abstract] [Hide abstract]
    ABSTRACT: Introduction The Interferons (IFNs) are a family of pleiotropic cytokines that mediate anti-microbial, anti-proliferative, anti-tumour, immuno-modulatory and homoeostatic host defence functions. Differential regulation of Interferon Stimulated Genes (ISGs) by IFNs, via multiple signalling pathways and the combinatorial effects of IFN-induced transcription factors, co-regulators, ncRNA and epigenetic modifications underlie these diverse biological functions. Methods We have built InterferonScape – a software platform for data-integration, mining and visualization to study IFN mediated ISG regulation at a systems level. This is a centralized systems-immunology resource that integrates ISG datasets into a data warehouse with a multidimensional data analysis capability. Statistical and computational data-mining methods combined with easy to use graphical user interfaces are included to provide custom-built data-visualization capability. This resource is provided as a web 2.0 rich internet application with novel embedded data-mining, search and semantic web technologies. Results Using integrative data-mining approaches we previously identified approximately 1800 ISGs in the human genome [1]. Very few of these ISGs have been studied systematically, and information on ISG pathways is virtually absent from both public and commercial pathway repositories. We have utilised computational pathway modelling approaches, coupled to natural language text mining and expert manual curation to build a reliable collection of ISG pathways and networks. By integrating both in-house generated and publicly-available gene expression, regulatory genomic and epigenetic datasets in different cell and tissue types and in both normal and disease conditions, we have identified the regulatory elements and promoter modules involved in the regulation of ISG networks. Conclusion Using integrative genomic and computational biology approaches we will demonstrate the evolutionary conservation, information flow, connectivity, network topology and regulatory interactions underlying these ISG networks. Understanding the propagation of ISG networks in different conditions will provide a greater insight into IFN biology in health and disease.
    No preview · Article · Sep 2012 · Cytokine
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The elucidation of breast cancer subgroups and their molecular drivers requires integrated views of the genome and transcriptome from representative numbers of patients. We present an integrated analysis of copy number and gene expression in a discovery and validation set of 997 and 995 primary breast tumours, respectively, with long-term clinical follow-up. Inherited variants (copy number variants and single nucleotide polymorphisms) and acquired somatic copy number aberrations (CNAs) were associated with expression in ~40% of genes, with the landscape dominated by cis- and trans-acting CNAs. By delineating expression outlier genes driven in cis by CNAs, we identified putative cancer genes, including deletions in PPP2R2A, MTAP and MAP2K4. Unsupervised analysis of paired DNA–RNA profiles revealed novel subgroups with distinct clinical outcomes, which reproduced in the validation cohort. These include a high-risk, oestrogen-receptor-positive 11q13/14 cis-acting subgroup and a favourable prognosis subgroup devoid of CNAs. Trans-acting aberration hotspots were found to modulate subgroup-specific gene networks, including a TCR deletion-mediated adaptive immune response in the ‘CNA-devoid’ subgroup and a basal-specific chromosome 5 deletion-associated mitotic network. Our results provide a novel molecular stratification of the breast cancer population, derived from the impact of somatic CNAs on the transcriptome.
    Full-text · Article · Apr 2012 · Nature
  • Source
    Richard Wilkinson · Simon Tavaré
    [Show abstract] [Hide abstract]
    ABSTRACT: ABC for ancestral inference
    Preview · Article · May 2011 · Nature Precedings
  • Source
    Andrea Sottoriva · Louis Vermeulen · Simon Tavaré
    [Show abstract] [Hide abstract]
    ABSTRACT: The cancer stem cell (CSC) concept is a highly debated topic in cancer research. While experimental evidence in favor of the cancer stem cell theory is apparently abundant, the results are often criticized as being difficult to interpret. An important reason for this is that most experimental data that support this model rely on transplantation studies. In this study we use a novel cellular Potts model to elucidate the dynamics of established malignancies that are driven by a small subset of CSCs. Our results demonstrate that epigenetic mutations that occur during mitosis display highly altered dynamics in CSC-driven malignancies compared to a classical, non-hierarchical model of growth. In particular, the heterogeneity observed in CSC-driven tumors is considerably higher. We speculate that this feature could be used in combination with epigenetic (methylation) sequencing studies of human malignancies to prove or refute the CSC hypothesis in established tumors without the need for transplantation. Moreover our tumor growth simulations indicate that CSC-driven tumors display evolutionary features that can be considered beneficial during tumor progression. Besides an increased heterogeneity they also exhibit properties that allow the escape of clones from local fitness peaks. This leads to more aggressive phenotypes in the long run and makes the neoplasm more adaptable to stringent selective forces such as cancer treatment. Indeed when therapy is applied the clone landscape of the regrown tumor is more aggressive with respect to the primary tumor, whereas the classical model demonstrated similar patterns before and after therapy. Understanding these often counter-intuitive fundamental properties of (non-)hierarchically organized malignancies is a crucial step in validating the CSC concept as well as providing insight into the therapeutical consequences of this model.
    Full-text · Article · May 2011 · PLoS Computational Biology
  • Source
    Doug Speed · Simon Tavaré
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents Sparse Partitioning, a Bayesian method for identifying predictors that either individually or in combination with others affect a response variable. The method is designed for regression problems involving binary or tertiary predictors and allows the number of predictors to exceed the size of the sample, two properties which make it well suited for association studies. Sparse Partitioning differs from other regression methods by placing no restrictions on how the predictors may influence the response. To compensate for this generality, Sparse Partitioning implements a novel way of exploring the model space. It searches for high posterior probability partitions of the predictor set, where each partition defines groups of predictors that jointly influence the response. The result is a robust method that requires no prior knowledge of the true predictor--response relationship. Testing on simulated data suggests Sparse Partitioning will typically match the performance of an existing method on a data set which obeys the existing method's model assumptions. When these assumptions are violated, Sparse Partitioning will generally offer superior performance.
    Preview · Article · Jan 2011 · The Annals of Applied Statistics
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Copy number abnormalities (CNAs) represent an important type of genetic mutation that can lead to abnormal cell growth and proliferation. New high-throughput sequencing technologies promise comprehensive characterization of CNAs. In contrast to microarrays, where probe design follows a carefully developed protocol, reads represent a random sample from a library and may be prone to representation biases due to GC content and other factors. The discrimination between true and false positive CNAs becomes an important issue. We present a novel approach, called CNAseg, to identify CNAs from second-generation sequencing data. It uses depth of coverage to estimate copy number states and flowcell-to-flowcell variability in cancer and normal samples to control the false positive rate. We tested the method using the COLO-829 melanoma cell line sequenced to 40-fold coverage. An extensive simulation scheme was developed to recreate different scenarios of copy number changes and depth of coverage by altering a real dataset with spiked-in CNAs. Comparison to alternative approaches using both real and simulated datasets showed that CNAseg achieves superior precision and improved sensitivity estimates. The CNAseg package and test data are available at
    Full-text · Article · Oct 2010 · Bioinformatics

  • No preview · Article · Jun 2010 · EJC Supplements
  • Source
    Sergii Ivakhno · Simon Tavaré
    [Show abstract] [Hide abstract]
    ABSTRACT: The current generation of single nucleotide polymorphism (SNP) arrays allows measurement of copy number aberrations (CNAs) in cancer at more than one million locations in the genome in hundreds of tumour samples. Most research has focused on single-sample CNA discovery, the so-called segmentation problem. The availability of high-density, large sample-size SNP array datasets makes the identification of recurrent copy number changes in cancer, an important issue that can be addressed using the cross-sample information. We present a novel approach for finding regions of recurrent copy number aberrations, called CNAnova, from Affymetrix SNP 6.0 array data. The method derives its statistical properties from a control dataset composed of normal samples and, in contrast to previous methods, does not require segmentation and permutation steps. For rigorous testing of the algorithm and comparison to existing methods, we developed a simulation scheme that uses the noise distribution present in Affymetrix arrays. Application of the method to 128 acute lymphoblastic leukaemia samples shows that CNAnova achieves lower error rate than a popular alternative approach. We also describe an extension of the CNAnova framework to identify recurrent CNA regions with intra-tumour heterogeneity, present in either primary or relapsed samples from the same patients. The CNAnova package and synthetic datasets are available at
    Preview · Article · Jun 2010 · Bioinformatics
  • Source
    A. D. Barbour · Simon Tavaré
    [Show abstract] [Hide abstract]
    ABSTRACT: The dynamics of tumour evolution are not well understood. In this paper we provide a statistical framework for evaluating the molecular variation observed in different parts of a colorectal tumour. A multi-sample version of the Ewens Sampling Formula forms the basis for our modelling of the data, and we provide a simulation procedure for use in obtaining reference distributions for the statistics of interest. We also describe the large-sample asymptotics of the joint distributions of the variation observed in different parts of the tumour. While actual data should be evaluated with reference to the simulation procedure, the asymptotics serve to provide theoretical guidelines, for instance with reference to the choice of possible statistics. Comment: 22 pages, 1 figure. Chapter 4 of "Probability and Mathematical Genetics: Papers in Honour of Sir John Kingman" (Editors N.H. Bingham and C.M. Goldie), Cambridge University Press, 2010
    Preview · Article · Apr 2010
  • Source
    Andrea Sottoriva · Simon Tavaré

    Full-text · Conference Paper · Jan 2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The amplification of millions of single molecules in parallel can be performed on microscopic magnetic beads that are contained in aqueous compartments of an oil-buffer emulsion. These bead-emulsion amplification (BEA) reactions result in beads that are covered by almost-identical copies derived from a single template. The post-amplification analysis is performed using different fluorophore-labeled probes. We have identified BEA reaction conditions that efficiently produce longer amplicons of up to 450 base pairs. These conditions include the use of a Titanium Taq amplification system. Second, we explored alternate fluorophores coupled to probes for post-PCR DNA analysis. We demonstrate that four different Alexa fluorophores can be used simultaneously with extremely low crosstalk. Finally, we developed an allele-specific extension chemistry that is based on Alexa dyes to query individual nucleotides of the amplified material that is both highly efficient and specific.
    Full-text · Article · Aug 2009 · Analytical Chemistry

Publication Stats

4k Citations
442.29 Total Impact Points


  • 2008-2014
    • Cancer Research UK Cambridge Institute
      Cambridge, England, United Kingdom
  • 2005-2014
    • University of Cambridge
      • • Department of Applied Mathematics and Theoretical Physics
      • • Department of Oncology
      Cambridge, England, United Kingdom
  • 1992-2006
    • University of Southern California
      • • Department of Biological Sciences
      • • Division of Molecular and Computational Biology
      • • Department of Pathology
      • • Department of Mathematics
      Los Angeles, California, United States
  • 1999
    • University of Oxford
      Oxford, England, United Kingdom
  • 1992-1999
    • University of California, Los Angeles
      • Department of Mathematics
      Los Ángeles, California, United States
  • 1994
    • Monash University (Australia)
      Melbourne, Victoria, Australia
  • 1981-1989
    • University of Utah
      • Department of Mathematics
      Salt Lake City, Utah, United States
  • 1987
    • University College London
      • Department of Statistical Science
      Londinium, England, United Kingdom
  • 1982-1984
    • Colorado State University
      • Department of Statistics
      Fort Collins, Colorado, United States
  • 1981-1983
    • Stanford University
      • Department of Mathematics
      Palo Alto, California, United States