Simon Tavaré

University of Cambridge, Cambridge, England, United Kingdom

Are you Simon Tavaré?

Claim your profile

Publications (181)

  • D.J. Andrews · A.G. Lynch · S. Tavaré
    [Show abstract] [Hide abstract] ABSTRACT: Methylation patterns present in a cell population can inform us about the way the cells are organized and how the population is sustained. Methylation is inheritable through cell divisions but changes can occur as a result of methylation replication errors. Hence, variation in methylation patterns in a cell population at a given time captures information about the history of the cell population. It is important that the observed methylation patterns are representative of those in the cell population. However, bisulfite sequencing may introduce new patterns and degradation may eliminate rare patterns. We investigate how bisulfite degradation may be expected to affect the data, and how inference could be made in light of this. A model for the data generation process makes it possible to estimate the starting number of distinct methylation patterns more accurately than simply counting the number of distinct patterns observed.
    Chapter · Mar 2016
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: Cancer genome sequencing studies have identified numerous driver genes, but the relative timing of mutations in carcinogenesis remains unclear. The gradual progression from premalignant Barrett's esophagus to esophageal adenocarcinoma (EAC) provides an ideal model to study the ordering of somatic mutations. We identified recurrently mutated genes and assessed clonal structure using whole-genome sequencing and amplicon resequencing of 112 EACs. We next screened a cohort of 109 biopsies from 2 key transition points in the development of malignancy: benign metaplastic never-dysplastic Barrett's esophagus (NDBE; n=66) and high-grade dysplasia (HGD; n=43). Unexpectedly, the majority of recurrently mutated genes in EAC were also mutated in NDBE. Only TP53 and SMAD4 mutations occurred in a stage-specific manner, confined to HGD and EAC, respectively. Finally, we applied this knowledge to identify high-risk Barrett's esophagus in a new non-endoscopic test. In conclusion, mutations in EAC driver genes generally occur exceptionally early in disease development with profound implications for diagnostic and therapeutic strategies.
    Full-text Article · Jun 2014 · Nature Genetics
  • [Show abstract] [Hide abstract] ABSTRACT: Introduction A series of clonal expansions are thought to underlie the progression of Barrett’s oesophagus (BE) to oesophageal adenocarcinoma (OAC). Each expansion carries with it somatic driver mutation (s) fixing it within a larger population and therefore increasing the likelihood of acquiring a second mutation. However, the precise order in which somatic variants occur remains unknown. Methods We performed whole genome sequencing in 25 cases of OAC and 3 matched cases of BE. Findings were validated in a larger cohort of OACs (n = 90), metaplastic never-dysplastic BE (NDBE, n = 66 with a median follow-up of 58 months) and high-grade dysplasia (n = 43) using amplicon resequencing. Mutational signatures and gene-centric somatic mutations were determined using an in-house pipeline incorporating standard statistical methods and the publically available EMu pipeline. Results There were 7 distinct mutational signatures present in both early (BE) and late disease (OAC). Fifteen genes were determined to be potential novel drivers of OAC development. Surprisingly in 53% of NDBE tissue samples we identified clonal expansion of cells (>10% mutant fraction) harbouring mutations in one or more of 13/15 of these putative driver genes. No difference in the frequency of mutation of these genes was observed between any of the disease stages studied. TP53 mutations clearly delineate between HGD/OAC and benign NDBE (p < 0.001). Whilst SMAD4 mutations are only observed in OAC (p < 0.001) demonstrating for the first time a clear genetic difference between the two. Conclusion Mutagenic processes active in OAC are also active in the earliest stages of BE. Recurrent driver mutations identified in cancer may be acquired very early in the disease and may provide little or no progression advantage. Molecular diagnostic approaches must account for this. Disclosure of Interest None Declared.
    Article · Jun 2014 · Gut
  • Alexandra Jauhiainen · Basetti Madhu · Masako Narita · [...] · Simon Tavaré
    [Show abstract] [Hide abstract] ABSTRACT: In metabolomics the goal is to identify and measure the concentrations of different metabolites (small molecules) in a cell or a biological system. The metabolites form an important layer in the complex metabolic network, and the interactions between different metabolites are often of interest. It is crucial to perform proper normalization of metabolomics data, but current methods may not be applicable when estimating interactions in the form of correlations between metabolites. We propose a normalization approach based on a mixed model, with simultaneous estimation of a correlation matrix. We also investigate how the common use of a calibration standard in NMR experiments affects the estimation of correlations. We show with both real and simulated data that our proposed normalization method is robust and has good performance when discovering true correlations between metabolites. The standardization of NMR data is shown in simulation studies to affect our ability to discover true correlations to a small extent. However, comparing standardized and non-standardized real data does not result in any large differences in correlation estimates. Source code is freely available at CONTACT: SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
    Article · Apr 2014 · Bioinformatics
  • Ernest Turro · William J Astle · Simon Tavaré
    [Show abstract] [Hide abstract] ABSTRACT: Most methods for estimating differential expression from RNA-seq are based on statistics that compare normalised read counts between treatment classes. Unfortunately, reads are in general too short to be mapped unambiguously to features of interest, such as genes, isoforms or haplotype-specific isoforms. There are methods for estimating expression levels that account for this source of ambiguity. However, the uncertainty is not generally accounted for in downstream analysis of gene expression experiments. Moreover, at the individual transcript level, it can sometimes be too large to allow useful comparisons between treatment groups. In this paper we make two proposals that improve the power, specificity and versatility of expression analysis using RNA-seq data. Firstly, we present a Bayesian method for model selection that accounts for read mapping ambiguities using random effects. This polytomous model selection approach can be used to identify many interesting patterns of gene expression and is not confined to detecting differential expression between two groups. For illustration, we use our method to detect imprinting, different types of regulatory divergence in cis and in trans, and differential isoform usage, but many other applications are possible. Secondly, we present a novel collapsing algorithm for grouping transcripts into inferential units that exploits the posterior correlation between transcript expression levels. The aggregate expression levels of these units can be estimated with useful levels of uncertainty. Our algorithm can improve the precision of expression estimates when uncertainty is large with only a small reduction in biological resolution. We have implemented our software in the mmdiff and mmcollapse multi-threaded C++ programs as part of the open-source MMSEQ package, available on
    Article · Nov 2013 · Bioinformatics
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: Although histopathological diagnosis is essential in decision of therapeutic strategy for gliomas, sometimes the tumors diagnosed in one histological entity show thoroughly different clinical courses. This phenomenon is believed to be due primarily to the presence of the genetic subgroup. In fact, relationship between treatment response and certain genetic characteristics is indicated (e.g. better chemosensitivity in glioma with losses of 1p/19q (−1p/19q)). It is highly likely that genetic classification of glioma is useful to select the adjuvant treatment. Additionally, gain of 7q (+7q) and −1p/19q are early events in 2 distinct tumor lineages, astrocytic tumors and oligodendroglial tumors, respectively, and these tumors obtain additional genetic aberration (−9p, 10q) with tumor progression. On the other hand, concerning the tumors without +7q or −1p/19q, little is known about clinically important genetic aberration. Therefore the study on such tumors could provide useful information for the prognosis prediction and the determination of treatment strategy. METHODS: We selected 39 cases of gliomas without +7q or −1p/19q from 200 adult supratentorial glioma cases surgically treated and analyzed chromosomal DNA copy number aberrations (CNAs) by comparative genomic hybridization (CGH) from 2005 to 2012. We correlated clinical features of these tumors with histological characteristics, CNAs and IDH1 status. RESULTS: The clinical course of gliomas without +7q or −1p/19q was not correlated with additional genetic aberration of -9p or 10q, which have been known as genetic markers for poor prognosis, and absence of +7q or −1p/19q was maintained at the time of recurrence. The tumors without +7q or −1p/19q showed relatively favorable prognosis although mutation of IDH1 was infrequent in these tumors (35.8 %). CONCLUSION: The gliomas without +7q or −1p/19q have clinical features distinct from the +7q and −1p/19q gliomas. Prognostic markers for each subgroups could help establish therapeutic strategy against the tumor.
    Full-text Article · Nov 2013 · Neuro-Oncology
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: Lineage-tracing approaches, widely used to characterize stem cell populations, rely on the specificity and stability of individual markers for accurate results. We present a method in which genetic labeling in the intestinal epithelium is acquired as a mutation-induced clonal mark during DNA replication. By determining the rate of mutation in vivo and combining this data with the known neutral-drift dynamics that describe intestinal stem cell replacement, we quantify the number of functional stem cells in crypts and adenomas. Contrary to previous reports, we find that significantly lower numbers of "working" stem cells are present in the intestinal epithelium (five to seven per crypt) and in adenomas (nine per gland), and that those stem cells are also replaced at a significantly lower rate. These findings suggest that the bulk of tumor stem cell divisions serve only to replace stem cell loss, with rare clonal victors driving gland repopulation and tumor growth.
    Full-text Article · Sep 2013 · Cell stem cell
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: Dynamic activity of signaling pathways, such as Notch, is vital to achieve correct development and homeostasis. However, most studies assess output many hours or days after initiation of signaling, once the outcome has been consolidated. Here we analyze genome-wide changes in transcript levels, binding of the Notch pathway transcription factor, CSL [Suppressor of Hairless, Su(H), in Drosophila], and RNA Polymerase II (Pol II) immediately following a short pulse of Notch stimulation. A total of 154 genes showed significant differential expression (DE) over time, and their expression profiles stratified into 14 clusters based on the timing, magnitude, and direction of DE. E(spl) genes were the most rapidly upregulated, with Su(H), Pol II, and transcript levels increasing within 5-10 minutes. Other genes had a more delayed response, the timing of which was largely unaffected by more prolonged Notch activation. Neither Su(H) binding nor poised Pol II could fully explain the differences between profiles. Instead, our data indicate that regulatory interactions, driven by the early-responding E(spl)bHLH genes, are required. Proposed cross-regulatory relationships were validated in vivo and in cell culture, supporting the view that feed-forward repression by E(spl)bHLH/Hes shapes the response of late-responding genes. Based on these data, we propose a model in which Hes genes are responsible for co-ordinating the Notch response of a wide spectrum of other targets, explaining the critical functions these key regulators play in many developmental and disease contexts.
    Full-text Article · Jan 2013 · PLoS Genetics
  • Dataset: Figure S1
    [Show abstract] [Hide abstract] ABSTRACT: Clustered expression profiles of DE genes. Graphs show log2 fold change in mRNA levels over time (min) for gene clusters. Black lines represent profiles of individual genes and coloured lines show the mean response of the cluster. (TIF)
    Dataset · Jan 2013
  • Dataset: Figure S4
    [Show abstract] [Hide abstract] ABSTRACT: Effect of cycloheximide on Notch response profiles. Graphs show log2 fold change in mRNA levels over time (min) for the indicated genes in the presence (black line) or absence (grey line) of cycloheximide (CHX). Error bars indicate standard error of the mean from 3 biological replicates. (TIF)
    Dataset · Jan 2013
  • Dataset: Table S5
    [Show abstract] [Hide abstract] ABSTRACT: Time-course expression data. Fold changes in expression for all expressed genes at the indicated time-points (min), results are for individual replicates (rep1-rep4). (XLSX)
    Dataset · Jan 2013
  • Dataset: Table S3
    [Show abstract] [Hide abstract] ABSTRACT: Su(H) and Hairy binding site analysis. Tabs indicate type of motif analysis. Columns as detailed in each sub-table. (XLSX)
    Dataset · Jan 2013
  • Dataset: Text S1
    [Show abstract] [Hide abstract] ABSTRACT: Text file with additional details of methods and statistical analysis. (DOC)
    Dataset · Jan 2013
  • Dataset: Figure S2
    [Show abstract] [Hide abstract] ABSTRACT: Activated Caspase 3 in treated and untreated cells and role of Su(H) motifs in W/hid enhancer. A. Images show staining for activated Caspase 3 (red) in DmD8 cells at indicated times following a 5 minute pulse of EDTA treatment (Nact/EDTA) or control treatment. The turquoise channel shows a phase contrast image of the field. B. Average number of cells containing activated Caspase 3 per field, quantified from a minimum of 5 fields per condition. Error bars indicate standard error of the mean. No significant differences were found between Notch activated and control conditions (30 min – p = 0.34, 60 min – p = 0.79). C. Response of the indicated enhancers to Nicd in transient transfection assays in Drosophila cells, expressed as fold-change (dark bars) relative to expression levels in the absence of Nicd (pale bars). Mutating Su(H) motifs in the W/hid enhancer (green bars) abolishes responsiveness to Nicd. Error bars indicate standard error of the mean from 3 biological replicates. E(spl)m3, control and un-mutated W/hid luciferase reporters were described previously [29]. Su(H) sites in the W/hid enhancer were mutated using oligonucleotides with 3-bp mismatch (introducing T at positions 3, 4 and 8) as described previously [29]. (TIF)
    Dataset · Jan 2013
  • Dataset: Table S4
    [Show abstract] [Hide abstract] ABSTRACT: Details of oligonucleotides used for qPCR. Name, gene name and primer orientation; Sequence, Primer sequence. (XLSX)
    Dataset · Jan 2013
  • Dataset: Table S1
    [Show abstract] [Hide abstract] ABSTRACT: Summary of differentially expressed genes. Column A, Oligo ID on expression arrays; Column B, FBgn number for each gene; Column C, Gene symbol; Column D,Transcript index; Column E, Transcript CG number; Column F, FBtr number for each transcript; Column G, Transcript symbol; Column H, Chromosome Column I–J, Left and right limits of gene; Column K, Gene orientation: 1 = forward strand, −1 = reverse strand; Column L, Rank by probability of differential expression; Column M, p-value of differential expression; Column N, Q-value of differential expression; Column O, Cluster assignment; Column P, Primary cluster assignment; Column Q, Secondary cluster assignment; Column R, p-value for primary cluster assignment; Column S,p-value for secondary cluster assignment; Column T–Z, PolII class assignment at 0, 10, 20, 30, 40, 60 or 100 min; Column AA–AG,No. of Su(H) binding peaks within 10 kb at 0, 10, 20, 30, 40, 60 or 100 min; Column AH–AK, A values for 4 trials at t = 0 (blue shading); Column AL-BC, Median M values for each transcript at each timepoint (yellow shading). (XLS)
    Dataset · Jan 2013
  • Dataset: Figure S3
    [Show abstract] [Hide abstract] ABSTRACT: Temporal changes in Pol II profiles at edl and argos. Enrichment for Pol II (red; 0–4.7 fold enrichment on a log2 scale) across the edl and argos genes at different time points (min) after Notch activation. (TIF)
    Dataset · Jan 2013
  • Dataset: Table S2
    [Show abstract] [Hide abstract] ABSTRACT: Genes within 10 kb of Su(H) peaks. Tabs indicate Time points. Column headings as follows: GeneFBgn, FlyBase gene identifier; GeneSymbol, Gene symbol; Chromosome, Chromosome; OligoID, Array oligo ID; TransIndex, Transcript number; TransCGNumber, Transcript CG number; TransFBtr, Flybase transcript identifier; TransSymbol, Transcript symbol; TransLeft, Coordinate of left transcript limit; TransRight, Coordinate of right transcript limit; TransStrand, Transcript orientation (1 = 5′ to 3′; −1 = 3′ to 5′); SuHLeft, Coordinate of Su(H) peak left limit; SuHRight, Coordinate of Su(H) peak right limit; MinDist, Minimum distance between transcript and Su(H) peak (0 indicates that the peak overlaps with the transcript). (XLSX)
    Dataset · Jan 2013
  • Source
    Full-text Dataset · Dec 2012
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: To identify novel dynamic patterns of gene expression, we develop a statistical method to cluster noisy measurements of gene expression collected from multiple replicates at multiple time points, with an unknown number of clusters. We propose a random-effects mixture model coupled with a Dirichlet-process prior for clustering. The mixture model formulation allows for probabilistic cluster assignments. The random-effects formulation allows for attributing the total variability in the data to the sources that are consistent with the experimental design, particularly when the noise level is high and the temporal dependence is not strong. The Dirichlet-process prior induces a prior distribution on partitions and helps to estimate the number of clusters (or mixture components) from the data. We further tackle two challenges associated with Dirichlet-process prior-based methods. One is efficient sampling. We develop a novel Metropolis-Hastings Markov Chain Monte Carlo (MCMC) procedure to sample the partitions. The other is efficient use of the MCMC samples in forming clusters. We propose a two-step procedure for posterior inference, which involves resampling and relabeling, to estimate the posterior allocation probability matrix. This matrix can be directly used in cluster assignments, while describing the uncertainty in clustering. We demonstrate the effectiveness of our model and sampling procedure through simulated data. Applying our method to a real data set collected from Drosophila adult muscle cells after five-minute Notch activation, we identify 14 clusters of different transcriptional responses among 163 differentially expressed genes, which provides several novel insights into underlying transcriptional mechanisms in the Notch signaling pathway. The algorithm developed here is implemented in the R package DIRECT.
    Full-text Article · Oct 2012 · The Annals of Applied Statistics

Publication Stats

4k Citations


  • 2012
    • University of Cambridge
      • Department of Applied Mathematics and Theoretical Physics
      Cambridge, England, United Kingdom
  • 2008
    • Cancer Research UK Cambridge Institute
      Cambridge, England, United Kingdom
  • 2004
    • University of Southern California
      • Department of Biological Sciences
      Los Angeles, CA, United States
  • 1999
    • University of Oxford
      Oxford, England, United Kingdom
  • 1992-1996
    • University of California, Los Angeles
      • Department of Mathematics
      Los Angeles, California, United States
  • 1981-1986
    • University of Utah
      Salt Lake City, Utah, United States
  • 1982
    • Colorado State University
      • Department of Statistics
      Fort Collins, Colorado, United States