[Show abstract][Hide abstract] ABSTRACT: Cancer genome sequencing studies have identified numerous driver genes, but the relative timing of mutations in carcinogenesis remains unclear. The gradual progression from premalignant Barrett's esophagus to esophageal adenocarcinoma (EAC) provides an ideal model to study the ordering of somatic mutations. We identified recurrently mutated genes and assessed clonal structure using whole-genome sequencing and amplicon resequencing of 112 EACs. We next screened a cohort of 109 biopsies from 2 key transition points in the development of malignancy: benign metaplastic never-dysplastic Barrett's esophagus (NDBE; n=66) and high-grade dysplasia (HGD; n=43). Unexpectedly, the majority of recurrently mutated genes in EAC were also mutated in NDBE. Only TP53 and SMAD4 mutations occurred in a stage-specific manner, confined to HGD and EAC, respectively. Finally, we applied this knowledge to identify high-risk Barrett's esophagus in a new non-endoscopic test. In conclusion, mutations in EAC driver genes generally occur exceptionally early in disease development with profound implications for diagnostic and therapeutic strategies.
[Show abstract][Hide abstract] ABSTRACT: A series of clonal expansions are thought to underlie the progression of Barrett's oesophagus (BE) to oesophageal adenocarcinoma (OAC). Each expansion carries with it somatic driver mutation (s) fixing it within a larger population and therefore increasing the likelihood of acquiring a second mutation. However, the precise order in which somatic variants occur remains unknown.
Gut 06/2014; 63(Suppl 1):A105-A106. DOI:10.1136/gutjnl-2014-307263.227 · 14.66 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In metabolomics the goal is to identify and measure the concentrations of different metabolites (small molecules) in a cell or a biological system. The metabolites form an important layer in the complex metabolic network, and the interactions between different metabolites are often of interest. It is crucial to perform proper normalization of metabolomics data, but current methods may not be applicable when estimating interactions in the form of correlations between metabolites. We propose a normalization approach based on a mixed model, with simultaneous estimation of a correlation matrix. We also investigate how the common use of a calibration standard in NMR experiments affects the estimation of correlations.
We show with both real and simulated data that our proposed normalization method is robust and has good performance when discovering true correlations between metabolites. The standardization of NMR data is shown in simulation studies to affect our ability to discover true correlations to a small extent. However, comparing standardized and non-standardized real data does not result in any large differences in correlation estimates.
Source code is freely available at https://sourceforge.net/projects/metabnorm/ CONTACT: firstname.lastname@example.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
[Show abstract][Hide abstract] ABSTRACT: Most methods for estimating differential expression from RNA-seq are based on statistics that compare normalised read counts between treatment classes. Unfortunately, reads are in general too short to be mapped unambiguously to features of interest, such as genes, isoforms or haplotype-specific isoforms. There are methods for estimating expression levels that account for this source of ambiguity. However, the uncertainty is not generally accounted for in downstream analysis of gene expression experiments. Moreover, at the individual transcript level, it can sometimes be too large to allow useful comparisons between treatment groups.
In this paper we make two proposals that improve the power, specificity and versatility of expression analysis using RNA-seq data. Firstly, we present a Bayesian method for model selection that accounts for read mapping ambiguities using random effects. This polytomous model selection approach can be used to identify many interesting patterns of gene expression and is not confined to detecting differential expression between two groups. For illustration, we use our method to detect imprinting, different types of regulatory divergence in cis and in trans, and differential isoform usage, but many other applications are possible. Secondly, we present a novel collapsing algorithm for grouping transcripts into inferential units that exploits the posterior correlation between transcript expression levels. The aggregate expression levels of these units can be estimated with useful levels of uncertainty. Our algorithm can improve the precision of expression estimates when uncertainty is large with only a small reduction in biological resolution.
We have implemented our software in the mmdiff and mmcollapse multi-threaded C++ programs as part of the open-source MMSEQ package, available on https://github.com/eturro/mmseq.
[Show abstract][Hide abstract] ABSTRACT: Lineage-tracing approaches, widely used to characterize stem cell populations, rely on the specificity and stability of individual markers for accurate results. We present a method in which genetic labeling in the intestinal epithelium is acquired as a mutation-induced clonal mark during DNA replication. By determining the rate of mutation in vivo and combining this data with the known neutral-drift dynamics that describe intestinal stem cell replacement, we quantify the number of functional stem cells in crypts and adenomas. Contrary to previous reports, we find that significantly lower numbers of "working" stem cells are present in the intestinal epithelium (five to seven per crypt) and in adenomas (nine per gland), and that those stem cells are also replaced at a significantly lower rate. These findings suggest that the bulk of tumor stem cell divisions serve only to replace stem cell loss, with rare clonal victors driving gland repopulation and tumor growth.
[Show abstract][Hide abstract] ABSTRACT: Dynamic activity of signaling pathways, such as Notch, is vital to achieve correct development and homeostasis. However, most studies assess output many hours or days after initiation of signaling, once the outcome has been consolidated. Here we analyze genome-wide changes in transcript levels, binding of the Notch pathway transcription factor, CSL [Suppressor of Hairless, Su(H), in Drosophila], and RNA Polymerase II (Pol II) immediately following a short pulse of Notch stimulation. A total of 154 genes showed significant differential expression (DE) over time, and their expression profiles stratified into 14 clusters based on the timing, magnitude, and direction of DE. E(spl) genes were the most rapidly upregulated, with Su(H), Pol II, and transcript levels increasing within 5-10 minutes. Other genes had a more delayed response, the timing of which was largely unaffected by more prolonged Notch activation. Neither Su(H) binding nor poised Pol II could fully explain the differences between profiles. Instead, our data indicate that regulatory interactions, driven by the early-responding E(spl)bHLH genes, are required. Proposed cross-regulatory relationships were validated in vivo and in cell culture, supporting the view that feed-forward repression by E(spl)bHLH/Hes shapes the response of late-responding genes. Based on these data, we propose a model in which Hes genes are responsible for co-ordinating the Notch response of a wide spectrum of other targets, explaining the critical functions these key regulators play in many developmental and disease contexts.
[Show abstract][Hide abstract] ABSTRACT: To identify novel dynamic patterns of gene expression, we develop a
statistical method to cluster noisy measurements of gene expression collected
from multiple replicates at multiple time points, with an unknown number of
clusters. We propose a random-effects mixture model coupled with a
Dirichlet-process prior for clustering. The mixture model formulation allows
for probabilistic cluster assignments. The random-effects formulation allows
for attributing the total variability in the data to the sources that are
consistent with the experimental design, particularly when the noise level is
high and the temporal dependence is not strong. The Dirichlet-process prior
induces a prior distribution on partitions and helps to estimate the number of
clusters (or mixture components) from the data. We further tackle two
challenges associated with Dirichlet-process prior-based methods. One is
efficient sampling. We develop a novel Metropolis-Hastings Markov Chain Monte
Carlo (MCMC) procedure to sample the partitions. The other is efficient use of
the MCMC samples in forming clusters. We propose a two-step procedure for
posterior inference, which involves resampling and relabeling, to estimate the
posterior allocation probability matrix. This matrix can be directly used in
cluster assignments, while describing the uncertainty in clustering. We
demonstrate the effectiveness of our model and sampling procedure through
simulated data. Applying our method to a real data set collected from
Drosophila adult muscle cells after five-minute Notch activation, we identify
14 clusters of different transcriptional responses among 163 differentially
expressed genes, which provides several novel insights into underlying
transcriptional mechanisms in the Notch signaling pathway. The algorithm
developed here is implemented in the R package DIRECT.
The Annals of Applied Statistics 10/2012; 7(3). DOI:10.1214/13-AOAS650 · 1.46 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Introduction
The Interferons (IFNs) are a family of pleiotropic cytokines that mediate anti-microbial, anti-proliferative, anti-tumour, immuno-modulatory and homoeostatic host defence functions. Differential regulation of Interferon Stimulated Genes (ISGs) by IFNs, via multiple signalling pathways and the combinatorial effects of IFN-induced transcription factors, co-regulators, ncRNA and epigenetic modifications underlie these diverse biological functions.
We have built InterferonScape – a software platform for data-integration, mining and visualization to study IFN mediated ISG regulation at a systems level. This is a centralized systems-immunology resource that integrates ISG datasets into a data warehouse with a multidimensional data analysis capability. Statistical and computational data-mining methods combined with easy to use graphical user interfaces are included to provide custom-built data-visualization capability. This resource is provided as a web 2.0 rich internet application with novel embedded data-mining, search and semantic web technologies.
Using integrative data-mining approaches we previously identified approximately 1800 ISGs in the human genome . Very few of these ISGs have been studied systematically, and information on ISG pathways is virtually absent from both public and commercial pathway repositories. We have utilised computational pathway modelling approaches, coupled to natural language text mining and expert manual curation to build a reliable collection of ISG pathways and networks. By integrating both in-house generated and publicly-available gene expression, regulatory genomic and epigenetic datasets in different cell and tissue types and in both normal and disease conditions, we have identified the regulatory elements and promoter modules involved in the regulation of ISG networks.
Using integrative genomic and computational biology approaches we will demonstrate the evolutionary conservation, information flow, connectivity, network topology and regulatory interactions underlying these ISG networks. Understanding the propagation of ISG networks in different conditions will provide a greater insight into IFN biology in health and disease.
[Show abstract][Hide abstract] ABSTRACT: The elucidation of breast cancer subgroups and their molecular drivers requires integrated views of the genome and transcriptome from representative numbers of patients. We present an integrated analysis of copy number and gene expression in a discovery and validation set of 997 and 995 primary breast tumours, respectively, with long-term clinical follow-up. Inherited variants (copy number variants and single nucleotide polymorphisms) and acquired somatic copy number aberrations (CNAs) were associated with expression in ~40% of genes, with the landscape dominated by cis- and trans-acting CNAs. By delineating expression outlier genes driven in cis by CNAs, we identified putative cancer genes, including deletions in PPP2R2A, MTAP and MAP2K4. Unsupervised analysis of paired DNA–RNA profiles revealed novel subgroups with distinct clinical outcomes, which reproduced in the validation cohort. These include a high-risk, oestrogen-receptor-positive 11q13/14 cis-acting subgroup and a favourable prognosis subgroup devoid of CNAs. Trans-acting aberration hotspots were found to modulate subgroup-specific gene networks, including a TCR deletion-mediated adaptive immune response in the ‘CNA-devoid’ subgroup and a basal-specific chromosome 5 deletion-associated mitotic network. Our results provide a novel molecular stratification of the breast cancer population, derived from the impact of somatic CNAs on the transcriptome.
[Show abstract][Hide abstract] ABSTRACT: Phylogeographic methods have attracted a lot of attention in recent years, stressing the need to provide a solid statistical framework for many existing methodologies so as to draw statistically reliable inferences. Here, we take a flexible fully Bayesian approach by reducing the problem to a clustering framework, whereby the population distribution can be explained by a set of migrations, forming geographically stable population clusters. These clusters are such that they are consistent with a fixed number of migrations on the corresponding (unknown) subdivided coalescent tree. Our methods rely upon a clustered population distribution, and allow for inclusion of various covariates (such as phenotype or climate information) at little additional computational cost. We illustrate our methods with an example from weevil mitochondrial DNA sequences from the Iberian peninsula.
Interface focus: a theme supplement of Journal of the Royal Society interface 12/2011; 1(6):909-21. DOI:10.1098/rsfs.2011.0054 · 2.63 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The cancer stem cell (CSC) concept is a highly debated topic in cancer research. While experimental evidence in favor of the cancer stem cell theory is apparently abundant, the results are often criticized as being difficult to interpret. An important reason for this is that most experimental data that support this model rely on transplantation studies. In this study we use a novel cellular Potts model to elucidate the dynamics of established malignancies that are driven by a small subset of CSCs. Our results demonstrate that epigenetic mutations that occur during mitosis display highly altered dynamics in CSC-driven malignancies compared to a classical, non-hierarchical model of growth. In particular, the heterogeneity observed in CSC-driven tumors is considerably higher. We speculate that this feature could be used in combination with epigenetic (methylation) sequencing studies of human malignancies to prove or refute the CSC hypothesis in established tumors without the need for transplantation. Moreover our tumor growth simulations indicate that CSC-driven tumors display evolutionary features that can be considered beneficial during tumor progression. Besides an increased heterogeneity they also exhibit properties that allow the escape of clones from local fitness peaks. This leads to more aggressive phenotypes in the long run and makes the neoplasm more adaptable to stringent selective forces such as cancer treatment. Indeed when therapy is applied the clone landscape of the regrown tumor is more aggressive with respect to the primary tumor, whereas the classical model demonstrated similar patterns before and after therapy. Understanding these often counter-intuitive fundamental properties of (non-)hierarchically organized malignancies is a crucial step in validating the CSC concept as well as providing insight into the therapeutical consequences of this model.
[Show abstract][Hide abstract] ABSTRACT: This paper presents Sparse Partitioning, a Bayesian method for identifying
predictors that either individually or in combination with others affect a
response variable. The method is designed for regression problems involving
binary or tertiary predictors and allows the number of predictors to exceed the
size of the sample, two properties which make it well suited for association
studies. Sparse Partitioning differs from other regression methods by placing
no restrictions on how the predictors may influence the response. To compensate
for this generality, Sparse Partitioning implements a novel way of exploring
the model space. It searches for high posterior probability partitions of the
predictor set, where each partition defines groups of predictors that jointly
influence the response. The result is a robust method that requires no prior
knowledge of the true predictor--response relationship. Testing on simulated
data suggests Sparse Partitioning will typically match the performance of an
existing method on a data set which obeys the existing method's model
assumptions. When these assumptions are violated, Sparse Partitioning will
generally offer superior performance.
The Annals of Applied Statistics 01/2011; 5(2011). DOI:10.1214/10-AOAS411 · 1.46 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Copy number abnormalities (CNAs) represent an important type of genetic mutation that can lead to abnormal cell growth and proliferation. New high-throughput sequencing technologies promise comprehensive characterization of CNAs. In contrast to microarrays, where probe design follows a carefully developed protocol, reads represent a random sample from a library and may be prone to representation biases due to GC content and other factors. The discrimination between true and false positive CNAs becomes an important issue.
We present a novel approach, called CNAseg, to identify CNAs from second-generation sequencing data. It uses depth of coverage to estimate copy number states and flowcell-to-flowcell variability in cancer and normal samples to control the false positive rate. We tested the method using the COLO-829 melanoma cell line sequenced to 40-fold coverage. An extensive simulation scheme was developed to recreate different scenarios of copy number changes and depth of coverage by altering a real dataset with spiked-in CNAs. Comparison to alternative approaches using both real and simulated datasets showed that CNAseg achieves superior precision and improved sensitivity estimates.
The CNAseg package and test data are available at http://www.compbio.group.cam.ac.uk/software.html.
[Show abstract][Hide abstract] ABSTRACT: The current generation of single nucleotide polymorphism (SNP) arrays allows measurement of copy number aberrations (CNAs) in cancer at more than one million locations in the genome in hundreds of tumour samples. Most research has focused on single-sample CNA discovery, the so-called segmentation problem. The availability of high-density, large sample-size SNP array datasets makes the identification of recurrent copy number changes in cancer, an important issue that can be addressed using the cross-sample information.
We present a novel approach for finding regions of recurrent copy number aberrations, called CNAnova, from Affymetrix SNP 6.0 array data. The method derives its statistical properties from a control dataset composed of normal samples and, in contrast to previous methods, does not require segmentation and permutation steps. For rigorous testing of the algorithm and comparison to existing methods, we developed a simulation scheme that uses the noise distribution present in Affymetrix arrays. Application of the method to 128 acute lymphoblastic leukaemia samples shows that CNAnova achieves lower error rate than a popular alternative approach. We also describe an extension of the CNAnova framework to identify recurrent CNA regions with intra-tumour heterogeneity, present in either primary or relapsed samples from the same patients.
The CNAnova package and synthetic datasets are available at http://www.compbio.group.cam.ac.uk/software.html.
[Show abstract][Hide abstract] ABSTRACT: The dynamics of tumour evolution are not well understood. In this paper we provide a statistical framework for evaluating the molecular variation observed in different parts of a colorectal tumour. A multi-sample version of the Ewens Sampling Formula forms the basis for our modelling of the data, and we provide a simulation procedure for use in obtaining reference distributions for the statistics of interest. We also describe the large-sample asymptotics of the joint distributions of the variation observed in different parts of the tumour. While actual data should be evaluated with reference to the simulation procedure, the asymptotics serve to provide theoretical guidelines, for instance with reference to the choice of possible statistics. Comment: 22 pages, 1 figure. Chapter 4 of "Probability and Mathematical Genetics: Papers in Honour of Sir John Kingman" (Editors N.H. Bingham and C.M. Goldie), Cambridge University Press, 2010
[Show abstract][Hide abstract] ABSTRACT: The amplification of millions of single molecules in parallel can be performed on microscopic magnetic beads that are contained in aqueous compartments of an oil-buffer emulsion. These bead-emulsion amplification (BEA) reactions result in beads that are covered by almost-identical copies derived from a single template. The post-amplification analysis is performed using different fluorophore-labeled probes. We have identified BEA reaction conditions that efficiently produce longer amplicons of up to 450 base pairs. These conditions include the use of a Titanium Taq amplification system. Second, we explored alternate fluorophores coupled to probes for post-PCR DNA analysis. We demonstrate that four different Alexa fluorophores can be used simultaneously with extremely low crosstalk. Finally, we developed an allele-specific extension chemistry that is based on Alexa dyes to query individual nucleotides of the amplified material that is both highly efficient and specific.