Antti Honkela

Helsinki Institute for Information Technology HIIT, Helsinki, Uusimaa, Finland

Are you Antti Honkela?

Claim your profile

Publications (78)99.33 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Genes with similar transcriptional activation kinetics can display very different temporal mRNA profiles due to differences in transcription time, degradation rate and RNA processing kinetics. Recent studies have shown that a splicing-associated RNA processing delay can be significant. We introduce a joint model of transcriptional activation and mRNA accumulation which can be used for inference of transcription rate, RNA processing delay and degradation rate given genome-wide data from high-throughput sequencing time course experiments. We combine a mechanistic differential equation model with a non-parametric statistical modelling approach which allows us to capture a broad range of activation kinetics, and use Bayesian parameter estimation to quantify the uncertainty in the estimates of the kinetic parameters. We apply the model to data from estrogen receptor (ER-{\alpha}) activation in the MCF-7 breast cancer cell line. We use RNA polymerase II (pol-II) ChIP-Seq time course data to characterise transcriptional activation and mRNA-Seq time course data to quantify mature transcripts. We find that 11% of genes with a good signal in the data display a delay of more than 20 minutes between completing transcription and mature mRNA production. The genes displaying these long delays are significantly more likely to be short. We also find a statistical association between high delay and late intron retention in pre-mRNA data, indicating significant splicing-associated processing delays in many genes.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Assigning RNA-seq reads to their transcript of origin is a fundamental task in transcript expression estimation. Where ambiguities in assignments exist due to transcripts sharing sequence, e.g. alternative isoforms or alleles, the problem can be solved through probabilistic inference. Bayesian methods have been shown to provide accurate transcript abundance estimates compared to competing methods. However, exact Bayesian inference is intractable and approximate methods such as Markov chain Monte Carlo (MCMC) and Variational Bayes (VB) are typically used. While providing a high degree of accuracy and modelling flexibility, standard implementations can be prohibitively slow for large datasets and complex transcriptome annotations. Results: We propose a novel approximate inference scheme based on VB and apply it to an existing model of transcript expression inference from RNA-seq data. Recent advances in VB algorithmics are used to improve the convergence of the algorithm beyond the standard Variational Bayes Expectation Maximisation (VBEM) algorithm. We apply our algorithm to simulated and biological datasets, demonstrating a significant increase in speed with only very small loss in accuracy of expression level estimation. We carry out a comparative study against six popular alternative methods and demonstrate that our new algorithm provides better accuracy and inter-replicate consistency while remaining competitive in computation time. Availability: The methods were implemented in R and C++, and are available as part of the BitSeq project at \url{https://github.com/BitSeq}. The method is also available through the BitSeq Bioconductor package.
  • Source
  • Source
  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Predicting the best treatment strategy from genomic information is a core goal of precision medicine. Here we focus on predicting drug response based on a cohort of genomic, epigenomic and proteomic profiling data sets measured in human breast cancer cell lines. Through a collaborative effort between the National Cancer Institute (NCI) and the Dialogue on Reverse Engineering Assessment and Methods (DREAM) project, we analyzed a total of 44 drug sensitivity prediction algorithms. The top-performing approaches modeled nonlinear relationships and incorporated biological pathway information. We found that gene expression microarrays consistently provided the best predictive power of the individual profiling data sets; however, performance was increased by including multiple, independent data sets. We discuss the innovations underlying the top-performing methodology, Bayesian multitask MKL, and we provide detailed descriptions of all methods. This study establishes benchmarks for drug sensitivity prediction and identifies approaches that can be leveraged for the development of new methods.
    Nature Biotechnology 06/2014; DOI:10.1038/nbt.2877 · 39.08 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene transcription mediated by RNA polymerase II (pol-II) is a key step in gene expression. The dynamics of pol-II moving along the transcribed region influence the rate and timing of gene expression. In this work, we present a probabilistic model of transcription dynamics which is fitted to pol-II occupancy time course data measured using ChIP-Seq. The model can be used to estimate transcription speed and to infer the temporal pol-II activity profile at the gene promoter. Model parameters are estimated using either maximum likelihood estimation or via Bayesian inference using Markov chain Monte Carlo sampling. The Bayesian approach provides confidence intervals for parameter estimates and allows the use of priors that capture domain knowledge, e.g. the expected range of transcription speeds, based on previous experiments. The model describes the movement of pol-II down the gene body and can be used to identify the time of induction for transcriptionally engaged genes. By clustering the inferred promoter activity time profiles, we are able to determine which genes respond quickly to stimuli and group genes that share activity profiles and may therefore be co-regulated. We apply our methodology to biological data obtained using ChIP-seq to measure pol-II occupancy genome-wide when MCF-7 human breast cancer cells are treated with estradiol (E2). The transcription speeds we obtain agree with those obtained previously for smaller numbers of genes with the advantage that our approach can be applied genome-wide. We validate the biological significance of the pol-II promoter activity clusters by investigating cluster-specific transcription factor binding patterns and determining canonical pathway enrichment. We find that rapidly induced genes are enriched for both estrogen receptor alpha (ER[Formula: see text]) and FOXA1 binding in their proximal promoter regions.
    PLoS Computational Biology 05/2014; 10(5):e1003598. DOI:10.1371/journal.pcbi.1003598 · 4.83 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recent advances in high-throughput sequencing (HTS) have made it possible to monitor genomes in great detail. New experiments not only use HTS to measure genomic features at one time point but to monitor them changing over time with the aim of identifying significant changes in their abundance. In population genetics, for example, allele frequencies are monitored over time to detect significant frequency changes that indicate selection pressures. Previous attempts at analysing data from HTS experiments have been limited as they could not simultaneously include data at intermediate time points, replicate experiments and sources of uncertainty specific to HTS such as sequencing depth. We present the beta-binomial Gaussian process (BBGP) model for ranking features with significant non-random variation in abundance over time. The features are assumed to represent proportions, such as proportion of an alternative allele in a population. We use the beta-binomial model to capture the uncertainty arising from finite sequencing depth and combine it with a Gaussian process model over the time series. In simulations that mimic the features of experimental evolution data, the proposed method clearly outperforms classical testing in average precision of finding selected alleles. We also present simulations exploring different experimental design choices and results on real data from Drosophila experimental evolution experiment in temperature adaptation. Availability: R software implementing the test is available at https://github.com/handetopa/BBGP. hande.topa@aalto.fi, agnes.jonas@vetmeduni.ac.at, carolin.kosiol@vetmeduni.ac.at, antti.honkela@hiit.fi. © The Author(s) 2015. Published by Oxford University Press.
    Bioinformatics 03/2014; 31(11). DOI:10.1093/bioinformatics/btv014 · 4.62 Impact Factor
  • Source
    Sohan Seth · Niko Välimäki · Samuel Kaski · Antti Honkela
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Over the recent years, the field of whole-metagenome shotgun sequencing has witnessed significant growth owing to the high-throughput sequencing technologies that allow sequencing genomic samples cheaper, faster and with better coverage than before. This technical advancement has initiated the trend of sequencing multiple samples in different conditions or environments to explore the similarities and dissimilarities of the microbial communities. Examples include the human microbiome project and various studies of the human intestinal tract. With the availability of ever larger databases of such measurements, finding samples similar to a given query sample is becoming a central operation. Results: In this article, we develop a content-based exploration and retrieval method for whole-metagenome sequencing samples. We apply a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples and use them to measure the dissimilarity between two samples. We evaluate the performance of the proposed approach on two human gut metagenome datasets as well as human microbiome project metagenomic samples. We observe significant enrichment for diseased gut samples in results of queries with another diseased sample and high accuracy in discriminating between different body sites even though the method is unsupervised. Availability and implementation: A software implementation of the DSM framework is available at https://github.com/HIITMetagenomics/dsm-framework. Contact: sohan.seth@hiit.fi or antti.honkela@hiit.fi Supplementary information: Supplementary data are available at Bioinformatics online.
    Bioinformatics 08/2013; 30(17). DOI:10.1093/bioinformatics/btu340 · 4.62 Impact Factor
  • Source
    James Hensman · Peter Glaus · Antti Honkela · Magnus Rattray
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: The mapping of RNA-seq reads to their transcripts of origin is a fundamental task in transcript expression estimation and differential expression scoring. Where ambiguities in mapping exist due to transcripts sharing sequence, e.g. alternative isoforms or alleles, the problem becomes an instance of non-trivial probabilistic inference. Bayesian inference in such a problem is intractable and approximate methods must be used such as Markov chain Monte Carlo (MCMC) and Variational Bayes. Standard implementations of these methods can be prohibitively slow for large datasets and complex gene models. Results: We propose an approximate inference scheme based on Variational Bayes applied to an existing model of transcript expression inference from RNA-seq data. We apply recent advances in Variational Bayes algorithmics to improve the convergence of the algorithm beyond the standard variational expectation-maximisation approach. We apply our algorithm to simulated and biological datasets, demonstrating that the increase in speed requires only a small trade-off in accuracy of expression level estimation. Availability: The methods were implemented in R and C++, and are available as part of the BitSeq project at https://code.google.com/p/bitseq/. The methods will be made available through the BitSeq Bioconductor package at the next stable release.
  • Source
    Karolis Uziela · Antti Honkela
    [Show abstract] [Hide abstract]
    ABSTRACT: Rapidly growing public gene expression databases contain a wealth of data for building an unprecedentedly detailed picture of human biology and disease. This data comes from many diverse measurement platforms that make integrating it all difficult. In this paper, we propose a new method for processing RNA-sequencing data that yields gene expression estimates that are much more similar to corresponding estimates from microarray data, hence greatly improving cross-platform comparability. The method we call PREBS is based on estimating the expression only from microarray probe regions, and processing these estimates with microarray summarisation algorithm RMA. This allows new ways of using RNA-sequencing data, such as expression estimation for microarray probe sets. Gene signatures defined based on PREBS expression measures of RNA-sequencing data are much more accurate for retrieval of similar microarray samples from a database.
    PLoS ONE 04/2013; 10(5). DOI:10.1371/journal.pone.0126545 · 3.23 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene transcription mediated by RNA polymerase II (pol-II) is a key step in gene expression. The dynamics of pol-II moving along the transcribed region influences the rate and timing of gene expression. In this work we present a probabilistic model of transcription dynamics which is fitted to pol-II occupancy time course data measured using ChIP-Seq. The model can be used to estimate transcription speed and to infer the temporal pol-II activity profile at the gene promoter. Model parameters are determined using either maximum likelihood estimation or via Bayesian inference using Markov chain Monte Carlo sampling. The Bayesian approach provides confidence intervals for parameter estimates and allows the use of priors that capture domain knowledge, e.g. the expected range of transcription speeds, based on previous experiments. The model describes the movement of pol-II down the gene body and can be used to identify the time of induction for transcriptionally engaged genes. By clustering the inferred promoter activity time profiles, we are able to determine which genes respond quickly to stimuli and group genes that share activity profiles and may therefore be co-regulated. We apply our methodology to biological data obtained using ChIP-seq to measure pol-II occupancy genome-wide when MCF-7 human breast cancer cells are treated with estradiol (E2). The transcription speeds we obtain agree with those obtained previously for smaller numbers of genes with the advantage that our approach can be applied genome-wide. We validate the biological significance of the pol-II promoter activity clusters by investigating cluster-specific transcription factor binding patterns and determining canonical pathway enrichment.
  • Antti Honkela · Magnus Rattray · Neil D Lawrence
    [Show abstract] [Hide abstract]
    ABSTRACT: Reverse engineering the gene regulatory network is challenging because the amount of available data is very limited compared to the complexity of the underlying network. We present a technique addressing this problem through focussing on a more limited problem: inferring direct targets of a transcription factor from short expression time series. The method is based on combining Gaussian process priors and ordinary differential equation models allowing inference on limited potentially unevenly sampled data. The method is implemented as an R/Bioconductor package, and it is demonstrated by ranking candidate targets of the p53 tumour suppressor.
    Methods in molecular biology (Clifton, N.J.) 01/2013; 939:59-67. DOI:10.1007/978-1-62703-107-3_6 · 1.29 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: RNA-Seq technology allows for studying the transcriptional state of the cell at an unprecedented level of detail. Beyond quantification of whole-gene expression, it is now possible to disentangle the abundance of individual alternatively spliced transcript isoforms of a gene. A central question is to understand the regulatory processes that lead to differences in relative abundance variation due to external and genetic factors. Here, we present a mixed model approach that allows for (i) joint analysis and genetic mapping of multiple transcript isoforms and (ii) mapping of isoform-specific effects. Central to our approach is to comprehensively model the causes of variation and correlation between transcript isoforms, including the genomic background and technical quantification uncertainty. As a result, our method allows to accurately test for shared as well as transcript-specific genetic regulation of transcript isoforms and achieves substantially improved calibration of these statistical tests. Experiments on genotype and RNA-Seq data from 126 human HapMap individuals demonstrate that our model can help to obtain a more fine-grained picture of the genetic basis of gene expression variation.
  • Source
    Hande Topa · Antti Honkela
    [Show abstract] [Hide abstract]
    ABSTRACT: We present techniques for effective Gaussian process (GP) modelling of multiple short time series. These problems are common when applying GP models independently to each gene in a gene expression time series data set. Such sets typically contain very few time points. Naive application of common GP modelling techniques can lead to severe over-fitting or under-fitting in a significant fraction of the fitted models, depending on the details of the data set. We propose avoiding over-fitting by constraining the GP length-scale to values that focus most of the energy spectrum to frequencies below the Nyquist frequency corresponding to the sampling frequency in the data set. Under-fitting can be avoided by more informative priors on observation noise. Combining these methods allows applying GP methods reliably automatically to large numbers of independent instances of short time series. This is illustrated with experiments with both synthetic data and real gene expression data.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A software library for constructing and learning probabilistic models is presented. The library offers a set of building blocks from which a large variety of static and dynamic models can be built. These include hierarchical models for variances of other variables and many nonlinear models. The underlying variational Bayesian machinery, providing for fast and robust estimation but being mathematically rather involved, is almost completely hidden from the user thus making it very easy to use the library. The building blocks include Gaussian, rectified Gaussian and mixture-of-Gaussians variables and computational nodes which can be combined rather freely.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Complete transcriptional regulatory network inference is a huge challenge because of the complexity of the network and sparsity of available data. One approach to make it more manageable is to focus on the inference of context-specific networks involving a few interacting transcription factors (TFs) and all of their target genes. Results We present a computational framework for Bayesian statistical inference of target genes of multiple interacting TFs from high-throughput gene expression time-series data. We use ordinary differential equation models that describe transcription of target genes taking into account combinatorial regulation. The method consists of a training and a prediction phase. During the training phase we infer the unobserved TF protein concentrations on a subnetwork of approximately known regulatory structure. During the prediction phase we apply Bayesian model selection on a genome-wide scale and score all alternative regulatory structures for each target gene. We use our methodology to identify targets of five TFs regulating Drosophila melanogaster mesoderm development. We find that confident predicted links between TFs and targets are significantly enriched for supporting ChIP-chip binding events and annotated TF-gene interations. Our method statistically significantly outperforms existing alternatives. Conclusions Our results show that it is possible to infer regulatory links between multiple interacting TFs and their target genes even from a single relatively short time series and in presence of unmodelled confounders and unreliable prior knowledge on training network connectivity. Introducing data from several different experimental perturbations significantly increases the accuracy.
    BMC Systems Biology 05/2012; 6(1):53. DOI:10.1186/1752-0509-6-53 · 2.85 Impact Factor
  • Source
    Peter Glaus · Antti Honkela · Magnus Rattray
    [Show abstract] [Hide abstract]
    ABSTRACT: High-throughput sequencing enables expression analysis at the level of individual transcripts. The analysis of transcriptome expression levels and differential expression (DE) estimation requires a probabilistic approach to properly account for ambiguity caused by shared exons and finite read sampling as well as the intrinsic biological variance of transcript expression. We present Bayesian inference of transcripts from sequencing data (BitSeq), a Bayesian approach for estimation of transcript expression level from RNA-seq experiments. Inferred relative expression is represented by Markov chain Monte Carlo samples from the posterior probability distribution of a generative model of the read data. We propose a novel method for DE analysis across replicates which propagates uncertainty from the sample-level model while modelling biological variance using an expression-level-dependent prior. We demonstrate the advantages of our method using simulated data as well as an RNA-seq dataset with technical and biological replication for both studied conditions. The implementation of the transcriptome expression estimation and differential expression analysis, BitSeq, has been written in C++ and Python. The software is available online from http://code.google.com/p/bitseq/, version 0.4 was used for generating results presented in this article.
    Bioinformatics 05/2012; 28(13):1721-8. DOI:10.1093/bioinformatics/bts260 · 4.62 Impact Factor
  • Source
    Janne Nikkila · Antti Honkela · Samuel Kaski
    [Show abstract] [Hide abstract]
    ABSTRACT: We study the discovery of gene regulatory modules based on transcription factor (TF) binding data and expression data from gene knockouts. We invoke the natural assumption that regulatory modules predominantly operate independently, which makes it possible to ap- ply a new method for extracting them: the Independent Variable Group Analysis. We demonstrate that i) the independence assumption helps in discovering the regulatory modules from TF data, and ii) the indepen- dent gene modules discovered from TF-data can be found also in expres- sion data from gene knockouts. This demonstrates that the regulatory effects by transcription factors are observable in knockout experiments. It additionally suggests that the difficult interpretation of the knock- out experiments could be eased by taking into account the independent regulatory modules.
  • Source

Publication Stats

558 Citations
99.33 Total Impact Points

Institutions

  • 2012–2015
    • Helsinki Institute for Information Technology HIIT
      Helsinki, Uusimaa, Finland
  • 2001–2014
    • University of Helsinki
      • • Department of Computer Science
      • • Helsinki Institute for Information Technology HIIT
      • • Institute for Molecular Medicine Finland (FIMM)
      Helsinki, Uusimaa, Finland
  • 2008–2010
    • Aalto University
      • Department of Information and Computer Science
      Helsinki, Province of Southern Finland, Finland
  • 2007
    • University of Turku
      Turku, Province of Western Finland, Finland
  • 2000
    • City of Espoo
      Esbo, Uusimaa, Finland