Antti Honkela

Helsinki Institute for Information Technology HIIT, Helsinki, Southern Finland Province, Finland

Are you Antti Honkela?

Claim your profile

Publications (73)117.79 Total impact

  • Source
  • Source
  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Predicting the best treatment strategy from genomic information is a core goal of precision medicine. Here we focus on predicting drug response based on a cohort of genomic, epigenomic and proteomic profiling data sets measured in human breast cancer cell lines. Through a collaborative effort between the National Cancer Institute (NCI) and the Dialogue on Reverse Engineering Assessment and Methods (DREAM) project, we analyzed a total of 44 drug sensitivity prediction algorithms. The top-performing approaches modeled nonlinear relationships and incorporated biological pathway information. We found that gene expression microarrays consistently provided the best predictive power of the individual profiling data sets; however, performance was increased by including multiple, independent data sets. We discuss the innovations underlying the top-performing methodology, Bayesian multitask MKL, and we provide detailed descriptions of all methods. This study establishes benchmarks for drug sensitivity prediction and identifies approaches that can be leveraged for the development of new methods.
    Nature Biotechnology 06/2014; · 32.44 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Predicting the best treatment strategy from genomic information is a core goal of precision medicine. Here we focus on predicting drug response based on a cohort of genomic, epigenomic and proteomic profiling data sets measured in human breast cancer cell lines. Through a collaborative effort between the National Cancer Institute (NCI) and the Dialogue on Reverse Engineering Assessment and Methods (DREAM) project, we analyzed a total of 44 drug sensitivity prediction algorithms. The top-performing approaches modeled nonlinear relationships and incorporated biological pathway information. We found that gene expression microarrays consistently provided the best predictive power of the individual profiling data sets; however, performance was increased by including multiple, independent data sets. We discuss the innovations underlying the top-performing methodology, Bayesian multitask MKL, and we provide detailed descriptions of all methods. This study establishes benchmarks for drug sensitivity prediction and identifies approaches that can be leveraged for the development of new methods.
    Nature Biotechnology 06/2014; · 32.44 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene transcription mediated by RNA polymerase II (pol-II) is a key step in gene expression. The dynamics of pol-II moving along the transcribed region influence the rate and timing of gene expression. In this work, we present a probabilistic model of transcription dynamics which is fitted to pol-II occupancy time course data measured using ChIP-Seq. The model can be used to estimate transcription speed and to infer the temporal pol-II activity profile at the gene promoter. Model parameters are estimated using either maximum likelihood estimation or via Bayesian inference using Markov chain Monte Carlo sampling. The Bayesian approach provides confidence intervals for parameter estimates and allows the use of priors that capture domain knowledge, e.g. the expected range of transcription speeds, based on previous experiments. The model describes the movement of pol-II down the gene body and can be used to identify the time of induction for transcriptionally engaged genes. By clustering the inferred promoter activity time profiles, we are able to determine which genes respond quickly to stimuli and group genes that share activity profiles and may therefore be co-regulated. We apply our methodology to biological data obtained using ChIP-seq to measure pol-II occupancy genome-wide when MCF-7 human breast cancer cells are treated with estradiol (E2). The transcription speeds we obtain agree with those obtained previously for smaller numbers of genes with the advantage that our approach can be applied genome-wide. We validate the biological significance of the pol-II promoter activity clusters by investigating cluster-specific transcription factor binding patterns and determining canonical pathway enrichment. We find that rapidly induced genes are enriched for both estrogen receptor alpha (ER[Formula: see text]) and FOXA1 binding in their proximal promoter regions.
    PLoS Computational Biology 05/2014; 10(5):e1003598. · 4.87 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Recent advances in high-throughput sequencing (HTS) have made it possible to monitor genomes in great detail. New experiments not only use HTS to measure genomic features at one time point but to monitor them changing over time with the aim of identifying significant changes in their abundance. In population genetics, for example, allele frequencies are monitored over time to detect significant frequency changes that indicate selection pressures. Previous attempts at analysing data from HTS experiments have been limited as they could not simultaneously include data at intermediate time points, replicate experiments and sources of uncertainty specific to HTS such as sequencing depth. Results: We present the beta-binomial Gaussian process (BBGP) model for ranking features with significant non-random variation in abundance over time. The features are assumed to represent proportions, such as proportion of an alternative allele in a population. We use the beta-binomial model to capture the uncertainty arising from finite sequencing depth and combine with a Gaussian process model over the time series. In simulations that mimic the features of experimental evolution data, the proposed method clearly outperforms classical testing in average precision of finding selected alleles. We also present results on real data from Drosophila experimental evolution experiment in temperature adaptation. Availability: R software implementing the test is available at https://github.com/handetopa/BBGP
    03/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Over the recent years, the field of whole metagenome shotgun sequencing has witnessed significant growth due to the next generation sequencing technologies that allow sequencing genomic samples cheaper, faster, and with better coverage than before. This technical advancement has initiated the trend of sequencing multiple samples in different conditions or environments to explore the similarities and dissimilarities of the microbial communities. Examples include the human microbiome project and various studies of the human intestinal tract. With the availability of ever larger databases of such measurements, finding samples similar to a given query sample is becoming a central operation. In this paper, we develop a content-based retrieval method for whole metagenome sequencing samples. We apply a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples, and use them to measure the dissimilarity between two samples. We evaluate the performance of the proposed approach on two human gut metagenome data sets and observe significant enrichment for diseased samples in results of queries with another diseased sample.
    Bioinformatics 08/2013; 30(17). · 5.32 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: The mapping of RNA-seq reads to their transcripts of origin is a fundamental task in transcript expression estimation and differential expression scoring. Where ambiguities in mapping exist due to transcripts sharing sequence, e.g. alternative isoforms or alleles, the problem becomes an instance of non-trivial probabilistic inference. Bayesian inference in such a problem is intractable and approximate methods must be used such as Markov chain Monte Carlo (MCMC) and Variational Bayes. Standard implementations of these methods can be prohibitively slow for large datasets and complex gene models. Results: We propose an approximate inference scheme based on Variational Bayes applied to an existing model of transcript expression inference from RNA-seq data. We apply recent advances in Variational Bayes algorithmics to improve the convergence of the algorithm beyond the standard variational expectation-maximisation approach. We apply our algorithm to simulated and biological datasets, demonstrating that the increase in speed requires only a small trade-off in accuracy of expression level estimation. Availability: The methods were implemented in R and C++, and are available as part of the BitSeq project at https://code.google.com/p/bitseq/. The methods will be made available through the BitSeq Bioconductor package at the next stable release.
    08/2013;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene transcription mediated by RNA polymerase II (pol-II) is a key step in gene expression. The dynamics of pol-II moving along the transcribed region influences the rate and timing of gene expression. In this work we present a probabilistic model of transcription dynamics which is fitted to pol-II occupancy time course data measured using ChIP-Seq. The model can be used to estimate transcription speed and to infer the temporal pol-II activity profile at the gene promoter. Model parameters are determined using either maximum likelihood estimation or via Bayesian inference using Markov chain Monte Carlo sampling. The Bayesian approach provides confidence intervals for parameter estimates and allows the use of priors that capture domain knowledge, e.g. the expected range of transcription speeds, based on previous experiments. The model describes the movement of pol-II down the gene body and can be used to identify the time of induction for transcriptionally engaged genes. By clustering the inferred promoter activity time profiles, we are able to determine which genes respond quickly to stimuli and group genes that share activity profiles and may therefore be co-regulated. We apply our methodology to biological data obtained using ChIP-seq to measure pol-II occupancy genome-wide when MCF-7 human breast cancer cells are treated with estradiol (E2). The transcription speeds we obtain agree with those obtained previously for smaller numbers of genes with the advantage that our approach can be applied genome-wide. We validate the biological significance of the pol-II promoter activity clusters by investigating cluster-specific transcription factor binding patterns and determining canonical pathway enrichment.
    03/2013;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Reverse engineering the gene regulatory network is challenging because the amount of available data is very limited compared to the complexity of the underlying network. We present a technique addressing this problem through focussing on a more limited problem: inferring direct targets of a transcription factor from short expression time series. The method is based on combining Gaussian process priors and ordinary differential equation models allowing inference on limited potentially unevenly sampled data. The method is implemented as an R/Bioconductor package, and it is demonstrated by ranking candidate targets of the p53 tumour suppressor.
    Methods in molecular biology (Clifton, N.J.) 01/2013; 939:59-67. · 1.29 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: RNA-Seq technology allows for studying the transcriptional state of the cell at an unprecedented level of detail. Beyond quantification of whole-gene expression, it is now possible to disentangle the abundance of individual alternatively spliced transcript isoforms of a gene. A central question is to understand the regulatory processes that lead to differences in relative abundance variation due to external and genetic factors. Here, we present a mixed model approach that allows for (i) joint analysis and genetic mapping of multiple transcript isoforms and (ii) mapping of isoform-specific effects. Central to our approach is to comprehensively model the causes of variation and correlation between transcript isoforms, including the genomic background and technical quantification uncertainty. As a result, our method allows to accurately test for shared as well as transcript-specific genetic regulation of transcript isoforms and achieves substantially improved calibration of these statistical tests. Experiments on genotype and RNA-Seq data from 126 human HapMap individuals demonstrate that our model can help to obtain a more fine-grained picture of the genetic basis of gene expression variation.
    10/2012;
  • Source
    Hande Topa, Antti Honkela
    [Show abstract] [Hide abstract]
    ABSTRACT: We present techniques for effective Gaussian process (GP) modelling of multiple short time series. These problems are common when applying GP models independently to each gene in a gene expression time series data set. Such sets typically contain very few time points. Naive application of common GP modelling techniques can lead to severe over-fitting or under-fitting in a significant fraction of the fitted models, depending on the details of the data set. We propose avoiding over-fitting by constraining the GP length-scale to values that focus most of the energy spectrum to frequencies below the Nyquist frequency corresponding to the sampling frequency in the data set. Under-fitting can be avoided by more informative priors on observation noise. Combining these methods allows applying GP methods reliably automatically to large numbers of independent instances of short time series. This is illustrated with experiments with both synthetic data and real gene expression data.
    10/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A software library for constructing and learning probabilistic models is presented. The library offers a set of building blocks from which a large variety of static and dynamic models can be built. These include hierarchical models for variances of other variables and many nonlinear models. The underlying variational Bayesian machinery, providing for fast and robust estimation but being mathematically rather involved, is almost completely hidden from the user thus making it very easy to use the library. The building blocks include Gaussian, rectified Gaussian and mixture-of-Gaussians variables and computational nodes which can be combined rather freely.
    07/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: BACKGROUND: Complete transcriptional regulatory network inference is a huge challenge because of the complexity of the network and sparsity of available data. One approach to make it more manageable is to focus on the inference of context-specic networks involving a few interacting transcription factors (TFs) and all of their target genes. RESULTS: We present a computational framework for Bayesian statistical inference of target genes of multiple interacting TFs from high-throughput gene expression time-series data. We use ordinary differential equation models that describe transcription of target genes taking into account combinatorial regulation. The method consists of a training and a prediction phase. During the training phase we infer the unobserved TF protein concentrations on a subnetwork of approximately known regulatory structure. During the prediction phase we apply Bayesian model selection on a genome-wide scale and score all alternativeregulatory structures for each target gene. We use our methodology to identify targets of ve TFs regulating Drosophila melanogaster mesoderm development. We nd that condent predicted links between TFs and targets are signicantly enriched for supporting ChIP-chip binding events and annotated TF-gene interations. Our method statistically signicantly outperforms existing alternatives. CONCLUSIONS: Our results show that it is possible to infer regulatory links between multiple interacting TFs and their target genes even from a single relatively short time series and in presence of unmodelled confounders and unreliable prior knowledge on training network connectivity. Introducing data from several different experimental perturbations signicantly increases the accuracy.
    BMC Systems Biology 05/2012; 6(1):53. · 2.98 Impact Factor
  • Source
    Peter Glaus, Antti Honkela, Magnus Rattray
    [Show abstract] [Hide abstract]
    ABSTRACT: High-throughput sequencing enables expression analysis at the level of individual transcripts. The analysis of transcriptome expression levels and differential expression (DE) estimation requires a probabilistic approach to properly account for ambiguity caused by shared exons and finite read sampling as well as the intrinsic biological variance of transcript expression. We present Bayesian inference of transcripts from sequencing data (BitSeq), a Bayesian approach for estimation of transcript expression level from RNA-seq experiments. Inferred relative expression is represented by Markov chain Monte Carlo samples from the posterior probability distribution of a generative model of the read data. We propose a novel method for DE analysis across replicates which propagates uncertainty from the sample-level model while modelling biological variance using an expression-level-dependent prior. We demonstrate the advantages of our method using simulated data as well as an RNA-seq dataset with technical and biological replication for both studied conditions. The implementation of the transcriptome expression estimation and differential expression analysis, BitSeq, has been written in C++ and Python. The software is available online from http://code.google.com/p/bitseq/, version 0.4 was used for generating results presented in this article.
    Bioinformatics 05/2012; 28(13):1721-8. · 5.47 Impact Factor
  • Handbook of Statistical Systems Biology, 09/2011: pages 376 - 394; , ISBN: 9781119970606
  • Peter Glaus, Antti Honkela, Magnus Rattray
    [Show abstract] [Hide abstract]
    ABSTRACT: Background / Purpose: High-throughput sequencing enables expression analysis at the level of individual transcripts. The analysis of transcriptome expression levels and differential expression estimation requires a probabilistic approach to properly account for the ambiguity caused by shared exons and finite read sampling as well as the intrinsic biological variance of transcript expression. Another important factor is the biological sources of variance which, as we show in our analysis, can be substantial and may be dependent on the transcript expression level. To avoid false positive differential expression calls, one has to anticipate the intrinsic variance of the transcript expression levels using empirical prior knowledge and information from replicates where they exist.We present a Bayesian approach to estimate transcript expression levels from RNA-seq experiments. Inferred relative expression is in the form of a probability distribution represented by samples of the distribution obtained from a Markov chain Monte Carlo inference method applied to a generative model of the read data. Additionally, by implementing the regular Gibbs sampling algorithm, we provide a comparison with Collapsed Gibbs sampling in which some of the parameters are marginalized in order to obtain faster convergence. Main conclusion: We propose a novel method for differential expression analysis across replicates which propagates uncertainty from the sample-level model while modeling biological variance using an expression-level dependent prior. We demonstrate the advantages of our method using a RNA-seq dataset (Xu G. et al. 2010) with technical and biological replication for both studied conditions.
    ISMB/ECCB 2011; 09/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: tigre is an R/Bioconductor package for inference of transcription factor activity and ranking candidate target genes from gene expression time series. The underlying methodology is based on Gaussian process inference on a differential equation model that allows the use of short, unevenly sampled, time series. The method has been designed with efficient parallel implementation in mind, and the package supports parallel operation even without additional software. AVAILABILITY: The tigre package is included in Bioconductor since release 2.6 for R 2.11. The package and a user's guide are available at http://www.bioconductor.org.
    Bioinformatics 02/2011; 27(7):1026-7. · 5.47 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Missing-feature reconstruction can improve speech recognition performance in unknown noisy environments. In this work, we examine using a nonlinear state-space model (NSSM) for missing-feature reconstruction and propose estimation with observed bounds to improve the NSSM performance. Evaluated in large-vocabulary continuou ss peech recognition task with babble and impulsive noise, using observed bounds in NSSM state estimation significantly improved the method performance.
    IEEE Signal Processing Letters 01/2011; 18:563-566. · 1.67 Impact Factor

Publication Stats

419 Citations
117.79 Total Impact Points

Institutions

  • 2012–2014
    • Helsinki Institute for Information Technology HIIT
      Helsinki, Southern Finland Province, Finland
  • 2008–2012
    • The University of Manchester
      • School of Computer Science
      Manchester, ENG, United Kingdom
    • University of Turku
      • Department of Information Technology
      Turku, Western Finland, Finland
  • 2002–2011
    • University of Helsinki
      • Helsinki Institute for Information Technology HIIT
      Helsinki, Province of Southern Finland, Finland
  • 2010
    • Aalto University
      • Department of Information and Computer Science
      Helsinki, Province of Southern Finland, Finland