Julien Gagneur

Julien Gagneur
Technische Universität München | TUM · Chair of Bioinformatics

PhD

About

269
Publications
30,763
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
9,619
Citations
Additional affiliations
July 2012 - December 2015
Ludwig-Maximilians-University of Munich
Position
  • Group Leader

Publications

Publications (269)
Preprint
Solve-RD is a pan-European rare disease (RD) research program that aims to identify disease-causing genetic variants in previously undiagnosed RD families. We utilized 10-fold coverage HiFi long-read sequencing (LRS) for detecting causative structural variants (SVs), single nucleotide variants (SNVs), insertion-deletions (InDels), and short tandem...
Article
Full-text available
Background The rise of large-scale multi-species genome sequencing projects promises to shed new light on how genomes encode gene regulatory instructions. To this end, new algorithms are needed that can leverage conservation to capture regulatory elements while accounting for their evolution. Results Here, we introduce species-aware DNA language m...
Article
Codon optimality is a major determinant of mRNA translation and degradation rates. However, whether and through which mechanisms its effects are regulated remains poorly understood. Here we show that codon optimality associates with up to 2-fold change in mRNA stability variations between human tissues, and that its effect is attenuated in tissues...
Article
Full-text available
The SARS-CoV-2 pandemic has highlighted the need to better define in-hospital transmissions, a need that extends to all other common infectious diseases encountered in clinical settings. To evaluate how whole viral genome sequencing can contribute to deciphering nosocomial SARS-CoV-2 transmission 926 SARS-CoV-2 viral genomes from 622 staff members...
Article
Full-text available
Background The Critical Assessment of Genome Interpretation (CAGI) aims to advance the state-of-the-art for computational prediction of genetic variant impact, particularly where relevant to disease. The five complete editions of the CAGI community experiment comprised 50 challenges, in which participants made blind predictions of phenotypes from g...
Article
Full-text available
Unlike for DNA and RNA, accurate and high-throughput sequencing methods for proteins are lacking, hindering the utility of proteomics in applications where the sequences are unknown including variant calling, neoepitope identification, and metaproteomics. We introduce Spectralis, a de novo peptide sequencing method for tandem mass spectrometry. Spe...
Article
Full-text available
Single-cell ATAC sequencing coverage in regulatory regions is typically binarized as an indicator of open chromatin. Here we show that binarization is an unnecessary step that neither improves goodness of fit, clustering, cell type identification nor batch integration. Fragment counts, but not read counts, should instead be modeled, which preserves...
Preprint
Full-text available
Despite the frequent implication of aberrant gene expression in diseases, algorithms predicting aberrantly expressed genes of an individual are lacking. To address this need, we compiled an aberrant expression prediction benchmark covering 8.2 million rare variants from 633 individuals across 48 tissues. While not geared toward aberrant expression,...
Article
Full-text available
Functional gene embeddings, numerical vectors capturing gene function, provide a promising way to integrate functional gene information into machine learning models. These embeddings are learnt by applying self-supervised machine-learning algorithms on various data types including quantitative omics measurements, protein–protein interaction network...
Article
Detection of aberrantly spliced genes is an important step in RNA-seq-based rare-disease diagnostics. We recently developed FRASER, a denoising autoencoder-based method that outperformed alternative methods of detecting aberrant splicing. However, because FRASER’s three splice metrics are partially redundant and tend to be sensitive to sequencing d...
Preprint
Full-text available
Rare genetic diseases often pose significant challenges for diagnosis. Over the past years, RNA sequencing and other omics modalities have emerged as complementary strategies to DNA sequencing to enhance diagnostic success. In the 6th round of the Critical Assessment of Genome Interpretation (CAGI), the SickKids clinical genomes and transcriptomes...
Preprint
Full-text available
The high heritability of ALS and similar rare diseases contrasts with their low molecular diagnosis rate post-genetic testing, pointing to potential undiscovered genetic factors. DNA accessibility assays quantify the activity of functional elements genome-wide, offering invaluable insights into dysregulated regions. In this research, we introduced...
Preprint
Full-text available
Background Rare oncogenic driver events, particularly affecting the expression or splicing of driver genes, are suspected to substantially contribute to the large heterogeneity of hematologic malignancies. However, their identification remains challenging. Methods To address this issue, we generated the largest dataset to date of matched whole gen...
Preprint
Full-text available
Background: The SARS-CoV-2 pandemic has highlighted the need to better define in-hospital transmissions, a need that extends to all other common infectious diseases encountered in clinical settings. Objectives: To evaluate how whole viral genome sequencing can contribute to deciphering nosocomial SARS-CoV-2 transmission Methods: 926 SARS-CoV-2 vira...
Article
Full-text available
We present RBPNet, a novel deep learning method, which predicts CLIP-seq crosslink count distribution from RNA sequence at single-nucleotide resolution. By training on up to a million regions, RBPNet achieves high generalization on eCLIP, iCLIP and miCLIP assays, outperforming state-of-the-art classifiers. RBPNet performs bias correction by modelin...
Preprint
Full-text available
Protein homeostasis is disrupted in aging and neurodegenerative diseases, yet, the specific impact of aging on brain proteostasis remains poorly understood. Here, we measured and integrated the effects of aging on the transcriptome, translatome, and multiple layers of the proteome in the brain of a short-lived killifish. We find that aging causes a...
Preprint
Full-text available
Genome-wide association studies have unearthed a wealth of genetic associations across many complex diseases. However, translating these associations into biological mechanisms contributing to disease etiology and heterogeneity has been challenging. Here, we hypothesize that the effects of disease-associated genetic variants converge onto distinct...
Article
Full-text available
Aberrant splicing is a major cause of genetic disorders but its direct detection in transcriptomes is limited to clinically accessible tissues such as skin or body fluids. While DNA-based machine learning models can prioritize rare variants for affecting splicing, their performance in predicting tissue-specific aberrant splicing remains unassessed....
Preprint
Full-text available
Codon optimality is a major determinant of mRNA translation and degradation rates. However, whether and through which mechanisms its effects are regulated remains poorly understood. Here we show that codon optimality associates with up to 2-fold change in mRNA stability variations between human tissues, and that its effect is attenuated in tissues...
Preprint
Full-text available
Detection of aberrantly spliced genes is an important step in RNA-seq-based rare disease diagnostics. We recently developed FRASER, a denoising autoencoder-based method for aberrant splicing detection that outperformed alternative approaches. However, as FRASER's three splice metrics are partially redundant and tend to be sensitive to sequencing de...
Article
Full-text available
Background The largest sequence-based models of transcription control to date are obtained by predicting genome-wide gene regulatory assays across the human genome. This setting is fundamentally correlative, as those models are exposed during training solely to the sequence variation between human genes that arose through evolution, questioning the...
Preprint
Full-text available
Motivation: Predicting gene expression from DNA is an open field of research. As in many areas, labeled data is dwarfed by unlabelled data, i.e. species with a sequenced genome but no gene expression assay data. Pretraining on unlabelled data using masked language modeling has proven highly successful in overcoming data constraints in natural langu...
Article
Full-text available
Motivation: Identifying regulatory regions in the genome is of great interest for understanding the epigenomic landscape in cells. One fundamental challenge in this context is to find the target genes whose expression is affected by the regulatory regions. A recent successful method is the Activity-By-Contact (ABC) model (Fulco et al., 2019) which...
Article
Full-text available
Transposon screens are powerful in vivo assays used to identify loci driving carcinogenesis. These loci are identified as Common Insertion Sites (CISs), i.e. regions with more transposon insertions than expected by chance. However, the identification of CISs is affected by biases in the insertion behaviour of transposon systems. Here, we introduce...
Preprint
Full-text available
Unlike for DNA and RNA, accurate and high-throughput sequencing methods for proteins are lacking, hindering the utility of proteomics in applications where the sequences are unknown including variant calling, neoepitope identification, and metaproteomics. We introduce Spectralis, a new de novo peptide sequencing method for tandem mass spectrometry....
Article
Full-text available
RNA sequencing (RNA-seq) is emerging in genetic diagnoses as it provides functional support for the interpretation of variants of uncertain significance. However, the use of amniotic fluid (AF) cells for RNA-seq has not yet been explored. Here, we examined the expression of clinically relevant genes in AF cells ( n = 48) compared with whole blood a...
Preprint
Cardiac resident macrophages (crMPs) were recently shown to exert pivotal functions in cardiac homeostasis and disease, but the underlying molecular mechanisms are largely unclear. Long non-coding RNAs (lncRNAs) are increasingly recognized as important regulatory molecules in a number of cell types, but neither the identity nor the molecular mechan...
Article
Full-text available
Mitochondrial translation defects are a continuously growing group of disorders showing a large variety of clinical symptoms including a wide range of neurological abnormalities. To date, mutations in PTCD3, encoding a component of the mitochondrial ribosome, have only been reported in a single individual with clinical evidence of Leigh syndrome. H...
Article
Full-text available
Peroxisomal biogenesis disorders (PBDs) are a heterogeneous group of genetic diseases. Multiple peroxisomal pathways are impaired, and very long chain fatty acids (VLCFA) are the first line biomarkers for the diagnosis. The clinical presentation of PBDs may range from severe, lethal multisystemic disorders to milder, late-onset disease. The vast ma...
Preprint
A bstract Unraveling sequence determinants which drive protein-RNA interaction is crucial for studying binding mechanisms and the impact of genomic variants. While CLIP-seq allows for transcriptome-wide profiling of in vivo protein-RNA interactions, it is limited to expressed transcripts, requiring computational imputation of missing binding inform...
Preprint
Full-text available
Background: The largest sequence-based models of transcription control to date have been obtained by predicting genome-wide gene regulatory assays across the human genome. This setting is fundamentally correlative, as those models are exposed during training solely to the sequence variation between human genes that arose through evolution, question...
Preprint
Aberrant splicing is a major cause of genetic disorders but its direct detection in transcriptomes is limited to clinically accessible tissues such as skin or body fluids. While DNA-based machine learning models allow prioritizing rare variants for affecting splicing, their performance on predicting tissue-specific aberrant splicing remains unasses...
Article
Full-text available
The accuracy of methods for assembling transcripts from short-read RNA sequencing data is limited by the lack of long-range information. Here we introduce Ladder-seq, an approach that separates transcripts according to their lengths before sequencing and uses the additional information to improve the quantification and assembly of transcripts. Usin...
Article
Full-text available
The majority of risk loci identified by genome-wide association studies (GWAS) are in non-coding regions, hampering their functional interpretation. Instead, transcriptome-wide association studies (TWAS) identify gene-trait associations, which can be used to prioritize candidate genes in disease-relevant tissue(s). Here, we aimed to systematically...
Preprint
Full-text available
Transcriptome-wide association studies (TWAS) explore genetic variants affecting gene expression for association with a trait. Here we studied coronary artery disease (CAD) using this approach by first determining genotype-regulated expression levels in nine CAD relevant tissues by EpiXcan in two genetics-of-gene-expression panels, the Stockholm-Ta...
Preprint
Full-text available
Background: The spectrum of mitochondrial disease is genetically and phenotypically diverse, resulting from pathogenic variants in over 400 genes, with aerobic energy metabolism defects as a common denominator. Such heterogeneity poses a significant challenge in making an accurate diagnosis, critical for precision medicine. Methods: In an internati...
Article
Full-text available
The 5’ untranslated region plays a key role in regulating mRNA translation and consequently protein abundance. Therefore, accurate modeling of 5’UTR regulatory sequences shall provide insights into translational control mechanisms and help interpret genetic variants. Recently, a model was trained on a massively parallel reporter assay to predict me...
Article
Full-text available
The SARS-CoV-2 virus is the causative agent of the global COVID-19 infectious disease outbreak, which can lead to acute respiratory distress syndrome (ARDS). However, it is still unclear how the virus interferes with immune cell and metabolic functions in the human body. In this study, we investigated the immune response in acute or convalescent CO...
Article
Full-text available
An amendment to this paper has been published and can be accessed via the original article.
Preprint
Full-text available
Background Lack of functional evidence hampers variant interpretation, leaving a large proportion of cases with a suspected Mendelian disorder without genetic diagnosis after genome or whole exome sequencing (WES). Research studies advocate to further sequence transcriptomes to directly and systematically probe gene expression defects. However, col...
Article
Full-text available
We develop the free and open-source model Multi-tissue Splicing (MTSplice) to predict the effects of genetic variants on splicing of cassette exons in 56 human tissues. MTSplice combines MMSplice, which models constitutive regulatory sequences, with a new neural network that models tissue-specific regulatory sequences. MTSplice outperforms MMSplice...
Preprint
Full-text available
Manuscript By lack of functional evidence, genome-based diagnostic rates cap at approximately 50% across diverse Mendelian diseases. Here we demonstrate the effectiveness of combining genomics, transcriptomics, and, for the first time, proteomics and phenotypic descriptors, in a systematic diagnostic approach to discover the genetic cause of mitoch...
Article
Full-text available
The arrangement (syntax) of transcription factor (TF) binding motifs is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution chromatin immunoprecipitation (ChIP)–nexus binding profiles of pluripotency TFs. We develop interpretation tools to le...
Article
Full-text available
Aberrant splicing is a major cause of rare diseases. However, its prediction from genome sequence alone remains in most cases inconclusive. Recently, RNA sequencing has proven to be an effective complementary avenue to detect aberrant splicing. Here, we develop FRASER, an algorithm to detect aberrant splicing from RNA sequencing data. Unlike existi...
Article
RNA sequencing (RNA-seq) has emerged as a powerful approach to discover disease-causing gene regulatory defects in individuals affected by genetically undiagnosed rare disorders. Pioneering studies have shown that RNA-seq could increase the diagnosis rates over DNA sequencing alone by 8–36%, depending on the disease entity and tissue probed. To acc...
Article
Background: Transcriptome sequencing (RNA-seq) improves diagnostic rates in individuals with suspected Mendelian conditions to varying degrees, primarily by directing the prioritization of candidate DNA variants identified on exome or genome sequencing (ES/GS). Here we implemented an RNA-seq guided method to diagnose individuals across a wide rang...
Preprint
Full-text available
The 5' untranslated region plays a key role in regulating mRNA translation and consequently protein abundance. Therefore, accurate modeling of 5'UTR regulatory sequences shall provide insights into translational control mechanisms and help interpret genetic variants. Recently, a model was trained on a massively parallel reporter assay to predict me...
Preprint
Full-text available
Tissue-specific splicing of exons plays an important role in determining tissue identity. However, computational tools predicting tissue-specific effects of variants on splicing are lacking. To address this issue, we developed MTSplice (Multi-tissue Splicing), a neural network which quantitatively predicts effects of human genetic variants on splic...
Preprint
Full-text available
Aberrant splicing is a major cause of rare diseases, yet its prediction from genome sequence remains in most cases inconclusive. Recently, RNA sequencing has proven to be an effective complementary avenue to detect aberrant splicing. Here, we developed FRASER, an algorithm to detect aberrant splicing from RNA sequencing data. Unlike existing method...
Preprint
RNA sequencing (RNA-seq) has emerged as a powerful approach to discover disease-causing gene regulatory defects for individuals affected with a genetically undiagnosed rare disorder. Pioneer studies have shown that RNA-seq could increase diagnostic rates over DNA sequencing alone by 8% to 36 % depending on disease entities and probed tissues. To ac...
Article
Gene transcription in eukaryotes is regulated through dynamic interactions of a variety of different proteins with DNA in the context of chromatin. Here, we used mass spectrometry for absolute quantification of the nuclear proteome and methyl marks on selected lysine residues in histone H3 during two stages of Drosophila embryogenesis. These analys...
Preprint
Full-text available
The arrangement of transcription factor (TF) binding motifs (syntax) is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution ChIP-nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representat...
Article
Precision medicine and sequence‐based clinical diagnostics seek to predict disease risk or to identify causative variants from sequencing data. The Critical Assessment of Genome Interpretation (CAGI) is a community experiment consisting of genotype–phenotype prediction challenges; participants build models, undergo assessment, and share key finding...
Preprint
In the accompanying chapter (Gressel, Lidschreiber, Cramer), we describe the detailed experimental protocol for transient transcriptome sequencing (TT-seq). TT-seq detects metabolically labeled, newly synthesized RNA fragments genome-wide in living cells. TT-seq can monitor gene activity and the dynamics of enhancer landscapes with great sensitivit...
Article
Pathogenic genetic variants are often primarily affecting splicing. However, it remains difficult to quantitatively predict whether and how genetic variants affect splicing. In 2018, the fifth edition of the Critical Assessment of Genome Interpretation (CAGI 5) proposed two splicing prediction challenges based on experimental perturbation assays: V...
Article
Full-text available
RNA splicing is an essential part of eukaryotic gene expression. Although the mechanism of splicing has been extensively studied in vitro, in vivo kinetics for the two-step splicing reaction remain poorly understood. Here we combine transient transcriptome sequencing (TT-seq) and mathematical modeling to quantify RNA metabolic rates at donor and ac...
Article
As a data-driven science, genomics largely utilizes machine learning to capture dependencies in data and derive novel biological hypotheses. However, the ability to extract new insights from the exponentially increasing volume of genomics data requires more expressive machine learning models. By effectively leveraging large data sets, deep learning...
Article
Full-text available
Predicting the effects of genetic variants on splicing is highly relevant for human genetics. We describe the framework MMSplice (modular modeling of splicing) with which we built the winning model of the CAGI5 exon skipping prediction challenge. The MMSplice modules are neural networks scoring exon, intron, and splice sites, trained on distinct la...
Article
Full-text available
Genome‐, transcriptome‐ and proteome‐wide measurements provide insights into how biological systems are regulated. However, fundamental aspects relating to which human proteins exist, where they are expressed and in which quantities are not fully understood. Therefore, we generated a quantitative proteome and transcriptome abundance atlas of 29 pai...
Article
Full-text available
Despite their importance in determining protein abundance, a comprehensive catalogue of sequence features controlling protein‐to‐mRNA (PTR) ratios and a quantification of their effects are still lacking. Here, we quantified PTR ratios for 11,575 proteins across 29 human tissues using matched transcriptomes and proteomes. We estimated by regression...
Article
Full-text available
In hyper-IgE syndromes (HIES), a group of primary immunodeficiencies clinically overlapping with atopic dermatitis, early diagnosis is crucial to initiate appropriate therapy and prevent irreversible complications. Identification of underlying gene defects such as in DOCK8 and STAT3 and corresponding molecular testing has improved diagnosis. Yet, i...
Article
RNA sequencing (RNA-seq) is gaining popularity as a complementary assay to genome sequencing for precisely identifying the molecular causes of rare disorders. A powerful approach is to identify aberrant gene expression levels as potential pathogenic events. However, existing methods for detecting aberrant read counts in RNA-seq data either lack ass...
Preprint
Full-text available
Predicting the effects of genetic variants on splicing is highly relevant for human genetics. We describe the framework MMSplice (modular modeling of splicing) with which we built the winning model of the CAGI 2018 exon skipping prediction challenge. The MMSplice modules are neural networks scoring exon, intron, and splice sites, trained on distinc...
Preprint
Full-text available
Advanced machine learning models applied to large-scale genomics datasets hold the promise to be major drivers for genome science. Once trained, such models can serve as a tool to probe the relationships between data modalities, including the effect of genetic variants on phenotype. However, lack of standardization and limited accessibility of trai...
Article
Full-text available
The accurate quantification of cellular and mitochondrial bioenergetic activity is of great interest in medicine and biology. Mitochondrial stress tests performed with Seahorse Bioscience XF Analyzers allow the estimation of different bioenergetic measures by monitoring the oxygen consumption rates (OCR) of living cells in multi-well plates. Howeve...
Data
Coefficient of variation within and between plates. Coefficient of variation computed as mean/standard deviation of OCR within and between plates using the controls NHDF only, in each time interval. (TXT)
Data
OCR raw data from all experiments. OCR and cell number raw data for each of the 126 plates and 203 samples across all the 12 time points and 4 treatments. (TXT)

Network

Cited By