Figure - available from: Genome Biology
This content is subject to copyright. Terms and conditions apply.
Individual modules of MMSplice and their combination to predict the effect of genetic variants on various splicing quantities. a MMSplice consists of six modules scoring sequences from donor, acceptor, exon, and intron sites. Modules were trained with rich genomics dataset probing the corresponding regulatory regions. b Modules from a are combined with a linear model to score variant effects on exon skipping (ΔΨ), alternative donor (ΔΨ3), or alternative acceptor site (ΔΨ5), splicing efficiency, and they are combined with a logistic regression model to predict variant pathogenicity. La and Ld stand for the length of intron sequence taken from the acceptor and donor side respectively

Individual modules of MMSplice and their combination to predict the effect of genetic variants on various splicing quantities. a MMSplice consists of six modules scoring sequences from donor, acceptor, exon, and intron sites. Modules were trained with rich genomics dataset probing the corresponding regulatory regions. b Modules from a are combined with a linear model to score variant effects on exon skipping (ΔΨ), alternative donor (ΔΨ3), or alternative acceptor site (ΔΨ5), splicing efficiency, and they are combined with a logistic regression model to predict variant pathogenicity. La and Ld stand for the length of intron sequence taken from the acceptor and donor side respectively

Source publication
Article
Full-text available
Predicting the effects of genetic variants on splicing is highly relevant for human genetics. We describe the framework MMSplice (modular modeling of splicing) with which we built the winning model of the CAGI5 exon skipping prediction challenge. The MMSplice modules are neural networks scoring exon, intron, and splice sites, trained on distinct la...

Similar publications

Article
Full-text available
The package mvlearnR and accompanying Shiny App is intended for integrating data from multiple sources or views or modalities (e.g. genomics, proteomics, clinical and demographic data). Most existing software packages for multiview learning are decentralized and offer limited capabilities, making it difficult for users to perform comprehensive inte...

Citations

... For AS, cis-acting factors-sequences or structures within the pre-mRNA-can regulate splicing diversity. The logic of cis regulation has been analyzed to identify features that give rise to AS 13,14,15,16,17 . But for differential AS, features within the nascent transcript are not sufficient, since the transcript is the same in all cell types. ...
... Previous work has found a role of sequence features in distinguishing AS events from constitutive splicing 13,14,15,16,17 . Do sequence features also affect differential AS? ...
Preprint
Full-text available
Alternative splicing is a key mechanism that shapes neuronal transcriptomes, helping to define neuronal identity and modulate function. Here, we present an atlas of alternative splicing across the nervous system of Caenorhabditis elegans . Our analysis identifies novel alternative splicing in key neuronal genes such as unc-40 /DCC and sax-3 /ROBO. Globally, we delineate patterns of differential alternative splicing in almost 2,000 genes, and estimate that a quarter of neuronal genes undergo differential splicing. We introduce a web interface for examination of splicing patterns across neuron types. We explore the relationship between neuron type and splicing patterns, and between splicing patterns and differential gene expression. We identify RNA features that correlate with differential alternative splicing, and describe the enrichment of microexons. Finally, we compute a splicing regulatory network that can be used to generate hypotheses on the regulation and targets of alternative splicing in neurons.
... SS motifs have also been explored experimentally as well, including a recent large-scale screen that measured the activity of all possible 5′SS sequences in three different exonic contexts (20). Various computational models focused on determining SS strength and classifying short sequences as 5′ or 3′SSs were among the earliest models related to splicing (21)(22)(23). ...
Article
Pre-mRNA splicing is a fundamental step in gene expression, conserved across eukaryotes, in which the spliceosome recognizes motifs at the 3′ and 5′ splice sites (SSs), excises introns, and ligates exons. SS recognition and pairing is often influenced by protein splicing factors (SFs) that bind to splicing regulatory elements (SREs). Here, we describe SMsplice, a fully interpretable model of pre-mRNA splicing that combines models of core SS motifs, SREs, and exonic and intronic length preferences. We learn models that predict SS locations with 83 to 86% accuracy in fish, insects, and plants and about 70% in mammals. Learned SRE motifs include both known SF binding motifs and unfamiliar motifs, and both motif classes are supported by genetic analyses. Our comparisons across species highlight similarities between non-mammals, increased reliance on intronic SREs in plant splicing, and a greater reliance on SREs in mammalian splicing.
... Splicing-based analysis of variants (SPANR) [3] is a Bayesian network model for predicting the percent spliced in (PSI or ) of alternatively spliced exons. More recently, deep learning models like MMSplice [9], SpliceAI [10] and Pangolin [11] employed deep convolutional neural networks to predict alternative splicing events, splice sites or splice site usage. These methods achieved superior performance as compared to earlier studies and have been widely utilized to analyze aberrant RNA splicing events caused by genetic variants. ...
Article
Full-text available
Language models pretrained by self-supervised learning (SSL) have been widely utilized to study protein sequences, while few models were developed for genomic sequences and were limited to single species. Due to the lack of genomes from different species, these models cannot effectively leverage evolutionary information. In this study, we have developed SpliceBERT, a language model pretrained on primary ribonucleic acids (RNA) sequences from 72 vertebrates by masked language modeling, and applied it to sequence-based modeling of RNA splicing. Pretraining SpliceBERT on diverse species enables effective identification of evolutionarily conserved elements. Meanwhile, the learned hidden states and attention weights can characterize the biological properties of splice sites. As a result, SpliceBERT was shown effective on several downstream tasks: zero-shot prediction of variant effects on splicing, prediction of branchpoints in humans, and cross-species prediction of splice sites. Our study highlighted the importance of pretraining genomic language models on a diverse range of species and suggested that SSL is a promising approach to enhance our understanding of the regulatory logic underlying genomic sequences.
... We retrieving eCLIP data for RBP binding site in different cell lines (K562 and HepG2) from ENCODE project to assess the number of RBPs. In addition, we curated 12 features (MMSplice scores) from the available implementation MMSplice (Cheng et al. 2019). These features included the percent spliced-in PSI ( ψ ), the splicing efficiency which was proposed to quantify the amount of precursor RNA that underwent splicing, the predicted impacts on recognition of the acceptors and donors, the predicted impacts on the splicing of the intron region and exon regions. ...
Article
Full-text available
Genetic diseases are mostly implicated with genetic variants, including missense, synonymous, non-sense, and copy number variants. These different kinds of variants are indicated to affect phenotypes in various ways from previous studies. It remains essential but challenging to understand the functional consequences of these genetic variants, especially the noncoding ones, due to the lack of corresponding annotations. While many computational methods have been proposed to identify the risk variants. Most of them have only curated DNA-level and protein-level annotations to predict the pathogenicity of the variants, and others have been restricted to missense variants exclusively. In this study, we have curated DNA-, RNA-, and protein-level features to discriminate disease-causing variants in both coding and noncoding regions, where the features of protein sequences and protein structures have been shown essential for analyzing missense variants in coding regions while the features related to RNA-splicing and RBP binding are significant for variants in noncoding regions and synonymous variants in coding regions. Through the integration of these features, we have formulated the Multi-level feature Genomic Variants Predictor (ML-GVP) using the gradient boosting tree. The method has been trained on more than 400,000 variants in the Sherloc-training set from the 6th critical assessment of genome interpretation with superior performance. The method is one of the two best-performing predictors on the blind test in the Sherloc assessment, and is further confirmed by another independent test dataset of de novo variants.
... Many other splicing-related pathogenicity predictors have been published. These tools typically leverage machine learning strategies, train directly on pathogenicity classifications (often from ClinVar, information Oncosplice infers without training) to predict consequences based on a priori knowledge of pathogenicity, provide limited to no mechanistic insight, and are often constrained to specific mutation types (synonymous SNVs, missense SNVs) and regions (intronic, splice site, splice region) [35][36][37][38][39][40][41] . A tabular description of these tools is available in Table 2. ...
Article
Full-text available
Cancer research has long relied on non-silent mutations. Yet, it has become overwhelmingly clear that silent mutations can affect gene expression and cancer cell fitness. One fundamental mechanism that apparently silent mutations can severely disrupt is alternative splicing. Here we introduce Oncosplice , a tool that scores mutations based on models of proteomes generated using aberrant splicing predictions. Oncosplice leverages a highly accurate neural network that predicts splice sites within arbitrary mRNA sequences, a greedy transcript constructor that considers alternate arrangements of splicing blueprints, and an algorithm that grades the functional divergence between proteins based on evolutionary conservation. By applying this tool to 12M somatic mutations we identify 8K deleterious variants that are significantly depleted within the healthy population; we demonstrate the tool’s ability to identify clinically validated pathogenic variants with a positive predictive value of 94%; we show strong enrichment of predicted deleterious mutations across pan-cancer drivers. We also achieve improved patient survival estimation using a proposed set of novel cancer-involved genes. Ultimately, this pipeline enables accelerated insight-gathering of sequence-specific consequences for a class of understudied mutations and provides an efficient way of filtering through massive variant datasets – functionalities with immediate experimental and clinical applications.
... Both c.1311C>T and c.1365-13T>C were predicted to have no splicing or deleterious effect on the G6PD gene [50][51][52]. While c.1365-13T>C was found to be benign, the c.1311C>T showed conflicting results and the clinical interpretations of pathogenicity were still uncertain on the ClinVar database (https:// www. ...
... U/gHb). This variant was neutral based on splicing and functional predictions [50][51][52]. While the c.486-34delT variant was described as benign by many clinical testing groups it was found to be associated with G6PD deficiency in unrelated hemizygotes on the ClinVar database. ...
Article
Full-text available
Background It was hypothesized that glucose-6-phosphate dehydrogenase (G6PD) deficiency confers a protective effect against malaria infection, however, safety concerns have been raised regarding haemolytic toxicity caused by radical cure with 8-aminoquinolines in G6PD-deficient individuals. Malaria elimination and control are also complicated by the high prevalence of G6PD deficiency in malaria-endemic areas. Hence, accurate identification of G6PD deficiency is required to identify those who are eligible for malaria treatment using 8-aminoquinolines. Methods The prevalence of G6PD deficiency among 408 Thai participants diagnosed with malaria by microscopy (71), and malaria-negative controls (337), was assessed using a phenotypic test based on water-soluble tetrazolium salts. High-resolution melting (HRM) curve analysis was developed from a previous study to enable the detection of 15 common missense, synonymous and intronic G6PD mutations in Asian populations. The identified mutations were subjected to biochemical and structural characterisation to understand the molecular mechanisms underlying enzyme deficiency. Results Based on phenotypic testing, the prevalence of G6PD deficiency (< 30% activity) was 6.13% (25/408) and intermediate deficiency (30–70% activity) was found in 15.20% (62/408) of participants. Several G6PD genotypes with newly discovered double missense variants were identified by HRM assays, including G6PD Gaohe + Viangchan, G6PD Valladolid + Viangchan and G6PD Canton + Viangchan. A significantly high frequency of synonymous (c.1311C>T) and intronic (c.1365-13T>C and c.486-34delT) mutations was detected with intermediate to normal enzyme activity. The double missense mutations were less catalytically active than their corresponding single missense mutations, resulting in severe enzyme deficiency. While the mutations had a minor effect on binding affinity, structural instability was a key contributor to the enzyme deficiency observed in G6PD-deficient individuals. Conclusions With varying degrees of enzyme deficiency, G6PD genotyping can be used as a complement to phenotypic screening to identify those who are eligible for 8-aminoquinolines. The information gained from this study could be useful for management and treatment of malaria, as well as for the prevention of unanticipated reactions to certain medications and foods in the studied population.
... These variants are predicted to result in the creation of donor (19 SNVs) or acceptor (7 SNVs) splicing sites with a SpliceAI delta score range 0.32-0. 98. Notably, all 25 deep intronic variants were detected through RNA analysis with subsequent sequence of the target intronic region of the patient's DNA. ...
... The effects of SNVs on pre-mRNA splicing were predicted using the SpliceAI (Illumina, USA) (23), MaxEntScan (MIT, USA) (96), and Human Splicing Finder (Aix Marseille Université, France ) (97) tools. Distinct variants were also evaluated using the MMSplice (98) and SPiP (99) tools. Variants were annotated using the wANNOVAR tool (Wang Genomics Lab, USA) (100). ...
Preprint
Full-text available
Background: Pathogenic variants in the dystrophin (DMD) gene lead to X-linked recessive Duchenne muscular dystrophy (DMD) and Becker muscular dystrophy (BMD). Nucleotide variants that affect splicing are a known cause of hereditary diseases. However, their representation in the public genomic variation databases is limited due to the low accuracy of their interpretation, especially if they are located within exons. The analysis of splicing variants in the DMD gene is essential both for understanding the underlying molecular mechanisms of the dystrophinopathies' pathogenesis and selecting suitable therapies for patients. Results: Using deep in silico mutagenesis of the entire DMD gene sequence and subsequent SpliceAI splicing predictions, we identified 7,948 DMD single nucleotide variants that could potentially affect splicing, 863 of them were located in exons. Next, we analyzed over 1,300 disease-associated DMD SNVs previously reported in the literature (373 exonic and 956 intronic) and intersected them with SpliceAI predictions. We predicted that ~95% of the intronic and ~10% of the exonic reported variants could actually affect splicing. Interestingly, the majority (75%) of patient-derived intronic variants were located in the AG-GT terminal dinucleotides of the introns, while these positions accounted for only 13% of all intronic variants predicted in silico. Of the 97 potentially spliceogenic exonic variants previously reported in patients with dystrophinopathy, we selected 38 for experimental validation. For this, we developed and tested a minigene expression system encompassing 27 DMD exons. The results showed that 35 (19 missense, 9 synonymous, and 7 nonsense) of the 38 DMD exonic variants tested actually disrupted splicing. We compared the observed consequences of splicing changes between variants leading to severe Duchenne and milder Becker muscular dystrophy and showed a significant difference in their distribution. This finding provides extended insights into relations between molecular consequences of splicing variants and the clinical features. Conclusions: Our comprehensive bioinformatics analysis, combined with experimental validation, improves the interpretation of splicing variants in the DMD gene. The new insights into the molecular mechanisms of pathogenicity of exonic single nucleotide variants contribute to a better understanding of the clinical features observed in patients with Duchenne and Becker muscular dystrophy.
... More recently, some attention has returned to the question of predicting splicing from raw sequence using neural network methods that have been highly successful at predictive tasks in many fields [16][17][18]. Rather than curating features by hand, these machine learning methods have feature selection built into their training process, albeit in a manner that makes extracting and interpreting these features extremely difficult. For instance, SpliceAI, a deep convolutional neural network (CNN), can produce extremely high end-to-end accuracy in predicting splice sites in the human genome using up to 10,000 bases of sequence context. ...
Article
Full-text available
Sequence-specific RNA-binding proteins (RBPs) play central roles in splicing decisions. Here, we describe a modular splicing architecture that leverages in vitro-derived RNA affinity models for 79 human RBPs and the annotated human genome to produce improved models of RBP binding and activity. Binding and activity are modeled by separate Motif and Aggregator components that can be mixed and matched, enforcing sparsity to improve interpretability. Training a new Adjusted Motif (AM) architecture on the splicing task not only yields better splicing predictions but also improves prediction of RBP-binding sites in vivo and of splicing activity, assessed using independent data. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-023-03162-x.
... On the other hand, some mutations are pathogenic at the RNA level through splicing alterations, and these mutations are often located in the splicing donor, acceptor, and intronic regions. Consequently, splicing alterations have also been considered for the pathogenicity evaluation of mutations [18,19]. However, in clinical practice of genetic diagnosis using WES, different types of mutations and mechanisms should be considered simultaneously to identify the pathogenic mutation. ...
Article
Full-text available
Identifying pathogenic variants from the vast majority of nucleotide variation remains a challenge. We present a method named Multimodal Annotation Generated Pathogenic Impact Evaluator (MAGPIE) that predicts the pathogenicity of multi-type variants. MAGPIE uses the ClinVar dataset for training and demonstrates superior performance in both the independent test set and multiple orthogonal validation datasets, accurately predicting variant pathogenicity. Notably, MAGPIE performs best in predicting the pathogenicity of rare variants and highly imbalanced datasets. Overall, results underline the robustness of MAGPIE as a valuable tool for predicting pathogenicity in various types of human genome variations. MAGPIE is available at https://github.com/shenlab-genomics/magpie.
... SS motifs have also been explored experimentally as well, including a recent large-scale screen that measured the activity of all possible 5'SS sequences in three different exonic contexts (Wong et al., 2018). Various computational models focused on determining SS strength and classifying short sequences as 5' or 3' SS have also been developed, and were among the earliest models related to splicing (Cheng et al., 2019;Shapiro & Senapathy, 1987;. ...
... For example, CASS scores might find use in design or troubleshooting of splicing in minigenes, or in synthetic biology applications. The SMsplice structure, which normally uses our new MaxEnt SS scores, could potentially also be applied to SS scores derived in other ways (Cheng et al., 2019) because of its modular design. ...
Preprint
Full-text available
Pre-mRNA splicing is a fundamental step in gene expression, conserved across eukaryotes, in which the spliceosome recognizes motifs at the 3' and 5' splice sites (SS), excises introns and ligates exons. SS recognition and pairing is often influenced by splicing regulatory factors (SRFs) that bind to splicing regulatory elements (SREs). Several families of sequence-specific SRFs are known to be similarly ancient. Here, we describe SMsplice, a fully interpretable model of pre-mRNA splicing that combines new models of core SS motifs, SREs, and exonic and intronic length preferences. We learn models the predict SS locations with 83-86% accuracy in fish, insects and plants, and about 70% in mammals. Learned SRE motifs include both known SRF binding motifs as well as novel motifs, and both classes are supported by genetic analyses. Our comparisons across species highlight similarities between non-mammals and a greater reliance on SREs in mammalian splicing, and increased reliance on intronic SREs in plant splicing.