Article

Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular context

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Alternative splicing is a major contributor to cellular diversity in mammalian tissues and relates to many human diseases. An important goal in understanding this phenomenon is to infer a 'splicing code' that predicts how splicing is regulated in different cell types by features derived from RNA, DNA and epigenetic modifiers. We formulate the assembly of a splicing code as a problem of statistical inference and introduce a Bayesian method that uses an adaptively selected number of hidden variables to combine subgroups of features into a network, allows different tissues to share feature subgroups and uses a Gibbs sampler to hedge predictions and ascertain the statistical significance of identified features. Using data for 3665 cassette exons, 1014 RNA features and 4 tissue types derived from 27 mouse tissues (http://genes.toronto.edu/wasp), we benchmarked several methods. Our method outperforms all others, and achieves relative improvements of 52% in splicing code quality and up to 22% in classification error, compared with the state of the art. Novel combinations of regulatory features and novel combinations of tissues that share feature subgroups were identified using our method. frey@psi.toronto.edu Supplementary data are available at Bioinformatics online.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... If computing power is unlimited, then the most effective method for "regularizing" a model having a set number of parameters is to take an average of predictions obtained from all possible configurations of those parameters and then weight them based on the posterior probability that each configuration would have given the training data. For basic or small models, this can be estimated very effectively (Xiong et al., 2011;Salakhutdinov and Mnih, 2008) [92], [93], however it is desirable to obtain a level of performance close to that of the Bayesian gold standard with significantly less computational power expended. Our goal is to accomplish this through approximation of the geometric mean of the predictions made by an exponentially large amount of learning models which share parameters. ...
... If computing power is unlimited, then the most effective method for "regularizing" a model having a set number of parameters is to take an average of predictions obtained from all possible configurations of those parameters and then weight them based on the posterior probability that each configuration would have given the training data. For basic or small models, this can be estimated very effectively (Xiong et al., 2011;Salakhutdinov and Mnih, 2008) [92], [93], however it is desirable to obtain a level of performance close to that of the Bayesian gold standard with significantly less computational power expended. Our goal is to accomplish this through approximation of the geometric mean of the predictions made by an exponentially large amount of learning models which share parameters. ...
Thesis
Full-text available
Deep learning technologies developed at an exponential rate throughout the years. Starting from Convolutional Neural Networks (CNNs) to Involutional Neural Networks (INNs), there are several neural network (NN) architectures today, including Vision Transformers (ViT), Graph Neural Networks (GNNs), Recurrent Neural Networks (RNNs) etc. However, uncertainty cannot be represented in these architectures, which poses a significant difficulty for decision-making given that capturing the uncertainties of these state-of-the-art NN structures would aid in making specific judgments. Dropout is one method that may be implemented within Deep Learning (DL) networks as a technique to assess uncertainty. Dropout is applied at the inference phase to measure the uncertainty of these neural network models. This approach, commonly known as Monte Carlo Dropout (MCD), works well as a low-complexity estimation to compute uncertainty. MCD is a widely used approach to measure uncertainty in DL models, but majority of the earlier works focus on only a particular application. Furthermore, there are many state-of-the-art (SOTA) NNs that remain unexplored, with regards to that of uncertainty evaluation. Therefore an up-to-date roadmap and benchmark is required in this field of study. Our study revolved around a comprehensive analysis of the MCD approach for assessing model uncertainty in neural network models with a variety of datasets. Besides, we include SOTA NNs to explore the untouched models regarding uncertainty. In addition, we demonstrate how the model may perform better with less uncertainty by modifying NN topologies, which also reveals the causes of a model’s uncertainty. Using the results of our experiments and subsequent enhancements, we also discuss the various advantages and costs of using MCD in these NN designs. While working with reliable and robust models we propose two novel architectures, which provide outstanding performances in medical image diagnosis.
... The prediction of splicing-perturbing variants has a long history of over 20 years' work [2][3][4][5][6][7][8][9]26,[37][38][39][40][41][42][43][44] . This includes tissue-specific models for mouse 43,44 and more recently human 9,41 . ...
... The prediction of splicing-perturbing variants has a long history of over 20 years' work [2][3][4][5][6][7][8][9]26,[37][38][39][40][41][42][43][44] . This includes tissue-specific models for mouse 43,44 and more recently human 9,41 . Those models showed successes in various splicing prediction tasks, such as quantitative change of percent spliced-in, splice site usage or splicing efficiency. ...
Article
Full-text available
Aberrant splicing is a major cause of genetic disorders but its direct detection in transcriptomes is limited to clinically accessible tissues such as skin or body fluids. While DNA-based machine learning models can prioritize rare variants for affecting splicing, their performance in predicting tissue-specific aberrant splicing remains unassessed. Here we generated an aberrant splicing benchmark dataset, spanning over 8.8 million rare variants in 49 human tissues from the Genotype-Tissue Expression (GTEx) dataset. At 20% recall, state-of-the-art DNA-based models achieve maximum 12% precision. By mapping and quantifying tissue-specific splice site usage transcriptome-wide and modeling isoform competition, we increased precision by threefold at the same recall. Integrating RNA-sequencing data of clinically accessible tissues into our model, AbSplice, brought precision to 60%. These results, replicated in two independent cohorts, substantially contribute to noncoding loss-of-function variant identification and to genetic diagnostics design and analytics.
... Persistent efforts have revealed that whether or not an exon is included in mature mRNA is mainly determined by the recognition efficiency of the splice site by the spliceosome [19]. At the same time, RNA motifs recognized by RNA binding proteins (RBPs) and other particular RNA features/structures constitute the blueprint of the so-called "splicing code" [20], which dictates the splicing in different cell types and conditions [21][22][23]. Consequently, a number of in silico tools [24][25][26][27] and experimental assays [28][29][30] have been designed to predict AS changes at a genome-wide scale, which have shed important insights into the regulation and function of splicing code, especially in evaluating the pathogenic roles of splicing-related variants. However, these studies are all apt at predicting the Ψ values of simple AS events, there is a lack of models to predict splicing complexity of alternative exons. ...
... Meanwhile, for the brain tissue, the proportion of AS event exhibiting apparent complexity change among non-Dev-events shows a monotonic increase with the phylogenetic breadth of alternative exons (Fig. 6D). Among human-specific non-Dev-events, 22.9% (liver) to 33.7% (heart) alternative exons have maximum changes of splicing entropy larger than 0.5 during organ development (Figs. 6E and S29B). ...
Article
Full-text available
Background As a significant process of post-transcriptional gene expression regulation in eukaryotic cells, alternative splicing (AS) of exons greatly contributes to the complexity of the transcriptome and indirectly enriches the protein repertoires. A large number of studies have focused on the splicing inclusion of alternative exons and have revealed the roles of AS in organ development and maturation. Notably, AS takes place through a change in the relative abundance of the transcript isoforms produced by a single gene, meaning that exons can have complex splicing patterns. However, the commonly used percent spliced-in (Ψ) values only define the usage rate of exons, but lose information about the complexity of exons’ linkage pattern. To date, the extent and functional consequence of splicing complexity of alternative exons in development and evolution is poorly understood. Results By comparing splicing complexity of exons in six tissues (brain, cerebellum, heart, liver, kidney, and testis) from six mammalian species (human, chimpanzee, gorilla, macaque, mouse, opossum) and an outgroup species (chicken), we revealed that exons with high splicing complexity are prevalent in mammals and are closely related to features of genes. Using traditional machine learning and deep learning methods, we found that the splicing complexity of exons can be moderately predicted with features derived from exons, among which length of flanking exons and splicing strength of downstream/upstream splice sites are top predictors. Comparative analysis among human, chimpanzee, gorilla, macaque, and mouse revealed that, alternative exons tend to evolve to an increased level of splicing complexity and higher tissue specificity in splicing complexity. During organ development, not only developmentally regulated exons, but also 10–15% of non-developmentally regulated exons show dynamic splicing complexity. Conclusions Our analysis revealed that splicing complexity is an important metric to characterize the splicing dynamics of alternative exons during the development and evolution of mammals.
... In addition, a variant may activate the cryptic site in a manner that induces an alternative use of the previously categorized constitutive splice site. Yuan et al., [93] formulated the assembly of a splicing code as a statistical inference problem and proposed a Bayesian method to predict tissue-regulated splicing using RNA sequences and cellular contexts. Subsequently, they developed a DNN-based model using dropout to learn and predict alternative splicing [94]. ...
... CNNs inherently learn a hierarchy of increasingly complex features, and are capable of operating directly on patches of images centered on the abnormal tissue. Applications of CNNs in medical imaging include the classification of interstitial lung diseases based on CT images, classification of tuberculosis manifestations based on X-ray images, classification of neural progenitor cells based on somatic cell sources [110], the detection of hemorrhages in color base images [111], and organ-or bodypart-specific anatomical classification of CT images [93]. A body-part recognition system was also presented by Yan et al., [112]. ...
Article
Full-text available
With the rapid growth of biological information, biological science technology has greatly enriched the biology and medicine data resources. The latest advantages of deep learning have achieved the state-or-the-art performance on high dimensional, non-structural and less explanatory biological data. The aim of this paper is to provide an overview of deep learning techniques and some of the-state-of-art applications in biology and medicine field. Specifically, we introduce the fundamental of deep learning methods, and then review their successes to bioinformatics, biomedical image, biomedicine and drug discovery. We also discuss the challenges, limitations and further improvement of this area.
... The analysis is also extended towards predicting how swift the transcription can occur. Gene data analysis also has a significant role to play in determining pattern [23][24][25][26][27] for splicing the DNA sequence such that introns are expelled and exons are retained in the pre-mRNA. Also, gene data analytics has a significant role to play in RNA structuring [28][29][30] that involves analyzing the interactions happening across the three layers of gene expression transformation. ...
... Hence 200 human pre-miRNA data from miRBase-8.2 is selected randomly. Also, non-human species, non-coding RNA and messenger RNA data are collected from GenBank [26] used for the training process. The hidden layer of the CNN is trained with standard Back Propagation (BP) algorithm using an adjusted weight values by stochastic gradient descent method (23) is given as, ...
Article
Full-text available
Bioinformatics is one of the emerging and rapidly developing research areas that is predominantly used for genetic data analysis and processing. Bioinformatics is characterized by its huge and voluminous data that is growing in nature which in turn complicates data analysis. In most cases, Bioinformatics data analysis and processing involve big data analytics due to the complex nature of the data. Previous research works handled data analytics using traditional tools and conventional big data analytical methods. However, it can be proved that machine learning algorithms and approaches can be effectively deployed to perform parallel, distributed and incremental processing of complex big data analytics especially in the case of gene big data analytics to enhance the efficiency in processing this large chunk of Bioinformatics-based gene big data. This paper provides a Machine Learning algorithm-based Convolution Neural Network (ML-CNN) approach for the process of identifying potential target genes, predicting miRNAs, visualizing the unique miRNA patterns, and validating genomes. The proposed approach has experimented with MATLAB software using deep learning toolbox on the pre - miRNA dataset. Experimental results indicate that machine learning algorithms certainly increases the efficiency of Bioinformatics-based methods of processing gene data in terms of prediction accuracy and reduced processing time. The mean performance of ML-CNN is improved 7% high than the existing system.
... These models use a variety of features to make their predictions, for instance, the hexamer additive linear (HAL) model predicts the change in PSI for a cassette exon following mutation based on the hexamer compositions of the wild type and variant sequences (8). There have also been efforts to define a comprehensive "splicing code" of relevant cis-acting elements, with more than 1000 features, including exon lengths and binding motifs for known SFs (24)(25)(26). These features have then been used in Bayesian neural network models to predict splicing disruption caused by genetic variants and relative PSI values for cassette alternative exons in different tissues. ...
Article
Pre-mRNA splicing is a fundamental step in gene expression, conserved across eukaryotes, in which the spliceosome recognizes motifs at the 3′ and 5′ splice sites (SSs), excises introns, and ligates exons. SS recognition and pairing is often influenced by protein splicing factors (SFs) that bind to splicing regulatory elements (SREs). Here, we describe SMsplice, a fully interpretable model of pre-mRNA splicing that combines models of core SS motifs, SREs, and exonic and intronic length preferences. We learn models that predict SS locations with 83 to 86% accuracy in fish, insects, and plants and about 70% in mammals. Learned SRE motifs include both known SF binding motifs and unfamiliar motifs, and both motif classes are supported by genetic analyses. Our comparisons across species highlight similarities between non-mammals, increased reliance on intronic SREs in plant splicing, and a greater reliance on SREs in mammalian splicing.
... Incorporating known exonic SREs into a simple splicing model was found to substantially improve its predictions [7]. Related problems have since been tackled, such as predicting the splicing phenotypes of mutations or predicting aspects of alternative splicing such as the percent spliced in (PSI) values of exons or the direction of changes in exon inclusion between different tissues [8][9][10][11][12][13]. Most models emphasized features known to be recognized in splicing, but to improve accuracy, some models also considered features not available to the spliceosome, such as evolutionary conservation [14,15]. ...
Article
Full-text available
Sequence-specific RNA-binding proteins (RBPs) play central roles in splicing decisions. Here, we describe a modular splicing architecture that leverages in vitro-derived RNA affinity models for 79 human RBPs and the annotated human genome to produce improved models of RBP binding and activity. Binding and activity are modeled by separate Motif and Aggregator components that can be mixed and matched, enforcing sparsity to improve interpretability. Training a new Adjusted Motif (AM) architecture on the splicing task not only yields better splicing predictions but also improves prediction of RBP-binding sites in vivo and of splicing activity, assessed using independent data. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-023-03162-x.
... These models use a variety of features to make their predictions, for instance, HAL predicts the change in PSI for a cassette exon following mutation based on the hexameric compositions of the wild type and variant sequences (Rosenberg et al., 2015). There have also been efforts to define a comprehensive 'splicing code' of relevant cis-acting elements, with over one thousand features, including exon lengths and binding motifs for known SFs (Barash et al., 2010;Xiong et al., 2011Xiong et al., , 2015. These features have then been used in Bayesian neural network models in order to predict splicing disruption caused by genetic variants and relative PSI values for cassette alternative exons in different tissues. ...
Preprint
Full-text available
Pre-mRNA splicing is a fundamental step in gene expression, conserved across eukaryotes, in which the spliceosome recognizes motifs at the 3' and 5' splice sites (SS), excises introns and ligates exons. SS recognition and pairing is often influenced by splicing regulatory factors (SRFs) that bind to splicing regulatory elements (SREs). Several families of sequence-specific SRFs are known to be similarly ancient. Here, we describe SMsplice, a fully interpretable model of pre-mRNA splicing that combines new models of core SS motifs, SREs, and exonic and intronic length preferences. We learn models the predict SS locations with 83-86% accuracy in fish, insects and plants, and about 70% in mammals. Learned SRE motifs include both known SRF binding motifs as well as novel motifs, and both classes are supported by genetic analyses. Our comparisons across species highlight similarities between non-mammals and a greater reliance on SREs in mammalian splicing, and increased reliance on intronic SREs in plant splicing.
... Related problems have since been tackled, such as predicting the splicing phenotypes of mutations, or predicting aspects of alternative splicing such as the percent spliced in (PSI) values of exons, or the direction of changes in exon inclusion between different tissues [8][9][10][11][12][13] . Most models emphasized features known to be recognized in splicing, but to improve accuracy some models also considered features not available to the spliceosome, such as evolutionary conservation 14,15 . ...
Preprint
Full-text available
Sequence-specific RNA-binding proteins (RBPs) play central roles in splicing decisions, but their exact binding locations and activities are difficult to predict. Here, we describe a modular splicing architecture that leverages in vitro -derived RNA affinity models for 79 human RBPs and the annotated human genome to produce improved models of RBP binding and activity. Binding and activity are modeled by separate Motif and Aggregator components that can be mixed and matched, enforcing sparsity to improve interpretability. Standard affinity models yielded reasonable predictions, but substantial improvements resulted from using a new Adjusted Motif (AM) architecture. While maintaining accurate modeling of in vitro binding, training these AMs on the splicing task yielded improved predictions of binding sites in vivo and of splicing activity, using independent crosslinking and massively parallel splicing reporter assay data. The modular structure of our model enables improved generalizability to other species (insects, plants) and to exons of different evolutionary ages.
... Amyloid precursor protein (APP) synthesis and expansion is a different mechanism that disrupts AD (108)(109)(110)(111).It is possible that curcumin plays a function in modulating γ-secretase by reducing the synthesis of Aβ and presenilin1 gene expression. Additionally, curcumin's impact on APP maturation may potentially contribute to the decreased APP synthesis (112,113).The fact that curcumin and certain of its metabolites have demonstrated treatment potential for AD is interesting. Tetrahydrocurcumin, Hexahydrocurcumin, and Octahydrocurcumin are the most widely used metabolites of curcumin; they are reductive in nature (114,115). ...
Article
Full-text available
Alzheimer's disease (AD) is a neurological illness with a progressive course that is the most common cause of dementia in the global population over the age of 65 (50-70% of all dementia cases).This chronic and progressive disease causes deficiencies in a variety of brain functions (mostly at the cortical and hippocampal levels), including memory, reasoning, orientation, understanding, computation, learning ability, language, and judgement. Changes that contribute to cognitive impairment are accompanied by loss in emotional regulation and social behaviour. According to a research by Ferri et al.3, about 23.4 million individuals have dementia today, with 4.6 new cases identified each year, or one every 7 seconds. These rates are expected to increase every 20 years, with 81.1 million people suffering from dementia by 2040. Patients rarely have symptoms before the age of 50, but the disease's prevalence rises with age. His steady rise has caused medical, social, and economic concerns, particularly in countries with accelerated population ageing.As the world's elderly population ages, Alzheimer's disease (AD) and other kinds of dementia are becoming a growing public health concern among the elderly in developing countries. According to estimates, emerging countries will house nearly 70% of the world's population aged 60 and older by 2020, with India accounting for 14.2% of that figure. Dementia is predicted to affect 7.4% of people aged 60 and up in India. There are around 8.8 million Indians over 60 with dementia. Continue to investigate new treatments and therapeutic procedures in attempt to reduce the progression of the disease. Above all, given the neuropathological complexity of the illness, these measurements are designed for many targets and intended for use in the early stages of AD. If these future treatments are to be effective, doctors must develop new diagnostic procedures that will allow doctors to diagnose AD in its preclinical period (before symptoms occur) or perhaps forecast AD before it develops. AD prevention is a reasonable objective, but in order to attain it, we must first get a better understanding of the aetiology of the disease and how environmental and lifestyle variables influence the chance of developing the disease.
... The best approach to overcome this problem would be to employ combinations of all possible parameters and verify whether they are suitable or not when applied to the testing data. Nevertheless, such an approach is unfeasible regarding DL architectures due to their vast number of parameters and exponential complexity [10]. ...
Preprint
Full-text available
Machine Learning algorithms have been extensively researched throughout the last decade, leading to unprecedented advances in a broad range of applications, such as image classification and reconstruction, object recognition, and text categorization. Nonetheless, most Machine Learning algorithms are trained via derivative-based optimizers, such as the Stochastic Gradient Descent, leading to possible local optimum entrapments and inhibiting them from achieving proper performances. A bio-inspired alternative to traditional optimization techniques, denoted as meta-heuristic, has received significant attention due to its simplicity and ability to avoid local optimums imprisonment. In this work, we propose to use meta-heuristic techniques to fine-tune pre-trained weights, exploring additional regions of the search space, and improving their effectiveness. The experimental evaluation comprises two classification tasks (image and text) and is assessed under four literature datasets. Experimental results show nature-inspired algorithms' capacity in exploring the neighborhood of pre-trained weights, achieving superior results than their counterpart pre-trained architectures. Additionally, a thorough analysis of distinct architectures, such as Multi-Layer Perceptron and Recurrent Neural Networks, attempts to visualize and provide more precise insights into the most critical weights to be fine-tuned in the learning process.
... Computational perspective. In terms of DL models using a large number of trainable parameters, major achievement in reducing overfitting was made by stochastically dropping out the trained weights on randomly selected neurons [225] or Bayesian approaches [228,229]. To fulfill the out-of-sample generalizability of ML models on independent data, train/valid/test data splitting strategies are introduced in terms of scaffold/random/temporal features, etc [11]. ...
Article
Full-text available
A large number of chemical compounds are available in databases such as PubChem and ZINC. However, currently known compounds, though large, represent only a fraction of possible compounds, which is known as chemical space. Many of these compounds in the databases are annotated with properties and assay data that can be used for drug discovery efforts. For this goal, a number of machine learning algorithms have been developed and recent deep learning technologies can be effectively used to navigate chemical space, especially for unknown chemical compounds, in terms of drug-related tasks. In this article, we survey how deep learning technologies can model and utilize chemical compound information in a task-oriented way by exploiting annotated properties and assay data in the chemical compounds databases. We first compile what kind of tasks are trying to be accomplished by machine learning methods. Then, we survey deep learning technologies to show their modeling power and current applications for accomplishing drug related tasks. Next, we survey deep learning techniques to address the insufficiency issue of annotated data for more effective navigation of chemical space. Chemical compound information alone may not be powerful enough for drug related tasks, thus we survey what kind of information, such as assay and gene expression data, can be used to improve the prediction power of deep learning models. Finally, we conclude this survey with four important newly developed technologies that are yet to be fully incorporated into computational analysis of chemical information.
... Alternatively, the best regularization would be to average the predictions of all possible parameter configurations, weighing the possibilities and checking out which would perform better. Nevertheless, such a methodology demands a cumbersome computational effort, only feasible for pitiful or non-complex models [23]. Some years ago, Srivastava et al. [16] proposed a regularization technique known as Dropout, where neurons are randomly dropped from neural networks during the training phase. ...
Chapter
Deep Learning architectures have been extensively studied in the last years, mainly due to their discriminative power in Computer Vision. However, one problem related to such models concerns their number of parameters and hyperparameters, which can easily reach hundreds of thousands. Additional drawbacks consist of their need for extensive training datasets and their high probability of overfitting. Recently, a naïve idea of disconnecting neurons from a network, known as Dropout, has shown to be a promising solution though it requires an adequate hyperparameter setting. Therefore, this work addresses finding suitable Dropout ratios through meta-heuristic optimization in the task of image reconstruction. Several energy-based Deep Learning architectures, such as Restricted Boltzmann Machines, Deep Belief Networks, and several meta-heuristic techniques, such as Particle Swarm Optimization, Bat Algorithm, Firefly Algorithm, Cuckoo Search, were employed in such a context. The experimental results describe the feasibility of using meta-heuristic optimization to find suitable Dropout parameters in three literature datasets and reinforce bio-inspired optimization as an alternative to empirically choosing regularization-based hyperparameters.
... The best approach to overcome this problem would be to employ combinations of all possible parameters and verify whether they are suitable or not when applied to the testing data. Nevertheless, such an approach is unfeasible regarding DL architectures due to their vast number of parameters and exponential complexity [10]. ...
Conference Paper
Machine Learning algorithms have been extensively researched throughout the last decade, leading to unprecedented advances in a broad range of applications, such as image classification and reconstruction, object recognition, and text categorization. Nonetheless, most Machine Learning algorithms are trained via derivative-based optimizers, such as the Stochastic Gradient Descent, leading to possible local optimum entrapments and inhibiting them from achieving proper performances. A bio-inspired alternative to traditional optimization techniques, denoted as meta-heuristic, has received significant attention due to its simplicity and ability to avoid local optimums imprisonment. In this work, we propose to use meta-heuristic techniques to fine-tune pre-trained weights, exploring additional regions of the search space, and improving their effectiveness. The experimental evaluation comprises two classification tasks (image and text) and is assessed under four literature datasets. Experimental results show nature-inspired algorithms' capacity in exploring the neighborhood of pre-trained weights, achieving superior results than their counterpart pre-trained architectures. Additionally, a thorough analysis of distinct architectures, such as Multi-Layer Perceptron and Recurrent Neural Networks, attempts to visualize and provide more precise insights into the most critical weights to be fine-tuned in the learning process.
... Payment mode depends on the individual wish and comfort level like online/card/. For card payment systems RFID cards are responsible for generating special 12 bytes code while swapping at the card at the machine (Xiong, Barash, & Frey 2011). Wireless transducer, antenna and encapsulating material are the major components of the card reading process. ...
... In another example, on a genetics dataset where the task is to predict the occurrence probability of three alternative-splicing-related events based on RNA features. The performance of 'Code Quality' (a measure of the KL divergence between the target and the predicted probability distributions) can be improved from 440 on standard network to 623 on BNN [59]. ...
Preprint
Bayesian neural network (BNN) allows for uncertainty quantification in prediction, offering an advantage over regular neural networks that has not been explored in the differential privacy (DP) framework. We fill this important gap by leveraging recent development in Bayesian deep learning and privacy accounting to offer a more precise analysis of the trade-off between privacy and accuracy in BNN. We propose three DP-BNNs that characterize the weight uncertainty for the same network architecture in distinct ways, namely DP-SGLD (via the noisy gradient method), DP-BBP (via changing the parameters of interest) and DP-MC Dropout (via the model architecture). Interestingly, we show a new equivalence between DP-SGD and DP-SGLD, implying that some non-Bayesian DP training naturally allows for uncertainty quantification. However, the hyperparameters such as learning rate and batch size, can have different or even opposite effects in DP-SGD and DP-SGLD. Extensive experiments are conducted to compare DP-BNNs, in terms of privacy guarantee, prediction accuracy, uncertainty quantification, calibration, computation speed, and generalizability to network architecture. As a result, we observe a new tradeoff between the privacy and the reliability. When compared to non-DP and non-Bayesian approaches, DP-SGLD is remarkably accurate under strong privacy guarantee, demonstrating the great potential of DP-BNN in real-world tasks.
... The process of turning pre-messenger RNA into mature messenger RNA (mRNA), which may be translated into a protein, is known as splicing. Jha et al. [13] used previously established BNN (Xiong et al.,) [14] and DNN (Leung et al.,) [15] models to construct integrated deep learning models for alternative splicing. Their algorithms can detect splicing regulators and their potential targets, as well as infer regulatory rules from the genomic sequence. ...
Research
Full-text available
Abstract: Genomic data has the potential to improve healthcare strategy in a variety of ways, including illness prevention, improved diagnosis, and better treatment. While Machine Learning may have revolutionized many fields, its implementation in the field of Genomics is new. Currently, Machine Learning is being applied and tested in a lot of genomic processes but all of those have not been clinically validated. Hence, we are far from providing Machine Learning or Deep Learning models for -omics data which can be implemented. This paper aims to explore in a very uncomplicated manner, what exactly is genomics, where does high performance computing and machine learning come into picture, current applications of machine learning in genomics and discuss potential future scope of machine learning in genomics.
... The model integrates regulatory sequence elements to qualitatively predict whether the inclusion of a cassette exon increases, decreases, or remains at a similar level from one tissue to another tissue. This model was further improved to predict directional changes between tissues along with discretized categories (low, medium, and high) within a tissue by using a Bayesian neural network with hidden variables [35]. In a subsequent study, a similar Bayesian neural network (SPANR) was trained on human data [29]. ...
Article
Full-text available
We develop the free and open-source model Multi-tissue Splicing (MTSplice) to predict the effects of genetic variants on splicing of cassette exons in 56 human tissues. MTSplice combines MMSplice, which models constitutive regulatory sequences, with a new neural network that models tissue-specific regulatory sequences. MTSplice outperforms MMSplice on predicting tissue-specific variations associated with genetic variants in most tissues of the GTEx dataset, with largest improvements on brain tissues. Furthermore, MTSplice predicts that autism-associated de novo mutations are enriched for variants affecting splicing specifically in the brain. We foresee that MTSplice will aid interpreting variants associated with tissue-specific disorders.
... Alternatively, the best regularization would be to average the predictions of all possible parameter configurations, weighing the possibilities and checking out which would perform better. Nevertheless, such a methodology demands a cumbersome computational effort, only feasible for pitiful or non-complex models [23]. Some years ago, Srivastava et al. [16] proposed a regularization technique known as Dropout, where neurons are randomly dropped from neural networks during the training phase. ...
Conference Paper
Deep Learning architectures have been extensively studied in the last years, mainly due to their discriminative power in Computer Vision. However, one problem related to such models concerns their number of parameters and hyperparameters, which can easily reach hundreds of thousands. Additional drawbacks consist of their need for extensive training datasets and their high probability of overfitting. Recently, a naïve idea of disconnecting neurons from a network, known as Dropout, has shown to be a promising solution though it requires an adequate hyperparameter setting. Therefore, this work addresses finding suitable Dropout ratios through meta-heuristic optimization in the task of image reconstruction. Several energy-based Deep Learning architectures, such as Restricted Boltzmann Machines, Deep Belief Networks, and several meta-heuristic techniques, such as Particle Swarm Optimization, Bat Algorithm, Firefly Algorithm, Cuckoo Search, were employed in such a context. The experimental results describe the feasibility of using meta-heuristic optimization to find suitable Dropout parameters in three literature datasets and reinforce bio-inspired optimization as an alternative to empirically choosing regularization-based hyperparameters.
... pieces of data for modelling gene coding in proteins (Xiong et al., 2011). ...
... The first group framed the prediction as a classification task, whether an alternative splicing event would occur given input or change in input. The earliest examples involved using a Bayesian regression [1] and Bayesian neural network [40] to predict whether an exon would be skipped or included in a transcript. [23] used a neural network with dense layers to predict the type of AS event. ...
Preprint
Full-text available
A single gene can encode for different protein versions through a process called alternative splicing. Since proteins play major roles in cellular functions, aberrant splicing profiles can result in a variety of diseases, including cancers. Alternative splicing is determined by the gene's primary sequence and other regulatory factors such as RNA-binding protein levels. With these as input, we formulate the prediction of RNA splicing as a regression task and build a new training dataset (CAPD) to benchmark learned models. We propose discrete compositional energy network (DCEN) which leverages the hierarchical relationships between splice sites, junctions and transcripts to approach this task. In the case of alternative splicing prediction, DCEN models mRNA transcript probabilities through its constituent splice junctions' energy values. These transcript probabilities are subsequently mapped to relative abundance values of key nucleotides and trained with ground-truth experimental measurements. Through our experiments on CAPD, we show that DCEN outperforms baselines and ablation variants.
... Alternatively, the best way to employ a regularization method would be to average the predictions of all possible parameter configurations, weighing the possibilities and checking out which would perform better. Nevertheless, such a methodology demands a cumbersome computational effort, only feasible for pitiful or non-complex models [8]. ...
Preprint
Full-text available
Deep learning architectures have been widely fostered throughout the last years, being used in a wide range of applications, such as object recognition, image reconstruction, and signal processing. Nevertheless, such models suffer from a common problem known as overfitting, which limits the network from predicting unseen data effectively. Regularization approaches arise in an attempt to address such a shortcoming. Among them, one can refer to the well-known Dropout, which tackles the problem by randomly shutting down a set of neurons and their connections according to a certain probability. Therefore, this approach does not consider any additional knowledge to decide which units should be disconnected. In this paper, we propose an energy-based Dropout (E-Dropout) that makes conscious decisions whether a neuron should be dropped or not. Specifically, we design this regularization method by correlating neurons and the model's energy as an importance level for further applying it to energy-based models, such as Restricted Boltzmann Machines (RBMs). The experimental results over several benchmark datasets revealed the proposed approach's suitability compared to the traditional Dropout and the standard RBMs.
... Alternatively, the best way to employ a regularization method would be to average the predictions of all possible parameter configurations, weighing the possibilities and checking out which would perform better. Nevertheless, such a methodology demands a cumbersome computational effort, only feasible for pitiful or non-complex models [8]. ...
Article
Deep learning architectures have been widely fostered throughout the last years, being used in a wide range of applications, such as object recognition, image reconstruction, and signal processing. Nevertheless, such models suffer from a common problem known as overfitting, which limits the network from predicting unseen data effectively. Regularization approaches arise in an attempt to address such a shortcoming. Among them, one can refer to the well-known Dropout, which tackles the problem by randomly shutting down a set of neurons and their connections according to a certain probability. Therefore , this approach does not consider any additional knowledge to decide which units should be disconnected. In this paper, we propose an energy-based Dropout (E-Dropout) that makes conscious decisions whether a neuron should be dropped or not. Specifically, we design this regularization method by correlating neurons and the model's energy as an importance level for further applying it to energy-based models, such as Restricted Boltzmann Machines (RBMs). The experimental results over several benchmark datasets revealed the proposed approach's suitability compared to the traditional Dropout and the standard RBMs.
... There have been a number of different studies that have aimed to address this issue. One approach has been the pursuit of deciphering the "splicing code" using computational techniques such as deep learning [5][6][7]. ...
Preprint
Full-text available
Understanding the functional impact of genomic variants is a major goal of modern genetics and personalized medicine. Although many synonymous and non-coding variants act through altering the efficiency of pre-mRNA splicing, it is difficult to predict how these variants impact pre-mRNA splicing. Here, we describe a massively parallel approach we used to test the impact of 2,059 human genetic variants spanning 110 alternative exons on pre-mRNA splicing. This method yields data that reinforces known mechanisms of pre-mRNA splicing, can rapidly identify genomic variants that impact pre-mRNA splicing, and will be useful for increasing our understanding of genome function.
... The first practical splicing code was introduced by Barash et al. (2010) and predicted tissue differences of cassette splicing events in mouse. Subsequent versions of the splicing code introduced a Bayesian neural network and predicted absolute splicing levels (Xiong et al., 2011). Since then, Bayesian neural networks (Xiong et al., 2015) and deep neural networks (Leung et al., 2014) have further improved on the state of art in predicting exon skipping. ...
Preprint
Motivation Alternative splice site selection is inherently competitive and the probability of a given splice site to be used also depends strongly on the strength of neighboring sites. Here we present a new model named Competitive Splice Site Model (COSSMO), which explicitly models these competitive effects and predict the PSI distribution over any number of putative splice sites. We model an alternative splicing event as the choice of a 3’ acceptor site conditional on a fixed upstream 5’ donor site, or the choice of a 5’ donor site conditional on a fixed 3’ acceptor site. We build four different architectures that use convolutional layers, communication layers, LSTMS, and residual networks, respectively, to learn relevant motifs from sequence alone. We also construct a new dataset from genome annotations and RNA-Seq read data that we use to train our model. Results COSSMO is able to predict the most frequently used splice site with an accuracy of 70% on unseen test data, and achieve an R ² of 60% in modeling the PSI distribution. We visualize the motifs that COSSMO learns from sequence and show that COSSMO recognizes the consensus splice site sequences as well as many known splicing factors with high specificity. Availability Our dataset is available from http://cossmo.deepgenomics.com . Contact frey@deepgenomics.com Supplementary information Supplementary data are available at Bioinformatics online.
... X Aij and X Bij are respectively the counts of mRNA molecules from isoforms g A and g B in i. Notice that X Aij is a random binomial sample from the total number of expressed pre-mRNA molecules of g in i with a probability É ij , as it has been modeled before (Waks et al., 2011;Xiong et al., 2011;Faigenbloom et al., 2015;Shen et al., 2014). É Tij is the true isoform ratio that include cassette exon j in cell i, obtained as the proportion of molecules of gene g that include the cassette exon j. ...
Article
Full-text available
Single-cell RNA sequencing provides powerful insight into the factors that determine each cell’s unique identity. Previous studies led to the surprising observation that alternative splicing among single cells is highly variable and follows a bimodal pattern: a given cell consistently produces either one or the other isoform for a particular splicing choice, with few cells producing both isoforms. Here, we show that this pattern arises almost entirely from technical limitations. We analyze alternative splicing in human and mouse single-cell RNA-seq datasets, and model them with a probabilistic simulator. Our simulations show that low gene expression and low capture efficiency distort the observed distribution of isoforms. This gives the appearance of binary splicing outcomes, even when the underlying reality is consistent with more than one isoform per cell. We show that accounting for the true amount of information recovered can produce biologically meaningful measurements of splicing in single cells.
... X Aij and X Bij are respectively the counts of mRNA molecules from isoforms g A and g B in i. Notice that X Aij is a random binomial sample from the total number of expressed pre-mRNA molecules of g in i with a probability É ij , as it has been modeled before (Waks et al., 2011;Xiong et al., 2011;Faigenbloom et al., 2015;Shen et al., 2014). É Tij is the true isoform ratio that include cassette exon j in cell i, obtained as the proportion of molecules of gene g that include the cassette exon j. ...
Article
Full-text available
Single-cell RNA sequencing provides powerful insight into the factors that determine each cell’s unique identity. Previous studies led to the surprising observation that alternative splicing among single cells is highly variable and follows a bimodal pattern: a given cell consistently produces either one or the other isoform for a particular splicing choice, with few cells producing both isoforms. Here, we show that this pattern arises almost entirely from technical limitations. We analyze alternative splicing in human and mouse single-cell RNA-seq datasets, and model them with a probabilistic simulator. Our simulations show that low gene expression and low capture efficiency distort the observed distribution of isoforms. This gives the appearance of binary splicing outcomes, even when the underlying reality is consistent with more than one isoform per cell. We show that accounting for the true amount of information recovered can produce biologically meaningful measurements of splicing in single cells.
... Gaussian process preference learning (GPPL) [11], a Thurstone-Mostellerbased model that accounts for the features of the instances when inferring their scores, can make predictions for unlabelled instances and copes better with sparse pairwise labels. GPPL uses Bayesian inference, which has been shown to cope better with sparse and noisy data [39,38,4,24], including disagreements between multiple annotators [12,36,14,22]. Through the random utility model, GPPL is able to handle disagreements between annotators as noise, since no label has a probability of one of being selected. ...
Preprint
Full-text available
Most humour processing systems to date make at best discrete, coarse-grained distinctions between the comical and the conventional, yet such notions are better conceptualized as a broad spectrum. In this paper, we present a probabilistic approach, a variant of Gaussian process preference learning (GPPL), that learns to rank and rate the humorousness of short texts by exploiting human preference judgments and automatically sourced linguistic annotations. We apply our system, which had previously shown good performance on English-language one-liners annotated with pairwise humorousness annotations, to the Spanish-language data set of the HAHA@IberLEF2019 evaluation campaign. We report system performance for the campaign's two subtasks, humour detection and funniness score prediction, and discuss some issues arising from the conversion between the numeric scores used in the HAHA@IberLEF2019 data and the pairwise judgment annotations required for our method.
... Previous computational methods on splicing have largely focused on discovering novel splice junctions based on RNA sequencing (RNA-seq) alignments [25,26], utilizing machine learning approaches [27,28] including deep neural networks [29]. Only a limited set of tools can model splicing regulation based on genomic sequences and select RNA features [30][31][32]. Moreover, studies on splicing regulation have focused heavily on identifying mutations that land within splice sites (SSs), cis-acting splicing regulatory elements, and trans-acting splicing factors [30,33]. ...
Article
Full-text available
Alternative RNA splicing provides an important means to expand metazoan transcriptome diversity. Contrary to what was accepted previously, splicing is now thought to predominantly take place during transcription. Motivated by emerging data showing the physical proximity of the spliceosome to Pol II, we surveyed the effect of epigenetic context on co-transcriptional splicing. In particular, we observed that splicing factors were not necessarily enriched at exon junctions and that most epigenetic signatures had a distinctly asymmetric profile around known splice sites. Given this, we tried to build an interpretable model that mimics the physical layout of splicing regulation where the chromatin context progressively changes as the Pol II moves along the guide DNA. We used a recurrent-neural-network architecture to predict the inclusion of a spliced exon based on adjacent epigenetic signals, and we showed that distinct spatio-temporal features of these signals were key determinants of model outcome, in addition to the actual nucleotide sequence of the guide DNA strand. After the model had been trained and tested (with >80% precision-recall curve metric), we explored the derived weights of the latent factors, finding they highlight the importance of the asymmetric time-direction of chromatin context during transcription.
... First, the features are an identifiable set representing prior biological knowledge about putative regulatory elements such as known sequence motifs and RNA conservation scores. Moreover, a myriad of models have already been applied to this task, including a mixture of decision trees, Bayesian neural networks, naive Bayes, and logistic regression [18,19]. Second, the splicing code model includes embedding in a lower dimension space, a common component in genomic models, allowing us to test the usage of feature embedding for prediction attribution. ...
Article
Full-text available
Despite the success and fast adaptation of deep learning models in biomedical domains, their lack of interpretability remains an issue. Here, we introduce Enhanced Integrated Gradients (EIG), a method to identify significant features associated with a specific prediction task. Using RNA splicing prediction as well as digit classification as case studies, we demonstrate that EIG improves upon the original Integrated Gradients method and produces sets of informative features. We then apply EIG to identify A1CF as a key regulator of liver-specific alternative splicing, supporting this finding with subsequent analysis of relevant A1CF functional (RNA-seq) and binding data (PAR-CLIP).
... The model integrates regulatory sequence elements to qualitatively predict whether the inclusion of a cassette exon increases, decreases, or remains at a similar level from one tissue to another tissue. This model was further improved to predict directional changes between tissues along with discretized Ψ categories (Low, -Medium, and -High) within a tissue by using a Bayesian neural network with hidden variables [34]. In a subsequent study, a similar Bayesian neural network (SPANR) was trained on human data [28]. ...
Preprint
Full-text available
Tissue-specific splicing of exons plays an important role in determining tissue identity. However, computational tools predicting tissue-specific effects of variants on splicing are lacking. To address this issue, we developed MTSplice (Multi-tissue Splicing), a neural network which quantitatively predicts effects of human genetic variants on splicing of cassette exons in 56 tissues. MTSplice combines the state-of-the-art predictor MMSplice, which models constitutive regulatory sequences, with a new neural network which models tissue-specific regulatory sequences. MTSplice outperforms MMSplice on predicting effects associated with naturally occurring genetic variants in most tissues of the GTEx dataset. Furthermore, MTSplice predicts that autism-associated de novo mutations are enriched for variants affecting splicing specifically in the brain. MTSplice is provided free of use and open source at the model repository Kipoi. We foresee MTSplice to be useful for functional prediction and prioritization of variants associated with tissue-specific disorders.
... 51 In combination with our previous studies showing that Fbxw7 plays an essential role in the maintenance of quiescence and stemness in cancer stem cells, [52][53][54] prediction of splicing sites has been a major goal over the past decade. Early studies adopted a naive Bayesian model, 55 but the advent of deep learning allowed the development of more complex models that have provided better predictive accuracy. SpliceAI, a 32-layer deep CNN, predicts splicing from a given pre-mRNA sequence. ...
Article
Full-text available
Artificial intelligence (AI) has contributed substantially to the resolution of a variety of biomedical problems, including cancer, over the last decade. Deep learning, a subfield of AI that is highly flexible and supports automatic feature extraction, is increasingly being applied in various areas of both basic and clinical cancer research. In this review, we describe numerous recent examples of the application of AI in oncology-including cases in which deep learning has efficiently solved problems that were previously thought to be unsolvable-and we address obstacles that must be overcome before such application can become more widespread. We also highlight resources and data sets that can help harness the power of AI for cancer research. The development of innovative approaches to and applications of AI will yield important insights in oncology in the coming decade.
... X Ai j and X Bi j are respectively the counts of mRNA molecules from isoforms g A and g B in i. Notice that X Ai j is a random binomial sample from the total number of expressed pre-mRNA molecules of g in i with a probability Ψ i j , as it has been modeled before [10,28,34,35]. Ψ Ti j is the true isoform ratio that include cassette exon j in cell i, obtained as the proportion of molecules of gene g that include the cassette exon j. ...
Preprint
Full-text available
Single cell RNA sequencing provides powerful insight into the factors that determine each cell's unique identity, including variation in transcription and RNA splicing among diverse cell types. Previous studies led to the surprising observation that alternative splicing outcomes among single cells are highly variable and follow a bimodal pattern: a given cell consistently produces either one or the other isoform for a particular splicing choice, with few cells producing both isoforms. Here we show that this pattern arises almost entirely from technical limitations. We analyzed single cell alternative splicing in human and mouse single cell RNA-seq datasets, and modeled them with a probablistic simulator. Our simulations show that low gene expression and low capture efficiency distort the observed distribution of isoforms in single cells. This gives the appearance of a binary isoform distribution, even when the underlying reality is consistent with more than one isoform per cell. We show that accounting for the true amount of information recovered can produce biologically meaningful measurements of splicing in single cells.
... When learning from noisy or small datasets, commonly-used methods based on maximum likelihood estimation may produce over-confident predictions (Xiong et al., 2011;Srivastava et al., 2014). In contrast, Bayesian inference accounts for model uncertainty when making predictions. ...
... This makes BANN to achieve a more generalized mapping of input-output relationship and better prediction capability. BANN is also found more efficient compared to other ML methods in efficient model development with limited data like the case of the present study (Xiong et al., 2011). A number of studies revealed better performance of BANN compared to other ML methods in prediction (Srivastava et al., 2014;Chen et al., 2019). ...
Article
Reliable prediction of rainfall extremes is vital for disaster management, particularly in the context of increasing rainfall extremes due to global climate change. Physical-empirical models have been developed in this study using three widely used Machine Learning (ML) methods namely, Support Vector Machines (SVM), Random Forests (RF), Bayesian Artificial Neural Networks (BANN) for the prediction of rainfall and rainfall related extremes during Northeast Monsoon (NEM) in Peninsular Malaysia from synoptic predictors. The gridded daily rainfall data of Asian Precipitation—Highly Resolved Observational Data Integration Towards Evaluation of Water Resources (APHRODITE) was used to estimate four rainfall indices namely, rainfall amount, average rainfall intensity, days having > 95-th percentile rainfall, and total number of dry days in Peninsular Malaysia during NEM for the period 1951–2015. The National Centers for Environmental Prediction (NCEP) reanalysis sea level pressure (SLP) data was used for the prediction of rainfall indices with different lead periods. The recursive feature elimination (RFE) method was used to select the SLP at different NCEP grid points which were found significantly correlated with NEM rainfall indices. The results showed superior performance of BANN among theML models with normalised root mean square error of 0.04–0.14, Nash-Sutcliff Efficiency of 0.98–1.0, andmodified agreement index of 0.97–0.99 and Kling-Gupta efficient index 0.65–0.96 for one-month lead period prediction. The 95% confidence interval (CI) band for BANN was found narrower than the other ML models.Almost all the forecasted values by BANN were also found with 95% CI, and therefore, the p-factor and the r-factor for BANN in predicting rainfall indices were found in the range of 0.95–1.0 and 0.25–0.49 respectively.Application of BANN in prediction of rainfall indices with higher lead time was also found excellent. The synoptic pattern revealed that SLP over the north of South China Sea is the major driver of NEM rainfall and rainfall extremes in Peninsular Malaysia.
Article
The expression of RNA-binding proteins and their interaction with the spliced pre-mRNA are the key factors in determining the final isoform profile. Transmembrane protein CD44 is involved in differentiation, invasion, motility, growth and survival of tumor cells, and is also a commonly accepted marker of cancer stem cells and epithelial-mesenchymal transition. However, the functions of the isoforms of this protein differ significantly. In this paper, we developed a method based on the boosted beta regression algorithm for identification of the significant RNA-binding proteins in the splicing process by modeling the isoform ratio. The application of this method to the analysis of CD44 splicing in colorectal cancer cells revealed 20 significant RNA-binding proteins. Many of them were previously shown as EMT regulators, but for the first time presented as potential CD44 splicing factors.
Article
Full-text available
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning, since we expect a superhuman intelligence that explores beyond our knowledge to interpret the genome from deep learning. A powerful deep learning model should rely on the insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with proper deep learning-based architecture, and we remark on practical considerations of developing deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research and point out current challenges and potential research directions for future genomics applications. We believe the collaborative use of ever-growing diverse data and the fast iteration of deep learning models will continue to contribute to the future of genomics.
Article
The expression of RNA-binding proteins and their interaction with the spliced pre-mRNA are the key factors in determining the final isoform profile. Transmembrane protein CD44 is involved in differentiation, invasion, motility, growth and survival of tumor cells, and is also a commonly accepted marker of cancer stem cells and epithelial-mesenchymal transition. However, the functions of the isoforms of this protein differ significantly. In this paper, we developed a method based on the boosted beta regression algorithm for identification of the significant RNA-binding proteins in the splicing process by modeling the isoform ratio. The application of this method to the analysis of CD44 splicing in colorectal cancer cells revealed 20 significant RNA-binding proteins. Many of them were previously shown as EMT regulators, but for the first time presented as potential CD44 splicing factors.
Chapter
Bayesian neural network (BNN) allows for uncertainty quantification in prediction, offering an advantage over regular neural networks that has not been explored in the differential privacy (DP) framework. We fill this important gap by leveraging recent development in Bayesian deep learning and privacy accounting to offer a more precise analysis of the trade-off between privacy and accuracy in BNN. We propose three DP-BNNs that characterize the weight uncertainty for the same network architecture in distinct ways, namely DP-SGLD (via the noisy gradient method), DP-BBP (via changing the parameters of interest) and DP-MC Dropout (via the model architecture). Interestingly, we show a new equivalence between DP-SGD and DP-SGLD, implying that some non-Bayesian DP training naturally allows for uncertainty quantification. However, the hyperparameters such as learning rate and batch size, can have different or even opposite effects in DP-SGD and DP-SGLD.Extensive experiments are conducted to compare DP-BNNs, in terms of privacy guarantee, prediction accuracy, uncertainty quantification, calibration, computation speed, and generalizability to network architecture. As a result, we observe a new tradeoff between the privacy and the reliability. When compared to non-DP and non-Bayesian approaches, DP-SGLD is remarkably accurate under strong privacy guarantee, demonstrating the great potential of DP-BNN in real-world tasks.KeywordsDeep learningBayesian neural networkDifferential privacyUncertainty quantificationOptimizationCalibration
Article
Full-text available
The closing of the gated ion channel in the Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) can be categorized as nonpermissive to reopening, which involves the unbinding of ADP or ATP, or permissive, which does not. Identifying the type of closing is of interest, as interactions with nucleotides can be affected in mutants or by introducing agonists. However, all closings are electrically silent and difficult to differentiate. For single-channel patch clamp traces, we show that the type of the closing can be accurately determined by an inference algorithm implemented on a factor graph, which we demonstrate using both simulated and lab-obtained patch clamp traces.
Article
Gesture can be used as an important way for human–robot interaction, since it is able to give accurate and intuitive instructions to the robots. Various sensors can be used to capture gestures. We apply three different sensors that can provide different modalities in recognizing human gestures. Such data also owns its own statistical properties for the purpose of transfer learning: they own the same labeled data, but both the source and the validation data-sets have their own statistical distributions. To tackle the transfer learning problem across different sensors with such kind of data-sets, we propose a weighting method to adjust the probability distributions of the data, which results in a more faster convergence result. We further apply this method in a broad learning system, which has proven to be efficient to learn with the incremental learning capability. The results show that although these three sensors measure different parts of the body using different technologies, transfer learning is able to find out the weighting correlation among the data-sets. It also suggests that using the proposed transfer learning is able to adjust the data which has different distributions which may be similar to the physical correlation between different parts of the body in the context of giving gestures.
Article
Full-text available
In this work we use variational inference to quantify the degree of uncertainty in deep learning model predictions of radio galaxy classification. We show that the level of model posterior variance for individual test samples is correlated with human uncertainty when labelling radio galaxies. We explore the model performance and uncertainty calibration for different weight priors and suggest that a sparse prior produces more well-calibrated uncertainty estimates. Using the posterior distributions for individual weights, we demonstrate that we can prune 30 per cent of the fully-connected layer weights without significant loss of performance by removing the weights with the lowest signal-to-noise ratio. A larger degree of pruning can be achieved using a Fisher information based ranking, but both pruning methods affect the uncertainty calibration for Fanaroff-Riley type I and type II radio galaxies differently. Like other work in this field, we experience a cold posterior effect, whereby the posterior must be down-weighted to achieve good predictive performance. We examine whether adapting the cost function to accommodate model misspecification can compensate for this effect, but find that it does not make a significant difference. We also examine the effect of principled data augmentation and find that this improves upon the baseline but also does not compensate for the observed effect. We interpret this as the cold posterior effect being due to the overly effective curation of our training sample leading to likelihood misspecification, and raise this as a potential issue for Bayesian deep learning approaches to radio galaxy classification in future.
Chapter
In this chapter, the study shows the performance of an integration of spatially explicit hyperparameter optimization.
Preprint
Full-text available
Background RNA binding protein-RNA interactions mediate a variety of processes including pre-mRNA splicing, translation, decay, polyadenylation and many others. Previous high-throughput studies have characterized general sequence features associated with increased and decreased splicing of certain exons, but these studies are limited by not knowing the mechanisms, and in particular, the mediating RNA binding proteins, underlying these associations. Results Here we utilize ENCODE data from diverse data modalities to identify functional splicing regulatory elements and their associated RNA binding proteins. We identify features which make splicing events more sensitive to depletion of RNA binding proteins, as well as which RNA binding proteins act as splicing regulators sensitive to depletion. To analyze the sequence determinants underlying RBP-RNA interactions impacting splicing, we assay tens of thousands of sequence variants in a high-throughput splicing reporter called Vex-seq and confirm a small subset in their endogenous loci using CRISPR base editors. Finally, we leverage other large transcriptomic datasets to confirm the importance of RNA binding proteins which we designed experiments around and identify additional RBPs which may act as additional splicing regulators of the exons studied. Conclusions This study identifies sequence and other features underlying splicing regulation mediated specific RNA binding proteins, as well as validates and identifies other potentially important regulators of splicing in other large transcriptomic datasets.
Preprint
Bayesian learning is a powerful learning framework which combines the external information of the data (background information) with the internal information (training data) in a logically consistent way in inference and prediction. By Bayes rule, the external information (prior distribution) and the internal information (training data likelihood) are combined coherently, and the posterior distribution and the posterior predictive (marginal) distribution obtained by Bayes rule summarize the total information needed in the inference and prediction, respectively. In this paper, we study the Bayesian framework of the Tensor Network from two perspective. First, we introduce the prior distribution to the weights in the Tensor Network and predict the labels of the new observations by the posterior predictive (marginal) distribution. Since the intractability of the parameter integral in the normalization constant computation, we approximate the posterior predictive distribution by Laplace approximation and obtain the out-product approximation of the hessian matrix of the posterior distribution of the Tensor Network model. Second, to estimate the parameters of the stationary mode, we propose a stable initialization trick to accelerate the inference process by which the Tensor Network can converge to the stationary path more efficiently and stably with gradient descent method. We verify our work on the MNIST, Phishing Website and Breast Cancer data set. We study the Bayesian properties of the Bayesian Tensor Network by visualizing the parameters of the model and the decision boundaries in the two dimensional synthetic data set. For a application purpose, our work can reduce the overfitting and improve the performance of normal Tensor Network model.
Book
Full-text available
Алергията към кравето мляко и млечни продукти е една от найшироко разпространените хранителни алергии напоследък, особено при пеленачета и деца в ранна детска възраст. Основните алергени в кравето мляко са казеина, β- лактоглобулин, α-лакталбумин, говежди серумен албумин и IgG. От тях, β- лактоглобулинът е най-важният алерген, тъй като не се съдържа в човешкото мляко. За намаляване на алергенния ефект на кравето мляко се прилага ензимна хидролиза, но получената хипоалергенна форма на млякото има незадоволителен вкус, причинен от наличието на горчиви пептиди и аминокиселини. Също така, някои изследвания установяват наличието на алергия към млечния хидролизат, което потвърждава необходимостта от търсене на нови подходи за елиминиране на алергенния ефект на кравето мляко. Известно е, че методът на гама облъчване оказва положително влияние върху редица хранителни продукти, чрез подобряване на безопасността и стабилността им, без да се компрометират техните хранителни и сензорни качества. Чрез гама лъчение с подходяща доза може да бъде модулиран състава на продукти предназначени за третиране на алергии. Облъчването с γ-лъчи води до различни промени в хранителните компоненти, включително в протеините. Химическите промени, в резултат от облъчване включват фрагментация, напречно свързване (cross-linking), агрегация и окисление от кислородни радикали, генерирани от радиолизата на водата. Тези промени могат да повлияят върху антигенните свойства на облъчените храни. Съвременни изследвания показват, че използването на йонизиращи лъчения в ниски дози при млечни продукти водят до разрушаване на епитопите в протеиновите молекули, което би могло да намали техният алергенен ефект, без това да окаже негативен ефект върху сензорните и вкусови характеристики на продукта в сравнение с познатата ензимна хидролиза. До сега в нито едно проучване не е оценен ефектът на γ-лъчите върху протеините в млякото като хранителен продукт. Тъй като ефекта на облъчването зависи от условията на процеса, е необходимо да се изследват ефектите на антигенност и алергенност на млечните протеини. Научната проблематика на настоящото изследване е подчинена на съвременните тенденции за получаване на нови научни знания и трансфер на технологии. Чрез използване на иновативни технологични подходи – облъчване с γ-лъчи на млечни протеини в различни дози се оказва влияние върху антигенните свойства на основния алергенен източник в млякото - суроватъчния протеин β-лактоглобулин. На база на получените експериментални данни е реализиран модел – тип изкуствена невронна мрежа, в подходяща софтуерна среда, за интерпретиране промените на фракциите от белтъчния спектър на млечните продукти и предсказване на оптималните дози на облъчване. За доказване ефекта от проведеното изследване е изведен биологичен експеримент с подходящи линии експериментални животни (бели мишки линия Balb/c) сензибилизирани и оценени по физиологични показатели на животните приемали изследваните хранителни продукти, спрямо контролни групи получаващи обичайните храни, оценка на симптомите, определяне на IgE, проследяване на промените в теглото, средна продължителност на живот и хематологични показатели. На базата на това комплексно изследване e извършена научна оценка на възможностите за понижаване на алергенния и антигенен ефект на млечни протеини чрез алтернативни технологични подходи, което напълно съответства на приоритетните сектори - здраве и качество на живот, биотехнологии, опазване на естествените и човешки ресурси.
Article
Full-text available
Gamma irradiation is a well-known method for sterilizing different foodstuffs, including fresh cow milk. Many studies witness that the low dose irradiation of milk and milk products affects the fractions of the milk protein, thus reducing its allergenic effect and make it potentially appropriate for people with milk allergy. The purpose of this study is to evaluate the relationship between the gamma radiation dose and size of the protein fractions, as potential approach to decrease the allergenic effect of the milk. In this paper, an approach for prediction of the dose in gamma irradiated products by using a Bayesian regularized neural network as a mean to save recourses for expensive electrophoretic experiments, is developed. The efficiency of the proposed neural network model is proved on data for two dairy products – lyophilized cow milk and curd.
Article
Full-text available
Single cell RNA sequencing provides powerful insight into the factors that determine each cell's unique identity. Previous studies led to the surprising observation that alternative splicing among single cells is highly variable and follows a bimodal pattern: a given cell consistently produces either one or the other isoform for a particular splicing choice, with few cells producing both isoforms. Here we show that this pattern arises almost entirely from technical limitations. We analyze alternative splicing in human and mouse single cell RNA-seq datasets, and model them with a probabilistic simulator. Our simulations show that low gene expression and low capture efficiency distort the observed distribution of isoforms. This gives the appearance of binary splicing outcomes, even when the underlying reality is consistent with more than one isoform per cell. We show that accounting for the true amount of information recovered can produce biologically meaningful measurements of splicing in single cells.
Article
Full-text available
The control of RNA alternative splicing is critical for generating biological diversity. Despite emerging genome-wide technologies to study RNA complexity, reliable and comprehensive RNA-regulatory networks have not been defined. Here, we used Bayesian networks to probabilistically model diverse data sets and predict the target networks of specific regulators. We applied this strategy to identify ~700 alternative splicing events directly regulated by the neuron-specific factor Nova in the mouse brain, integrating RNA-binding data, splicing microarray data, Nova-binding motifs, and evolutionary signatures. The resulting integrative network revealed combinatorial regulation by Nova and the neuronal splicing factor Fox, interplay between phosphorylation and splicing, and potential links to neurologic disease. Thus, we have developed a general approach to understanding mammalian RNA regulation at the systems level.
Article
Full-text available
Transcripts from approximately 95% of human multi-exon genes are subject to alternative splicing (AS). The growing interest in AS is propelled by its prominent contribution to transcriptome and proteome complexity and the role of aberrant AS in numerous diseases. Recent technological advances enable thousands of exons to be simultaneously profiled across diverse cell types and cellular conditions, but require accurate identification of condition-specific splicing changes. It is necessary to accurately identify such splicing changes to elucidate the underlying regulatory programs or link the splicing changes to specific diseases. We present a probabilistic model tailored for high-throughput AS data, where observed isoform levels are explained as combinations of condition-specific AS signals. According to our formulation, given an AS dataset our tasks are to detect common signals in the data and identify the exons relevant to each signal. Our model can incorporate prior knowledge about underlying AS signals, measurement quality and gene expression level effects. Using a large-scale multi-tissue AS dataset, we demonstrate the advantage of our method over standard alternative approaches. In addition, we describe newly found tissue-specific AS signals which were verified experimentally, and discuss associated regulatory features. Supplementary data are available at Bioinformatics online.
Article
Full-text available
Alternative splicing has a crucial role in the generation of biological complexity, and its misregulation is often involved in human disease. Here we describe the assembly of a 'splicing code', which uses combinations of hundreds of RNA features to predict tissue-dependent changes in alternative splicing for thousands of exons. The code determines new classes of splicing patterns, identifies distinct regulatory programs in different tissues, and identifies mutation-verified regulatory sequences. Widespread regulatory strategies are revealed, including the use of unexpectedly large combinations of features, the establishment of low exon inclusion levels that are overcome by features in specific tissues, the appearance of features deeper into introns than previously appreciated, and the modulation of splice variant levels by transcript structure characteristics. The code detected a class of exons whose inclusion silences expression in adult tissues by activating nonsense-mediated messenger RNA decay, but whose exclusion promotes expression during embryogenesis. The code facilitates the discovery and detailed characterization of regulated alternative splicing events on a genome-wide scale.
Article
Full-text available
Metazoan genomes encode hundreds of RNA-binding proteins (RBPs) but RNA-binding preferences for relatively few RBPs have been well defined. Current techniques for determining RNA targets, including in vitro selection and RNA co-immunoprecipitation, require significant time and labor investment. Here we introduce RNAcompete, a method for the systematic analysis of RNA binding specificities that uses a single binding reaction to determine the relative preferences of RBPs for short RNAs that contain a complete range of k-mers in structured and unstructured RNA contexts. We tested RNAcompete by analyzing nine diverse RBPs (HuR, Vts1, FUSIP1, PTB, U1A, SF2/ASF, SLM2, RBM4 and YB1). RNAcompete identified expected and previously unknown RNA binding preferences. Using in vitro and in vivo binding data, we demonstrate that preferences for individual 7-mers identified by RNAcompete are a more accurate representation of binding activity than are conventional motif models. We anticipate that RNAcompete will be a valuable tool for the study of RNA-protein interactions.
Article
Full-text available
We carried out the first analysis of alternative splicing complexity in human tissues using mRNA-Seq data. New splice junctions were detected in approximately 20% of multiexon genes, many of which are tissue specific. By combining mRNA-Seq and EST-cDNA sequence data, we estimate that transcripts from approximately 95% of multiexon genes undergo alternative splicing and that there are approximately 100,000 intermediate- to high-abundance alternative splicing events in major human tissues. From a comparison with quantitative alternative splicing microarray profiling data, we also show that mRNA-Seq data provide reliable measurements for exon inclusion levels.
Article
Full-text available
Protein-RNA interactions have critical roles in all aspects of gene expression. However, applying biochemical methods to understand such interactions in living tissues has been challenging. Here we develop a genome-wide means of mapping protein-RNA binding sites in vivo, by high-throughput sequencing of RNA isolated by crosslinking immunoprecipitation (HITS-CLIP). HITS-CLIP analysis of the neuron-specific splicing factor Nova revealed extremely reproducible RNA-binding maps in multiple mouse brains. These maps provide genome-wide in vivo biochemical footprints confirming the previous prediction that the position of Nova binding determines the outcome of alternative splicing; moreover, they are sufficiently powerful to predict Nova action de novo. HITS-CLIP revealed a large number of Nova-RNA interactions in 3' untranslated regions, leading to the discovery that Nova regulates alternative polyadenylation in the brain. HITS-CLIP, therefore, provides a robust, unbiased means to identify functional protein-RNA interactions in vivo.
Article
Full-text available
Through alternative processing of pre-messenger RNAs, individual mammalian genes often produce multiple mRNA and protein isoforms that may have related, distinct or even opposing functions. Here we report an in-depth analysis of 15 diverse human tissue and cell line transcriptomes on the basis of deep sequencing of complementary DNA fragments, yielding a digital inventory of gene and mRNA isoform expression. Analyses in which sequence reads are mapped to exon-exon junctions indicated that 92-94% of human genes undergo alternative splicing, 86% with a minor isoform frequency of 15% or more. Differences in isoform-specific read densities indicated that most alternative splicing and alternative cleavage and polyadenylation events vary between tissues, whereas variation between individuals was approximately twofold to threefold less common. Extreme or 'switch-like' regulation of splicing between tissues was associated with increased sequence conservation in regulatory regions and with generation of full-length open reading frames. Patterns of alternative splicing and alternative cleavage and polyadenylation were strongly correlated across tissues, suggesting coordinated regulation of these processes, and sequence conservation of a subset of known regulatory motifs in both alternative introns and 3' untranslated regions suggested common involvement of specific factors in tissue-level regulation of both splicing and polyadenylation.
Article
Full-text available
The neural cell-specific N1 exon of the c-src pre-mRNA is both negatively regulated in nonneural cells and positively regulated in neurons. We previously identified conserved intronic elements flanking N1 that direct the repression of N1 splicing in a nonneural HeLa cell extract. The upstream repressor elements are located within the polypyrimidine tract of the N1 exon 3' splice site. A short RNA containing this 3' splice site sequence can sequester trans-acting factors in the HeLa extract to allow splicing of N1. We now show that these upstream repressor elements specifically interact with the polypyrimidine tract binding protein (PTB). Mutations in the polypyrimidine tract reduce both PTB binding and the ability of the competitor RNA to derepress splicing. Moreover, purified PTB protein restores the repression of N1 splicing in an extract derepressed by a competitor RNA. In this system, the PTB protein is acting across the N1 exon to regulate the splicing of N1 to the downstream exon 4. This mechanism is in contrast to other cases of splicing regulation by PTB, in which the protein represses the splice site to which it binds.
Article
Full-text available
Knowledge of the functional cis-regulatory elements that regulate constitutive and alternative pre-mRNA splicing is fundamental for biology and medicine. Here we undertook a genome-wide comparative genomics approach using available mammalian genomes to identify conserved intronic splicing regulatory elements (ISREs). Our approach yielded 314 ISREs, and insertions of ~70 ISREs between competing splice sites demonstrated that 84% of ISREs altered 5' and 94% altered 3' splice site choice in human cells. Consistent with our experiments, comparisons of ISREs to known splicing regulatory elements revealed that 40%-45% of ISREs might have dual roles as exonic splicing silencers. Supporting a role for ISREs in alternative splicing, we found that 30%-50% of ISREs were enriched near alternatively spliced (AS) exons, and included almost all known binding sites of tissue-specific alternative splicing factors. Further, we observed that genes harboring ISRE-proximal exons have biases for tissue expression and molecular functions that are ISRE-specific. Finally, we discovered that for Nova1, neuronal PTB, hnRNP C, and FOX1, the most frequently occurring ISRE proximal to an alternative conserved exon in the splicing factor strongly resembled its own known RNA binding site, suggesting a novel application of ISRE density and the propensity for splicing factors to auto-regulate to associate RNA binding sites to splicing factors. Our results demonstrate that ISREs are crucial building blocks in understanding general and tissue-specific AS regulation and the biological pathways and functions regulated by these AS events.
Article
Full-text available
Alternative splicing (AS) functions to expand proteomic complexity and plays numerous important roles in gene regulation. However, the extent to which AS coordinates functions in a cell and tissue type specific manner is not known. Moreover, the sequence code that underlies cell and tissue type specific regulation of AS is poorly understood. Using quantitative AS microarray profiling, we have identified a large number of widely expressed mouse genes that contain single or coordinated pairs of alternative exons that are spliced in a tissue regulated fashion. The majority of these AS events display differential regulation in central nervous system (CNS) tissues. Approximately half of the corresponding genes have neural specific functions and operate in common processes and interconnected pathways. Differential regulation of AS in the CNS tissues correlates strongly with a set of mostly new motifs that are predominantly located in the intron and constitutive exon sequences neighboring CNS-regulated alternative exons. Different subsets of these motifs are correlated with either increased inclusion or increased exclusion of alternative exons in CNS tissues, relative to the other profiled tissues. Our findings provide new evidence that specific cellular processes in the mammalian CNS are coordinated at the level of AS, and that a complex splicing code underlies CNS specific AS regulation. This code appears to comprise many new motifs, some of which are located in the constitutive exons neighboring regulated alternative exons. These data provide a basis for understanding the molecular mechanisms by which the tissue specific functions of widely expressed genes are coordinated at the level of AS.
Article
Full-text available
Alternative splicing of pre-mRNAs is a major contributor to both proteomic diversity and control of gene expression levels. Splicing is tightly regulated in different tissues and developmental stages, and its disruption can lead to a wide range of human diseases. An important long-term goal in the splicing field is to determine a set of rules or "code" for splicing that will enable prediction of the splicing pattern of any primary transcript from its sequence. Outside of the core splice site motifs, the bulk of the information required for splicing is thought to be contained in exonic and intronic cis-regulatory elements that function by recruitment of sequence-specific RNA-binding protein factors that either activate or repress the use of adjacent splice sites. Here, we summarize the current state of knowledge of splicing cis-regulatory elements and their context-dependent effects on splicing, emphasizing recent global/genome-wide studies and open questions.
Article
A quantitative and practical Bayesian framework is described for learning of mappings in feedforward networks. The framework makes possible (1) objective comparisons between solutions using alternative network architectures, (2) objective stopping rules for network pruning or growing procedures, (3) objective choice of magnitude and type of weight decay terms or additive regularizers (for penalizing large weights, etc.), (4) a measure of the effective number of well-determined parameters in a model, (5) quantified estimates of the error bars on network parameters and on network output, and (6) objective comparisons with alternative learning and interpolation models such as splines and radial basis functions. The Bayesian "evidence" automatically embodies "Occam's razor," penalizing overflexible and overcomplex models. The Bayesian approach helps detect poor underlying assumptions in learning models. For learning models well matched to a problem, a good correlation between generalization ability and the Bayesian evidence is obtained.
Book
From the Publisher: Artificial "neural networks" are now widely used as flexible models for regression classification applications, but questions remain regarding what these models mean, and how they can safely be used when training data is limited. Bayesian Learning for Neural Networks shows that Bayesian methods allow complex neural network models to be used without fear of the "overfitting" that can occur with traditional neural network learning methods. Insight into the nature of these complex Bayesian models is provided by a theoretical investigation of the priors over functions that underlie them. Use of these models in practice is made possible using Markov chain Monte Carlo techniques. Both the theoretical and computational aspects of this work are of wider statistical interest, as they contribute to a better understanding of how Bayesian methods can be applied to complex problems. Presupposing only the basic knowledge of probability and statistics, this book should be of interest to many researchers in statistics, engineering, and artificial intelligence. Software for Unix systems that implements the methods described is freely available over the Internet.
Article
We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal 'hidden' units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.
Article
Alternative splicing plays critical roles in differentiation, development, and disease and is a major source for protein diversity in higher eukaryotes. Analysis of alternative splicing regulation has traditionally focused on RNA sequence elements and their associated splicing factors, but recent provocative studies point to a key function of chromatin structure and histone modifications in alternative splicing regulation. These insights suggest that epigenetic regulation determines not only what parts of the genome are expressed but also how they are spliced.
Article
A quantitative and practical Bayesian framework is described for learning of mappings in feedforward networks. The framework makes possible (1) objective comparisons between solutions using alternative network architectures, (2) objective stopping rules for network pruning or growing procedures, (3) objective choice of magnitude and type of weight decay terms or additive regularizers (for penalizing large weights, etc.), (4) a measure of the effective number of well-determined parameters in a model, (5) quantified estimates of the error bars on network parameters and on network output, and (6) objective comparisons with alternative learning and interpolation models such as splines and radial basis functions. The Bayesian "evidence" automatically embodies "Occam's razor," penalizing overflexible and overcomplex models. The Bayesian approach helps detect poor underlying assumptions in learning models. For learning models well matched to a problem, a good correlation between generalization ability and the Bayesian evidence is obtained.
Article
Alternative splicing of messenger RNA (mRNA) precursors affects the majority of human genes, has a considerable impact on eukaryotic gene function and offers distinct opportunities for regulation. Alterations in alternative splicing can cause or modify the progression of a significant number of pathologies. Recent high-throughput technologies have uncovered a wealth of transcript diversity generated by alternative splicing, as well as examples for how this diversity can be established and become misregulated. A variety of mechanisms modulate splice site choice coordinately with other cellular processes, from transcription and mRNA editing or decay to miRNA-based regulation and telomerase function. Alternative splicing studies can contribute to our understanding of multiple biological processes, including genetic diversity, speciation, cell/stem cell differentiation, nervous system function, neuromuscular disorders and tumour progression.
Article
The fibronectin EIIIB exon is alternatively spliced in a cell-type-specific manner, and TGCATG repeats in the intron downstream of EIIIB have been implicated in this regulation. Analysis of the intron sequence from several vertebrates shows that the pattern of repeats in the 3′ half of the intron is evolutionarily conserved. Point mutations in certain highly conserved repeats greatly reduce EIIIB inclusion, suggesting that a multicomponent complex may recognize the repeats. Expression of the SR protein SRp40, SRp20, or ASF/SF2 stimulates EIIIB inclusion. Studies of the interplay between mutations in the repeats and SRp40-stimulated inclusion suggest that the repeats are recognized in many, if not all, cell types, and that EIIIB inclusion may be regulated by quantitative changes in multiple factors.
Article
Recent analyses of sequence and microarray data have suggested that alternative splicing plays a major role in the generation of proteomic and functional diversity in metazoan organisms. Efforts are now being directed at establishing the full repertoire of functionally relevant transcript variants generated by alternative splicing, the specific roles of such variants in normal and disease physiology, and how alternative splicing is coordinated on a global level to achieve cell- and tissue-specific functions. Recent progress in these areas is summarized in this review.
Article
Human genes contain a dense array of diverse cis-acting elements that make up a code required for the expression of correctly spliced mRNAs. Alternative splicing generates a highly dynamic human proteome through networks of coordinated splicing events. Cis- and trans-acting mutations that disrupt the splicing code or the machinery required for splicing and its regulation have roles in various diseases, and recent studies have provided new insights into the mechanisms by which these effects occur. An unexpectedly large fraction of exonic mutations exhibit a primary pathogenic effect on splicing. Furthermore, normal genetic variation significantly contributes to disease severity and susceptibility by affecting splicing efficiency.
Article
DNA microarrays can provide insight into genetic changes that characterize different stages of a disease process. Accurate identification of these changes has significant therapeutic and diagnostic implications. Statistical analysis for multistage (multigroup) data is challenging, however. ANOVA-based extensions of two-sample Z-tests, a popular method for detecting differentially expressed genes in two groups, do not work well in multigroup settings. False detection rates are high because of variability of the ordinary least squares estimators and because of regression to the mean induced by correlated parameter estimates. We develop a Bayesian rescaled spike and slab hierarchical model specifically designed for the multigroup gene detection problem. Data preprocessing steps are introduced to deal with unique features of microarray data and to enhance selection performance. We show theoretically that spike and slab models naturally encourage sparse solutions through a process called "selective shrinkage". This translates into oracle-like gene selection risk performance compared with ordinary least squares estimates. The methodology is illustrated on a large microarray repository of samples from different clinical stages of metastatic colon cancer. Through a functional analysis of selected genes, we show that spike and slab models identify important biological signals while minimizing biologically implausible false detections.
The polypyrimidine tract binding protein binds upstream ofneuralcell-specificc-srcexonn1torepressthesplicingoftheintrondownstream Functional coordination of alternative splicing in the mammalian central nervous system Decrypting the genome’s alternative messages
  • R Chan
  • D Black
Chan,R. and Black,D. (1997) The polypyrimidine tract binding protein binds upstream ofneuralcell-specificc-srcexonn1torepressthesplicingoftheintrondownstream. Mol. Cell. Biol., 17, 4667. Fagnani,M. et al. (2007) Functional coordination of alternative splicing in the mammalian central nervous system. Genome Biol., 8, R108. Hartmann,B. and Valcárcel,J. (2009) Decrypting the genome’s alternative messages
Bayesian Learning for Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing
  • R Neal
  • Ny Springer
  • Q Pan
Neal,R. (1996) Bayesian Learning for Neural Networks, Vol. 118. Springer, NY. Pan,Q. et al. (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet., 40, 1413–1415.
Discovery and analysis of evolutionarily conserved intronic splicing regulatory elements Integrative modeling defines the nova splicing-regulatory network and its combinatorial controls
  • G Yeo
Yeo,G. et al. (2007) Discovery and analysis of evolutionarily conserved intronic splicing regulatory elements. PLoS Genet., 3, e85. Zhang,C. et al. (2010) Integrative modeling defines the nova splicing-regulatory network and its combinatorial controls. Science, 329, 439.
Model-based detection of alternative splicing signals Pattern Recognition and Machine Learning Alternative splicing: new insights from global analyses
  • Y Barash
Barash,Y. et al. (2010b) Model-based detection of alternative splicing signals. Bioinformatics, 26, i325. Bishop,C.M. (2006) Pattern Recognition and Machine Learning. Springer, NY. Blencowe,B. (2006) Alternative splicing: new insights from global analyses. Cell, 126, 37–47.
Bayesian Learning for
  • R Neal
Neal,R. (1996) Bayesian Learning for Neural Networks, Vol. 118. Springer, NY.
and by feeding the feature vectors and target splicing patterns for all species into the learning algorithm. It was found that using large numbers of features improved code quality, pointing to the importance of further exploring new feature types, such as those derived using in vivo
  • Y Xiong
Y.Xiong et al. and by feeding the feature vectors and target splicing patterns for all species into the learning algorithm. It was found that using large numbers of features improved code quality, pointing to the importance of further exploring new feature types, such as those derived using in vivo (Licatalosi et al., 2008) or in vitro (Ray et al., 2009) RNA binding data, DNA, chromatin structure and histone modifications (Luco et al., 2011).
Splicing in disease: disruption of the splicing code and the decoding machinery
  • Wang
Wang,G. and Cooper,T. (2007) Splicing in disease: disruption of the splicing code and the decoding machinery. Nat. Rev. Genet., 8, 749–761.