ArticlePDF Available

AVISPA: A web tool for the prediction and analysis of alternative splicing

Authors:

Abstract and Figures

Transcriptome complexity and its relation to numerous diseases underpins the need to predict in silico splice variants and the regulatory elements that affect them. Building upon our recently described splicing code, we developed AVISPA, a Galaxy-based web tool for splicing prediction and analysis. Given an exon and its proximal sequence, the tool predicts whether the exon is alternatively spliced, displays tissue-dependent splicing patterns, and whether it has associated regulatory elements. We assess AVISPA's accuracy on an independent dataset of tissue-dependent exons, and illustrate how the tool can be applied to analyze a gene of interest. AVISPA is available at http://avispa.biociphers.org.
Content may be subject to copyright.
SOF T W A R E Open Access
AVISPA: a web tool for the prediction and
analysis of alternative splicing
Yoseph Barash
1,2,4,5*
, Jorge Vaquero-Garcia
1,2
, Juan González-Vallinas
1,3
, Hui Yuan Xiong
4
, Weijun Gao
4
,
Leo J Lee
4
and Brendan J Frey
4,5
Abstract
Transcriptome complexity and its relation to numerous diseases underpins the need to predict in silico splice
variants and the regulatory elements that affect them. Building upon our recently described splicin g code, we
developed AVISPA, a Galaxy-based web tool for splicing prediction and analysis. Given an exon and its proximal
sequence, the tool predicts whether the exon is alternatively spliced, displays tissue-dependent splicing patterns,
and whether it has associated regulatory elements. We assess AVISPA's accuracy on an independent dataset of
tissue-dependent exons, and illustrate how the tool can be applied to analyze a gene of interest. AVISPA is
available at http://avispa.biociphers.org.
Alternative splicing (AS) is estimated to affect tran-
scripts from over 95% of human multi-exon genes [1,2],
with the most common class of AS involving cassette
exons. Thousands of alternative cassette exons have
been found to be differentially spliced between mamma-
lian tissues, with tissues such as the brain displaying the
most complex patterns [1,2]. These observations and the
association of many splicing defects with diseases [3]
motivated the recent derivation of a splicing code. The
code, comprising a model with a set of rules that can
predict splicing outcomes given genomic sequence and
cellular context [4,5], used over 1,000 regulatory fea-
tures. Trained using inclusion measurements for 3,700
cassette exons across 27 mouse tissues , the codes model
was shown to predict differential AS in four tissue
groups: the central nervous system (CNS), muscle, di-
gestive, and embryo versus adult tissues.
The derivation of a predictive splicing code served as
proof-of-concept and enabled insights into RNA biogen-
esis [5,6], but was limited in scope. Specifically, it was
only applied to a subset of alternative exons in specific
studies. However, given the importance of splicing in the
study of gene regulation, development and disease, it
became important to translate the splicing code models
into a tool that would be accessible for researchers
in a wide range of fields. Here, we present AVISPA
(Advanced Visualization of Splicing Prediction and Ana-
lysis), a web tool that enables both prediction and spli-
cing analysis of alternative and tissue-dependent exons
in any gene of interest. Given an exon, the tool predicts
whether it is alternative and whether its inclusion is ex-
pected to change in different tissues. It reports whether
the exon is known to be alternative based on an internal
transcripts database, and perform s in silico splicing ana-
lysis, identifying putative regu latory elements and map-
ping those as tracks in the genome browser.
AVISPAs pipeline is illustrated in Figure 1. Users sub-
mit a query by specifying the sequence or genomic coor-
dinates of either a single exon, or a triplet of exons that
includes the immediate up- and downstream exons of
the query exon. In the pre-processing step, the query is
matched against an internal database of exon triplets
mined from known transcripts and mapped to the refer-
ence genome. The result of the pre-processing is re-
ported in the AVISPAs output and indicates existing
evidence for whether the exon is alternatively spliced
based on, for example, alignments of cDNA and E ST
data. After the query has been successfully matched,
RNA features are extracted from the query exon and
flanking regions [5]. At the first prediction stage, the
* Correspondence: yosephb@upenn.edu
Equal contributors
1
Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104,
USA
2
Department of Computer and Information Science, University of
Pennsylvania, Philadelphia, PA 19104, USA
Full list of author information is available at the end of the article
© 2013 Barash et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Barash et al. Genome Biology 2013, 14:R114
http://genomebiology.com/2013/14/10/R114
extracted features are used to predict whether the query
exon is alternatively or constitutively spliced. If the
query is predicted to be an alternative cassette exon, a
second prediction step assesses whether the exon is dif-
ferentially included in specific tissues.
The new web tool offers marked improvements over
available software. First , it offers 'genome-wide' tissue-
dependent splicing predictions, where any exon can be
submitted as a query. By contrast, the original work only
allowed analysis on a previously mined set of approxi-
mately 12,000 cassette exons, while other tools focus on
quantifying experimental data or general splice site and
motif analysis [7-9]. Second, AVISPA offers a new in
silico analysis of regulatory features and the mapping of
putative regulatory sequence motifs in the genome. As
part of this analysis, motifs found to be robustly in-
cluded in the B ayesian ensemble of models and present
in the query are removed in silico to determine their ef-
fect on splicing prediction. The relative effect of the se
feature removals is reported as a bar chart of the nor-
malized feature effect (NFE). The putative regulatory
motifs are also mapped to the genome using the UCSC
genome browser, where they can be combined with
other tracks, such as known single nucleotide polymor-
phisms and binding measurements of known splicing
factors [10]. Additionally, the enrichment of the querys
features is compared to refer ence groups such as alter-
native or constitutively spliced exons in AVISPAs data-
base. Feature enrichment is reported using a standard
heat map ranging from blue, for relatively low values , to
red for relatively high values. For example, a relatively
strong 3 splice site will appear red, indicating a high
score, while a weak splice site will be marked blue.
The new tool also includes several other improve-
ments. First, the prediction technique is now based on a
Bayesian neural network, which provides improved pre-
diction accuracy compared to a battery of other methods
[11]. Second, the original dataset of 3,700 cassette exons
has been expanded to approximately 30,000 exons using
data from 33 experiments in 11 mouse tissues [12].
Third, AVISPA uses an extended set of features that in-
clude computationally predicted nucleosome occupancy
[13] together with primary sequence motifs implicated
in general splicing regulation.
Assessing splicing prediction accuracy
The new two-stage prediction paradigm, combined with
the expanded dataset, yields a significant improvement
in detecting alternative cassette exons (Figure 2a). For
example, using only tissue-d ependent splicing predictors
achieves an area under the curve (AUC) of 64% for dis-
tinguishing between alternative and constitu tive exons,
compared to 86% by the first stage classifier. The im-
proved accuracy of 94% AUC achieved for detecting
tissue-dependent exons is to be expected, as many
regulatory features and higher intronic conservation
are associated with such exons. Notably, AVISPAs se-
quence-based predictions offer a significant improve-
ment compared to a similar classifier that directly uses
normalized exon expression measurements from 33 ex-
periments [12]. The latter achieves an overall lower ac-
curacy of 71% AUC, with a significantly 2.5-fold lower
sensitivity (54% versus 21%) for high-confidence events
at a false positive rate of 2%. These results illustr ate the
usefulness of the new tool, which generalizes over ex-
perimental conditions and is not limited by technical
factors such as microarray noise or read coverage. We
note that these accuracy estimates can be considered as
lower bounds, as some of the events labeled as constitu-
tive in our database may be alternative.
Motif Effect
Motif 1
Motif 2
Motif 3
Motif Effect
Feat 1
Feat 2
Feat 4
Feat 1
Feat 3
Feat 4
PreA I1(5') A PostAI1(3') I2(5') I2(3')
Matched query
Alternative splicing
Constitutive
Alternative
Muscle
Digestive
Embryo
CNS
CNS-dependent splicing
Pr[AS]
Motif effect
Motif map
Motif 1
Motif 2
Motif 3
Feat 2
Feat 1
Feat 3
Feat 3
Feat 4
Feat 3
Feat 2
Low
values
High
values
Pr[Ts]
PreA A PostA
A
or
(1) Query submission
Cassette DB
Transcript DB
Genome
Unmatched Unmatched
Unmatched
Matched Matched Matched
(2) Query Matching
(4) Splicing regulatory analysis
(3) Splicing prediction
Feature enrichment
Reg F[AS] F[Con] F[Inc] F[Exc]
Figure 1 AVISPAs analysis pipeline. The analysis is composed of the following steps. (1) Query submission: users submit a query composed of
either a single exon of interest or an exon triplet that also specifies the up- and downstream exons. (2) Query matching: the submitted query is
first matched against internal databases (DB) of known transcripts and alternative exons. If no match is found the query is searched against the
reference genome. If the query cannot be matched (red cross) an error is reported. (3) Splicing prediction: a successfully matched query
(light blue rectangle) is scored as an alternative cassette exon, followed by scoring for differential splicing in four tissue groups. (4) Splicing
analysis: if the querys predictions pass a user-defined significance threshold a splicing analysis is performed. Analysis includes feature enrichment,
effect of in silico motif removal on splicing predictions, and mapping putative regulatory motifs to the genome. A visual summary of both predic-
tions and splicing analysis is produced (right).
Barash et al. Genome Biology 2013, 14:R114 Page 2 of 8
http://genomebiology.com/2013/14/10/R114
The new tool also achieves significant improvement in
detecting tissue-depend ent exons (Figure 2b). The over-
all accuracy in discriminating between tissue-dependent
and non-tissue-dependent exons is 89% AUC, but varies
considerably between tissues and between differential in-
clusion and exclusion in the same tissue type. For ex-
ample, the highest accuracy was achieved for detecting
increased inclusion of exons in CNS (94% AUC) and
muscle tissues (91% AUC), while the lowest accuracy was
for detecting increased exclusion in CNS (85% AUC) and
increased inclusion in embryonic tissues (82% AUC).
In order to test AVISPA on an independent dataset,
we computed predictions for a set of cassette exons re-
cently shown to be regulated by the Muscleblind-like
proteins Mbnl1/2 in mouse brain, muscle, and heart
[14]. Figure 2c shows AVISPA easily distinguished these
exons from constitutive exons (97% AUC), similar to its
performance in detecting tissue-dependent alternative
exons in the original test set. In discriminating the
Mbnl1/2-regulated exons from non-CNS- and non-
muscle-dependent exons, AVISPA achieves an AUC of
93% and 94%, respectively, while in silico removal of
Mbnl1/2 caused, on average, an almost two-fold larger
effect for Mbnl1/2-regulated exons compared to the ef-
fect for non-muscle- and non-heart-dependent exons.
The improved accuracy in detecting Mbnl1/2-regulated
exons compared to the detection of tissue-dependent
exons in the original test data is likely due to a lower
false detection rate from the RNA-Seq and CLIP-Seq ex-
periments in [14].
Finally, we also tested whether the regulatory features
added in the web tool were useful for splicing prediction.
As expected, many of the sequence motifs implicated in
general splicing regulation were included in the code,
especially for differentiating between alternative and
constitutive exons. By contrast, the relation between nu-
cleosome occupancy and alternative splicing is less well
understood, and has garnered much research attention
[15,16]. We found that the model selected features
representing nucleosome occupancy around the alterna-
tive exon, but training the model without these features
resulted in similar prediction accuracy (data not shown).
This result indicates that other features in our model,
such as di- and tri-nucleotide frequencies, already cap-
tured the 'predictive power' of computationally derived
nucleosome position features.
Vegfa in silico splicing analysis
Previous work demonstrated how the splicing code
model could be used to identify new regulatory ele-
ments, detect novel tissue-dependent splicing e vents,
and study the evolution of splicing across vertebrates
[6]. Here, we illustrate how the new tool can be used to
analyze a well-studied gene of major interest. We
applied AVISPA to the vascular endothelial growth fac-
tor A (Vegfa) gene. Vegfa has a complex and highly
Alternative exons
Pr[Ts] − 64%
33 exon arrays − 71%
Pr[AS] − 86%
Pr[AS] Tissue dep. − 91%
Rand
0.2
0.4
0.6
0.8
1
Sensitivity
0
Pr[AS] Mbnl dep. − 97%
CNS Mbnl dep. − 94%
Muscle Mbnl dep. − 95%
Rand
Rand
Tissue-dependent exons
0 0.2 0.4 0.6 0.8 1
1−Specificity
0 0.2 0.4 0.6 0.8 1
1−Specificity
0 0.2 0.4 0.6 0.8 1
1−Specificity
Mbnl1/2-dependent exons
abc
CNS.Inc − 94%
CNS.Exc − 85%
Muscle.Inc − 91%
Muscle.Exc − 86%
EM.Inc − 82%
EM.Exc − 89%
Digestive.Inc − 87%
Digestive.Exc − 88%
All Tiss. Dep. − 89%
Figure 2 Prediction accuracy. (a) Differentiating alternative (n = 11,773) from constitutive (n = 9,638) exons. Detecting which exons are
alternative (green) is significantly improved compared to a classifier that uses exon expression measurements from 33 experiments (cyan), and
compared to the original classifier trained to detect only tissue-dependent cassette exons (red). Detection of exons that exhibit tissue-dependent
splicing changes (blue, n = 659) is much more accurate. Numbers within each legend represent the area under the curve (AUC) (b) Identifying
tissue-dependent splicing. Detecting tissue-dependent splicing changes (n = 865) from a random set of non-tissue-dependent exons (n = 4,000)
achieves an overall accuracy of 89% AUC (black). Accuracy varies considerably between tissues and for detecting increased inclusion (solid line) or
exclusion (dashed) in a tissue (c) Detection accuracy for an independent set of Mbnl1/2-dependent exons [14] (n = 461). Differentiating between
Mbnl1/2-dependent exons and constitutive exons achieves 97% AUC. Accuracy in detecting Mbnl1/2-dependent exons from a random set of
non-tissue-dependent exons (n = 2,000) is approximately 94% AUC for both brain (blue) and muscle (red).
Barash et al. Genome Biology 2013, 14:R114 Page 3 of 8
http://genomebiology.com/2013/14/10/R114
conserved pattern of alternative splicing that changes
across tissues and developmental stages [17,18]. Its role
in angiogenesis, which is controlled in part by alternative
splicing, has made it an attractive target of several
anticancer therapies. Accordingly, there is considerable
interest in identifying the factors that regulate the spli-
cing of Vegfa transcripts [18,19]. Analyzing all Vegfa
exon triplets revealed that only exons 6 and 7 were pre-
dicted to be cassette exons, with a score corresponding
to a false positive rate of 0.009 and 0.017, respectively.
For comparison, other exons scores corresponded to a
false positive rate of 0.22 or higher (data not shown).
These prediction s are in line with annotated transcripts,
many of which skip exon 6, one that skips exon 7
(ENSMUST00000113519), and several that skip both.
Exons 6 and 7 were also both predicted, with a fal se
positive rate of less than 0.025, to exhibit differential
splicing in all four major tissue groups modeled. While
confidence in differential splicing was high, the predic-
tions were not conclusive as to whether a relative in-
crease or decrease of exon inclusion would occur in the
tissues. These results reflect the conserved and complex
splicing pattern of Vegfa, with RT-PCR experiments
showing exon 6 to have a complex bi-phasic increase of
inclusion in developing mouse and chicken heart [18].
Prediction of other splice variations of Vegfa, such as the
3 splice site variation in exon 8, are currently not sup-
ported by the tool.
Figure 3 shows the regulatory feature analysis for
differential inclusion of Vegfa exon 6 in muscle. The
enrichment analysis in Figure 3a highlights that
the alternative exon is depleted of non-tissue-specific
exonic splicing enhancers and is highly enriched with
exonic splic ing silencers. Other highlighted features are
enriched secondary structure-free regions in the up-
stream intron, a distant first AG nucleotide upstream
and a particularly short preceding exon 5. The preceding
exon, for example, is 32 bp long, and the enrichment
analysis indicates that only 0.127% of the tools reference
set of alternative exons has a shorter preceding exon.
The most dominant effect of in silico motif removal
(Figure 3b) is for CU-rich elements known to bind Ptb1/
b
a
Figure 3 Analysis of Vegfa exon 6 muscle-dependent inclusion. A subset of the summary page produced by AVISPA is shown. (a) Feature
enrichment analysis: the values of the features listed on the left are computed for Vegfa exon 6 and compared against matching feature values in
a set of labeled exons. The four sets of exons compared against here are alternative exons ('AS', third column from the left), constitutive exons
('Const', third column from the right), exons differentially included in muscle ('Muscle Inc', second column from the right), and differentially
excluded in muscle ('Muscle Exc', right most column). Relative enrichment or depletion of features is indicated using the heat map on the right.
Only features with significantly low (blue) and high (red) values are shown here. The genomic region of each feature is indicated by the second
from left column using the notation and colors in the top figure. (b) Stacked bar chart (left) of the normalized feature effect (NFE, y-axis) on
splicing prediction. Only the top motifs are shown. Motif regions are annotated using the color scheme depicted below. Mapping of the motifs
onto the UCSC genome browser is shown on the right. Tracks combining all motifs used by the code model (red), the unbiased motif search [5]
(grey scaled), and conservation (blue) are added at the bottom.
Barash et al. Genome Biology 2013, 14:R114 Page 4 of 8
http://genomebiology.com/2013/14/10/R114
2, followed by an ACUAAY motif known to bind Quak-
ing (Qk). These splice factors have not been previously
reported to regulate Vegfa, but a recent study estimates
39% of regulated exons during myogenesis are under the
control of one or both of these splicing factors [20]. A
smaller effect on splicing prediction in muscle is associ-
ated with intronic motifs known to bind Cugbp1/2 and
Muscleblind-like protein (Mbnl1/2). Both Cugbp1/2 and
Mbnl1/2 have been shown to play an important role in
regulating splicing in developing hearts. Overexpressing
Cugbp1 or knockdown of Mbnl1 in the adult mouse
heart did not alter exon 6 inclusion levels significantly
[18], but recent results point to possible compensatory
effects between Mbnl1 and Mbnl2 [14]. Othe r elements
implicated in Vegfa splicing regulation include the short
YCAY motifs known to bind Nova proteins [21] and a
UGCAUG motif, known to bind the brain- and muscle-
specific splicing factor Fox-1 (A2bp 1) and its paralog
Fox-2 (Rbm9) [22]. While the Fox-1/2 binding site is
highly conserved, it resides over 1 kb downstream of
exon 6 and Fox-1/2 have not been pre viously reported to
regulate Vegfa. However, recent results indicate that
Fox-2 knockdown in mice clearly alters Vegfa splicing
pattern during heart development (Xiang-Dong Fu, per-
sonal communication). Smaller effects associated with
non-tissue-specific regulation include G-rich elements,
known to bind hnRNP-F/H, and U-rich elements that
are known to bind hnRNP-C and Tiar/Tia1 [23]. Not-
ably, Tia1 was previously reported to regulate Vegfa iso-
form expression [24]. Overall, our exploratory analysis of
Vegfa splicing is consistent with previou s results and
offers new insights into mechanism of Vegfa regulation
that are supported by recent experiments.
In summary, we presented a new tool, AVISPA, for in
silico prediction and analysis of alternative splicing. The
tool is not limit ed by technical constraints such as se-
quencing depth, and its predictions for alternatively
spliced exons generalize over unmeasured conditions.
Beyond the splicing outcom e, it offers researchers the
ability to identify putative regulatory elements and map
those to the genome. These capabilities were re cently
used in an independent study to identify TIA1 as a regu-
lator of an alternative exon coding miR-412 [25]. Here,
we used a recent genome-wide study to demon strate the
tools accuracy for predicting muscl e, heart, and brain
regulated exons and performed detailed in silico splicing
analysis for the vascular endothelial growth factor A.
Several important elements remain as on-going and
future enhancements of the tool. These include predic-
tions for species other than mouse, predictions for
additional forms of alternative splicing (for example,
alternative 3 and 5
splice sites), and higher resolution of
tissue spe cificity. Currently, AVISPAs predictions reflect
confidence in alternative splicing or in relative, tissue-
dependent, inclusion changes. Thus, users may infer an
exon is likely to be alternative or to be differentially in-
cluded in brain versus other tissues, but predictions for
absolute inclusion levels (for example, 20% inclusion in
brain, 40% inclusion in liver) are currently not sup-
ported. The tool has some technical limitations as well.
Users can only submit a single cassette exon as a query,
due to the computational burden involved in processing
a query. Queries must be based on annotated exons,
cannot contain exons shorter than 10 bases long, and
non-canonical splicing by the minor spliceosome is
not supported. Nonetheless, the ability to perform
splicing prediction irrespective of experimental limita-
tions , coupled with the new regulatory element s analysis,
should serve researchers studying gene regulation, RNA
biogenesis, and development. Moreover, AVISPA is built
as a flexible platform that can be repeatedly updated as
more data and improved models be come available. The
new computational analysis offered by AVISPA should
facilitate the discovery of novel splicing variant s, regula-
tory elements, and genomic variations affecting pheno-
typic variability or disease.
Materials and methods
Query matching against sequence database
The web-tools internal database includes three compo-
nents. The first is a database of 11,773 cassette exons that
we previously mined from sequence libraries [5]. The sec-
ond is a set of 9,638 exon triplets derived from Refseq [26]
and other sequence libraries as described in [5], where
every three constitutive exons in a transcript define a trip-
let. These triplets were also scanned against exon expres-
sion measurements in 11 mouse tissues [12] and triplets
suspected to contain an alternative cassette exon were re-
moved. A querys sequence is matched against the two
transcript databases using BLAT with parameters set to
tileSize = 8, minMatch = 2, minIdentity = 88. The third
database component is the mouse assembly mm10 from
the UCSC Genome Browser [27]. Matching a query to the
reference genome is executed only if no match in the two
transcript-based databases is found, and only when gen-
omic coordinates for all three exons are specified.
Extended regulatory feature set
We extended the set of putative regulatory features to
include the occurrences of 350 new binding motifs in
the seven regions around a cassette exon as defined in
[5]. The motifs correspond to general splicing related
RNA binding proteins (RBPs), SR and SR-related pro-
teins (SC35, SRp20, 9G8, ASF/SF2, SRp30c, SRp38,
SRp40, SRp55, SRp75, Tra2α/β), and hnRNP proteins
(hnRNPA1, hnRNPA2/B1, hnRNPF/H, hnRNPG).
We also added features encoding computationally pre-
dicted nucleosome occupancy around the alternative
Barash et al. Genome Biology 2013, 14:R114 Page 5 of 8
http://genomebiology.com/2013/14/10/R114
exon [13]. Features were defined as the average and
maximal occupancy scores in the first 100 nucleotides in
each intron and the first or last 50 nucleotides of the
alternative exon.
Extended training set for tissue-specific alternative
splicing
A total of 33 data tracks for normalized expression mea-
surements using Affymetrix exon arrays were down-
loaded from the UCSC Genome Browser. The tracks are
composed of measurements in 11 mouse tissues (brain,
embryo, heart, kidney, liver, lung, muscle, ovary, spleen,
testis, thymus) with three replicates for each tissue [12].
The expression of each exon and the relative inclusion
of a putative cassette exon compared to its flanking
exons were used as input features to train an ensemble
of Bayesian neural networks [11]. The networks used
these input features to identify differential inclusion and
exclusion of alternative exons in the four tissue groups
previously identified (CNS, muscle, digestive, embryo).
Training was based on a subset of 3,770 cassette exons
for which three probabilities for increased inclusion
(q
inc
), increased exclusion (q
exc
) and no change (q
nc
)in
each of the four tissue groups was previously computed
[5]. This training step allowed the calibration of differen-
tial splicing estimation obtained from the new set of 33
experiments to the estimates used to train the original
splicing model [5]. The model ensemble was then used
to estimate differential splic ing (q
inc
,q
exc
,q
nc
) for the
remaining exons. The differential splicing estimates for
the original set of 3,770 exons were averaged between
the two datasets and care was taken to make sure pre-
dictions were based on non-overlapping training sets.
Predicting alternative cassette exons using expression
data and a single stage tissue-specific classifier
The33expressiondatatracksdescribedabovewerealso
used to train a Bayesian deep neural network classifier [11],
denoted '33 exon arrays'inFigure2a.Anyexontriplets
from the set of 11,773 cassette exons and 9,638 putative
constitutive exons that had missing data were removed,
maintaining a total of 8,986 for training and test purposes.
The prediction of alternative exons using a single stage
tissue classifier, denoted Pr[Ts] in Figure 2a, used a max
function over the chance of differential splicing (1 - p
nc
)
in each tissue.
Training a splicing code model for alternative exons and
for tissue-dependent splicing
For the purpose of inferring a regulatory model, we used
a Bayesian neural network that worked better for this
task than support vector machines, boosted decision
trees , and other leading machine learning techniques
[11]. To discriminate between alternative and constitutive
exons the network was set to have 10 hidden units and a
sparsity prior of 0.9 for connections between features and
hidden units. For predicting tissue-dependent splicing the
network was set to have 20 units and a sparsity prior of
0.95. Varying the sparsity prior between 0.85 and 0.95 and
adding up to 10 more hidden units did not have a signifi-
cant effect on the results (data not shown). An ensemble
of 5,000 models generated by Markov chain Monte Carlo
simulations was used to estimate differential splicing
(q
inc
,q
exc
,q
nc
) as was previously described [11].
Scoring tissue-dependent splicing
Under the new framework the probability that any given
triplet of exons contain a tissue-dependent cassette exon
can be expressed as:
PO
t
¼ ch r
e
Þ¼PASr
e
ÞPO
t
¼ ch r
e
; ASÞ;jðjðjð
where P(O
t
= ch|r
e
) denotes the probability to observe a
change in the exons inclusion level in tissue t given the
exons feature vector r
e
, P(AS|r
e
) is the probability the
exon is alternative, and P(O
t
= ch|r
e
, AS) is the probabil-
ity of observin g differential splicing given that the exon
is alternative. The first term on the right is computed by
the first stage predictor, while the second term is com-
puted by the second stage predictor.
ROC performance evaluation
Receiver operating characteristic (ROC) performance
was evaluated using repeated five-fold cross-validation
and care was taken to make sure predictions were based
on non-redundant training sets, as was previously de-
scribed [5]. Evaluation of discriminating between alter-
native and constitutive exons was based on a set of
11,773 cassette exons and 9,638 putative constitutive
exons derived from EST/cDNA sequences [5]. In order
to assess the accuracy of detecting cassette exons that
exhibit a tissue-dependent splicing pattern ( for example,
differential inclusion in muscle) we compared the scores
of such exons to those of a random set of exon triplets
that do not exhibit this splicing pattern. The random set
was selected using the following procedure. First, we
used the 33 genome-wide exon expression measure-
ments described above to quantify the inclusion level of
all exon triplets from all Refseq transcripts. Next, we dis-
carded triplets with missing data and required the rela-
tive expression of the upstream and downstream exons
to be no more than 1.5-fold apart in all experiments. In
order to avoid probe sets with little signal, we required
the up- and downstream exons to have a normalized ab-
solute value of at least 0.1 in at least 15 experiments.
Additionally, we required in at least three experiments
of the tissue group of interest (for e xample, digestive)
that the up- and downstream exons are not in the
Barash et al. Genome Biology 2013, 14:R114 Page 6 of 8
http://genomebiology.com/2013/14/10/R114
bottom 20 percentile. Finally, the relative expression of
each middle exon compared to its flanking exons was used
to estimate the chance it is differentially included in each
tissue group [28]. Any triplet that had a P-value of 0.7 or
higher was deemed non-tissue-dependent and a set of ap-
proximately 2,000 exons was then selected for each tissue
as a non-tissue-dependent exon set. Exons were selected
randomly from the respective genes and then randomly
from the relative order within the gene. We then verified
that these are not biased in terms of relative location
within the gene or gene length compared to a random
sample of triplets from the genome (data not shown).
While small variations in the parameters of the above
process did not have a notable effect on the results, we did
detect an apparent selection bias in this procedure. Specif-
ically, using expression measurements to select exons
based on high confidence in non-tissue-dependent spli-
cing may favor constitutive exons. Notably, the 'true' labels
of any given exon as alternative or constitutive is unavail-
able. However, since our prediction algorithm has proved
accurate in distinguishing alternative from constitutive
exons (Figure 2a), we applied it to the set of 2,000 non-
tissue-dependent exons selected for each tissue group.
Compared to a random set of 1,000 exon triplets, these
exons were biased towards constitutive exon scores
(Additional file 1). To correct for this apparent bias we
subsampled 1,000 exons for each tissue group so that their
scores as alternative match those in the random set
(Additional file 1, green and red lines). This corrected set
of a total of 4,000 predictions was then used for subsequent
analysis (Figure 2b,c). We note that without this correction
the initial set of non-tissue-dependent exons results in im-
proved performance compared to that shown in Figure 2.
In silico feature removal and normalized feature effect
In order to evaluate the relative effect of a putative regu-
latory sequence motif (for example, the occurrence of a
[U]GCAUG motif, known to bind Fox1/2, upstream of
the alternative exon), the feature is first set to zero. The
splicing predictions with the mutated feature, denoted
p
inc
Δf
; p
exc
Δf

; are then computed with the total effect on
differential splicing defined a s FE
f
¼jp
inc
p
inc
Δ
jþjp
exc
p
exc
Δ
j. This definition aims to capture the effect of features
that not only change the confidence in a splicing change
p
nc
; p
nc
Δf

; but also change the relative confidence in
either differential inclusion or exclusion. Finally, the nor-
malized feature effect (NFE) is defined as:
NFE
f
¼
FE
f
jJ
FE
j
where J is the set of robust features. By itself, the NFE has
no statistical significance measure associated with it. The
NFE serves mainly as a quantitative tool to guide re-
searchers interested in knowing which of the identified
regulatory features have a higher effect on the modelspre-
diction confidence.
Additional file
Additional file 1: Figure S1. Correcting constitutive exons selection
bias in non-tissue-dependent exons. Exon scores for being alternative
versus constitutive (x-axis) are plotted as a cumulative distribution func-
tion (CDF, y-axis). The initial set of selected non-tissue-dependent exons
(blue) was biased towards constitutive exons compared to a random
sample of 1,000 exon triplets from the genome (red). Subsampling the
original set of 2,000 exons per tissue to fit the score distribution of a
random set gave a good fit (green). Both green and red line plots are
accumulated over all exons in all tissues as no significant difference was
observed between the different tissues.
Abbreviations
AS: Alternative splicing; AUC: Area under the curve; AVISPA: Advanced
visualization of splicing prediction and analysis; CNS: Central nervous system;
EST: Expressed sequence tag; NFE: Normalized feature effect.
Competing interests
The authors declare that they have no competing interests.
Authors contributions
YB and BJF conceived of the project. YB developed the combined prediction
framework and in silico feature analysis. YB, JVG and WG developed the
analysis pipeline with input from all authors. YB, JVG, WG and LJL created
the sequence databases. JGV, WG and JVG developed the web tool. HYX, YB
and BJF developed the prediction algorithms. YB and JVG performed the
data analysis. YB wrote the paper with input from BJF. All authors read and
approved the final manuscript.
Acknowledgements
We thank Ben Blencowe for his support and advice throughout this project. We
thank Kristen Lynch for helpful feedback and discussions; Xiang-Dong Fu for
sharing experimental results; members of the Barash, Lynch, Blencowe and Frey
labs for providing helpful comments on the manuscript and suggestions for the
web tool. Funding to JGV through FPI grant from the Spanish Ministry of Sci-
ence; HYX, WG and LJL were funded from NSERC Steacie and CIHR grants to
BJF. While at the University of Toronto, YB was funded from an OGI Spark grant
and a Genome Canada grant to BJF and others.
Author details
1
Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104,
USA.
2
Department of Computer and Information Science, University of
Pennsylvania, Philadelphia, PA 19104, USA.
3
Universitat Pompeu Fabra,
Barcelona 08003, Spain.
4
Department of Electrical and Computer
Engineering, University of Toronto, Toronto, ON M5S 3G4, Canada.
5
Banting
and Best Department of Medical Research, University of Toronto, Toronto,
ON M5G 1L6, Canada.
Received: 9 July 2013 Accepted: 11 October 2013
Published: 24 October 2013
References
1. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ: Deep surveying of alternative
splicing complexity in the human transcriptome by high-throughput
sequencing. Nat Genet 2008, 40:14131415.
2. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF,
Schroth GP, Burge CB: Alternative isoform regulation in human tissue
transcriptomes. Nature 2008, 456:470476.
3. Wang ET, Cooper AT: Splicing in disease: disruption of the splicing code
and the decoding machinery. Nature 2007, 8:749761.
4. Wang Z, Burge CB: Splicing regulation: from a parts list of regulatory
elements to an integrated splicing code. RNA 2008, 14:802813.
Barash et al. Genome Biology 2013, 14:R114 Page 7 of 8
http://genomebiology.com/2013/14/10/R114
5. Barash Y, Calarco JA, Gao W, Pan Q, Wang X, Shai O, Blencowe BJ, Frey BJ:
Deciphering the splicing code. Nature 2010, 465:5359.
6. Barbosa-Morais NL, Irimia M, Pan Q, Xiong HY, Gueroussov S, Lee LJ,
Slobodeniuc V, Kutter C, Watt S, Çolak R, Kim T, Misquitta-Ali CM, Wilson
MD, Kim PM, Odom DT, Frey BJ, Blencowe BJ: The evolutionary landscape
of alternative splicing in vertebrate species. Science 2012, 338:15871593.
7. Yeo G, Burge CB: Maximum entropy modeling of short sequence motifs
with applications to RNA splicing signals. J Comput Biol 2004, 11:377394.
8. Dogan RI, Getoor L, Wilbur WJ, Mount SM: SplicePortan interactive splice-
site analysis tool. Nucleic Acids Res 2007, 35:W285W291.
9. Cartegni L, Wang J, Zhu Z, Zhang MQ, Krainer AR: ESEfinder: a web
resource to identify exonic splicing enhancers. Nucleic Acids Res 2003,
31:35683571.
10. Yeo GW, Coufal NG, Liang TY, Peng GE, Fu X-D, Gage FH: An RNA code for
the FOX2 splicing regulator revealed by mapping RNA-protein interac-
tions in stem cells. Nat Struct Mol Biol 2009, 16:130137.
11. Xiong HY, Barash Y, Frey BJ: Bayesian prediction of tissue-regulated
splicing using RNA sequence and cellular context. Bioinformatics 2011,
27:25542562.
12. Pohl AA, Sugnet CW, Clark TA, Smith K, Fujita PA, Cline MS: Affy exon
tissues: exon levels in normal tissues in human, mouse and rat.
Bioinformatics 2009, 25:24422443.
13. Xi L, Fondufe-Mittendorf Y, Xia L, Flatow J, Widom J, Wang JP: Predicting
nucleosome positioning using a duration Hidden Markov Model.
BMC Bioinformatics 2010, 11:346.
14. Wang ET, Cody NAL, Jog S, Biancolella M, Wang TT, Treacy DJ, Luo S,
Schroth GP, Housman DE, Reddy S, Lécuyer E, Burge CB: Transcriptome-
wide regulation of pre-mRNA splicing and mRNA localization by muscle-
blind proteins. Cell 2012, 150:710724.
15. Schwartz S, Meshorer E, Ast G: Chromatin organization marks exon-intron
structure. Nat Struct Mol Biol 2009, 16:990
995.
16. Tilgner H, Nikolaou C, Althammer S, Sammeth M, Beato M, Valcárcel J,
Guigó R: Nucleosome positioning as a determinant of exon recognition.
Nat Struct Mol Biol 2009, 16:9961001.
17. Harper SJ, Bates DO: VEGF-A splicing: the key to anti-angiogenic thera-
peutics? Nat Rev Cancer 2008, 8:880887.
18. Kalsotra A, Xiao X, Ward AJ, Castle JC, Johnson JM, Burge CB, Cooper TA:
A postnatal switch of CELF and MBNL proteins reprograms alternative
splicing in the developing heart. Proc Natl Acad Sci USA 2008,
105:2033320338.
19. Nowak DG, Amin EM, Rennel ES, Hoareau-Aveilla C, Gammons M, Damo-
doran G, Hagiwara M, Harper SJ, Woolard J, Ladomery MR, Bates DO:
Regulation of Vascular Endothelial Growth Factor (VEGF) Splicing from
Pro-angiogenic to Anti-angiogenic Isoforms: a Novel Therapeutic
Strategy for Angiogenesis. J Biol Chem 2009, 285:55325540.
20. Hall MP, Nagel RJ, Fagg WS, Shiue L, Cline MS, Perriman RJ, Donohue JP,
Ares M: Quaking and PTB control overlapping splicing regulatory
networks during muscle cell differentiation. RNA 2013, 19:627638.
21. Ule J, Stefani G, Mele A, Ruggiu M, Wang X, Taneri B, Gaasterland T,
Blencowe BJ, Darnell RB: An RNA map predicting Nova-dependent
splicing regulation. Nature 2006, 444:580586.
22. Kawamoto S: Neuron-specific alternative splicing of nonmuscle myosin II
heavy chain-B pre-mRNA requires a cis-acting intron sequence. J Biol
Chem 1996, 271:1761317616.
23. Aznarez I, Barash Y, Shai O, He D, Zielenski J, Tsui LC, Parkinson J, Frey BJ,
Rommens JM, Blencowe BJ: A systematic analysis of intronic sequences
downstream of 5 splice sites reveals a widespread role for U-rich motifs
and TIA1/TIAL1 proteins in alternative splicing regulation. Genome Res
2008, 18:1247 1258.
24. Chen M, Manley JL: Mechanisms of alternative splicing regulation:
insights from molecular and genomics approaches. Nat Rev Mol Cell Biol
2009, 10:741754.
25. Melamed Z, Levy A, Ashwal-Fluss R, Lev-Maor G, Mekahel K, Atias N, Gilad S,
Sharan R, Levy C, Kadener S, Ast G: Alternative Splicing Regulates Biogen-
esis of miRNAs Located across Exon-Intron Junctions. Mol Cell 2013,
50:869
881.
26. Pruitt KD, Tatusova T, Brown GR, Maglott DR: NCBI Reference Sequences
(RefSeq): current status, new features and genome annotation policy.
Nucleic Acids Res 2012, 40:D130D135.
27. Dreszer TR, Karolchik D, Zweig AS, Hinrichs AS, Raney BJ, Kuhn RM, Meyer
LR, Wong M, Sloan CA, Rosenbloom KR: The UCSC Genome Browser
database: extensions and updates 2011. Nucleic Acids Res 2012,
40:D918D923.
28. Ben-Dor A, Friedman N, Yakhini Z: Scoring Genes for Relevance. Agilent; 2000.
doi:10.1186/gb-2013-14-10-r114
Cite this article as: Barash et al.: AVISPA: a web tool for the prediction
and analysis of alternative splicing. Genome Biology 2013 14:R114.
Submit your next manuscript to BioMed Central
and take full advantage of:
Convenient online submission
Thorough peer review
No space constraints or color figure charges
Immediate publication on acceptance
Inclusion in PubMed, CAS, Scopus and Google Scholar
Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit
Barash et al. Genome Biology 2013, 14:R114 Page 8 of 8
http://genomebiology.com/2013/14/10/R114
... These tools interrogate exon flanking sequences to identify features--sequence elements--that are compatible with exon skipping or other forms of alternative splicing. Tools like AVISPA (8), SpliceAI (9), Splice-Port (10), MaxEntScan (11) and ESEFinder (12) look for regulatory elements that direct splicing which allow for al-ternative isoforms. This approach is limited by our knowledge of the factors that mediate splicing and the complexity of this process. ...
... Deep neural networks are a good choice when trying to infer complex interactions in sequence data (8,9,(33)(34)(35)(36). Accordingly, we have built a model, called Exon ByPASS, using deep neural networks based on a CNN-LSTM (convolution neural network--long-short-term memory) ( Figure 1A) (37,38). ...
... Exon ByPASS predicts the likelihood that an exon of interest can be skipped or is constitutive based on the resulting protein sequence. Previous methods to predict exon criticality have been based on nucleotide sequence (8)(9)(10)(11)(12); however, Exon ByPASS considers the possibility that exon skipping constraints can be imposed at the protein level. ...
Article
Full-text available
Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive model, called Exon ByPASS (predicting Exon skipping Based on Protein amino acid SequenceS), to assess the criticality of exon inclusion based solely on information contained in the amino acid sequence upstream and downstream of the exon junctions. By focusing on protein sequence, Exon ByPASS predicts exon skipping independent of tissue and species in the absence of any intronic information. We validate model predictions using transcriptomic and proteomic data and show that the model can capture exon skipping in different tissues and species. Additionally, we reveal potential therapeutic opportunities by predicting synthetically skippable exons and neo-junctions arising in cancer cells.
... Tumor Mechanism of antigen escape in occurs in exon 2 of CD19 and is regulated by SRSF3 splice factor [41] sgRNA sequencing Tumor Death receptor signature can predict response In another strategy to understand the molecular state of CART cell therapy, T cell receptor beta (TCRB) sequencing, lentiviral integration sites analysis, and RNA sequencing on patient samples were simultaneously performed [24] . The investigators compared the infused CART cells to the expanded CART cells at 1-2 weeks and 26-30 days post infusion. ...
... This phenomenon was seen in the first published phase I clinical trial of the CD19-directed 4-1BB-ζ CART cell therapy at the Children's Hospital of Philadelphia [40] . The pre-treatment leukemic cells were analyzed and compared to the CD19negative relapsed cells by performing whole genome sequencing and RNA sequencing [41] . These studies revealed that the primary mechanism of antigen escape occurs in exon 2 of the CD19 gene [41] . ...
... The pre-treatment leukemic cells were analyzed and compared to the CD19negative relapsed cells by performing whole genome sequencing and RNA sequencing [41] . These studies revealed that the primary mechanism of antigen escape occurs in exon 2 of the CD19 gene [41] . While frameshift mutations in exon 2 were observed in some of these cases, it did not appear to be the primary mechanism of the mutation [41] . ...
Article
Full-text available
Chimeric antigen receptor T (CART) cell therapy has revolutionized the treatment of relapsed/refractory B cell malignancies in recent years. Despite high initial response rates, durable response rates are low, and CART cell efficacy in solid tumors is very modest. Additionally, the overall success of CART cell therapy is limited by toxicities such as cytokine release syndrome and neurotoxicity. Decades of advancement in genome sequencing technology and bioinformatics have given us a better understanding of how cancer develops and evolves following treatments. This has resulted in a better understanding of patient response to cancer treatment on a molecular level. Resistance to CART cell therapy can be mediated by the cancer cells, the tumor microenvironment, or the patient’s T cells. In this review, we will outline lessons learned from multi-omics studies (1) to identify biomarkers of response or toxicity to CART cell therapy or (2) to develop biomarker-guided therapeutic interventions to overcome these limitations.
... We next utilized splicing code models (Barash et al. 2010) through the AVISPA tool (Barash et al. 2013) to predict the relevance of the RBFOX motifs in determining splicing outcome. ...
... doi: bioRxiv preprint Consistent with the pentamer enrichment described above, AVISPA found the RBFOX motif occurred significantly more often downstream of CELF2 repressed exons (36.6%) compared to unresponsive exons (25.6%, Fisher's exact two-tailed p < 3x10 -4 ) or CELF2 enhanced exons (24.1%, p < 1 x 10 -3 , Fig. 1E inset). Moreover, the presence of the RBFOX motif downstream CELF2 repressed exons is predicted by AVISPA to have significantly more impact on splicing (higher normalized feature effect (Barash et al. 2013), see Supplemental Methods) when compared to either unresponsive (Kolmogorov-Smirnov two-sample p < 6.3x10 -11 ) or CELF2 enhanced exons (p < 2x10 -4 , Fig. 1E). The RBFOX motifs downstream of the CELF2 repressed exons are also more highly conserved than those around CELF2-unresponsive exons (Fig 1F), which is an additional hallmark of functional relevance (Lambert et al. 2014;Taliaferro et al. 2016). ...
... For motif enrichment we first defined splicing relevant regions for all responsive and unresponsive cassette exons. There regions were defined as: the upstream constitutive exon (C1); the alternative exon (A); the downstream constitutive exon (C2); and all intron sequence between C1 and A or A and C2 and within 300 nt of an exon (Barash et al. 2010;Barash et al. 2013). Counts for all 1024 possible pentamers were obtained for each of these regions and a hypergeometric test was applied to obtain significant differences in the presence or absence of a pentamer in a specific region in a regulated versus unresponsive set of exons. ...
Preprint
Full-text available
Over 95% of human multi-exon genes undergo alternative splicing, a process important in normal development and often dysregulated in disease. We sought to analyze the global splicing regulatory network of CELF2 in human T cells, a well-studied splicing regulator critical to T cell development and function. By integrating high-throughput sequencing data for binding and splicing quantification with sequence features and probabilistic splicing code models, we find evidence of splicing antagonism between CELF2 and the RBFOX family of splicing factors. We validate this functional antagonism through knockdown and overexpression experiments in human cells and find CELF2 represses RBFOX2 mRNA and protein levels. Because both families of proteins have been implicated in the development and maintenance of neuronal, muscle, and heart tissues, we analyzed publicly available data in these systems. Our analysis suggests global, antagonistic co-regulation of splicing by the CELF and RBFOX proteins in mouse muscle and heart in several physiologically relevant targets including proteins involved in calcium signaling and members of the MEF2 family of transcription factors. Importantly, a number of these co-regulated events are aberrantly spliced in mouse models and human patients with diseases that affect these tissues including heart failure, diabetes, or myotonic dystrophy. Finally, analysis of exons regulated by ancient CELF family homologs in chicken, and Drosophila suggests this antagonism is conserved through evolution.
... In this work, we use RNA-seq experiments processed by [16] from six mouse tissues (hippocampus, heart, liver, lung, spleen, thymus) with average read coverage of 60 million reads. We generated 1357 genomic features from 14,596 exon skipping events and 74,156 constitutive exon triplets using AVISPA [39]. and quantification for the exon skipping events were generated using MAJIQ [17]. ...
... , attr(g m )], R = [ attr(r 1 ), attr(r 2 ), . . . , attr(r l )], [18,39]. To compute the attribution for a meta-feature, we sum the attributions of all its features. ...
... As described in the main text, the set of splicing code features used in this work has been previously curated from the literature and is therefore highly enriched in informative splicing regulatory features in general, and for regulation of splicing in the muscle and brain in particular [18,39]. To define a high-quality set of known biological metafeatures from these for our specific task of interest, we therefore ran a hypergeometric test to compute p values for enrichment or depletion of a feature in the differentially included splicing events in the brain compared to a negative set of constitutive splicing events. ...
Article
Full-text available
Despite the success and fast adaptation of deep learning models in biomedical domains, their lack of interpretability remains an issue. Here, we introduce Enhanced Integrated Gradients (EIG), a method to identify significant features associated with a specific prediction task. Using RNA splicing prediction as well as digit classification as case studies, we demonstrate that EIG improves upon the original Integrated Gradients method and produces sets of informative features. We then apply EIG to identify A1CF as a key regulator of liver-specific alternative splicing, supporting this finding with subsequent analysis of relevant A1CF functional (RNA-seq) and binding data (PAR-CLIP).
... Transcriptomic studies describe dynamic alternative splicing networks in a range of tissues from adult organs such as brain, heart and skeletal muscle, to embryonic stem and precursor cells, particularly during differentiation or reprogramming of various cell lineages as well as epithelial-mesenchymal transitions. Alternative splicing contributes to cell differentiation and lineage determination, tissue identity acquisition and maintenance and organ development [23][24][25]. RBPs are specific for each differentiation status and contribute to splicing coordination. In fact, genes regulated by alternative splicing are not usually modulated at their overall expression levels (they are typically up-and down-regulated) [22,26]. ...
Article
Full-text available
Myotonic dystrophy type I (DM1) is the most common form of adult muscular dystrophy, caused by expansion of a CTG triplet repeat in the 3' untranslated region (3'UTR) of the myotonic dystrophy protein kinase (DMPK) gene. The pathological CTG repeats result in protein trapping by expanded transcripts, a decreased DMPK translation and the disruption of the chromatin structure, affecting neighboring genes expression. The muscleblind-like (MBNL) and CUG-BP and ETR-3-like factors (CELF) are two families of tissue-specific regulators of developmentally programmed alternative splicing that act as antagonist regulators of several pre-mRNA targets, including troponin 2 (TNNT2), insulin receptor (INSR), chloride channel 1 (CLCN1) and MBNL2. Sequestration of MBNL proteins and up-regulation of CELF1 are key to DM1 pathology, inducing a spliceopathy that leads to a developmental remodelling of the transcriptome due to an adult-to-foetal splicing switch, which results in the loss of cell function and viability. Moreover, recent studies indicate that additional pathogenic mechanisms may also contribute to disease pathology, including a misregulation of cellular mRNA translation, localization and stability. This review focuses on the cause and effects of MBNL and CELF1 deregulation in DM1, describing the molecular mechanisms underlying alternative splicing misregulation for a deeper understanding of DM1 complexity. To contribute to this analysis, we have prepared a comprehensive list of transcript alterations involved in DM1 pathogenesis, as well as other deregulated mRNA processing pathways implications.
... Previous computational methods on splicing have largely focused on discovering novel splice junctions based on RNA sequencing (RNA-seq) alignments [25,26], utilizing machine learning approaches [27,28] including deep neural networks [29]. Only a limited set of tools can model splicing regulation based on genomic sequences and select RNA features [30][31][32]. Moreover, studies on splicing regulation have focused heavily on identifying mutations that land within splice sites (SSs), cis-acting splicing regulatory elements, and trans-acting splicing factors [30,33]. ...
Article
Full-text available
Alternative RNA splicing provides an important means to expand metazoan transcriptome diversity. Contrary to what was accepted previously, splicing is now thought to predominantly take place during transcription. Motivated by emerging data showing the physical proximity of the spliceosome to Pol II, we surveyed the effect of epigenetic context on co-transcriptional splicing. In particular, we observed that splicing factors were not necessarily enriched at exon junctions and that most epigenetic signatures had a distinctly asymmetric profile around known splice sites. Given this, we tried to build an interpretable model that mimics the physical layout of splicing regulation where the chromatin context progressively changes as the Pol II moves along the guide DNA. We used a recurrent-neural-network architecture to predict the inclusion of a spliced exon based on adjacent epigenetic signals, and we showed that distinct spatio-temporal features of these signals were key determinants of model outcome, in addition to the actual nucleotide sequence of the guide DNA strand. After the model had been trained and tested (with >80% precision-recall curve metric), we explored the derived weights of the latent factors, finding they highlight the importance of the asymmetric time-direction of chromatin context during transcription.
Preprint
Full-text available
Alternative splicing contributes to molecular diversity across brain cell types. RNA-binding proteins (RBPs) regulate splicing, but the genome-wide mechanisms remain poorly understood. Here, we used RBP binding sites and/or the genomic sequence to predict exon inclusion in neurons and glia as measured by long-read single-cell data in human hippocampus and frontal cortex. We found that alternative splicing is harder to predict in neurons compared to glia in both brain regions. Comparing neurons and glia, the position of RBP binding sites in alternatively spliced exons in neurons differ more from non-variable exons indicating distinct splicing mechanisms. Model interpretation pinpointed RBPs, including QKI, potentially regulating alternative splicing between neurons and glia. Finally, using our models, we accurately predict and prioritize the effect of splicing QTLs. Taken together, our models provide new insights into the mechanisms regulating cell-type-specific alternative splicing and can accurately predict the effect of genetic variants on splicing.
Chapter
The development of new drugs is expensive, time-consuming, and often results in failure. These problems can partially be solved through the use of AI to identify drug targets, search for molecules capable of interacting with these targets, and then model the interactions of the drug and its target while modelling the physiochemical properties of this drug. Alternative splicing is commonly altered in cancer and as such has become a target for the designing of new drugs. While many drugs have been designed to target either the new isoforms that favour cancer development or proteins involved in the splicing pathway, AI can improve this by helping screen proteome and transcriptome databases to identify new splice variants. AI can also model the three-dimensional structure of new isoforms in order to screen for compounds that can bind exclusively to these isoforms.
Article
Although wireless sensor networks (WSNs) are widely used in many fields, such as industrial production, medical studies, and environmental monitoring, they are vulnerable to various security problems. This study proposes a WSN node access authentication protocol based on trusted connection architecture to prevent easy node capture and various malicious attacks as well as to address the limited energy and computing power and different levels of node credibility in WSNs. First, each node of a WSN is configured using a trusted platform module to ensure complete key generation and safe storage, and thus provides security for the access protocol. Second, an alarm mechanism is introduced to avoid cluster node issues, such as not forwarding data, forwarding part of the data, and forwarding wrong data. This mechanism enhances the troubleshooting capability. Finally, during node access, bidirectional node identity authentication, platform identity authentication, and platform integrity verification are performed to achieve trusted node access. Our protocol is formally verified using Syverson-Van Oorschot (SVO) logic. The security features are applied to analyze the protocol, and back-end analysis modules such as On-the-fly Model-Checker (OFMC) and Constraint Logic based Attack Searcher (CL-AtSe) of the Automated Validation of Internet Security Protocols and Applications (AVISPA) tool are used to test the protocol. The theoretical analysis and test results show that the established security target of the protocol can resist network attacks in real application scenarios. In addition, the implementation efficiency of the protocol is sufficiently analyzed and evaluated. The results show that the protocol has high execution efficiency. In particular, the protocol is suitable for WSNs with high security requirements and limited computing power.
Article
Full-text available
Neurological disorders significantly outnumber diseases in other therapeutic areas. However, developing drugs for central nervous system (CNS) disorders remains the most challenging area in drug discovery, accompanied with the long timelines and high attrition rates. With the rapid growth of biomedical data enabled by advanced experimental technologies, artificial intelligence (AI) and machine learning (ML) have emerged as an indispensable tool to draw meaningful insights and improve decision making in drug discovery. Thanks to the advancements in AI and ML algorithms, now the AI/ML‐driven solutions have an unprecedented potential to accelerate the process of CNS drug discovery with better success rate. In this review, we comprehensively summarize AI/ML‐powered pharmaceutical discovery efforts and their implementations in the CNS area. After introducing the AI/ML models as well as the conceptualization and data preparation, we outline the applications of AI/ML technologies to several key procedures in drug discovery, including target identification, compound screening, hit/lead generation and optimization, drug response and synergy prediction, de novo drug design, and drug repurposing. We review the current state‐of‐the‐art of AI/ML‐guided CNS drug discovery, focusing on blood–brain barrier permeability prediction and implementation into therapeutic discovery for neurological diseases. Finally, we discuss the major challenges and limitations of current approaches and possible future directions that may provide resolutions to these difficulties.
Article
Full-text available
Alternative splicing contributes to muscle development, but a complete set of muscle-splicing factors and their combinatorial interactions are unknown. Previous work identified ACUAA ("STAR" motif) as an enriched intron sequence near muscle-specific alternative exons such as Capzb exon 9. Mass spectrometry of myoblast proteins selected by the Capzb exon 9 intron via RNA affinity chromatography identifies Quaking (QK), a protein known to regulate mRNA function through ACUAA motifs in 3' UTRs. We find that QK promotes inclusion of Capzb exon 9 in opposition to repression by polypyrimidine tract-binding protein (PTB). QK depletion alters inclusion of 406 cassette exons whose adjacent intron sequences are also enriched in ACUAA motifs. During differentiation of myoblasts to myotubes, QK levels increase two- to threefold, suggesting a mechanism for QK-responsive exon regulation. Combined analysis of the PTB- and QK-splicing regulatory networks during myogenesis suggests that 39% of regulated exons are under the control of one or both of these splicing factors. This work provides the first evidence that QK is a global regulator of splicing during muscle development in vertebrates and shows how overlapping splicing regulatory networks contribute to gene expression programs during differentiation.
Article
Full-text available
Whence Species Variation? Vertebrates have widely varying phenotypes that are at odds with their much more limited proteincoding genotypes and conserved messenger RNA expression patterns. Genes with multiple exons and introns can undergo alternative splicing, potentially resulting in multiple protein isoforms (see the Perspective by Papasaikas and Valcárcel ). Barbosa-Morais et al. (p. 1587 ) and Merkin et al. (p. 1593 ) analyzed alternative splicing across the genomes of a variety of vertebrates, including human, primates, rodents, opossum, platypus, chicken, lizard, and frog. The findings suggest that the evolution of alternative splicing has for the most part been very rapid and that alternative splicing patterns of most organs more strongly reflect the identity of the species rather than the organ type. Species-classifying alternative splicing can affect key regulators, often in disordered regions of proteins that may influence protein-protein interactions, or in regions involved in protein phosphorylation.
Article
Full-text available
The University of California Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu) offers online public access to a growing database of genomic sequence and annotations for a wide variety of organisms. The Browser is an integrated tool set for visualizing, comparing, analysing and sharing both publicly available and user-generated genomic datasets. As of September 2012, genomic sequence and a basic set of annotation ‘tracks’ are provided for 63 organisms, including 26 mammals, 13 non-mammal vertebrates, 3 invertebrate deuterostomes, 13 insects, 6 worms, yeast and sea hare. In the past year 19 new genome assemblies have been added, and we anticipate releasing another 28 in early 2013. Further, a large number of annotation tracks have been either added, updated by contributors or remapped to the latest human reference genome. Among these are an updated UCSC Genes track for human and mouse assemblies. We have also introduced several features to improve usability, including new navigation menus. This article provides an update to the UCSC Genome Browser database, which has been previously featured in the Database issue of this journal.
Article
Full-text available
Recent molecular level studies that compare different classes of disease conditions produce labeled gene expression data. We examine scoring methods that are useful in mining such gene expression data for genes that have biological relevance to the condition studied. Relevance information is useful in identifying genes driving the biological process, in selecting small subsets of genes with diagnostic potential, and in better understanding the condition studied and its relationship to known or hypothesized biochemical pathways. We present the scoring methods; de-scribe a process for computing the corresponding p-values; and finally, present results from application to actual cancer gene ex-pression data. These include applying classification techniques employing varying relevance based selected sets of genes.
Article
Full-text available
The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of genomic, transcript and protein sequence records. These records are selected and curated from public sequence archives and represent a significant reduction in redundancy compared to the volume of data archived by the International Nucleotide Sequence Database Collaboration. The database includes over 16 000 organisms, 2.4 × 106 genomic records, 13 × 106 proteins and 2 × 106 RNA records spanning prokaryotes, eukaryotes and viruses (RefSeq release 49, September 2011). The RefSeq database is maintained by a combined approach of automated analyses, collaboration and manual curation to generate an up-to-date representation of the sequence, its features, names and cross-links to related sources of information. We report here on recent growth, the status of curating the human RefSeq data set, more extensive feature annotation and current policy for eukaryotic genome annotation via the NCBI annotation pipeline. More information about the resource is available online (see http://www.ncbi.nlm.nih.gov/RefSeq/).
Article
Full-text available
The University of California Santa Cruz Genome Browser (http://genome.ucsc.edu) offers online public access to a growing database of genomic sequence and annotations for a wide variety of organisms. The Browser is an integrated tool set for visualizing, comparing, analyzing and sharing both publicly available and user-generated genomic data sets. In the past year, the local database has been updated with four new species assemblies, and we anticipate another four will be released by the end of 2011. Further, a large number of annotation tracks have been either added, updated by contributors, or remapped to the latest human reference genome. Among these are new phenotype and disease annotations, UCSC genes, and a major dbSNP update, which required new visualization methods. Growing beyond the local database, this year we have introduced ‘track data hubs’, which allow the Genome Browser to provide access to remotely located sets of annotations. This feature is designed to significantly extend the number and variety of annotation tracks that are publicly available for visualization and analysis from within our site. We have also introduced several usability features including track search and a context-sensitive menu of options available with a right-click anywhere on the Browser's image.
Article
The initial step in microRNA (miRNA) biogenesis requires processing of the precursor miRNA (pre-miRNA) from a longer primary transcript. Many pre-miRNAs originate from introns, and both a mature miRNA and a spliced RNA can be generated from the same transcription unit. We have identified a mechanism in which RNA splicing negatively regulates the processing of pre-miRNAs that overlap exon-intron junctions. Computational analysis identified dozens of such pre-miRNAs, and experimental validation demonstrated competitive interaction between the Microprocessor complex and the splicing machinery. Tissue-specific alternative splicing regulates maturation of one such miRNA, miR-412, resulting in effects on its targets that code a protein network involved in neuronal cell death processes. This mode of regulation specifically controls maturation of splice-site-overlapping pre-miRNAs but not pre-miRNAs located completely within introns or exons of the same transcript. Our data present a biological role of alternative splicing in regulation of miRNA biogenesis.
Article
Vascular endothelial growth factor (VEGF) is produced either as a pro-angiogenic or anti-angiogenic protein depending upon splice site choice in the terminal, eighth exon. Proximal splice site selection (PSS) in exon 8 generates pro-angiogenic isoforms such as VEGF165, and distal splice site selection (DSS) results in anti-angiogenic isoforms such as VEGF165b. Cellular decisions on splice site selection depend upon the activity of RNA-binding splice factors, such as ASF/SF2, which have previously been shown to regulate VEGF splice site choice. To determine the mechanism by which the pro-angiogenic splice site choice is mediated, we investigated the effect of inhibition of ASF/SF2 phosphorylation by SR protein kinases (SRPK1/2) on splice site choice in epithelial cells and in in vivo angiogenesis models. Epithelial cells treated with insulin-like growth factor-1 (IGF-1) increased PSS and produced more VEGF165 and less VEGF165b. This down-regulation of DSS and increased PSS was blocked by protein kinase C inhibition and SRPK1/2 inhibition. IGF-1 treatment resulted in nuclear localization of ASF/SF2, which was blocked by SPRK1/2 inhibition. Pull-down assay and RNA immunoprecipitation using VEGF mRNA sequences identified an 11-nucleotide sequence required for ASF/SF2 binding. Injection of an SRPK1/2 inhibitor reduced angiogenesis in a mouse model of retinal neovascularization, suggesting that regulation of alternative splicing could be a potential therapeutic strategy in angiogenic pathologies.
Article
The muscleblind-like (Mbnl) family of RNA-binding proteins plays important roles in muscle and eye development and in myotonic dystrophy (DM), in which expanded CUG or CCUG repeats functionally deplete Mbnl proteins. We identified transcriptome-wide functional and biophysical targets of Mbnl proteins in brain, heart, muscle, and myoblasts by using RNA-seq and CLIP-seq approaches. This analysis identified several hundred splicing events whose regulation depended on Mbnl function in a pattern indicating functional interchangeability between Mbnl1 and Mbnl2. A nucleotide resolution RNA map associated repression or activation of exon splicing with Mbnl binding near either 3' splice site or near the downstream 5' splice site, respectively. Transcriptomic analysis of subcellular compartments uncovered a global role for Mbnls in regulating localization of mRNAs in both mouse and Drosophila cells, and Mbnl-dependent translation and protein secretion were observed for a subset of mRNAs with Mbnl-dependent localization. These findings hold several new implications for DM pathogenesis.
Article
Alternative splicing is a major contributor to cellular diversity in mammalian tissues and relates to many human diseases. An important goal in understanding this phenomenon is to infer a 'splicing code' that predicts how splicing is regulated in different cell types by features derived from RNA, DNA and epigenetic modifiers. We formulate the assembly of a splicing code as a problem of statistical inference and introduce a Bayesian method that uses an adaptively selected number of hidden variables to combine subgroups of features into a network, allows different tissues to share feature subgroups and uses a Gibbs sampler to hedge predictions and ascertain the statistical significance of identified features. Using data for 3665 cassette exons, 1014 RNA features and 4 tissue types derived from 27 mouse tissues (http://genes.toronto.edu/wasp), we benchmarked several methods. Our method outperforms all others, and achieves relative improvements of 52% in splicing code quality and up to 22% in classification error, compared with the state of the art. Novel combinations of regulatory features and novel combinations of tissues that share feature subgroups were identified using our method. frey@psi.toronto.edu Supplementary data are available at Bioinformatics online.