MATS: a Bayesian framework for flexible detection of differential alternative splicing from RNASeq data

Article: PrimerSeq: Design and Visualization of RTPCR Primers for Alternative Splicing Using RNAseq Data
[Show abstract] [Hide abstract]
ABSTRACT: The vast majority of multiexon genes in higher eukaryotes are alternatively spliced and changes in alternative splicing (AS) can impact gene function or cause disease. Highthroughput RNA sequencing (RNAseq) has become a powerful technology for transcriptomewide analysis of AS, but RTPCR still remains the goldstandard approach for quantifying and validating exon splicing levels. We have developed PrimerSeq, a userfriendly software for systematic design and visualization of RTPCR primers using RNAseq data. PrimerSeq incorporates userprovided transcriptome profiles (i.e., RNAseq data) in the design process, and is particularly useful for largescale quantitative analysis of AS events discovered from RNAseq experiments. PrimerSeq features a graphical user interface (GUI) that displays the RNAseq data juxtaposed with the expected RTPCR results. To enable primer design and visualization on userprovided RNAseq data and transcript annotations, we have developed PrimerSeq as a standalone software that runs on local computers. PrimerSeq is freely available for Windows and Mac OS X along with source code at http://primerseq.sourceforge.net/. With the growing popularity of RNAseq for transcriptome studies, we expect PrimerSeq to help bridge the gap between highthroughput RNAseq discovery of AS events and molecular analysis of candidate events by RTPCR.Genomics Proteomics & Bioinformatics 01/2014;  SourceAvailable from: PubMed Central[Show abstract] [Hide abstract]
ABSTRACT: Alternative splicing is a major contributor to cellular diversity. Therefore the identification and quantification of differentially spliced transcripts in genomewide transcript analysis is an important consideration. Here, I review the software available for analysis of RNASeq data for differential splicing and discuss intrinsic challenges for differential splicing analyses. Three approaches to differential splicing analysis are described, along with their associated software implementations, their strengths, limitations, and caveats. Suggestions for future work include more extensive experimental validation to assess accuracy of the software predictions and consensus formats for outputs that would facilitate visualizations, data exchange, and downstream analyses.Human genomics 01/2014; 8(1):3.  SourceAvailable from: Kristoffer VittingSeerup[Show abstract] [Hide abstract]
ABSTRACT: RNAseq data is currently underutilized, in part because it is difficult to predict the functional impact of alternate transcription events. Recent software improvements in fulllength transcript deconvolution prompted us to develop spliceR, an R package for classification of alternative splicing and prediction of coding potential. spliceR uses the fulllength transcript output from RNAseq assemblers to detect single or multiple exon skipping, alternative donor and acceptor sites, intron retention, alternative first or last exon usage, and mutually exclusive exon events. For each of these events spliceR also annotates the genomic coordinates of the differentially spliced elements, facilitating downstream sequence analysis. For each transcript isoform fraction values are calculated to identify transcript switching between conditions. Lastly, spliceR predicts the coding potential, as well as the potential nonsense mediated decay (NMD) sensitivity of each transcript. spliceR is an easytouse tool that extends the usability of RNAseq and assembly technologies by allowing greater depth of annotation of RNAseq data. spliceR is implemented as an R package and is freely available from the Bioconductor repository (http://www.bioconductor.org/packages/2.13/bioc/html/spliceR.html).BMC Bioinformatics 03/2014; 15(1):81. · 3.02 Impact Factor
Page 1
MATS: a Bayesian framework for flexible
detection of differential alternative splicing
from RNASeq data
Shihao Shen1, Juw Won Park2, Jian Huang3, Kimberly A. Dittmar4, Zhixiang Lu2,
Qing Zhou5, Russ P. Carstens4,6and Yi Xing2,7,*
1Department of Biostatistics,2Department of Internal Medicine,3Department of Statistics and Actuarial Science,
University of Iowa, Iowa City, IA 52242,4Renal Division, Department of Medicine, University of Pennsylvania,
School of Medicine, Philadelphia, PA 19104,5Department of Statistics, University of California, Los Angeles, CA
90095, USA,6Department of Genetics, University of Pennsylvania School of Medicine, Philadelphia, PA 19104
and7Department of Biomedical Engineering, University of Iowa, Iowa City, IA 52242
Received August 6, 2011; Revised December 10, 2011; Accepted December 15, 2011
ABSTRACT
Ultradeep RNA sequencing has become a powerful
approach for genomewide analysis of premRNA
alternative splicing. We develop MATS (multivariate
analysis of transcript splicing), a Bayesian statistical
framework for flexible hypothesis testing of differ
ential alternative splicing patterns on RNASeq data.
MATS uses a multivariate uniform prior to model
the betweensample correlation in exon splicing
patterns, and a Markov chain Monte Carlo (MCMC)
method coupled with a simulationbased adaptive
sampling procedure to calculate the Pvalue and
false discovery rate (FDR) of differential alternative
splicing. Importantly, the MATS approach is applic
able to almost any type of null hypotheses of
interest, providing the flexibility to identify differen
tial alternative splicing events that match a given
userdefined pattern. We evaluated the perform
ance of MATS using simulated and real RNASeq
data sets. In the RNASeq analysis of alternative
splicing events regulated by the epithelialspecific
splicing factor ESRP1, we obtained a high RT–PCR
validation rate of 86% for differential exon skipping
events with a MATS FDR of <10%. Additionally,
over the full list of RT–PCR tested exons, the
MATS FDR estimates matched well with the experi
mental validation rate. Our results demonstrate
that MATS is an effective and flexible approach for
detectingdifferential alternative
RNASeq data.
splicingfrom
INTRODUCTION
Alternative splicing plays critical roles in regulating gene
function and activity in higher eukaryotes (1). By alterna
tive selection of exons and splice sites during splicing, a
single gene locus can produce multiple mRNA and protein
isoforms with divergent structural and functional pro
perties (2). Over 90% of multiexon human genes are
alternatively spliced and many genes undergo changes in
alternative splicing during development, cell differenti
ation and disease (3–6). Changes in alternative splicing
patterns can be manifested as allornone switches between
distinct mRNA isoforms, but more frequently as shifts in
the relative abundance of multiple mRNA isoforms of a
gene. In some disease genes, a slight shift (by as few as
several percent) in the relative isoform proportions can
trigger disease pathogenesis (7,8). Due to the prevalent
role of alternative splicing in gene regulation and disease,
there is a pressing need for genomic tools that can accur
ately quantify isoform ratios and reliably detect changes in
isoform ratios (i.e. differential alternative splicing) among
different biological conditions.
Direct sequencing of fulllength mRNAs and mRNA
fragments has been a popular and effective approach for
discovery and quantification of alternative splicing events
(3,4,9). Since mRNAs represent the end products of
splicing, by aligning mRNA sequences to the genome
one can delineate exon–intron structures and identify al
ternative splicing events. If sequencing reaches the suffi
cient depth, such that the relative abundance of distinct
mRNA isoforms can be confidently estimated by the
numbers of RNA sequences matching to specific isoforms,
we can compare the mRNA sequence counts across bio
logical conditions to identify differential alternative
*To whom correspondence should be addressed. Tel: +1 319 384 3099; Fax: +1 319 384 3150; Email: yixing@uiowa.edu
Published online 20 January 2012Nucleic Acids Research, 2012, Vol. 40, No. 8e61
doi:10.1093/nar/gkr1291
? The Author(s) 2012. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution NonCommercial License (http://creativecommons.org/licenses/
bync/3.0), which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Page 2
splicing events. This approach was first adopted for the
discovery of tissuespecific exons from expressed sequence
tags (ESTs) (10). Xu et al. categorized the ESTs of human
genes according to their tissue origins. The exon inclusion
level of an alternatively spliced cassette exon in any given
tissue was estimated from the counts of ESTs mapped
uniquely to the exon inclusion or skipping exonexon junc
tions (10) (for a formal definition, see ‘Materials and
Methods’ section). Using a Bayesian approach, Xu et al.
compared the EST counts across different tissues to detect
exons with tissuespecific shifts in exon inclusion levels.
Specifically, the prior distribution of an exon’s exon inclu
sion levels in two tissues was modeled as two independent
uniform (0, 1) distributions. The EST read count of the
exon inclusion isoform in each tissue was assumed to
follow a binomial distribution. A Bayesian posterior
probability was calculated to assess whether the exon in
clusion levels of an exon differed between two tissues.
Similar approacheswere
cancerspecific alternative splicing events (11,12). These
studies pioneered the use of RNA sequence count data
for quantitative splicing analysis. However, due to the
low throughput of EST sequencing, ESTbased analysis
of differential alternative splicing has limited capacity
and high noise (13).
Recently, with the advent of the highthroughput
RNA sequencing technology (RNASeq), it has become
feasible to generate hundreds of millions of short
RNASeq reads on any RNA sample of interest (14). This
technology enables genomewide quantitative analyses of
RNA alternative splicing (3,4). Pan and colleagues
demonstrated that if the RNASeq coverage of an alterna
tively spliced cassette exon reaches at least 20 reads for one
of its exon–exon junctions, the exon inclusion levels
estimated by RNASeq strongly correlate with splicing
microarray measurements (4). Other studies comparing
RNASeq data to RT–PCR data reached similar conclu
sions (15–17). Thus, by analyzing and comparing deep
RNASeq data from different biological conditions, one
can identify exons with changes in exon inclusion levels
onagenomescale.FromtheRNAsequencecountdata,dif
ferentialalternativesplicingeventsarecommonlyidentified
by testing the equality of transcript isoform ratios between
samples (11,15,16,18–20). Various methods have been used
toassessthestatisticalsignificanceofsuchdifferentialalter
native splicing events, including Fisher exact test of
isoformspecific read counts (15,19,20), and Bayesian
approachesthatmodelreadcountsassampledfromaprob
abilistic mixture of distinct isoforms (11,16,21).
In this article, we introduce MATS (multivariate
analysis of transcript splicing), a Bayesian statistical
framework for flexible hypothesis testing of differential
alternativesplicingpatterns
Compared to previous computational methods for detect
ing differential alternative splicing events from RNA
sequence count data, MATS has several novel features.
First and most importantly, MATS offers the flexibility
to identify differential alternative splicing events that
match a given userdefined pattern. For example, MATS
can calculate the statistical significance that the absolute
difference in the exon inclusion levels of an exon between
laterused toidentify
onRNASeqdata.
two conditions exceeds a given threshold (e.g. 10%). This
allows biologists to identify alternative splicing changes
that reach any specified magnitude. MATS can also be
used to detect exons with the extreme ‘switchlike’ differ
ential alternative splicing pattern, i.e. exons predominant
ly included in the transcripts in one condition but
predominantly skipped in
switchlike pattern is of considerable biological interest,
because it is a strong indicator of functional alternative
splicing events (3,22). Second, MATS uses a multivariate
uniform distribution as the joint prior for the exon inclu
sion levels in two conditions. Compared to the independ
ent uniform priors commonly used by previous methods,
the multivariate uniform prior is more general and better
captures the genomewide similarity in exon splicing
patterns between biological conditions. Of note, this
prior distribution is motivated by the observation that
between any two conditions there is generally an overall
positive correlation in exon splicing patterns, and only a
small percentage of alternatively spliced exons undergo
differentialsplicing(see‘Results’
MATS employs a Markov chain Monte Carlo (MCMC)
method coupledwitha
sampling procedure to calculate the Pvalue and false dis
covery rate (FDR) of differential alternative splicing, by
comparing the posterior probability of the observed data
to a set of posterior probabilities simulated from the null
hypothesis. This approach is applicable to almost any type
of null hypotheses of interest. To evaluate the perform
ance of MATS, we analyzed a deep RNASeq data set
with 256 million reads on a human breast cancer cell line
(MDAMB231) with ectopic expression of the epithelial
specific splicing factor ESRP1 or an empty vector (EV)
control. In this experimental system, the splicing factor
ESRP1 induces largescale changes in alternative splicing
towards an epithelialspecific splicing signature (23).
Based on the MATS result, we selected 164 exons that
covered a broad range of FDR values for RT–PCR valid
ation.ForexonswithaMATSFDRof<10%,weobtained
ahigh overall RT–PCR
demonstrating that MATS can reliably detect differential
alternative splicing events. Additionally, over the full list
of RT–PCR tested exons, we observed a progressive
decrease in the RT–PCR validation rate with increasing
FDR values, suggesting MATS can yield experimentally
meaningful FDR estimates to help biologists interpret
RNASeq predictions and design followup experiments.
anothercondition.This
section).Finally,
simulationbasedadaptive
validationrate of86%,
MATERIALS AND METHODS
Overview of MATS
MATS is a Bayesian statistical framework to detect dif
ferential alternative splicing events from RNASeq data.
For every alternatively spliced cassette exon, MATS
assesses the statistical significance of differential alterna
tive splicing based on a userdefined hypothesis. The
default analysis of MATS is to test whether the difference
in the exon inclusion levels between two samples exceeds a
given userdefined threshold (e.g. 10%). Compared to
existing RNASeq analysis methods that test the equality
e61Nucleic Acids Research, 2012,Vol. 40,No. 8PAGE 2 OF 13
Page 3
of exon inclusion levels between samples, the MATS test
of splicing difference has three advantages. First, it
provides a rigorous statistical framework for biologists
to identify alternative splicing changes that reach any
specified magnitude. Second, it improves robustness
against random sampling noise in the RNASeq data,
which could cause a minor shift in the estimated isoform
ratio between samples. For exons with high RNASeq
read counts such as those from highly expressed genes,
such random noise might introduce false positives to a
test of equality on exon inclusion levels. Third, the
flexible hypothesis formulation also allows testing of
other types of differential alternative splicing behavior
such as the ‘switchlike’ pattern, in which an exon is pre
dominantly included in the transcripts in one condition
but predominantly skipped in another condition.
The major steps of MATS are illustrated schematically
in Figure 1. First, for each exon MATS uses the counts of
RNASeq reads mapped to the exonexon junctions of its
inclusion or skipping isoform to estimate the exon inclu
sion levels in two samples (Figure 1A). Second, the exon
inclusion levels of all alternatively spliced cassette exons
are used to construct a multivariate uniform prior that
models the overall similarity in alternative splicing profiles
between the two samples (Figure 1B). Third, based on the
multivariate uniform prior and a binomial likelihood
model for the RNASeq read counts of the exon inclu
sion/skipping isoforms, MATS uses a MCMC method
to calculate the Bayesian posterior probability for
splicing difference. Under the default setting, MATS cal
culates the posterior probability that the change in the
exon inclusion level of a given exon exceeds a given
userdefined threshold (e.g. 10%; Figure 1C). Finally,
MATS calculates a Pvalue for each exon by comparing
the observed posterior probability (from Step C) with a set
of simulated posterior probabilities from the null hypoth
esis (Figure 1D). These Pvalues are then transformed to
FDR values by the Benjamini–Hochberg FDR method
(24). Below we describe the details of the MATS
algorithm.
Estimating exon inclusion levels
We define the exon inclusion level ( ) of an alternatively
spliced cassette exon as the percentage of ‘exon inclusion’
transcripts that splice from its upstream exon into the
cassette exon then into its downstream exon among all
such ‘exon inclusion’ transcripts plus ‘exon skipping’ tran
scripts that splice from its upstream exon directly into its
downstream exon (16,17,25). In a RNASeq study, for
each exon in a given sample we count the number of
RNASeq reads uniquely mapped to its upstream, down
stream or skipping exon–exon junctions (Figure 1A). The
upstream junction count (UJC) and the downstream
junction count (DJC) reflect the abundance of the exon
inclusion isoform, while the skipping junction count (SJC)
reflects the abundance of the exon skipping isoform. Let I
and S represent the counts of exon inclusion and skipping
isoforms respectively. Assuming that the read counts
follow a binomial distribution, the maximum likelihood
estimate (MLE) of the exon inclusion level ( ) of an
exon in a given sample can be calculated as:
^ ¼ I=ðI+SÞ ¼
ðUJC+DJCÞ=2
ðUJC+DJCÞ=2+SJC
Calculating the Bayesian posterior probability for
differential alternative splicing
To compare alternative splicing patterns between two
samples, for each exon we define 1 and 2 as its exon
inclusion levels in sample 1 and 2. Under the default
setting, MATS tests the hypothesis that the difference in
the exon inclusion levels of a given exon between sample 1
and 2 is above a userdefined cutoff c, i.e. j 1? 2j > c.
The cutoff c is a userdefined parameter that represents the
extent of splicing change one wishes to identify. For
example, if a researcher is interested in identifying exons
with at least 10% change in exon inclusion levels, the
cutoff c should be set as 10%. The values of 1and 2
under the null hypothesis (H0) and the alternative hypo
thesis (H1) of this test are illustrated graphically in
Figure 1. Basic steps of MATS. (A) For each exon MATS uses the
counts of RNASeq reads mapped to the exon–exon junctions of its
inclusion or skipping isoform to estimate the exon inclusion levels in
two samples. (B) The exon inclusion levels of all alternatively spliced
cassette exons are used to construct a multivariate uniform prior that
models the overall similarity in alternative splicing profiles between the
two samples. (C) Based on the multivariate uniform prior and a
binomial likelihood model for the RNASeq read counts of the exon
inclusion/skipping isoforms, MATS uses a Markov chain Monte Carlo
(MCMC) method to calculate the Bayesian posterior probability for
splicing difference. (D) MATS calculates a Pvalue for each exon by
comparing the observed posterior probability with a set of simulated
posterior probabilities from the null hypothesis, followed by adjustment
for multiple testing to obtain the FDR value.
PAGE 3 OF 13Nucleic AcidsResearch, 2012, Vol.40,No. 8e61
Page 4
Figure 2A. In this section, we describe how MATS calcu
lates the posterior probability of j 1? 2j > c from the
RNASeq counts, i.e. Pðj 1? 2j > cjDataÞ. The Pvalue
and FDR calculation is described in the next section.
To calculate the posterior probability of j 1? 2j > c,
we need to define the prior probability and the likelihood
model. In MATS, the joint prior distribution of 1and 2
is modeled as a multivariate uniform distribution (Figure
1B), with marginal distributions as uniform (0, 1) and cor
relation ? ? uniform (0, 1):
Prior :ð 1, 2Þ ? MultiVarUniform 0,1,cor ¼
? ? Uniformð0,1Þ
The multivariate uniform distribution was obtained
by applying cumulative standard normal distribution
functions to a random vector that follows a multivariate
normaldistribution.Specifically,
?ðZ2ÞÞ, where Z1 and Z2 are standard normal random
variables Nð0,1Þ with correlation ? and ? is the cumulative
distribution function of the standard normal distribution.
The obtained multivariate uniform distribution is equiva
lent to a bivariate distribution with uniform marginals.
We note that our choice of prior distribution in MATS
differs from previous methods which model the priors of
1 and 2 as two independent uniform distributions
(11,16). This multivariate uniform prior distribution of
1and 2is motivated by the observation that between
any two biological conditions, there is generally an overall
similarity in the genomewide alternative splicing profiles,
and only a small percentage of alternatively spliced exons
undergo differential splicing. Indeed, our analysis of
several RNASeq data sets suggests that this multivariate
uniform prior provides a good fit with empirical data
(see ‘Results’ section). In contrast, the commonly used
independent uniform prior distributions assume that the
splicing activities of the same exon in two different
samples are independent, even if these two samples have
1
?
?
1
????
ð 1, 2Þ ¼ ð?ðZ1Þ,
the identical cell type and tissue origin. This lacks bio
logical justification and fits empirical data poorly.
In each sample, the exon inclusion isoform count I
follows a binomial distribution with n ¼ I+S and p ¼ ,
where S is the skipping isoform count.
Assume we have a total of N alternatively spliced
cassette exons, for each exon i ¼ 1,:::,N, we denote:
i1, i2: exon inclusion levels of exon i in sample 1 and 2;
Ii1,Ii2: counts of the exon inclusion isoform of exon i in
sample 1 and 2;
Si1,Si2: counts of the exon skipping isoform of exon i in
sample 1 and 2.
Likelihood : Ii1j i1 ? Binomial ðni1¼ Ii1+Si1,pi1¼ i1Þ
Ii2j i2 ? Binomial ðni2¼ Ii2+Si2,pi2¼ i2Þ
The posterior probability of differential alternative
splicing for exon i can be calculated as Pi¼ Pðj i1? i2j
> cjIi1,Ii2,Si1,Si2,I½?i?1,I½?i?2,S½?i?1,S½?i?2Þ, where the counts
of the exon inclusion/skipping isoforms of exon i and all
other alternatively spliced cassette exons (indexed by [?i])
are used to infer the parameter ? in the multivariate
uniform prior as well as i1 and i2, and c represents
the userdefined threshold for splicing change.
As this posterior probability cannot be calculated with
an analytic solution in closedform, we adopt a numeric
solution based on the MCMC method. Specifically, the
posterior probability is calculated by the JAGS program
(Just Another Gibbs Sampler; http://sourceforge.net/
projects/mcmcjags/). This program estimates the poster
ior probability Pðj i1? i2j > cjIi1,Ii2,Si1,Si2,I½?i?1,I½?i?2,
S½?i?1,S½?i?2Þ for all exons and the parameter ? of the multi
variate uniform prior. The parameter ? is a global param
eter for all exons, which describes the overall correlation
of the exon inclusion levels of all alternatively spliced
exons between two samples. Therefore, for N exons
there are a total of 2N+1 parameters, including 2N par
ameters for exon inclusion levels in two samples and the
Figure 2. Null hypotheses in MATS. (A) Under the default setting of MATS, the H1 alternative hypothesis is that the difference in the exon
inclusion levels between two samples is above the userdefined cutoff c (the white area). The H0null hypothesis is that the difference is below the
userdefined cutoff c (the gray area). (B) MATS can test the extreme ‘switchlike’ differential alternative splicing pattern with a different hypothesis.
The H1alternative hypothesis is that the exon inclusion level is below a userdefined threshold s in sample 1 and above 1s in sample 2, or vice versa
(the white area). The H0 null hypothesis is outside the alternative hypothesis region (the gray area).
e61Nucleic Acids Research, 2012,Vol. 40,No. 8PAGE 4 OF 13
Page 5
global parameter ?. To estimate the global parameter ?
along with all the values, the data of all N exons are
used as the input for the MCMC sampler. We burn in the
MCMC sampler for 2000 iterations, followed by another
10000 iterations to calculate the posterior probabilities.
The posterior probability of a given exon i (denoted as
Pobs
i
) is estimated by the fraction of iterations with
j i1? i2j ? c among all 10000 iterations (Figure 1C).
Calculating the Pvalue and FDR for differential
alternative splicing
We use a simulationbased adaptive sampling procedure
to calculate the Pvalue and FDR for differential alterna
tive splicing. In theory, Pvalue comes from the compari
son of the observed test statistics with statistics from the
null hypothesis. In MATS, when we test j 1? 2j > c, we
consider the null hypothesis that j 1? 2j ? c. We calcu
late the Pvalue of each exon by comparing the posterior
probability of the observed data (Pobs
probabilities simulated from the null hypothesis. For each
exon, we find the maximum likelihood estimate (MLE) of
the constrained 1and 2(denoted as^ c
the binomial distributions with the counts I1,I2,S1and S2
in two conditions,subject
j 1? 2j ? c. Specifically,
ð^ c
i
) to a set of posterior
1and^ c
2) from
to theconstraint that
1,^ c
2Þ ¼ argmax
j 1? 2j?cðI1log 1+S1logð1 ? 1Þ+I2log 2
+S2logð1 ? 2ÞÞ
limitedmemoryBroyden–Fletcher–Goldfarb–
Shanno boxconstraints (LBFGSB) algorithm is used
to search for the constrained MLE (26,27). Then we
simulate RNASeq count data from the constrained
MLE of^ c
2, and calculate the posterior probability
of j 1? 2j > c given the simulated data. The Pvalue of
each exon is calculated by comparing the posterior prob
ability of differential splicing based on the observed data
to a set of simulated posterior probabilities. The details of
this calculation for a given exon i are summarized below:
The
1and^ c
(1) Retrieve the estimated global parameter ^ ? from the
MCMC calculation of posterior probabilities of all
alternatively spliced exons. The value of ^ ? is fixed in
the Pvalue calculation.
(2) For exon i, retrieve the observed posterior prob
abilityfrom theMCMC
Pðj i1? i2j > cjIi1,Ii2,Si1,Si2,I½?i?1,I½?i?2,S½?i?1,S½?i?2Þ.
(3) For exon i, simulate M posterior probabilities from
the constrained MLE of^ c
i) Simulate data
calculationPobs
i
¼
i1and^ c
i2. For j ¼ 1,:::,M:
Ii1jj^ c
Si1j¼ ni1? Ii1j
Ii2jj^ c
Si2j¼ ni2? Ii2j
ii) Calculate the posterior probability from the simu
lated data Ii1j,Ii2j,Si1j,Si2jusing the MCMC method
as Psim
ij
¼ Pðj i1j? i2jj > cjIi1j,Ii2j,Si1j, Si2j,^ ?Þ.
i1? Binomial ðni1¼ Ii1+Si1,pi1¼^ c
i1Þ;
i2? Binomial ðni2¼ Ii2+Si2,pi2¼^ c
i2Þ;
iii) Calculate the Pvalue for exon i by comparing
Pobs
i
with thesimulated
IðPobs
For each exon, the number of M is determined by an
adaptive sampling procedure (see below).
The number of simulations (M) in calculating the
simulated posterior probabilities is determined by an
adaptive sampling procedure. Initially, we aim to reach
a Pvalue precision of 0.01 by setting M=100. One
hundred simulated posterior probabilities are calculated
for each exon, and the exon’s Pvalue is generated by
comparing the observed posterior probability to the
simulated ones. A zero or closetozero Pvalue for any
exon indicates that the number of simulations is insuffi
cient for generating a reliable Pvalue estimate. For all
exons with a Pvalue of smaller than three times the pre
cision, the number of simulations is increased by 10fold in
a new round of simulation, which increases the precision
of the Pvalue estimate from 0.01 to 0.001. This adaptive
sampling procedure is repeated for multiple rounds.
The default setting of MATS is to repeat this procedure
for at most six rounds to reach the highest Pvalue preci
sion of 10?7, but this parameter can be adjusted by users.
This adaptive sampling procedure enables us to selectively
increase the precision and running time for exons with
significant (i.e. small) Pvalues, thus reducing the overall
running time needed for all exons.
After we obtain the Pvalues of all exons, we apply
the classic Benjamini–Hochberg method (24) on these
Pvalues to obtain the FDR values.
{Psim
ij
}as
1
M
PM
j¼1
i
? Psim
ijÞ.
Detecting switchlike differential alternative splicing
MATS offers the flexibility for testing different types of
hypotheses on the differential alternative splicing pattern.
An analysis of considerable biological interest is to identify
exons with the extreme ‘switchlike’ differential alter
native splicing pattern, i.e. ð 1< s and 2> 1 ? sÞ or
ð 1> 1 ? s and 2< sÞ where s is a userdefined param
eter between 0 and 1/2. For example, if we set s as 1/3, we
test if the exon inclusion level of an exon switches from
less than 1/3 in one sample to more than 2/3 in the other
sample. Such an extreme switch of exon inclusion levels
between conditions is a strong indicator of functional alter
nativesplicingevents(3,22).Inthe‘switchlike’test,MATS
considers the null hypothesis that the 1and 2are outside
oftheregiondefinedby
or ð 1> 1 ? s and 2< sÞ. The values of 1 and 2
under the null hypothesis (H0) and the alternative hypoth
esis (H1) for the ‘switchlike’ test are illustrated graphically
in Figure 2B. The same MCMC and simulation procedures
for testing j 1? 2j > c can be used to calculate the
Bayesian posterior probability, Pvalue, and FDR for
‘switchlike’ differential alternative splicing.
ð 1< s and 2> 1 ? sÞ
Exon–exon junction database of human genes
We constructed a database of exon–exon junctions in
human genes using the Ensembl transcript annotations
(release 57) (28). The database includes all known exon–
exon junctions observed in Ensembl transcripts, as well as
PAGE 5 OF 13Nucleic AcidsResearch, 2012, Vol.40,No. 8e61
Page 6
hypothetical exonexon junctions obtained by all possible
pairwise fusions of exons within genes. In total, the data
base contains ?3.5 million exon–exon junctions. Each
exon–exon junction sequence is 84bp long with 42bp
from the 30end of the upstream exon and 42bp from
the 50end of the downstream exon. This exon–exon junc
tion database is available for download from the MATS
website http://intron.healthcare.uiowa.edu/MATS/.
RNASeq analysis of ESRP1 regulated differential
alternative splicing events in the MDAMB231 breast
cancer cell line
MDAMB231 cells were maintained and retrovirally
transduced by a cDNA encoding the epithelialspecific
splicing factor ESRP1 or the empty vector (EV) control
as described previously (23,29). Sequencing libraries were
preparedusingthemRNASeq
(Illumina) according to the manufacturer’s instructions.
Ten micrograms of total RNA was used to prepare
polyA RNA for fragmentation followed by cDNA synthe
sis with random hexamers and ligation to Illumina
adaptor sequences. The samples underwent an RNA
quality assurance check and were quantified using an
Agilent 2100 Bioanalyzer, loaded onto flowcells for cluster
generation, and sequenced on an Illumina Genome
Analyzer II using singleend protocol to generate 76bp
reads (Illumina). The resulting RNASeq dataset consisted
of 256 million singleend reads, including 136 million
reads for the ESRP1 sample and 120 million reads for
the EV sample.
During the quality assessment of our 76bp singleend
RNASeq data, we found that the first two 25bp segments
of these reads had a high mapping rate to the human
genome, while the 3rd 25bp segment had a much lower
mapping rate (data not shown). This is likely due to the
increased sequencing error rate near the 30end of
the Illumina RNASeq reads. Thus, we decided to use
the first 50bp of each read for mapping and subsequent
analysis. We mapped RNASeq reads to the human
genome (hg19) and the exon–exon junction database,
using the software Bowtie (30) allowing up to 3bp
mismatches. Each mapped exon–exon junction read
required at least 8bp from each side of the exon–exon
junction. We further removed exon–exon junction reads
that mapped to either the human genome (hg19) or
multiple exon–exon junctions. For all identified alterna
tively spliced cassette exons, the exon–exon junction
counts were used as the input for MATS.
SamplePrepKits
Illumina Human Body Map 2.0 data on 16 human tissues
We obtained a pairedend RNASeq data set from
Illumina, with ?80 million read pairs per tissue for 16
human tissues (adipose, adrenal, brain, breast, colon,
heart, kidney, liver, lung, lymph node, ovary, prostate,
skeletal muscle, testes, thyroid and white blood cells).
This data set was referred to as the ‘Human Body Map
2.0’ by Illumina. For each pairedend read (50bp?2), we
mapped each end to the human genome (hg19) and the
exon–exon junction database, using the software Bowtie
(30) allowing up to 3bp mismatches. The final mapping
location of the pairedend read was determined by
requiring that the two ends be mapped in the opposite
orientation tothesame
maximum genomic distance of 50kb between the two
ends (to allow introns between the two mapped ends).
genomicregion,witha
RT–PCR validation
Quantitation of alternative splicing was performed using
standard RT–PCR incorporating radiolabeled dCTP or
fluorescently labeled primers, or highthroughput RT–
PCR at the Universite ´ de Sherbrooke as described
(23,31). Since we used MATS to test j 1? 2j > 0:1, we
defined a differential alternative splicing event as validated
if the RT–PCRbased exon inclusion levels differed by at
least 10% between the two samples with the direction of
change matching the RNASeq prediction.
RESULTS
Multivariate uniform prior in MATS
MATS uses a multivariate uniform distribution to model
the joint prior of exon inclusion levels of alternatively
spliced cassette exons in two samples. This is different
from and more general than the independent uniform
priors used by previous methods (11,16). Note that the
multivariate uniform prior includes the independent
uniform prior model as a special case (?=0). These two
types of prior distributions have distinct underlying as
sumptions on the alternative splicing patterns of different
biological conditions. Intuitively, the multivariate uniform
prior allows the modeling of similarity in alternative
splicing patterns between two samples (using the correl
ation parameter ?). In contrast, the independent uniform
priors assume that the global splicing pattern of one
sample is independent of the other sample. To determine
whether the multivariate uniform prior is appropriate and
able to capture the correlation pattern in the data, we
analyzed two RNASeq data sets covering diverse tissues
and cell types.
We first compared the alternative splicing profiles of a
single cell line subject to two different treatments. The
data set came from our deep singleend RNA sequencing
of a human breast cancer cell line (MDAMB231) with
ectopic expression of the epithelialspecific splicing factor
ESRP1 or an empty vector (EV) control (see ‘Materials
and Methods’ section). ESRP1 encodes a master celltype
specific regulator of alternative splicing that controls a
global epithelialspecific splicing network (23,29). In the
MDAMB231 cell line, the ectopic expression of ESRP1
drives coordinated switches of ESRP1regulated exons
towardsanepithelialsplicing
provides an excellent system for testing algorithms of
alternative splicing analysis. We generated 136 million
singleend reads on the ESRP1 sample and 120 million
singleend reads on the EV sample. We identified a total
of 18859 alternatively spliced cassette exons in this data
set. Pan and colleagues previously demonstrated that
RNASeq can reliably estimate the exon inclusion levels
of alternatively spliced exons, when the sequencing
coverage reaches at least 20 reads for one of the three
signature(23).This
e61 Nucleic Acids Research, 2012,Vol. 40,No. 8PAGE 6 OF 13
Page 7
exonexon junctions (4). Therefore, to assess the global
correlation in alternative splicing patterns between these
two samples (ESRP1 and EV), we restricted our analysis
to 12890 alternatively spliced cassette exons with at least
20 reads mapped to one of the three exonexon junctions
in both samples. We observed a high correlation in
the exon inclusion levels of these exons between the
ESRP1 and EV samples (Pearson correlation r=0.95,
P<2.2e16; Figure 3A). In contrast, the exon inclusion
levels simulated from two independent uniform priors
had no correlation between the two samples (Pearson
correlation r=0; Figure 3B), contradicting with the
real data.
To assess if the multivariate uniform prior is capable of
modeling the correlation pattern in the data, we used our
MCMC procedure to obtain the estimate of the correl
ation parameter ? on the real ESRP data set. As the
control, we analyzed 10000 simulated data sets using
exon inclusion levels simulated from two independent
uniform priors, in which no correlation existed between
the two samples. For the ESRP data, our MCMC proced
ure obtained an estimate of ? of 0.93, consistent with the
high overall positive correlation observed in the data. In
contrast, the estimates of ? on the 10000 simulated data
sets were close to 0, indicating no correlation between the
two samples (Figure 3C). These results suggest that the
multivariate uniform prior model is flexible enough for
both situations, and that the MCMC procedure is
capable of obtaining an estimate of the parameter ? that
reflects the degree of correlation in the data.
To assess if the pattern observed in the MDAMB231
cell line holds true when we compare more distantly
related samples of different tissue origins, we analyzed
the Illumina Human Body Map 2.0 data on 16 human
tissues (see ‘Materials and Methods’ section). We per
formed pairwise comparisons of alternative splicing
profiles of all possible tissue pairs. Between any two
tissues, we observed a high overall correlation in the
estimated exon inclusion levels of alternatively spliced
cassette exons, with the Pearson correlation coefficient
ranging from 0.89 to 0.97 (Supplementary Figure S1).
These data further justify the use of the multivariate
uniform prior, even for the comparison between more
divergent samples representing different tissue types.
Simulation study of MATS
We evaluated the performance of MATS with a simula
tion study.Specifically, we
RNASeq data set with a mixture of data points repre
senting nondifferentially spliced exons and differentially
spliced exons. The threshold of the exon inclusion level
difference between two samples was set as 10% in our
simulation study (i.e. j 1? 2j > 0:1). We generated
data for nondifferentially spliced exons by sampling the
exon inclusion level of an exon in sample 1 from a uniform
(0, 1) distribution, and randomly added or subtracted a
small variation drawn from a uniform (0, 0.1) distribution
to obtain the exon inclusion level in sample 2. We
generated datafordifferentially
sampling the exon inclusion level of an exon in sample 1
from a uniform (0, 1) distribution, and randomly added or
subtracted a large variation drawn from a uniform (0.1, 1)
distribution to obtain the exon inclusion level in sample 2.
For all simulated exons, if the variation added to or sub
tracted from the exon inclusion level in sample 1 caused
the exon inclusion level in sample 2 to be above 1 or below
0, the sampling step for the variation was repeated until
the exon inclusion level in sample 2 was within the [0,1]
range. In the simulated data, we generated a total of 5000
data points in which 95% represented nondifferentially
spliced exons and 5% represented differentially spliced
exons (Figure 4A). After the exon inclusion levels were
simulated for 5000 exons, we generated 5 simulated data
sets, by setting the total inclusion+skipping isoform
junction counts per exon and sample as 100, 200, 500,
generatedasimulated
splicedexonsby
Figure 3. The multivariate uniform prior can model the betweensample correlation pattern in the RNASeq data. (A) The scatter plot of the
estimated exon inclusion levels of 12890 alternatively spliced cassette exons in the ESRP1 and EV samples. Only exons with at least 20 reads mapped
to one of the three exon–exon junctions in both samples are included in the plot. (B) The scatter plot of the exon inclusion levels in two samples
simulated from two independent uniform priors. In (A and B), the two red lines define the area where j 1? 2j ? 0:1. (C) The MCMC estimate of
the correlation parameter ? can capture the correlation pattern in the data. For the ESRP data, ^ ? is 0.93 (the red vertical line). For the 10000
simulated data sets from independent uniform priors, ^ ? is distributed close to zero.
PAGE 7 OF 13Nucleic AcidsResearch, 2012, Vol.40,No. 8 e61
Page 8
1000 and 2000 respectively. The inclusion isoform count
of an exon in a sample was then calculated as the product
of its simulated exon inclusion level and the total junction
count. After the simulation data sets were generated, we
used MATS to calculate the Pvalue and FDR of each
exon, and compared the estimated FDRs to the true
FDRs. As shown in Figure 4B–F, although the estimated
FDRs were generally more conservative (i.e. with higher
values) than the true FDRs, the overall curve of the
estimated FDR followed the trend of the true FDR
curve, especially for exons ranked by MATS among the
top 250 (i.e. the number of true positives in our simulated
Figure 4. Simulation study of MATS. (A) Simulated exon inclusion levels of 5000 exons in two samples. A total of 95% of the data points are
simulated from the null hypothesis (j 1? 2j ? 0:1) and 5% are simulated from the alternative hypothesis (j 1? 2j > 0:1). (B–F) MATS FDR
estimates on simulated data with the exon inclusion levels from (A) and the total junction count per exon and sample as 100 (B), 200 (C), 500 (D),
1000 (E) and 2000 (F). In each panel, exons are rank sorted by MATS FDR estimates in ascending order. The zoomedin figure shows the FDR
estimates of the top 250 exons by MATS.
e61 Nucleic Acids Research, 2012,Vol. 40,No. 8PAGE 8 OF 13
Page 9
data set). We observed a sharp increase in the estimated
FDR when the MATS rank of differential alternative
splicing approached 250, consistent with the total
number oftruepositives
Moreover, the number of exons with MATS FDR value
of close to 0 increased progressively with increasing
simulated junction counts (see the zoomedin figures of
Figure 4B–F), reflecting the influence of RNASeq depth
on the sensitivity of detecting differential alternative
splicing events. We note that nondifferentially spliced
exons constitute 95% of the simulated data points. For
such exons, the correct FDR estimates should be high
FDR values.
To further evaluate the performance of the MATS algo
rithm especially the benefit of the correlation parameter ?,
we conducted another simulation study to compare
MATS with a simplified MATS Bayesian model in
which ? is fixed at 0 (i.e. independent prior), as well as
the Fisher exact test. As in the previous simulation study,
we generated 5000 data points in which 95% represented
nondifferentially spliced exons and 5% represented dif
ferentially spliced exons. To mimic the overall distribution
of junction counts in real data sets, for each simulated
exon its total junction counts (i.e. inclusion+skipping
isoform junction counts) in sample 1 and 2 respectively
were randomly sampled from the MDAMB231 ESRP1
data set by taking the counts of a randomly selected alter
natively spliced cassette exon in the ESRP1 and EV
samples. For each of the three methods tested, we
calculated the true positive rate and false positive rate
under sliding Pvalue cutoffs from 0 to 1. We then
generated the receiver operating characteristic (ROC)
curve for each method as the true positive rate versus
false positive rate plot, and calculated the area under
curve (AUC) for each method (Figure 5). MATS had
the highest AUC of 0.96, significantly better than the
inthesimulateddata.
simplified MATS Bayesian model with ? ¼ 0 (AUC=
0.88, DeLong’s Test P=6.8e8). MATS also significantly
outperformed the Fisher
DeLong’s Test P<2.2e16). Additionally, we note that
even the simplified MATS Bayesian model (with ? ¼ 0)
outperformed the Fisher exact test in the most critical
area of the ROC curve where the false positive rate was
low (Figure 5). This indicates that by testing for difference
(with a threshold) instead of equality, the test statistics is
better at separating true positives from false positives.
exacttest (AUC=0.90,
MATS analysis of ESRP1regulated differential
alternative splicing
To evaluate the performance of MATS on a real data set,
we used MATS to detect ESRP1regulated differential al
ternative splicing events using our RNASeq data on the
MDAMB231 cell line with ectopic expression of ESRP1
or an empty vector (EV) control. For each of the 18859
alternatively spliced cassette exons (defined as exons with
at least one inclusion read and one skipping read in these
two samples), we calculated the Bayesian posterior prob
ability, Pvalue and FDR for j 1? 2j > 0:1. Among 240
exons with MATS FDR of <10%, all (100%) had poster
ior probability of >0.85, including 239 (99.6%) and 234
(97.5%) with posterior probability of >0.9 and >0.95 re
spectively. Figure 6 illustrates a differentially spliced exon
(exon 7 of SPNS1) identified by MATS. Based on the
RNASeq read counts we estimated an exon inclusion
level of 77% in the EV sample and 27% in the ESRP1
sample, with a FDR value of 4.6e4 (Figure 6A). These
predictions matched RTPCR results of SPNS1 exon 7
splicing in these two samples (Figure 6B).
To assess the overall accuracy of our FDR estimates, we
selected 164 exons covering a broad range of MATS FDR
values (Supplementary Table S1) and tested their splicing
patterns by RT–PCR. Of all the exons tested by RT–PCR,
111 exons had at least 10% difference in the exon inclu
sion levels between the two samples with the direction of
change matching the RNASeq predictions. This yielded
an overall validation rate of 68%. To assess whether the
validation rate correlated with MATS FDR estimates,
we divided the full list of 164 exons into four cohorts ac
cording to the estimated FDR values, and calculated the
RT–PCR validation rate for each cohort. We observed a
progressive decrease in the RT–PCR validation rate for
cohorts with increasing FDR values (Figure 7). The first
cohort had 92 exons with FDR estimates between 0 to
10%. In this cohort, 79 exons were validated by
RT–PCR as differentially spliced, yielding a high valid
ation rate of 86%. The second, third and fourth cohorts
corresponded to exons with FDR estimates between 10%
and 30%, between 30% and 60%, and between 60% and
100%. These three cohorts had RT–PCR validation rates
of 73%, 55% and 36%, respectively (Figure 7). These
results indicate that MATS can generate experimentally
meaningful FDR estimates to help biologists with the in
terpretation of RNASeq predictions and the design of
followup experiments. There was a sharp increase in the
estimated FDR value after the initial list of top 240–406
exons (Figure 7), with ?98% of the exons having a FDR
Figure 5. Simulation study to compare MATS, a simplified MATS
Bayesian model in which ? is fixed at 0 (i.e. independent prior), and
the Fisher exact test. MATS significantly outperforms the other two
methods based on the AUC of the ROC curve (i.e. the true positive
rate versus false positive rate plot).
PAGE 9 OF 13Nucleic AcidsResearch, 2012, Vol.40,No. 8 e61
Page 10
of ?90%. This was similar to the shape of the FDR dis
tribution in the simulation study (Figure 4), probably re
flecting the number of ESRP1regulated exons in the
human genome as well as the percentage of which that
can be detected at the current RNASeq depth. Of note,
among the 164 exons tested by RT–PCR, 17 had a MATS
FDR of 100%. Only 1 of the 17 exons had more than 10%
change in the RT–PCR estimated exon inclusion levels
with the direction of change matching the RNASeq pre
diction, yielding a low validation rate of only 6% that
closely matched the estimated FDR. This indicates that
MATS can provide reliable FDR estimates for the full
range of possible FDR values.
Since our exon–exon junction database includes all
known exonexon junctions observed in Ensembl tran
scripts, as well as hypothetical exon–exon junctions
obtained by all possible pairwise fusions of exons within
genes, we can detect and analyze novel exon skipping
events not supported by existing transcript annotations.
This is important considering the prevalence of tissue
and conditionspecific alternative splicing (32). Of all
18859 alternatively spliced cassette exons, 5373 represent
known events and 13486 represent novel events. Of the
240 significant events with MATS FDR<10%, 140 repre
sent known events while 100 represent novel events. The
percentage of events called as differentially spliced is sig
nificantly higher for known events (2.6%; 140/5373) than
for novel events (0.7%; 100/13486). Interestingly, of the 92
RT–PCR tested cassette exons with FDR<10%, the RT–
PCR validation rate of differential alternative splicing is
higher for novel events (98%; 43/44) than for known
events (75%; 36/48).
The MATS algorithm can be naturally extended to
other types of alternative splicing events such as alterna
tive 50or 30splice sites and mutually exclusive exon usage.
In the analysis of alternative 50or 30splice sites, the counts
of junction reads that uniquely support the two competing
splice sites can serve as the input for MATS. These counts
can be used to estimate the ‘inclusion level’ of any specific
splice site. All subsequent analysis steps are identical to
the analysis of differential exon skipping. To illustrate
this, we applied MATS to 1571 alternative 50splice site
events and 2383 alternative 30splice site events in the
ESRP1 data set. With a FDR cutoff of 5%, 13 events (9
alternative 50splice sites and 4 alternative 30splice sites)
were identified to undergo significant differential splicing
(>10% change) between the ESRP1 and EV samples. The
small number of detected differential alternative 50or 30
splice sites is consistent with an early study using the
Affymetrix exon 1.0 array (33). One possible explanation
is that ESRP1 regulates a small number of such events in
the transcriptome. Nonetheless, we tested five events by
RT–PCR, of which three were validated to have at least
10% change in splice site inclusion level (Supplementary
Table S2). Supplementary Figure S2 shows the example of
a validated differential alternative 5’ splice site event in
exon 4 of HNRNPH3.
MATS analysis of switchlike alternative splicing between
brain and 15 other tissues
MATS has the flexibility to detect exons with the extreme
‘switchlike’differentialalternativesplicingpattern
Figure 6. RNASeq and RT–PCR analysis of SPNS1 exon 7 splicing. (A) RNASeq junction counts and MATS result of SPNS1 exon 7 in the EV
and ESRP1 samples. (B) RT–PCR result of SPNS1 exon 7 in the EV and ESRP1 samples.
Figure 7. RT–PCR validation of 164 exons covering a broad range of
MATS FDR values. All exons analyzed by MATS are rank sorted by
FDR estimates (yaxis) in ascending order. The 164 exons tested by
RT–PCR are divided into four nonoverlapping cohorts according to
the FDR estimates. The validation rate for each cohort is shown.
e61Nucleic Acids Research, 2012,Vol. 40,No. 8PAGE 10 OF 13
Page 11
(see ‘Materials and Methods’ section). To illustrate this
function, we used MATS to detect switchlike differential
alternative splicing events between the brain and each of
the 15 other tissues in the Illumina Human Body Map 2.0
data set. For each pairwise comparison, we tested if the
exon inclusion level of an exon switches from less than 1/3
in one tissue to more than 2/3 in the other tissue (i.e.
ð 1< 1=3 and 2> 2=3Þ or ð 1> 2=3 and 2< 1=3Þ.
With a FDR cutoff of <50%, a total of 229 exons were
identified to have the switchlike differential alternative
splicing pattern between the brain and at least one other
tissue. Prior studies have revealed sequence features of
such ‘tissueswitched’ cassette exons characteristic of func
tional alternative splicing events (3,22). A unique feature
of tissueswitched exons is that they are much more likely
to be exact multiples of 3nt in length, thus alternative
splicing adds or removes a modular peptide segment of
the final protein product while preserving the downstream
open reading frame (i.e. ‘framepreserving’, as opposed
to ‘frameswitching’ for exons not exact multiples of 3nt
in length). Consistent with these findings, of the 229
switchlike exons detected by MATS between brain and
15 other tissues, 70% are framepreserving, compared to
42% for other alternatively spliced cassette exons without
switchlike differential alternative splicing (P=0; see
Supplementary Figure S3).
DISCUSSION
We present MATS, a new method to detect differential
alternative splicing events from RNASeq data. A major
advantage of MATS over existing methods is that it allows
flexible hypothesis testing of differential alternative
splicing patterns. Most of the published work attempted
to identify differential alternative splicing events by testing
the equality of the exon inclusion levels between samples
(11,15,16,18–20). Some also applied a secondary filter
(without statistical testing) on the change in the estimated
exon inclusion levels (20). MATS provides a Bayesian
statistical framework to directly test the hypothesis and
evaluate the statistical significance that the absolute dif
ference in exon inclusion levels between two samples
exceeds any userdefined threshold. This allows research
ers to select the magnitude of splicing changes suitable
for specific research goals in a rigorous statistical setting.
We assessed the performance of MATS using simulated
data and real RNASeq data sets. In the RNASeq
analysis of ESRP1 regulated alternative splicing, we
obtained a high RT–PCR validation rate of 86% for can
didate exons with MATS FDR <10%. Additionally, over
the full list of RT–PCR tested exons, the MATS FDR
estimates matched well with the experimental validation
rate (Figure 7). The MATS framework is also applicable
to other null hypotheses of interest. For example,
MATS can be used to test the hypothesis that an exon
exhibits the extreme ‘switchlike’ differential alternative
splicing pattern.
A novel feature of MATS is the multivariate uniform
prior that models the betweensample correlation in exon
splicing patterns. In both the ESRP data and the Human
Body Map 2.0 data, the degree of correlation between
samples is high, resulting in a high estimated value of
the correlation parameter ? for the multivariate prior
model. The high betweensample correlation observed in
real RNASeq data is consistent with the mechanism of
splicing regulation in eukaryotic cells. Splicing is a
complex process mediated by extensive interactions
among cis regulatory elements and trans acting regulators
(34). Most splicing regulators may control the splicing of
up to several hundred exons via sequencespecific protein–
RNA interactions (35). Perturbing a specific component of
the splicing regulatory pathway usually changes the
splicing activity of a small subset of alternatively spliced
exons, while the majority of alternatively spliced exons
remain unaffected. We also note that when no correlation
exists in the data (see Figure 3 for the simulated data), the
estimate of ? by the MCMC procedure is close to zero.
Taken together, our analysis indicates that the multivari
ate uniform prior model is flexible enough to accommo
date different degrees of betweensample correlation
in the RNASeq data. Moreover, our simulation study
(Figure 5) indicates that incorporating the correlation par
ameter ? in the MATS model improves the performance
of the algorithm.
The MATS software and documentation as well as
the raw ESRP1 RNASeq data are freely available
at http://intron.healthcare.uiowa.edu/MATS/. The scripts
provided online include the core MATS program to cal
culate the posterior probability, Pvalue and FDR of dif
ferential alternative splicing from the input junction
counts, as well as accessory scripts to generate junction
counts and detect alternative splicing events from raw
RNASeq data. To facilitate data analysis, we also
provide databases of precompiled exonexon junctions
based on the Ensembl (28) or UCSC Known Genes (36)
annotations. We suggest that MATS can be used either as
a standalone software for differential alternative splicing
analysis of RNASeq data, or as part of an existing
RNASeq analysis pipeline to calculate the statistical sig
nificance of a userdefined differential splicing pattern
using junction counts generated by other mapping proced
ures or transcript annotation databases. We note that
a number of recent studies have reported biases in
RNASeq data such as the nonuniform distribution of
RNASeq reads along mRNA transcripts, and have
proposed methods to adjust raw RNASeq read counts
by correcting for such biases (18,37–41). Additionally, it
has been demonstrated that in pairedend RNASeq, the
distribution of insert size between the two ends can be
utilized to improve the assignment of reads to specific
transcript isoforms (16). Thus, it is possible to use an ap
propriate method to adjust raw RNASeq read counts and
refine the counts of isoformspecific junctions, prior to the
hypothesis testing of differential alternative splicing by
MATS. It should also be noted that although the
analysis in this manuscript is mostly focused on exon
skipping events (i.e. differential inclusion/skipping of an
entire exon), exon skipping is only one subtype of alter
native splicing events. The MATS algorithm can be
readily applied to junction counts generated for other
types of alternative splicing events, as illustrated for
PAGE 11 OF 13Nucleic AcidsResearch, 2012, Vol.40,No. 8e61
Page 12
alternative 50or 30splice sites on the ESRP1 data set
(Supplementary Figure S2).
MATS currently performs twogroup comparison of
two samples, with one sample per group without
withingroup replicates. This is the typical experimental
setup in most published RNASeq studies of alternative
splicing including studies of splicing regulators (16,20,42),
largely due to the high cost of RNASeq to achieve suffi
cient depth for splicing analysis. However, as the cost of
highthroughput sequencing continues to decline, we an
ticipate that it will soon become feasible and common for
researchers toincorporate
RNASeq studies of alternative splicing (43). Medical re
searchers may soon be able to generate RNASeq data
across a large number of healthy and diseased specimens,
with the depth sufficient for quantifying splicing in each
individual sample. Thus, an important future direction is
to extend the statistical framework to incorporate the use
of RNASeq replicates in detecting differential alternative
splicing.
biological replicatesin
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online:
Supplementary Figures S1–S3 and Supplementary Tables
S1–S2.
ACKNOWLEDGEMENTS
We wish to thank Keyan Zhao for comments, and Collin
Tokheim for technical assistance. We thank Gary Schroth
for early access to the Illumina Human Body Map 2.0
data. We thank David Eichmann, Ben Rogers and the
Universityof IowaInstitute
Translational Science (NIH grant UL1RR024979) for
computer support.
forClinicaland
FUNDING
National
P30DK054759
R01CA120988 to J.H.); National Science Foundation
(DMS0805491, Career award DMS1055286 to Q.Z.); a
junior faculty grant from the Edward Mallinckrodt Jr
Foundation (to Y.X.); National Institutes of Health T32
postdoctoral fellow training grant (T32HL007638 to
J.W.P.). Fundingfor open
R01GM088342.
Institutes of Health
R01GM088809
(R01GM088342 and
toY.X., toR.P.C.,
access charge:NIH
Conflict of interest statement. None declared.
REFERENCES
1. Keren,H., LevMaor,G. and Ast,G. (2010) Alternative splicing
and evolution: diversification, exon definition and function.
Nat. Rev. Genet., 11, 345–355.
2. Graveley,B.R. (2001) Alternative splicing: increasing diversity in
the proteomic world. Trends Genet., 17, 100–107.
3. Wang,E.T., Sandberg,R., Luo,S., Khrebtukova,I., Zhang,L.,
Mayr,C., Kingsmore,S.F., Schroth,G.P. and Burge,C.B. (2008)
Alternative isoform regulation in human tissue transcriptomes.
Nature, 456, 470–476.
4. Pan,Q., Shai,O., Lee,L.J., Frey,B.J. and Blencowe,B.J. (2008)
Deep surveying of alternative splicing complexity in the human
transcriptome by highthroughput sequencing. Nat. Genet., 40,
1413–1415.
5. Cooper,T.A., Wan,L. and Dreyfuss,G. (2009) RNA and disease.
Cell, 136, 777–793.
6. Heyd,F. and Lynch,K.W. (2011) DEGRADE, MOVE,
REGROUP: signaling control of splicing proteins. Trends
Biochem. Sci., 36, 397–404.
7. Buchner,D.A., Trudeau,M. and Meisler,M.H. (2003) SCNM1,
a putative RNA splicing factor that modifies disease severity in
mice. Science, 301, 967–969.
8. Ingram,E.M. and Spillantini,M.G. (2002) Tau gene mutations:
dissecting the pathogenesis of FTDP17. Trends Mol. Med., 8,
555–562.
9. Modrek,B. and Lee,C. (2002) A genomic view of alternative
splicing. Nat. Genet., 30, 13–19.
10. Xu,Q., Modrek,B. and Lee,C. (2002) Genomewide detection of
tissuespecific alternative splicing in the human transcriptome.
Nucleic Acids Res., 30, 3754–3766.
11. Xu,Q. and Lee,C. (2003) Discovery of novel splice forms and
functional analysis of cancerspecific alternative splicing in human
expressed sequences. Nucleic Acids Res., 31, 5635–5643.
12. Wang,Z., Lo,H.S., Yang,H., Gere,S., Hu,Y., Buetow,K.H. and
Lee,M.P. (2003) Computational analysis and experimental
validation of tumorassociated alternative RNA splicing in human
cancer. Cancer Res., 63, 655–657.
13. Gupta,S., Zink,D., Korn,B., Vingron,M. and Haas,S.A. (2004)
Strengths and weaknesses of ESTbased prediction of
tissuespecific alternative splicing. BMC Genomics, 5, 72.
14. Wang,Z., Gerstein,M. and Snyder,M. (2009) RNASeq: a
revolutionary tool for transcriptomics. Nat. Rev. Genet., 10,
57–63.
15. Griffith,M., Griffith,O.L., Mwenifumbo,J., Goya,R.,
Morrissy,A.S., Morin,R.D., Corbett,R., Tang,M.J., Hou,Y.C.,
Pugh,T.J. et al. (2010) Alternative expression analysis by RNA
sequencing. Nat. Methods, 7, 843–847.
16. Katz,Y., Wang,E.T., Airoldi,E.M. and Burge,C.B. (2010)
Analysis and design of RNA sequencing experiments for
identifying isoform regulation. Nat. Methods, 7, 1009–1015.
17. Shen,S., Lin,L., Cai,J.J., Jiang,P., Kenkel,E.J., Stroik,M.R.,
Sato,S., Davidson,B.L. and Xing,Y. (2011) Widespread
establishment and regulatory impact of Alu exons in human
genes. Proc. Natl Acad. Sci. USA, 108, 2837–2842.
18. Srivastava,S. and Chen,L. (2010) A twoparameter generalized
Poisson model to improve the analysis of RNAseq data.
Nucleic Acids Res., 38, e170.
19. Lalonde,E., Ha,K.C., Wang,Z., Bemmo,A., Kleinman,C.L.,
Kwan,T., Pastinen,T. and Majewski,J. (2011) RNA sequencing
reveals the role of splicing polymorphisms in regulating human
gene expression. Genome Res., 21, 545–554.
20. Brooks,A.N., Yang,L., Duff,M.O., Hansen,K.D., Park,J.W.,
Dudoit,S., Brenner,S.E. and Graveley,B.R. (2011) Conservation
of an RNA regulatory map between Drosophila and mammals.
Genome Res., 21, 193–202.
21. Xing,Y., Yu,T., Wu,Y.N., Roy,M., Kim,J. and Lee,C. (2006)
An expectationmaximization algorithm for probabilistic
reconstructions of fulllength isoforms from splice graphs.
Nucleic Acids Res., 34, 3150–3160.
22. Xing,Y. and Lee,C.J. (2005) Protein modularity of alternatively
spliced exons is associated with tissuespecific regulation of
alternative splicing. PLoS Genet., 1, e34.
23. Warzecha,C.C., Jiang,P., Amirikian,K., Dittmar,K.A., Lu,H.,
Shen,S., Guo,W., Xing,Y. and Carstens,R.P. (2010) An
ESRPregulated splicing programme is abrogated during the
epithelialmesenchymal transition. EMBO J., 29, 3286–3300.
24. Benjamini,Y. and Hochberg,Y. (1995) Controlling the false
discovery rate: a practical and powerful approach to multiple
testing. J. R. Stat. Soc. Ser. B Methodol., 57, 289–300.
25. Modrek,B. and Lee,C. (2003) Alternative splicing in the human,
mouse and rat genomes is associated with an increased rate of
exon creation/loss. Nat. Genet., 34, 177–180.
26. Zhu,C.Y., Byrd,R.H., Lu,P.H. and Nocedal,J. (1997) Algorithm
778: LBFGSB: Fortran subroutines for largescale
e61Nucleic Acids Research, 2012,Vol. 40,No. 8PAGE 12 OF 13
Page 13
boundconstrained optimization. ACM Trans. Math. Software, 23,
550–560.
27. Byrd,R.H., Lu,P.H., Nocedal,J. and Zhu,C.Y. (1995) A limited
memory algorithm for bound constrained optimization. SIAM J.
Sci. Comput., 16, 1190–1208.
28. Flicek,P., Amode,M.R., Barrell,D., Beal,K., Brent,S., Chen,Y.,
Clapham,P., Coates,G., Fairley,S., Fitzgerald,S. et al. (2011)
Ensembl 2011. Nucleic Acids Res., 39, D800–D806.
29. Warzecha,C.C., Sato,T.K., Nabet,B., Hogenesch,J.B. and
Carstens,R.P. (2009) ESRP1 and ESRP2 are epithelial celltype
specific regulators of FGFR2 splicing. Mol. Cell, 33, 591–601.
30. Langmead,B., Trapnell,C., Pop,M. and Salzberg,S.L. (2009)
Ultrafast and memoryefficient alignment of short DNA
sequences to the human genome. Genome Biol., 10, R25.
31. Lu,Z.X., Jiang,P., Cai,J.J. and Xing,Y. (2011) Contextdependent
robustness to 50splice site polymorphisms in human populations.
Hum. Mol. Genet., 20, 1084–1096.
32. Kalsotra,A. and Cooper,T.A. (2011) Functional consequences of
developmentally regulated alternative splicing. Nat. Rev. Genet.,
12, 715–729.
33. Warzecha,C.C., Shen,S., Xing,Y. and Carstens,R.P. (2009) The
epithelial splicing factors ESRP1 and ESRP2 positively and
negatively regulate diverse types of alternative splicing events.
RNA Biol., 6, 546–562.
34. Wang,Z. and Burge,C.B. (2008) Splicing regulation: from a parts
list of regulatory elements to an integrated splicing code. RNA,
14, 802–813.
35. Chen,M. and Manley,J.L. (2009) Mechanisms of alternative
splicing regulation: insights from molecular and genomics
approaches. Nat. Rev. Mol. Cell Biol., 10, 741–754.
36. Hsu,F., Kent,W.J., Clawson,H., Kuhn,R.M., Diekhans,M. and
Haussler,D. (2006) The UCSC known genes. Bioinformatics, 22,
1036–1046.
37. Hansen,K.D., Brenner,S.E. and Dudoit,S. (2010) Biases in
Illumina transcriptome sequencing caused by random hexamer
priming. Nucleic Acids Res., 38, e131.
38. Schwartz,S., Oren,R. and Ast,G. (2011) Detection and removal of
biases in the analysis of nextgeneration sequencing reads. PLoS
ONE, 6, e16685.
39. Li,J., Jiang,H. and Wong,W.H. (2010) Modeling nonuniformity
in shortread rates in RNASeq data. Genome Biol., 11, R50.
40. Roberts,A., Trapnell,C., Donaghey,J., Rinn,J.L. and Pachter,L.
(2011) Improving RNASeq expression estimates by correcting for
fragment bias. Genome Biol., 12, R22.
41. Bohnert,R. and Ratsch,G. (2010) rQuant.web: a tool for
RNASeqbased transcript quantitation. Nucleic Acids Res., 38,
W348–W351.
42. Luco,R.F., Pan,Q., Tominaga,K., Blencowe,B.J., Pereira
Smith,O.M. and Misteli,T. (2010) Regulation of
alternative splicing by histone modifications. Science, 327,
996–1000.
43. Hansen,K.D., Wu,Z., Irizarry,R.A. and Leek,J.T. (2011)
Sequencing technology does not eliminate biological variability.
Nat. Biotechnol., 29, 572–573.
PAGE 13 OF 13Nucleic AcidsResearch, 2012, Vol.40,No. 8 e61