Large-scale identification and characterization of alternative splicing variants of human gene transcripts using 56 419 completely sequenced and manually annotated full-length cDNAs
ABSTRACT We report the first genome-wide identification and characterization of alternative splicing in human gene transcripts based on analysis of the full-length cDNAs. Applying both manual and computational analyses for 56 419 completely sequenced and precisely annotated full-length cDNAs selected for the H-Invitational human transcriptome annotation meetings, we identified 6877 alternative splicing genes with 18 297 different alternative splicing variants. A total of 37 670 exons were involved in these alternative splicing events. The encoded protein sequences were affected in 6005 of the 6877 genes. Notably, alternative splicing affected protein motifs in 3015 genes, subcellular localiza-tions in 2982 genes and transmembrane domains in 1348 genes. We also identified interesting patterns of alternative splicing, in which two distinct genes seemed to be bridged, nested or having overlapping protein coding sequences (CDSs) of different reading:10.1093/nar/gkl507 frames (multiple CDS). In these cases, completely unrelated proteins are encoded by a single locus. Genome-wide annotations of alternative splicing, relying on full-length cDNAs, should lay firm groundwork for exploring in detail the diversifica-tion of protein function, which is mediated by the fast expanding universe of alternative splicing variants. INTRODUCTION
-
Article: Alternative splicing of pre-mRNA: developmental consequences and mechanisms of regulation.
[show abstract] [hide abstract]
ABSTRACT: Alternative splicing of pre-mRNAs is a powerful and versatile regulatory mechanism that can effect quantitative control of gene expression and functional diversification of proteins. It contributes to major developmental decisions and also to fine tuning of gene function. Genetic and biochemical approaches have identified cis-acting regulatory elements and trans-acting factors that control alternative splicing of specific pre-mRNAs. Both approaches are contributing to an understanding of their mode of action. Some alternative splicing decisions are controlled by specific factors whose expression is highly restricted during development, but others may be controlled by more modest variations in the levels of general factors acting cooperatively or antagonistically. Certain factors play active roles in both constitutive splicing and regulation of alternative splicing. Cooperative and antagonistic effects integrated at regulatory elements are likely to be important for specificity and for finely tuned differences in cell-type-specific alternative splicing patterns.Annual Review of Genetics 02/1998; 32:279-305. · 22.23 Impact Factor -
Article: Alternative pre-mRNA splicing: the logic of combinatorial control.
[show abstract] [hide abstract]
ABSTRACT: Alternative splicing of mRNA precursors is a versatile mechanism of gene expression regulation that accounts for a considerable proportion of proteomic complexity in higher eukaryotes. Its modulation is achieved through the combinatorial interplay of positive and negative regulatory signals present in the RNA, which are recognized by complexes composed of members of the hnRNP and SR protein families.Trends in Biochemical Sciences 09/2000; 25(8):381-8. · 10.85 Impact Factor -
Article: Generation of recognition diversity in the nervous system.
[show abstract] [hide abstract]
ABSTRACT: For decades, it has been suggested that complex neural wiring might be specified by extensive diversity in receptor isoforms. Dscam is a cell surface protein with 38,016 potential alternatively spliced isoforms in the fly nervous system. Remarkable binding studies now show that Dscam isoform diversity indeed results in an unprecedented level of recognition diversity, showing isoform-specific homophilic binding. In vivo studies have begun to suggest models for use of Dscam diversity in neuron-target recognition, axon fasciculation, and neuron self-recognition.Neuron 11/2004; 44(2):219-22. · 14.74 Impact Factor
Page 1
Large-scale identification and characterization of
alternative splicing variants of human gene
transcripts using 56 419 completely sequenced
and manually annotated full-length cDNAs
Jun-ichi Takeda1,2, Yutaka Suzuki3, Mitsuteru Nakao4,5, Roberto A. Barrero6,
Kanako O. Koyanagi7, Lihua Jin6, Chie Motono4, Hiroko Hata3, Takao Isogai8,9,
Keiichi Nagai9,10, Tetsuji Otsuki9, Vladimir Kuryshev11, Masafumi Shionyu12,
Kei Yura13,14, Mitiko Go11,15, Jean Thierry-Mieg16,17, Danielle Thierry-Mieg16,17,
Stefan Wiemann11, Nobuo Nomura2, Sumio Sugano3, Takashi Gojobori2,6
and Tadashi Imanishi2,7,*
1IntegratedDatabaseGroup,JapanBiologicalInformationResearchCenter,JapanBiologicalInformaticsConsortium,
AISTBio-ITResearchBuilding,Aomi2-42,Koto-ku,Tokyo135-0064,Japan,2BiologicalInformationResearchCenter,
National Institute of Advanced Industrial Science and Technology, AIST Bio-IT Research Building, Aomi 2-42,
Koto-ku, Tokyo 135-0064, Japan,3Department of Medical Genome Sciences, Graduate School of Frontier Sciences,
the University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8562, Japan,4Computational Biology Research
Center, National Institute of Advanced Science and Technology, AIST Bio-IT Research Building, Aomi 2-42,
Koto-ku, Tokyo 135-0064, Japan,5Kazusa DNA Research Institute, 2-6-7 Kazusa-Kamatari, Kisarazu, Chiba
292-0818, Japan,6Center for Information Biology and DDBJ, National Institute of Genetics, 1111 Yata, Mishima,
Shizuoka 411-8540, Japan,7Graduate School of Information Science and Technology, Hokkaido University, North 14,
West 9, Kita-ku, Sapporo, Hokkaido 060-0814, Japan,8Reverse Proteomics Research Institute,
2-6-7 Kazusa-Kamatari, Kisarazu, Chiba 292-0818, Japan,9Helix Research Institute, Inc. 1532-3, Yana, Kisarazu,
Chiba 292-0812, Japan,10Central Research Laboratory, Hitachi, Ltd, 1-280, Higashi-koigakubo, Kokubunji-shi,
Tokyo 185-8601, Japan,11Division of Molecular Genome Analysis, German Cancer Research Center, Im
Neuenheimer Feld 580, D-69120 Heidelberg, Germany,12Faculty of Bio-Science, Nagahama Institute of Bio-Science
and Technology, 1266 Tamura-cho, Nagahama, Shiga 526-0829, Japan,13Quantum Bioinformatics Team, Center for
Computational Science and Engineering, Japan Atomic Energy Agency, 8-1 Umemidai, Kizu, Souraku, Kyoto
619-0215, Japan,14Core Research for Evolution Science and Technology, Japan Science and Technology Agency,
Japan,15Ochanomizu University, 2-1-1 Otsuka, Bunkyo-ku, Tokyo 112-8610, Japan,16National Center for
Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA and
17Centre National de la Recherche Scientifique, Laboratoire de Physique Mathematique, Montpellier, France
Received April 22, 2006; Revised and Accepted July 3, 2006
ABSTRACT
We report the first genome-wide identification and
characterization of alternative splicing in human
gene transcripts based on analysis of the full-length
cDNAs. Applying both manual and computational
analyses for 56419 completely sequenced and
precisely annotated full-length cDNAs selected for
the H-Invitational human transcriptome annotation
meetings, we identified 6877 alternative splicing
genes with 18 297 different alternative splicing
variants. A total of 37 670 exons were involved in
these alternative splicing events. The encoded
protein sequences were affected in 6005 of the
6877 genes. Notably, alternative splicing affected
protein motifs in 3015 genes, subcellular localiza-
tions in 2982 genes and transmembrane domains in
1348 genes. We also identified interesting patterns
of alternative splicing, in which two distinct genes
seemed to be bridged, nested or having overlapping
protein coding sequences (CDSs) of different reading
*To whom correspondence should be addressed. Tel: +81 3 3599 8800; Fax: 81 3 3599 8801; Email: imanishi@jbirc.aist.go.jp
? 2006 The Author(s).
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Published online 12 August 2006 Nucleic Acids Research, 2006, Vol. 34, No. 143917–3928
doi:10.1093/nar/gkl507
Page 2
frames (multiple CDS). In these cases, completely
unrelated proteins are encoded by a single locus.
Genome-wide annotations of alternative splicing,
relyingon full-length cDNAs,
groundwork for exploring in detail the diversifica-
tion of protein function, which is mediated by the
fast expanding universe of alternative splicing
variants.
should lay firm
INTRODUCTION
Alternative splicing is a phenomenon in which different com-
binations of exons are spliced to produce distinct transcripts
(1). Especially in higher eukaryotes, alternative splicing is
frequently used as versatile means of producing diverse tran-
scripts from a single gene locus. The alterations of the exons
may cause changes in the encoded amino acid sequences and,
at least in some cases, produce functionally divergent pro-
teins by modifying, for example, the binding site of a growth
factor receptor or an activation site of transcription factor
(2,3). The most striking example of this is the Drosophila
DSCAM gene, which is an axon guidance receptor gene.
This gene consists of 17 exons, with 12, 48, 44 and 2
mutually exclusive alternative splicing exons for exons 4, 6,
9 and 17, respectively. Thus, even if not all combinations of
these exons are allowed, this gene can encode thousands of
protein products, which should enable functional diversifica-
tion of the protein that could in turn assure precise axonal
trajectory (4,5). Since the initial draft sequence of the
human genome revealed that there seem to be an unexpec-
tedly small number of genes embedded in the human genome
(6), it has been hypothesized that alternative splicing is one of
the most significant processes giving rise to the functional
complexity of the human genome and that it might be indis-
pensable for generating highly complex organisms such as
humans (7).
However, despite the growing interest on the impact of
alternative splicing in various aspects of the biological
processes, our understanding of alternative splicing is still
very primitive and its mechanisms of control are mostly
unknown (8). In order to advance our understanding of the
biological significance of alternative splicing in humans, it
is essential to identify and characterize the genes that are sub-
ject to alternative splicing and which splicing patterns are
used in what context in a genome-wide manner.
For this reason, large-scale attempts to identify alternative
splicing have been initiated by several groups [i.e. (9)],
mainly using bioinformatics analysis of partially sequenced
cDNAs (ESTs). The EST sequences are clustered and com-
pared to evaluate if the differences in their sequences might
have been resulted from alternative splicing. So far, millions
of human ESTs have been analyzed, and the newly identified
alternative splicingvariants
databases, such as AceView (130576 alternative splicing
variants from 19557 genes; http://www.ncbi.nlm.nih.gov/
IEB/Research/Acembly/; Y. Kohara et al., in preparation),
ASAP [30793 alternative splicing variants from 7991
genes; http://www.bioinformatics.ucla.edu/ASAP/; (10)] and
ASD [73340 alternative splicing variants from 16236
are presentedinseveral
genes; http://www.ebi.ac.uk/asd/; (11)]. However, these
recent high-throughput approaches have limitations. The
first is the skewed coverage of the ESTs over the mRNAs.
Since the ESTs are generally scarce around the 50ends of
an mRNA, previous data could have been biased towards
the 30ends. Secondly, some combinations of the alternative
splicing exons may not be allowed. In those cases, analysis
of only partially sequenced cDNAs would have no chance
to identify mutual dependence of the combinations in altern-
ative splicing.
Full-length cDNAs provide an ideal solution to all of these
limitations. Moreover, not only full-length cDNAs can be
utilized to extract complete cDNA sequence information,
but are also useful as physical reagents indispensable for
experimental analysis to examine the functional con-
sequences of the identified alternative splicing. In this
study, we used 56419 cDNA sequences of human genes
selected for the H-Invitational human transcriptome annota-
tion meetings (12). These cDNAs were enriched for full-
length cDNAs using various methods (13–16). In addition,
they were fully sequenced with a sequence reliability higher
than 99% [Phred values greater than 30; (17)]. Also, poten-
tially problematic sequences, such as vectors and poly(A)
tails were precisely trimmed. Thus, this cDNA collection
constitutes an outstanding resource for comprehensive studies
of alternative splicing. Out of the 56419 cDNAs, 55036 were
successfully mapped onto the human genomic sequence
(UCSC hg16; http://hgdownload.cse.ucsc.edu/downloads.
html#human) and clustered into 24425 loci. Of these,
10127 loci contained two or more cDNAs and both manual
and computational inspection allowed us to identify 18297
alternative splicing variants encoded in 6877 loci [Table 1;
(12)]. General statistics and a part of the related information
have been published as a part of H-Invitational paper (12).
In this paper, we describe further detailed features of altern-
ative splicing. Here, we report the large-scale genome-wide
identification and analysis of human alternative splicing
based on fully sequenced and precisely annotated full-length
cDNAs.
MATERIALS AND METHODS
Dataset description
In the present study, the set of the 56 419 cDNAs, selected for
the H-Invitational human transcriptome annotation meetings
was used. Contributors and attributes of each of the cDNA
data subsets are described in Supplementary Table 1, the
reference (12) and further references therein.
Computational procedures to identify and characterize
the alternative splicing variants
We mapped the full-length cDNAs to the human genome
(UCSC hg16; http://hgdownload.cse.ucsc.edu/downloads.
html#human). Alignments were generated using EST2GEN-
OME (http://emboss.sourceforge.net/apps/est2genome.html);
alignments having at least 95% identity and 90% coverage
were selected. The cDNAs mapped on the same genomic
region (at least one base overlap) were clustered and regarded
3918 Nucleic Acids Research, 2006, Vol. 34, No. 14
Page 3
as a putative ‘locus’. For further details of the mapping and
clustering, see Ref. (12).
Since this study focuses on complete transcript variants, we
conducted a sequence inspection of the mapped cDNAs to
identify and discard 50end-truncated cDNAs from the data-
set. To this end, we excluded cDNAs whose 50ends had
been located inside the second or later exons of any other
cDNAs with compatible exon structure in the same locus.
We accepted the cDNAs whose 50ends were located inside
of the first exons and considered as variations in the exact
transcriptional start sites. We also assumed that those
cDNAs whose 50ends were located outside of the exonic
regions of any other clones could not be truncated forms of
any known types of transcripts, at least. This assumption is
based on the fact that the combination of multiple errors,
for example, truncation followed by erroneous oligo-capping
that occurred on an immature form, would be required to
erroneously generate such cDNAs [for further detailed dis-
cussion of this subject, see Ref. (18)]. Using similar concepts,
we examined the completeness of the 30end using the same
procedure and similarly removed all possible 30end-truncated
cDNAs. The 30ends located inside of the last exons were
allowed and considered as alternative polyadenylation sites.
Computational identification of the alternative splicing
variants was then performed using the resulting filtered set
of full-length cDNAs as follows: (i) The genomic position
of each exon–intron boundary was compared with those of
the other transcripts belonging to the same locus. For the
comparison, a 10 bp allowance was made; (ii) If a cDNA
had a part of the exonic sequence in the first/last exon inside
confirmed intronic regions of the other cDNAs, it was
regarded as being a ‘50/30end’ alternative splicing variants.
(iii) If a cDNA had a part of an internal exonic sequence
inside a confirmed intronic region of other cDNAs, it was
recognized as being an ‘internal’ alternative splicing variants
(for a schematic representation, see Figure 1).
In order to evaluate and characterize the outcomes of the
identified alternative splicing events on the encoded protein
sequences, we used the information of the ORFs (i.e. posi-
tions and reading frames) annotated during the H-Invitational
meetings. Differences in the length of the ORFs were evalu-
ated in a pair-wise manner between all alternative splicing
variants within the locus and the average ORF length differ-
ence was calculated for each locus. Possible targets for
nonsense-mediated decay were selected as variants in which
the stop codon mapped more than 50 bp upstream of the last
exon junctions. For the detection of Alu-like elements,
RepeatMasker was run with default settings and for the detec-
tion of exonic splice enhancers (ESEs) (19), the RESCUE-
ESE program was run as described previously (20).
Based on the deduced amino acid sequences for alternative
splicing variants, protein motifs and Gene Ontology (GO)
terms were predicted using InterProScan (http://www.ebi.ac.
uk/interpro/) with default parameters. GO terms were auto-
matically added to each of the variants when a protein
motif(s) was recognized and could be associated with func-
tional annotation. Enrichments of the motifs and GO terms
were statistically evaluated using a hypergeometric distribu-
tion by using the following equation:
?M
?N
Here, N ¼ 24 425 (number of loci containing successfully
mapped cDNAs), n ¼ 6877 (number of loci containing altern-
ative splicing variants), M ¼ 12 764 (number of motif-
containing loci), k ¼ 5523 (number of motif-containing loci
with alternative splicing variants) in the case of Table 2. Sim-
ilar calculations were done to evaluate the statistical enrich-
ments for particular motifs and GO terms in Tables 3 and 4.
For the subcellular localization signal predictions, PSORT
II (http://psort.ims.u-tokyo.ac.jp/) was run as indicated previ-
ously (21) and for the predictions of the transmembrane
domains, TMHMM (http://www.cbs.dtu.dk/services/TMH-
MM/) was run with the cut-off value of 0.8. In cases where
a protein motif(s) and/or a predicted localization signal(s),
including transmembrane domains, were altered between
two alternative splicing variants belonging to the same
locus, the corresponding locus was defined as a ‘Motif/GO/
Subcellular localization/Transmembrane domain-changed’
locus.
X
m
x¼k
x
?
·
?
N
j n
?
?
M
x j
?
n
?
Manual procedures to inspect the identified alternative
splicing variants
The results of the computational identification and annotation
of the alternative splicing were visually inspected by
the members of the alternative splicing annotation team of
Table 1. Statistics of the data processing and of the alternative splicing variants and exons identified
#Locus #cDNA#Total exon#Alternative exon #Constitutive exon
H-Invitational total
Successfully mapped
>2 cDNAs per locus
Identified alternative splicing
50end alternative splicing
Internal alternative splicing
30end alternative splicing
50-UTR alternative splicing
CDS alternative splicing
30-UTR alternative splicing
25 585
24 425
10 127
6877
4568
5565
2933
3216
6005
797
56 419
55 036
35 030
18 297
7494
11 156
4940
4750
13 409
1034
389 895a
389 895
331 924
176 505
18 297
139 911
18 297
18 262
148 242
5877
44 727
44 727
44 727
37 670
7494
25 236
4940
6398
28 728
1401
345 168
345 168
287 197
138 835
10 803
114 675
13 357
11 864
119 514
4476
aUnmapped cDNAs’ exons could not be counted.
Nucleic Acids Research, 2006, Vol. 34, No. 14 3919
Page 4
H-Invitational meetings by using the G-integra human gen-
ome browser [(12); Supplementary Figure 1] to verify the
accuracy of the detected alternative splicing. Several items
were manually evaluated, including correct discrimination
of truncated cDNAs, proper identification of the alternative
splicing types, and the corresponding functional annotations
related to them. Further details of the annotation items and
the records of the manual inspection are available at our
web site (see below). The results of the manual and compu-
tational annotations were compared with each other, and in
cases where the results were consistent between the two
approaches, the alternative splicing and the related annota-
tions were defined as ‘validated’. The results obtained for
the annotations for each of the loci were made public and
freely available at our web site (http://jbirc.jbic.or.jp/h-
inv2_as/).
Identification of uncommon patterns of alternative
splicing
Three ‘uncommon’ patterns of alternative variations were
defined as follows: (i) ‘bridged’: a locus in which two altern-
ative splicing variants are arrayed tandemly without sharing
any exons and another transcript ‘bridged’ these two variants,
sharing at least some of its exons with both of them; (ii) ‘nes-
ted’: a locus in which protein coding sequence (CDS) of one
alternative splicing variant is not shared with another variant
and (iii) ‘multiple CDS’: a locus in which different ORFs
>200 amino acids in length are annotated independently for
different alternative splicing variants having overlapping
CDSs of different reading frames.
RT–PCR of the ‘bridged’ transcript
RT–PCR was performed using the primers: primer A 50-CG-
TGAGCTCGCCCGCCAGAAG-30; primer B 50-TCCAACT-
CCAGCTCCACATC-30; primer C 50-CGAGATGACGGGC-
TTTCTGC-30; primer D 50-GGAATGCCATCGGTGCTGG-
30; primer E 50-CCGACTATGCAGAGGAGAAG-30; primer
F 50-GCGTTCTGCTGCTGCTCGAG-30; primer (GAPDH
fw) 50-TCGGAGTCAACGGATTTGGT-30; primer (GAPDH
rv) 50-TGACGGTGCCATGGAATTTG-30, using ABI Prism
7900 Real Time PCR (ABI) with standard reaction conditions.
The template RNAs (50 ng for each PCR) used were an RNA
panel (BD Biosciences). For a negative control, 50 ng of
human genomic DNA (Promega) was used as a template.
RESULTS
Identification of alternative splicing variants by manual
and computational methods
A total of 56 419 human full-length cDNAs (Supplementary
Table 1) were mapped onto the human genome and 55 036
cDNAs were unambiguously mapped. The mapped cDNAs
were clustered into 24 425 loci, resulting in 2.3 cDNAs per
locus on average. Single exon transcripts and the sole
transcript in the locus were then removed. As a result, 10
127 loci contained at least two cDNAs (for further details
on the mapping and clustering procedures, see Materials
Figure 1. Schematic representation of the identification of the alternative
splicing. Essentially, the illustrated patterns of the exon pairs were searched
for and selected as alternative splicing exons in both computational and
manual annotations.
Table 2. Relation between alternative splicing genes and motifs
#Motif-related locus#NOT
motif-related
locus
Total
Alternative
splicing locus
NOT alternative
splicing locus
Total
5523a
(3015; motif-changed)
7241
13546877
10 307 17 548
12 76411 66124 425
aP-value < 10?16.
3920 Nucleic Acids Research, 2006, Vol. 34, No. 14
Page 5
and Methods). These 10 127 loci, containing 35 030 cDNAs
in total, were subjected to computational and manual inspec-
tion schemes to find if alternative splicing variants (defined as
‘complete forms of the transcripts’, with fully sequenced
cDNAs) were included. We used this strategy since we con-
sidered that both the manual and computational methods
should have advantages and disadvantages. Concerning the
manual annotation, human errors are inevitable and for com-
putational methods, detection of spurious alternative splicing
due to various errors/ambiguities inherent to automated ana-
lyses is problematic. In order to maximize the accuracy of the
analysis, the results of the manual and computational ana-
lyses should be compared with each other. When the results
from computational analyses were ‘approved’ by manual
inspections, the results were regarded as ‘validated’ (for the
criteria of the annotation, see Figure 1).
Out of 35 030 cDNAs, the cDNAs annotated to be derived
from ‘50/30end-truncated’ or ‘immature’ transcripts either by
manual or computational annotation were also excluded. In
total, 5308 (15%), 706 (2%) and 787 (2%) cDNAs were
defined as ‘50end’, ‘30end’ and ‘50/30-both side’ truncated,
respectively (for further confirmation of the 50end complete-
ness of each of the transcripts, see Supplementary Figure 2).
A total of 1297 (4%) cDNAs were defined as ‘immature’.
Also note that as for 3913 (11%) cDNA sequences, some
of the annotators reported concerns that they might contain
sequence problems due to cloning errors. The largest popula-
tion of them was suspected spontaneous deletion of a part of
cDNA insert in bacteria. The cDNAs supported by no
ESTs were also identified. However, we did not remove
these indecisive cDNAs, because it was not always straight-
forward to distinguish them from non-canonical rarely-
occurring, but, biologically interesting alternative splicing
(also see below). We provide independent statistics using
the dataset which did not include them in Supplementary
Table 2. Also, caveats for possible errors were precisely
annotated for each of them in our web site (http://jbirc.jbic.
or.jp/h-inv2_as/).
As a result, a representative ‘validated’ dataset of 6877 loci
(68%), in which 18 297 ‘unique’ alternative splicing variants
(2.7 variants per locus) were made of 37 670 alternative spli-
cing exons (2.1 exons per variant; also see Supplementary
Figure 3), was obtained and used for the subsequent analyses
(Table 1). Also, with respect to each of the analyses described
below, we chose the same strategy; to perform computational
calculations first and then manually check the results. There-
fore, each of the subsequent statistical analyses was based on
the data which had passed both computational and manual
inspections. A schematic representation of the computational
calculations and the web interfaces used for the manual
checks are shown in Supplementary Figure 1. The final res-
ults of each of the annotations were made public and freely
available from our web site.
Interestingly, a surprisingly large population of the altern-
ative splicing variants identified in the present study (as full-
length forms) did not match in Ensembl (http://www.
ensembl.org). If H-Invitational transcripts were defined as
identical to Ensembl transcripts, when all of the exon–intron
boundaries were corresponded to those of Ensembl transcripts
with 10 bp allowance, we found 11 704 out of 18 297 (64%)
of the H-Invitational transcripts were represented in Ensembl.
When this was counted at the locus level, 6284 out of
6877 (91%) of the alternative splicing locus presented here
contained novel alternative splicing relationship. This low
level of overlap might be reflecting the fact that Ensembl is
mainly based on the conservative analyses of the EST
Table 3. Most frequently observed motifs which were affected by alternative splicing variants
InterPro ID Motifs in alternative
splicing locus
Motifs in all locusRatio Significance of
enrichment (P-value)
Definition
003598
000005
000867
000345
003962
002017
000379
002035
000595
003034
417
237
73
114
55
56
62
42
22
31
495
245
79
211
78
88
103
60
25
42
0.84
0.97
0.92
0.54
0.71
0.64
0.6
0.7
0.89
0.74
<10?16
<10?16
<10?16
10?15
10?15
10?12
10?11
10?11
10?10
10?9
Immunoglobulin C-2 type
Helix–turn–helix, AraC type
Insulin-like growth factor-binding protein (IGFBP)
Cytochrome c heme-binding site
Fibronectin, type III subdomain
Spectrin repeat
Esterase/lipase/thioesterase
von Willebrand factor, type A
Cyclic nucleotide-binding domain
DNA-binding SAP
Table 4. Most frequently observed GO terms which were affected by alternative splicing variants
GO ID GOs in alternative splicing locusGOs in all locus Ratio Significance of enrichment (P-value)GO term
0003676
0003700
0003677
0004713
0005215
0008270
0005520
0005524
0003824
0016491
451
327
276
164
164
148
73
379
190
116
1112
518
603
318
299
276
79
967
429
237
0.41
0.63
0.46
0.52
0.55
0.54
0.92
0.39
0.44
0.49
<10?16
<10?16
<10?16
<10?16
<10?16
<10?16
<10?16
10?14
10?13
10?11
Nucleic acid binding
Transcription factor activity
DNA-binding
Protein tyrosine kinase activity
Transporter activity
Zinc ion binding
Insulin-like growth factor-binding
ATP binding
Catalytic activity
Oxidoreductase activity
Nucleic Acids Research, 2006, Vol. 34, No. 14 3921
Page 6
sequences and do not put much stress on the full-length
cDNAs.
Patterns of the identified alternative splicing variants
Using the obtained dataset of the 18 297 alternative splicing
variants from 6877 loci, we first examined the genome-wide
features of alternative splicing in terms of their complete
variants. Although similar analyses have been carried out
by several groups (22–24), most of them were based
either on partial EST sequences or analyses using smaller
datasets or smaller numbers of cDNAs subjected to limited
analyses (25).
The alternative splicing variants were first classified with
regard to the splicing types, as often employed in previous
studies (8). As shown in Figure 2, ‘cassette’ type alternative
splicing variants were most frequently observed in our data-
set, which is consistent with a previous result for the
alternative splicing genes located on chromosome 22 (26).
This type of alternative splicing variants may be preferred,
because diversification of the transcripts can be achieved
more flexibly than with the other categories of splice
variations. ‘Retained intron’ type alternative splicing variants
were observed in 1970 loci. However, it became a concern to
us that cDNAs derived from unspliced, immature forms of
transcripts might be classified as ‘retained intron’ types. To
address this concern, we further checked how many of the
alternative splicing variants of this category might be subject
to nonsense-mediated decay [NMD; see Materials and
Methods for the procedure to identify possible NMD-target
variants (27)]. In 402 cases, one of the alternative splicing
variants were annotated as a possible NMD-target, thus,
might not have biological relevance. However, in the
remaining cases, the alternative splicing variants with a
‘retained intron’ seemed to encode proteins with no explicit
defects. So, it may indicate that most variants with ‘retained
introns’ were beneficial for the diversification of human
proteome.
We then examined the sequences of the alternative splicing
exons and found that remnants of Alu-like elements were
detected in a significant population (12%), which is consider-
ably higher than the frequency in constitutive exons
(2%; Table 5). This is consistent with previous results demon-
strating that integration and subsequent alteration of the Alu-
like elements play a significant role in the birth of alternative
splicing exons (28). We also examined exon–intron junctions
and detected canonical GT-AG (22) splice junctions in 156
181 (98.7%) out of 158 208 sites in total. Non-canonical
AT-AC sites, the processing of which is using a splicing
machinery with different and specific components (29),
were also found in 89 cases. Another non-canonical splice
site, GC-AG, were found in 57 cases. Taken together, we
found that non-canonical junctions were enriched in alternat-
ive splicing exons (3.4%) compared to constitutive exons
(0.8%). It is also intriguing those ESEs (20), which are sug-
gested to play a role in efficient splicing, were also less
frequent in alternative splicing exons. These findings suggest
that the alternatively spliced junctions may be less determin-
istic than constitutive junctions, thus allowing versatile
patterns of splicing events.
Positions of the identified alternative splicing exons
In our dataset of the 6877 alternative splicing genes, alternat-
ive splicing events located at the 50end, internal and 30end
exons were observed in 4568, 5565 and 2933 genes, and con-
sisted of 7494, 11 156 and 4940 ‘50end’, ‘internal’ and ‘30
end’ alternative splicing variants, respectively (Table 1).
Considering that the total numbers of the exons examined
were 18 297, 139 911 and 18 297, the average frequency of
the alternative splicing exons were 0.41, 0.08 and 0.27 for
each of the positions, respectively. It was intriguing that the
‘50end’ alternative splicing variants formed the most frequent
category. We further examined these 50end alternative spli-
cing exons and found that, among the 7494 ‘50end alternative
splicing genes’, 3495 genes (47%) contained 50end alternat-
ive splicing variants, which were separated by more than 500
bp from each other. It is likely that these alternative splicing
exons were produced as a consequence of the use of alternat-
ive promoters (30). Therefore, the biological significance of
those 50end alternative splicing variants should be compre-
hensively analyzed together with the results of recent studies
demonstrating that alternative use of promoters is prevalent in
human genes (18,31,32). It is possible that the 50end altern-
ative splicing exons were most abundant because of the
requirements for the diversification of the transcriptional
modulation and vice versa.
In order to characterize the biological consequences of the
identified alternative splicing, we examined the relative
Figure 2. Patterns of the identified alternative splicing.
Table 5. Characteristics of the identified alternative splicing exons
Exon-intron junction type
Canonical
Containing Alu-like elementContaining ESE Total
Non-canonical
Alternative splicing exon
Constitutive exon
Total
26 888 (96.6%)
129 293 (99.2%)
156 181 (98.7%)
954 (3.4%)
1073 (0.8%)
2027 (1.3%)
12%
2%
4%
8%
10%
9%
27 842
130 366
158 208
3922 Nucleic Acids Research, 2006, Vol. 34, No. 14
Page 7
position of the alternative splicing variants compared to their
CDSs in 6555 genes in which two or more alternative
splicing variants are annotated as protein coding. We
found that alternative splicing variants were located in the
50-untranslated regions (50-UTRs), CDSs, and 30-UTRs in
4750, 13409 and 1034 alternative splicing variants in 3216,
6005 and 797 genes, respectively. In the majority of the
alternative splicing genes, alternations in their CDSs were
observed. Notice that 30end alternative splicing was counter-
selected in our analyses, by performing in silico filtration of
NMD. Although in 80% of the alternative splicing variants,
the alteration of the polypeptide length was less than 200
amino acid, 3% of the alternative splicing variants resulted
in alteration of the polypeptide length of more than 500
amino acid (Figure 3). For this population, alternative
splicing should have the most significant impact on the func-
tions of the encoded proteins. It is even possible that these
loci encode two functionally different proteins simultan-
eously, in which the fraction of the transcripts that is shared
and their biological functional similarities may widely vary.
Extreme examples of these cases will be discussed later.
Possible biological relevance of the identified
alternative splicing variants to the diversification of
protein functions
The influence of the identified alternative splicing variants on
the encoded protein sequences and their possible biologi-
cal functions was further evaluated from the following
viewpoints. The complete records of each of the analyses
are available for each of the entries from our web site
(http://jbirc.jbic.or.jp/h-inv2_as/).
Motif. Protein motifs were frequently affected by the alternat-
ive splicing located within the CDSs. Out of 6005 alternative
splicing variants residing in the CDSs, 3015 were located in
protein motifs (Table 6). As shown in Figure 4, the alternative
splicing variants influenced a wide variety of protein motifs.
For example, we identified a novel alternative splicing
variant, AK093798 in the IkB kinase epsilon gene (IKKe;
NM_014002). This novel variant specifically lacks the kinase
domain, while the rest of the CDS remained intact
(Figure 4A). The IKK complex plays a pivotal role in
immune and inflammatory responses by transmitting a variety
of signals to a transcription factor, NF-kB (33). It has been
reported that a kinase-deficient dominant-negative mutant of
IKKe blocks the induction of NF-kB invoked by T cell
receptor but has no effect on its activation invoked by
TNFa (34). The natural kinase-deficient variant of IKKe rep-
resented by AK093798 could serve as a modulator between
these two signaling pathways, and thus provide cells with
the opportunity to regulate the relative amounts of the signals
that the two pathways receive at a single point.
In total, among 6877 genes with alternative splicing, altera-
tions of the motifs due to the alternative splicing were
observed in 3015 genes (44%; Table 2). In 2508 genes
(36%), alternative splicing did not change the annotated
motifs. The remaining 1354 genes (20%) contained no annot-
ated motif. The alternative splicing genes had a higher fre-
quency of motifs overall (80%), as among 17 548 non-
alternative splicing genes, only 7241 genes (41%) contained
annotated motifs (P < 10?16; Table 2). We also found that
alternative splicing exons were enriched in motifs. On aver-
age, one motif was contained per 1.6 alternative splicing
exons, while the average frequency of motifs was only one
per 3.0 constitutive exons. A similar tendency was also sug-
gested from a recent EST-based study (35). The motifs should
be actively associated with alternative splicing, and in many
cases direct switching of the motifs is enabled by alternative
splicing.
Subcellular localization. We also examined whether the pre-
dicted protein motifs determining the subcellular localization
signals of the proteins, such as secretion signal peptides,
mitochondria targeting signals and transmembrane domains,
were affected by the alternative splicing (Table 6). For the
subcellular localization signal predictions, PSORT II was
used. The subcellular localization signal was predicted for
each of the alternative splicing variants, and in cases where
the alternative splicing variants were predicted to localize
in different subcellular compartments, the alternative splicing
were categorizedas‘subcellular
(shown in Figure 4B). Similarly, ‘transmembrane domain-
changed’ alternativesplicing
TMHMM. In total, 2982 subcellular localizations and 1348
transmembrane domains were affected (Table 6). Figure 4C
shows a case in which the transmembrane domain was
altered. The most frequently observed switching of the pre-
dicted subcellular localization signal was between ‘nuclear’
and ‘cytoplasm’ (2455 cases). Switching between ‘secretory’
and ‘plasma membrane’ was detected in 1145 cases. These
findings suggested that the proteins produced from the same
loci as a result of the alternative splicing are actively utilized
in a multi-faceted manner at different locations or compart-
ments in the cells. A similar tendency of the use of several
localization signals in the protein isoforms was also indicated
by a recent bioinformatics study (36).
localization-changed’
were identifiedusing
GO. In a large population of the genes (1779 genes; 27%;
Table 6), the GO terms attached to the transcripts were
Figure 3. Distribution of the length difference between the alternative
splicing variants. The percentages show the populations belonging to the
corresponding groups.
Nucleic Acids Research, 2006, Vol. 34, No. 143923
Page 8
altered between the alternative splicing variants even within
the same locus because the protein motifs and subcellular
localizations were affected by the alternative splicing vari-
ants, as mentioned above. For example, in the case shown
in Figure 4A, the GO term, ‘kinase activity’ (and related
terms), was assigned to the kinase domain-containing variants,
but not to the kinase negative-variants. Some of the motifs
and GO terms were significantly enriched in alternative
splicing genes (Table 4). It is especially noteworthy that
motifs and GO terms associated with signal transduction
and transcriptional regulation were most frequently influ-
enced by alternative splicing (Tables 3 and 4). Fine adjust-
ments of the resultant protein functions might be one of the
most important functions of alternative splicing in higher
organisms, such as humans.
Uncommon patterns of alternative splicing
During manual annotation, we noticed that a number of
transcriptsundergo novelpatternsof alternative
splicing. Definitions of them are described in the legend for
Figure 5.
Bridged. As shown in the upper panel of Figure 5A, a cDNA,
AK000438, was identified as an alternative splicing variant of
AK000479 and AF161485. However, it was more likely that
AK000438 was overlaid on two adjacent loci rather than ‘a
variantofeitheroftheloci’,representinga‘bridging’transcript
from two genes, the SERF2 gene (NM_005770) and the
HYPK gene (NM_016400). It is noteworthy that both genes
are associated with neuromuscular diseases. These genes
were originally identified respectively as a candidate modify-
ing gene for spinal muscular atrophy (37) and a Huntingtin
interacting protein (38). We excluded the possibility that
AK000438 was derived from artifacts, such as chimeric tran-
scripts produced during the cDNA cloning process because of
the facts that: (i) genes immediately adjacent to each other are
bridged; (ii) the transcript was connected exactly at exon–
intron junctions satisfying the GT-AG rule; (iii) there
are supporting dbEST sequences which correspond to the
Figure 4. Examples of the alternative splicing variants detected as ‘motif-changed’ (A), ‘subcellular localization-changed’ (B) and ‘transmembrane domain-
changed’ (C). Exons and introns are represented by green boxes and lines, respectively. The violet boxes are protein coding regions and yellow boxes are
alternative splicing exons. The positions of the detected motifs and transmembrane domains are shown beneath the transcripts. In the uppermost panel, GO terms
attached to the transcript indicated by the lower line are shown.
3924 Nucleic Acids Research, 2006, Vol. 34, No. 14
Page 9
‘bridging’ transcript and (iv) RT–PCR using primers sets at
the SERF2 and HYPK transcripts produced direct evidence
that the ‘bridging’ transcript exits in vivo (Figure 5A lower
panel). A previous study demonstrated that this kind of ‘brid-
ging’ transcript is produced between the GALT gene and the
IL-11Ra gene, and results in an mRNA encoding a protein
retaining the functions of both proteins (39). This kind of
‘bridging’ transcript has been reported to be especially
frequent in Caenorhabditis elegans, (179 out of 3829 altern-
ative splicing genes) as are complex genes generating two
completely distinct proteins in addition to a fusion of both.
Recent reports suggested that such transcripts seem abundant
in humans as well [(40); also see http://www.ncbi.nlm.nih.
gov/IEB/Research/Acembly/index.html?human].
further experimental characterization is required to clarify
their biological consequences, it is possible that such bridged
transcripts play some roles by combining the respective gene
functions in general.
Although,
Nested. Figure 5B shows another uncommon type of alternat-
ive splicing (‘nested’). In this case, all exons of AB508739
except for a shared 30end exon, are embedded in the last
intron of AK130874 (Figure 5B, upper panel). Neither a map-
ping error of the cDNAs nor an assembly error of the genome
sequence is likely to account for this observation, since the
sequence identity/coverage between the genome and cDNA
sequences are almost 100% for each of the exons and
genomic sequences are ‘finished’ in this region. In this
case, these transcripts may share some modulatory elements,
such as binding sites of regulatory proteins or non-coding
RNAs, but their biological functions should be completely
different. A similar specific case of alternative splicing was
also described with regard to the 50end exons, on which
promoters seemed to be shared. (Figure 5B, lower panel).
In a very recent paper, two otherwise independent genes
were shown to be co-expressed from one promoter, giving
yield to tissue-specific expression of an alternative form of
an otherwise B-cell specific gene (41).
Multiple CDS. Alternative splicing even seemed to allow a
particular genomic sequence to encode two distinct amino
acids in different reading frames. The last exon of
AK097244 overlapped with the second exon of AK000272
(Figure 5C, upper panel). Both of the exons were coding
exons, however, their reading frames were different. Similarly,
the reading frames of the second exon of AK096258 and fifth
exons of BC029781 were different (Figure 5C, lower panel).
Interestingly, in this gene, another alternative splicing vari-
ant, BC043484 used the latter-type reading frame for the
N-terminal half of the protein and the former-type reading
frame for the C-terminal half. In this variant, sixth exon
seemed to serve as a ‘frame switching’ exon. Although the
gene function of this locus still remains unknown, it is pos-
sible that the novel proteins discovered in the present study
should also play roles which are distinct from but, at the
same time, correlate with each other. Consistent with this
possibility, emerging evidence also suggested that the cases
in which mRNAs transcribed from a single locus are used
as templates for two independent proteins seem not extremely
rare (42). Further experimental validation should clarify
which gets translated to produce functional proteins and
which have regulatory roles.
The numbers of cases identified for each of the above-
mentioned ‘uncommon’ patterns of the alternative splicing
are shown in Table 6. In any case, it would be more appro-
priate to regard these cases as two ‘genes’ merged into a sin-
gle ‘locus’ rather than as mutually ‘alternative’ variants
occurring from the same locus, since, obviously, it is not
probable that these variants share the same protein function.
If we consider the evolutionary origins of these loci, they
could originally have been two neighboring genes that then
evolved to share some of their exons as a result of mutational
changes. In each case, the extents to which multiple loci are
merged vary. In this study, we applied strict criteria to select
them (see Materials and Methods). Therefore, the numbers
shown in Table 6 should be the minimum values, suggesting
that it is not extremely rare to observe such ‘uncommon’
alternative splicing events in cells. Such mechanisms may
allow for further diversification of the transcriptome of
human genes. Although careful evaluations should be neces-
sary (i.e. whether the cDNAs are originated from cancerous
cells) before concluding whether the particular examples
described here should have biological relevance, if any,
such mechanism would enable the multi-faceted use of the
biological information encoded in a given genomic region.
It should also be noted that without a combination of man-
ual and computational analyses, this kind of alternative spli-
cing variants would have been overlooked. By performing
detailed manual and computational analyses, we could pre-
cisely identify these novel patterns of alternative splicing.
Also, if no more than partial information of the alternative
splicing exons had been available, these findings would
have been utterly impossible. Judging from these results,
alternative splicing is likely to be used in a more versatile
manner to enable the diversification of the gene functions
of human genes than was previously thought.
DISCUSSION
In this paper we described the genome-wide identification
and characterization of alternative splicing variants of
human gene transcripts based on completely sequenced
human full-length cDNAs in details. Starting with 56 419
cDNAs, we identified 18 297 complete alternative splicing
Table 6. Numbers of the genes in which alternative splicing variants should
influence the possible protein functions
#Locus#cDNA
Alternative splicing affecting
functional annotation (total)
Motif-changed
Subcellular localization-changed
GO-changed
Transmembrane domain-changed
Uncommon alternative
splicing pattern (total)
Bridged
Nested
Multiple CDS
448112 542
3015
2982
1779
1348
316
8727
8624
5179
3933
1033
129
172
27
604
390
56
Nucleic Acids Research, 2006, Vol. 34, No. 143925
Page 10
variants at 6877 loci of human genes and precisely annotated
each of them.
Lander et al. (6) proposed that there should be five altern-
ative splicing variants per locus in human genes. Although,
our dataset used here is the largest one from a full-length
cDNA collection, it is much smaller than that from dbEST,
which includes more than 7 million ESTs. Therefore, some
of the alternative splicing variants that actually exist in
human genes might not be represented in our dataset (we
identified alternative splicing variants from 68% of the loci
examined, with 2.7 variants per locus). However, our dataset
should have two major advantages that largely compensate
for the shortage of coverage. First, our dataset had been
validated by heuristic annotations, and therefore various com-
putational errors which could be easily discriminated by the
human eye were excluded. Secondly, each of the identified
alternative splicing variants was supported by full-length
cDNAs whose sequences had been completely determined.
Figure 5. Examples of the ‘uncommon’ patterns of alternative splicing; ‘bridged’ (A), ‘nested’ (B) and ‘multiple CDS’ (C). These ‘uncommon’ patterns of
alternative variations were defined as following: i) ‘bridged’: a locus in which two alternative splicing variants were arrayed tandemly without sharing any exons
and another transcript ‘bridged’ these two variants, sharing at least some of its exons with both of them; ii) ‘nested’: a locus in which CDS region of one
alternative splicing variant was not shared with another variant and iii) ‘multiple CDS’: a locus in which different ORFs >200 amino acid in length were
annotated independently for different alternative splicing variants having overlapping CDSs of different reading frames. In the lower panel of (A), the results of
RT–PCR are shown. Each photograph shows the amplicons of the indicated RT–PCR using the indicated primers. Tissue origins of the template RNAs are shown
in the margin. The asterisk indicates a non-specific band. The coloring of the figures is the same as in Figure 4.
3926 Nucleic Acids Research, 2006, Vol. 34, No. 14
Page 11
This enabled us to assure that each of the alternative splicing
variants identified in this study corresponds to a complete
form of a particular transcription unit. This feature was
especially important when the relevance of the alternative
splicing variants to the protein motifs or various subcellular
localizations was evaluated. If comprehensive information
of the cDNA sequences had not been available, those ana-
lyses would have been less reliable. Sometimes protein motifs
are embedded over a wide region of the protein sequences,
and all of the combinations of the alternative splicing exons
may not be allowed. Besides, for certain types of subcellular
targeting signals, such as signal peptides, the position within
the protein sequence is critical.
Recently, it was reported that many of the alternative spli-
cing variants seemed not to be evolutionarily conserved, and
thus the biological significance of their existence might be
questionable (43,44). The same argument may be used in
the other direction as well. Alternative splicing could be the
easiest road to diversification of the species. Future analyses
of our full-length alternative splicing data should be useful to
clarify which alternative splicing should be evolutionarily
wobbly and which should provide a molecular basis for vari-
ous species-specific biological features. Indeed, it will be
important in future studies to discriminate which of the iden-
tified transcript variants have a general raison d’etre and
which play species-specific roles, because lack of such know-
ledge would severely restrict the potential power of the
comparative genomic approaches utilizing the genomic
sequences of many organisms which will be determined in
the next few years. The information described here together
with the availability of the accompanying physical full-length
cDNA clone resources should lay firm groundwork for
exploring how alternative splicing generates the functional
diversification of the human transcriptome and proteome.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR online.
ACKNOWLEDGEMENTS
The authors thank Y. Fujii, Y. Sato, H. Sakai, T. Habara,
C. Yamasaki and M. Tanino for genome mapping and ORF
prediction of H-Invitational full-length cDNA dataset. The
authors thank F. Todokoro, H. Kawashima, E. Sekimori and
H. Wakaguri for technical support of computational analysis.
The authors thank K. Abe for experimental validation of
the alternative splicing. The authors are also grateful to
E. Nakajima for critical reading of the manuscript. This
research was financially supported by the Ministry of
Economy, Trade and Industry of Japan (METI), the
Ministryof Education,Culture,
Technology of Japan (MEXT), the Bund sministerium fu ¨r
Bildung und Forschung (BMBF, Grant NGFN-01GR0420),
and the Japan Biological Informatics Consortium (JBIC).
Funding to pay the Open Access publication charges for this
article was provided by JBIC.
Sports, Scienceand
Conflict of interest statement. None declared.
REFERENCES
1. Gilbert,W. (1978) Why genes in pieces? Nature, 271, 501.
2. Lopez,A.J. (1998) Alternative splicing of pre-mRNA: developmental
consequences and mechanisms of regulation. Annu. Rev. Genet., 32,
279–305.
3. Smith,C.W. and Valcarcel,J. (2000) Alternative pre-mRNA splicing:
the logic of combinatorial control. Trends Biochem. Sci., 25, 381–388.
4. Schmucker,D. and Flanagan,J.G. (2004) Generation of
recognition diversity in the nervous system. Neuron, 44,
219–222.
5. Wojtowicz,W.M., Flanagan,J.J., Millard,S.S., Zipursky,S.L. and
Clemens,J.C. (2004) Alternative splicing of Drosophila Dscam
generates axon guidance receptors that exhibit isoform-specific
homophilic binding. Cell, 118, 619–633.
6. Lander,E.S., Linton,L.M., Birren,B., Nusbaum,C., Zody,M.C.,
Baldwin,J., Devon,K., Dewar,K., Doyle,M., FitzHugh,W. et al. (2001)
Initial sequencing and analysis of the human genome. Nature,
409, 860–921.
7. Ewing,B. and Green,P. (2000) Analysis of expressed sequence tags
indicates 35 000 human genes. Nature Genet., 25, 232–234.
8. Ladd,A.N. and Cooper,T.A. (2002) Finding signals that regulate
alternative splicing in the post-genomic era. Genome Biol., 3,
reviews0008.1–0008.16.
9. Modrek,B. and Lee,C. (2002) A genomic view of alternative splicing.
Nature Genet., 30, 13–19.
10. Lee,C., Atanelov,L., Modrek,B. and Xing,Y. (2003) ASAP: the
Alternative Splicing Annotation Project. Nucleic Acids Res.,
31, 101–105.
11. Stamm,S., Riethoven,J.J., Le Texier,V., Gopalakrishnan,C.,
Kumanduri,V., Tang,Y., Barbosa-Morais,N.L. and Thanaraj,T.A.
(2006) ASD: a bioinformatics resource on alternative splicing. Nucleic
Acids Res., 34, D46–D55.
12. Imanishi,T., Itoh,T., Suzuki,Y., O’Donovan,C., Fukuchi,S.,
Koyanagi,K.O., Barrero,R.A., Tamura,T., Yamaguchi-Kabata,Y.,
Tanino,M. et al. (2004) Integrative annotation of 21 037 human genes
validated by full-length cDNA clones. PLoS Biol., 2, e162.
13. Zhang,Q.H., Ye,M., Wu,X.Y., Ren,S.X., Zhao,M., Zhao,C.J., Fu,G.,
Shen,Y., Fan,H.Y., Lu,G. et al. (2000) Cloning and functional analysis
of cDNAs with open reading frames for 300 previously undefined
genes expressed in CD34+ hematopoietic stem/progenitor cells.
Genome Res., 10, 1546–1560.
14. Wiemann,S., Weil,B., Wellenreuther,R., Gassenhuber,J., Glassl,S.,
Ansorge,W., Bocher,M., Blocker,H., Bauersachs,S., Blum,H. et al.
(2001) Toward a catalog of human genes and proteins: sequencing and
analysis of 500 novel complete protein coding human cDNAs. Genome
Res., 11, 422–435.
15. Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G.,
Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D.,
Altschul,S.F. et al. (2002) Generation and initial analysis of more than
15 000 full-length human and mouse cDNA sequences. Proc. Natl
Acad. Sci. USA, 99, 16899–16903.
16. Ota,T., Suzuki,Y., Nishikawa,T., Otsuki,T., Sugiyama,T., Irie,R.,
Wakamatsu,A., Hayashi,K., Sato,H., Nagai,K. et al. (2004) Complete
sequencing and characterization of 21 243 full-length human cDNAs.
Nature Genet., 36, 40–45.
17. Ewing,B., Hillier,L., Wendl,M.C. and Green,P. (1998) Base-calling of
automated sequencer traces using phred. I. Accuracy assessment.
Genome Res., 8, 175–185.
18. Kimura,K., Wakamatsu,A., Suzuki,Y., Ota,T., Nishikawa,T.,
Yamashita,R., Yamamoto,J., Sekine,M., Tsuritani,K., Wakaguri,H.
et al. (2006) Diversification of transcriptional modulation: large-scale
identification and characterization of putative alternative promoters of
human genes. Genome Res., 16, 55–65.
19. Fairbrother,W.G., Yeh,R.F., Sharp,P.A. and Burge,C.B. (2002)
Predictive identification of exonic splicing enhancers in human genes.
Science, 297, 1007–1013.
20. Fairbrother,W.G., Yeo,G.W., Yeh,R., Goldstein,P., Mawson,M.,
Sharp,P.A. and Burge,C.B. (2004) RESCUE-ESE identifies candidate
exonic splicing enhancers in vertebrate exons. Nucleic Acids Res., 32,
W187–W190.
21. Nakao,M. and Nakai,K. (2002) Improvement of PSORT II protein
sorting prediction for mammalian proteins. Genome Informatics, 13,
441–442.
Nucleic Acids Research, 2006, Vol. 34, No. 14 3927
Page 12
22. Burset,M., Seledtsov,I.A. and Solovyev,V.V. (2000) Analysis of
canonical and non-canonical splice sites in mammalian genomes.
Nucleic Acids Res., 28, 4364–4375.
23. Croft,L., Schandorff,S., Clark,F., Burrage,K., Arctander,P. and
Mattick,J.S. (2000) ISIS, the intron information system, reveals the
high frequency of alternative splicing in the human genome. Nature
Genet., 24, 340–341.
24. Modrek,B., Resch,A., Grasso,C. and Lee,C. (2001) Genome-wide
detection of alternative splicing in expressed sequences of human
genes. Nucleic Acids Res., 29, 2850–2859.
25. Kochiwa,H., Suzuki,R., Washio,T., Saito,R., Bono,H., Carninci,P.,
Okazaki,Y., Miki,R., Hayashizaki,Y. and Tomita,M. (2002) Inferring
alternative splicing patterns in mouse from a full-length cDNA library
and microarray data. Genome Res., 12, 1286–1293.
26. Hide,W.A., Babenko,V.N., van Heusden,P.A., Seoighe,C. and
Kelso,J.F. (2001) The contribution of exon-skipping events on
chromosome 22 to protein coding diversity. Genome Res., 11,
1848–1853.
27. Lejeune,F. and Maquat,L.E. (2005) Mechanistic links between
nonsense-mediated mRNA decay and pre-mRNA splicing in
mammalian cells. Curr. Opin. Cell Biol., 17, 309–315.
28. Lev-Maor,G., Sorek,R., Shomron,N. and Ast,G. (2003) The birth of an
alternatively spliced exon: 30splice-site selection in Alu exons.
Science, 300, 1288–1291.
29. Will,C.L. and Luhrmann,R. (2005) Splicing of a rare class of
introns by the U12-dependent spliceosome. Biol. Chem., 386,
713–724.
30. Landry,J.R., Mager,D.L. and Wilhelm,B.T. (2003) Complex controls:
the role of alternative promoters in mammalian genomes. Trends
Genet., 19, 640–648.
31. Carninci,P., Kasukawa,T., Katayama,S., Gough,J., Frith,M.C.,
Maeda,N., Oyama,R., Ravasi,T., Lenhard,B., Wells,C. et al. (2005) The
transcriptional landscape of the mammalian genome. Science, 309,
1559–1563.
32. Kim,T.H., Barrera,L.O., Zheng,M., Qu,C., Singer,M.A.,
Richmond,T.A., Wu,Y., Green,R.D. and Ren,B. (2005) A
high-resolution map of active promoters in the human genome. Nature,
436, 876–880.
33. Karin,M. (1999) How NF-kappaB is activated: the role of the IkappaB
kinase (IKK) complex. Oncogene, 18, 6867–6874.
34. Peters,R.T., Liao,S.M. and Maniatis,T. (2000) IKKepsilon is part of a
novel PMA-inducible IkappaB kinase complex. Mol. Cell, 5, 513–522.
35. Xing,Y., Resch,A. and Lee,C. (2004) The multiassembly problem:
reconstructing multiple transcript isoforms from EST fragment
mixtures. Genome Res., 14, 426–441.
36. Nakao,M., Barrero,R.A., Mukai,Y., Motono,C., Suwa,M. and Nakai,K.
(2005) Large-scale analysis of human alternative protein isoforms:
pattern classification and correlation with subcellular localization
signals. Nucleic Acids Res., 33, 2355–2363.
37. Scharf,J.M., Endrizzi,M.G., Wetter,A., Huang,S., Thompson,T.G.,
Zerres,K., Dietrich,W.F., Wirth,B. and Kunkel,L.M. (1998)
Identification of a candidate modifying gene for spinal muscular
atrophy by comparative genomics. Nature Genet., 20, 83–86.
38. Faber,P.W., Barnes,G.T., Srinidhi,J., Chen,J., Gusella,J.F. and
MacDonald,M.E. (1998) Huntingtin interacts with a family of WW
domain proteins. Hum. Mol. Genet., 7, 1463–1474.
39. Magrangeas,F., Pitiot,G., Dubois,S., Bragado-Nilsson,E., Cherel,M.,
Jobert,S., Lebeau,B., Boisteau,O., Lethe,B., Mallet,J. et al. (1998)
Cotranscription and intergenic splicing of human
galactose-1-phosphate uridylyltransferase and interleukin-11 receptor
alpha-chain genes generate a fusion mRNA in normal cells. Implication
for the production of multidomain proteins during evolution. J. Biol.
Chem., 273, 16005–16010.
40. Akiva,P., Toporik,A., Edelheit,S., Peretz,Y., Diber,A., Shemesh,R.,
Novik,A. and Sorek,R. (2006) Transcription-mediated gene fusion in
the human genome. Genome Res., 16, 30–36.
41. Wiemann,S., Kokocinski,A.K. and Poustka,A. (2005) Alternative
pre-mRNA processing regulates cell-type specific expression of the
IL4l1 and NUP62 genes. BMC Biol., 3, 16.
42. Oyama,M., Itagaki,C., Hata,H., Suzuki,Y., Izumi,T., Natsume,T.,
Isobe,T. and Sugano,S. (2004) Analysis of small human proteins
reveals the translation of upstream open reading frames of mRNAs.
Genome Res., 14, 2048–2052.
43. Modrek,B. and Lee,C.J. (2003) Alternative splicing in the human,
mouse and rat genomes is associated with an increased frequency of
exon creation and/or loss. Nature Genet., 34, 177–180.
44. Yeo,G.W., Van Nostrand,E., Holste,D., Poggio,T. and Burge,C.B.
(2005) Identification and analysis of alternative splicing events
conserved in human and mouse. Proc. Natl Acad. Sci. USA, 102,
2850–2855.
3928Nucleic Acids Research, 2006, Vol. 34, No. 14