H-DBAS: human-transcriptome database for alternative splicing: update 2010.
ABSTRACT H-DBAS (http://h-invitational.jp/h-dbas/) is a specialized database for human alternative splicing (AS) based on H-Invitational full-length cDNAs. In this update, for better annotations of AS events, we correlated RNA-Seq tag information to the AS exons and splice junctions. We generated a total of 148,376,598 RNA-Seq tags from RNAs extracted from cytoplasmic, nuclear and polysome fractions. Analysis of the RNA-Seq tags allowed us to identify 90,900 exons that are very likely to be used for protein synthesis. On the other hand, 254 AS junctions of human RefSeq transcripts are unique to nuclear RNA and may not have any translational consequences. We also present a new comparative genomics viewer so that users can empirically understand the evolutionary turnover of AS. With the unique experimental data closely connected with intensively curated cDNA information, H-DBAS provides a unique platform for the analysis of complex AS.
- [show abstract] [hide abstract]
ABSTRACT: This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Also available from http://biology.plosjournals.org/. First six authors only, plus University of Leicester author, listed. The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for nonprotein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.
- [show abstract] [hide abstract]
ABSTRACT: The Human-transcriptome DataBase for Alternative Splicing (H-DBAS) is a specialized database of alternatively spliced human transcripts. In this database, each of the alternative splicing (AS) variants corresponds to a completely sequenced and carefully annotated human full-length cDNA, one of those collected for the H-Invitational human-transcriptome annotation meeting. H-DBAS contains 38,664 representative alternative splicing variants (RASVs) in 11,744 loci, in total. The data is retrievable by various features of AS, which were annotated according to manual annotations, such as by patterns of ASs, consequently invoked alternations in the encoded amino acids and affected protein motifs, GO terms, predicted subcellular localization signals and transmembrane domains. The database also records recently identified very complex patterns of AS, in which two distinct genes seemed to be bridged, nested or degenerated (multiple CDS): in all three cases, completely unrelated proteins are encoded by a single locus. By using AS Viewer, each AS event can be analyzed in the context of full-length cDNAs, enabling the user's empirical understanding of the relation between AS event and the consequent alternations in the encoded amino acid sequences together with various kinds of affected protein motifs. H-DBAS is accessible at http://jbirc.jbic.or.jp/h-dbas/.Nucleic Acids Research 02/2007; 35(Database issue):D104-9. · 8.28 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: Here we report the new features and improvements in our latest release of the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/ ), a comprehensive annotation resource for human genes and transcripts. H-InvDB, originally developed as an integrated database of the human transcriptome based on extensive annotation of large sets of full-length cDNA (FLcDNA) clones, now provides annotation for 120 558 human mRNAs extracted from the International Nucleotide Sequence Databases (INSD), in addition to 54 978 human FLcDNAs, in the latest release H-InvDB_4.6. We mapped those human transcripts onto the human genome sequences (NCBI build 36.1) and determined 34 699 human gene clusters, which could define 34 057 (98.1%) protein-coding and 642 (1.9%) non-protein-coding loci; 858 (2.5%) transcribed loci overlapped with predicted pseudogenes. For all these transcripts and genes, we provide comprehensive annotation including gene structures, gene functions, alternative splicing variants, functional non-protein-coding RNAs, functional domains, predicted sub cellular localizations, metabolic pathways, predictions of protein 3D structure, mapping of SNPs and microsatellite repeat motifs, co-localization with orphan diseases, gene expression profiles, orthologous genes, protein–protein interactions (PPI) and annotation for gene families. The current H-InvDB annotation resources consist of two main views: Transcript view and Locus view and eight sub-databases: the DiseaseInfo Viewer, H-ANGEL, the Clustering Viewer, G-integra, the TOPO Viewer, Evola, the PPI view and the Gene family/group.
H-DBAS: human-transcriptome database for
alternative splicing: update 2010
Jun-ichi Takeda1,2, Yutaka Suzuki2, Ryuichi Sakate1, Yoshiharu Sato1,
Takashi Gojobori1,3, Tadashi Imanishi1,* and Sumio Sugano2
1Integrated Database and Systems Biology Team, Biomedicinal Information Research Center National
Institute of Advanced Industrial Science and Technology, AIST Bio-IT Research Bldg. Aomi 2-4-7, Koto-ku,
Tokyo 135-0064,2Department of Medical Genome Sciences, Graduate School of Frontier Sciences, the
University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8562 and3Center for Information Biology
and DDBJ, National Institute of Genetics, 1111 Yata, Mishima, Shizuoka 411-8540, Japan
Received September 15, 2009; Revised October 13, 2009; Accepted October 15, 2009
H-DBAS (http://h-invitational.jp/h-dbas/) is a spe-
cialized database for human alternative splicing
(AS) based on H-Invitational full-length cDNAs.
In this update, for better annotations of AS events,
we correlated RNA-Seq tag information to the AS
exons and splice junctions. We generated a total
of 148376598 RNA-Seq tags from RNAs extracted
from cytoplasmic, nuclear and polysome fractions.
Analysis of the RNA-Seq tags allowed us to identify
90900 exons that are very likely to be used for
protein synthesis. On the other hand, 254 AS junc-
tions of human RefSeq transcripts are unique to
nuclear RNA and may not have any translational
consequences. We also present a new comparative
genomics viewer so that users can empirically
understand the evolutionary turnover of AS. With
the unique experimental data closely connected
analysis of complex AS.
Alternative splicing (AS) is a phenomenon in which a
single gene produces various functional protein isoforms.
AS is frequently observed especially in higher eukaryotes.
At least 50% of human genes are reported to be subjected
to AS. However, the biological significance of this high
level of AS and its regulation mostly remain elusive
(1,2). For better understanding of AS in humans, we con-
structed a human-transcriptome database for alternative
splicing (H-DBAS) in 2006, which collects information
of human AS variants from the viewpoints of protein
functions affected by AS. H-DBAS is based on the
manually inspected and well-annotated cDNA informa-
tion collected by the H-Invitational cDNA Annotation
Project. By utilizing the annotation information and
cDNA sequence information, it was possible to identify
AS events that invoke changes in protein-coding regions,
thereby influencing protein functions (3–5). Based on the
result of intensive annotations of AS events, H-DBAS
presents thousands of AS events that may increase the
functional diversification of the human genome.
conservation of the identified AS events and found that
a large number of these annotated AS events may not be
evolutionarily conserved between humans and mice.
Similar results were also reported by other groups (6).
Our concern was that they could simply represent intrinsic
noise of transcription inherently occurring in the human
genome without biological relevance. Therefore, further
extensive annotations in which AS events are likely to be
translated into proteins and whether such AS events are
evolutionarily conserved would be essential. Such infor-
mation will be extremely useful to prioritize targets for
future functional characterization of AS events and to
determine the direction of validation experiments.
The latest generation of sequencers have greatly
improved the cost and speed of cDNA sequencing (7).
A recent paper reported the use of a new generation
sequencer for in-depth identification and characterization
of human AS events. They generated dozens of millions of
shotgun RNA sequence tags by the so-called RNA-Seq
analysis and analyzed the collected tags (RNA-Seq tags)
to detect positions and frequencies of the usage of every
splice junction (8,9). In this particular study, polyA+
RNA was used for RNA-Seq analysis. However, several
methodological improvements have been made so that it is
now possible to consider a similar approach for analysis of
RNAs from any population. In a very recent study, we
generated a total of 150-million RNA-Seq tag sequences
*To whom correspondence should be addressed. Tel: +81 3 3599 8800; Fax: +81 3 3599 8801; Email: firstname.lastname@example.org
Nucleic Acids Research, 2010, Vol. 38, Database issuePublished online 7 December 2009
? The Author(s) 2009. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
using RNAs that were separately extracted from cyto-
plasmic, nuclear and polysome (translating ribosome)
subcellular fractions in DLD-1 cells, a colon cancer cell
line. In this update of H-DBAS, we incorporated this
RNA-Seq data enabling a clear representation of which
RNAs and their AS variants are identified in which
subcellular fractions. Observing a particular AS variant
in the polysome fraction should be especially important
because it provides direct evidence for its transla-
tional consequence. Also, to determine whether an AS is
evolutionarily conserved, we used a comparative genomic
viewer. In this viewer, AS events are categorized according
to whether they are transcribed from conserved genomic
regions or whether the corresponding transcripts that are
also identified in mice. The updated H-DBAS including
these two expanded features should provide a unique and
important resource to explore the complex world of
Statistics of the new RNA-Seq datasets
By RNA-Seq analysis using Illumina GA, we generated
46354139, 47120831 and 54901628 single-end-read
36-bp RNA-Seq tags from cytoplasmic, nuclear and
polysome fractions of the RNAs from DLD-1 cells,
respectively. Separation of the respective subcellular
fractions was confirmed
glyceraldehyde-3-phosphate dehydrogenase, a cytoplasmic
protein and lamin A/C, a nuclear protein, as well as
real-time RT-PCR analysis of sno/scaRNAs, nuclear
RNAs (see RNA-Seq analysis page on the top page of
H-DBAS for the related experimental data; details of the
experimental procedures are also described there). The
RNA-Seq tags obtained were mapped to the reference
human genome of UCSC genome browser (hg18) (10).
To identify tags that span splice junctions, we used
Eland RNA and TopHat (version 1.0.9) (11,12) with the
default options of considering only junctions following the
‘GT–AG’ rule and allowing up to two base mismatches.
We further selected the splice junctions that were sup-
ported by two or more RNA-Seq tags. As a result,
201280, 236764 and 319577 junctions were represented
in the RNA-Seq datasets derived from cytoplasmic,
nuclear and polysome subcellular fractions, respectively
The RNA-Seq tag information obtained was further
correlated with transcript information. For analyzing the
subdataset of human AS variants, we used RefSeq
transcripts (release 23) (13). Among the total of 26814
human RefSeq transcripts, 10923 were annotated to rep-
resent mutual AS variants according to H-InvDB (release
6.0) [see ref. (4) for further details]. In total, 81547, 85923
and 90900 exons were represented by RNA-Seq tags
derived from cytoplasmic, nuclear and polysome fractions,
respectively. In addition, 47615, 47260 and 51041 splice
junctions were represented in the RNA-Seq tags in the
respective fractions. Of these, 1067, 1021 and 1114 junc-
tions corresponded to mutual AS junctions, directly sug-
gesting that these AS events are expressed and located in
the respective subcellular locations. Statistical analysis of
the enrichment of tags also showed that some AS variants
were enriched in a given subcellular location: 260, 254
and 299 AS variants were selectively observed in the
cytoplasmic, nuclear and polysome fractions, respectively.
Especially for 178 AS variant pairs, both of the variants
appeared to be translated to proteins simultaneously in
DLD-1 cells. All of the above extensive annotations on
the biological relevance of each AS are represented as
a graphic interface as described below.
RNA-Seq viewer can be accessed from the RNA-Seq
analysis page at the H-DBAS top page. On the
RNA-Seq analysis page, RNA-Seq and AS annotation
information were described in a table. In the table, the
number of corresponding RNA-Seq tags and presumed
subcellular locations of AS events were shown. By follow-
ing the link from the table, details of the RNA-Seq tag
supports in the junction appear in the RNA-Seq viewer. In
this viewer, RefSeq transcripts and tags located in the
splice junctions are represented. RNA-Seq tag informa-
tion was further categorized so that users can examine
tag distribution in each subcellular location. Figure 1
exemplifies RNA-Seq tag analysis in the case of caspase
4, an apoptosis-related cysteine peptidase gene. In this
gene, the AS junction (indicated by a red line) was
exclusively identified in nuclear fractions. Figure 1 also
represents 35 RNA-Seq tags mapped to the corresponding
splice junctions. These results suggested that the AS
variant using the most upstream exon (using splice junc-
tions marked in red) is retained in the nucleus and is not
used for protein translation in DLD-1 cells.
Comparative genomics viewer
In order to distinguish AS events having a clear biological
significance, it would be informative to consider whether
an AS is evolutionarily conserved, for which we newly
Table 1. Statistics of human RefSeq junctions expressed in each cellular fraction using RNA-Seq
Nucleic Acids Research, 2010,Vol.38, Database issueD87
Figure 1. Screenshot of RNA-Seq viewer. Genomic regions in caspase 4, an apoptosis-related cysteine peptidase (CASP4) and the estimated junc-
tions with the two or more supporting RNA-Seq tags mapped to the corresponding genomic regions. (A) AS variants of RefSeq are represented.
Annotated protein-coding regions and untranslated regions are indicated by green and yellow boxes, respectively. (B) Junctions estimated by the
mapped RNA-Seq tags derived from cytoplasmic, nuclear and polysome cellular fractions are shown in cyan, navy and brown, respectively. If the AS
junction of RefSeq transcript is expressed in unique sub-cellular fraction (nuclear in this figure), it is shown in red. The gray boxes indicate the
assembled exonic regions of RefSeq transcripts. (C) RNA-Seq tags which support the junction are represented. Two or more RNA-Seq tags mapped
on the splice sites are shown by each sub-cellular fraction. The represented colors are the same as (B).
D88Nucleic Acids Research,2010, Vol.38, Database issue
implemented a comparative genomics viewer to empiri-
cally represent the degree of evolutionary conservation for
any AS. In this viewer, each AS variant can be viewed for
the following points: (i) whether its surrounding genomic
sequence is conserved between humans and mice and
(ii) whether the corresponding AS event is also observed
in mice. Genomic sequences and alignment information
were obtained from UCSC genome browser (hg18 and
mm9 for humans and mice, respectively) (10). For
full-length cDNA information, we used 65158 human
Mammalian Gene Collection (15). In total, 20803 repre-
full-length cDNAs are represented. Among 207 399
exons of the total 20803 human RASVs, 27567 exons
were mapped to the genomic regions that had no aligned
mouse genomic regions. On the other hand, 22396 exons
were mapped to the aligned genomic regions (coverage
(14)(5), FANTOM and
?70% and identity ?60%), but the corresponding tran-
scripts were not identified in mice. The remaining 157436
exons were mapped to the conserved genomic regions
and corresponding transcripts were identified in mouse
full-length cDNAs. Among the 7875 conserved RASVs
thus identified, 5494 were equally spliced variants (ESVs)
with mouse full-length cDNAs, which are conserved
between humans and mice and are likely to have
evolutionarily conserved biological roles (Table 2). For
example, as shown in Figure 2, the phosphoinositide-3-
kinase regulatory subunit gene has several AS variants.
For the two AS variants, their splice patterns are identical
to those of the mouse full-length cDNAs. These AS
variants may contribute to functional diversification of
gene function, playing conserved biological roles both in
humans and mice. Further details of the statistical analysis
of frequencies of conserved AS variants in various gene
groups have been described previously (16). The compar-
ative genomics viewer is embedded in the main AS viewer.
It can also be accessed from the summary annotation table
at the H-DBAS top page and users can search specifically
about the comparative genomics analysis from Advanced
search page at the top page.
We updated our H-DBAS so that AS transcripts having
various types of annotation information can be repre-
sented in an integrativemanner. These types of
Figure 2. Screenshot of comparative genomic viewer. AS variants in the phosphatidylinositol 3-kinase regulatory subunit alpha (PI3-kinase p85
subunit alpha) gene are shown both in humans and mice. The exon structures of the human AS variants and the mouse full-length cDNAs are shown
in the upper and lower panels, respectively, across the human–mouse genome alignment. In this view option (Exon view), constitutively spliced
introns of the transcripts are omitted. Mutually equally spliced variants in humans and mice are indicated by blue and orange arrows, respectively.
Table 2. Statistics of comparative genomics between human and
mouse full-length cDNAs
At least one
RASV: representative AS variant; ESV: equally spliced variant.
Nucleic Acids Research, 2010,Vol.38, Database issueD89
cDNA sequences; RNA-Seq tags derived from RNAs
extracted from nuclear, cytoplasm and polysome frac-
tions; and degree of evolutionary conservation of AS. By
enabling the integrative interpretation of annotation infor-
mation, we believe that H-DBAS can serve as a unique
and useful database for future functional characterization
of AS events. In future, we aim to further enrich the
diverse annotation information connected to each AS.
For this purpose, we aim to expand similar RNA-Seq
analysis to cover the transcriptome information of mice
and other mammals. Also, we aim to continue to collect
RNA-Seq tags from a wider variety of cell types cultured
under different conditions in order to understand which
AS events are transcribed in which cell types and under
what cellular conditions. Results of such extensive
analyses will be fed back to the manual annotations in
H-InvDB. With integrative transcriptome data, we aim
to provide expanded knowledge of the biological signifi-
cance of the functional diversification of human genes
realized by AS, which should add useful molecular back-
ground to the complex human gene network created by a
limited number of genes.
The authors thank Y. Kawahara, A. Matsuya, H.
Nakaoka, T. Habara, F. Todokoro and C. Yamasaki
for their assistance in genome mapping and ORF predic-
tion. We also thank E. Sekimori for the technical support
for RNA-Seq analysis, M. Nitta for constructing the com-
parative genomics viewer, and T. Endo for the technical
support for server usage. Finally, we are grateful to all
those who annotated the full-length human cDNAs at
the H-Invitational and H-Invitational two conferences.
Integrated database project of the Ministry of Economy,
Trade, and Industry of Japan, Ministry of Education,
Culture, Sports, Science and Technology of Japan,
National Institute of Advanced Industrial Science and
Technology (AIST) and Japan Biological Informatics
Consortium (JBIC). Funding for open access charge:
Conflict of interest statement. None declared.
1. Modrek,B. and Lee,C. (2002) A genomic view of alternative
splicing. Nat. Genet., 30, 13–19.
2. Tress,M.L., Martelli,P.L., Frankish,A., Reeves,G.A.,
Wesselink,J.J., Yeats,C., Olason,P.L., Albrecht,M., Hegyi,H.,
Giorgetti,A. et al. (2007) The implications of alternative splicing
in the ENCODE protein complement. Proc. Natl Acad. Sci. USA,
3. Imanishi,T., Itoh,T., Suzuki,Y., O’Donovan,C., Fukuchi,S.,
Koyanagi,K.O., Barrero,R.A., Tamura,T., Yamaguchi-Kabata,Y.,
Tanino,M. et al. (2004) Integrative annotation of 21,037
human genes validated by full-length cDNA clones. PLoS Biol.,
4. Takeda,J., Suzuki,Y., Nakao,M., Kuroda,T., Sugano,S.,
Gojobori,T. and Imanishi,T. (2007) H-DBAS: alternative splicing
database of completely sequenced and manually annotated
full-length cDNAs based on H-Invitational. Nucleic Acids Res.,
5. Yamasaki,C., Murakami,K., Fujii,Y., Sato,Y., Harada,E.,
Takeda,J., Taniya,T., Sakate,R., Kikugawa,S., Shimada,M. et al.
(2008) The H-Invitational database (H-InvDB), a comprehensive
annotation resource for human genes and transcripts. Nucleic
Acids Res., 36, D793–D799.
6. Modrek,B. and Lee,C.J. (2003) Alternative splicing in the human,
mouse and rat genomes is associated with an increased frequency
of exon creation and/or loss. Nat. Genet., 34, 177–180.
7. Graveley,B.R. (2008) Molecular biology: power sequencing.
Nature, 453, 1197–1198.
8. Wang,E.T., Sandberg,R., Luo,S., Khrebtukova,I., Zhang,L.,
Mayr,C., Kingsmore,S.F., Schroth,G.P. and Burge,C.B. (2008)
Alternative isoform regulation in human tissue transcriptomes.
Nature, 456, 470–476.
9. Pan,Q., Shai,O., Lee,L.J., Frey,B.J. and Blencowe,B.J. (2008)
Deep surveying of alternative splicing complexity in the human
transcriptome by high-throughput sequencing. Nat. Genet., 40,
10. Karolchik,D., Kuhn,R.M., Baertsch,R., Barber,G.P., Clawson,H.,
Diekhans,M., Giardine,B., Harte,R.A., Hinrichs,A.S., Hsu,F.
et al. (2008) The UCSC Genome Browser Database: 2008 update.
Nucleic Acids Res., 36, D773–D779.
11. Trapnell,C., Pachter,L. and Salzberg,S.L. (2009) TopHat:
discovering splice junctions with RNA-Seq. Bioinformatics, 25,
12. Langmead,B., Trapnell,C., Pop,M. and Salzberg,S.L. (2009)
Ultrafast and memory-efficient alignment of short DNA
sequences to the human genome. Genome Biol., 10, R25.
13. Pruitt,K.D., Tatusova,T. and Maglott,D.R. (2007) NCBI
reference sequences (RefSeq): a curated non-redundant sequence
database of genomes, transcripts and proteins. Nucleic Acids Res.,
14. Carninci,P., Kasukawa,T., Katayama,S., Gough,J., Frith,M.C.,
Maeda,N., Oyama,R., Ravasi,T., Lenhard,B., Wells,C. et al.
(2005) The transcriptional landscape of the mammalian genome.
Science, 309, 1559–1563.
15. Gerhard,D.S., Wagner,L., Feingold,E.A., Shenmen,C.M.,
Grouse,L.H., Schuler,G., Klein,S.L., Old,S., Rasooly,R., Good,P.
et al. (2004) The status, quality, and expansion of the NIH
full-length cDNA project: the Mammalian Gene Collection
(MGC). Genome Res., 14, 2121–2127.
16. Takeda,J., Suzuki,Y., Sakate,R., Sato,Y., Seki,M., Irie,T.,
Takeuchi,N., Ueda,T., Nakao,M., Sugano,S. et al. (2008) Low
conservation and species-specific evolution of alternative splicing
in humans and mice: comparative genomics analysis using
well-annotated full-length cDNAs. Nucleic Acids Res., 36,
D90Nucleic Acids Research,2010, Vol.38, Database issue