Integrative annotation of 21,037 human genes validated by full-length cDNA clones.
Tadashi Imanishi, Takeshi Itoh, Yutaka Suzuki, Claire O'Donovan, Satoshi Fukuchi, Kanako O Koyanagi, Roberto A Barrero, Takuro Tamura, Yumi Yamaguchi-Kabata, Motohiko Tanino, Kei Yura, Satoru Miyazaki, Kazuho Ikeo, Keiichi Homma, Arek Kasprzyk, Tetsuo Nishikawa, Mika Hirakawa, Jean Thierry-Mieg, Danielle Thierry-Mieg, Jennifer Ashurst, Libin Jia, Mitsuteru Nakao, Michael A Thomas, Nicola Mulder, Youla Karavidopoulou, Lihua Jin, Sangsoo Kim, Tomohiro Yasuda, Boris Lenhard, Eric Eveno, Yoshiyuki Suzuki, Chisato Yamasaki, Jun-ichi Takeda, Craig Gough, Phillip Hilton, Yasuyuki Fujii, Hiroaki Sakai, Susumu Tanaka, Clara Amid, Matthew Bellgard, Maria de Fatima Bonaldo, Hidemasa Bono, Susan K Bromberg, Anthony J Brookes, Elspeth Bruford, Piero Carninci, Claude Chelala, Christine Couillault, Sandro J de Souza, Marie-Anne Debily, Marie-Dominique Devignes, Inna Dubchak, Toshinori Endo, Anne Estreicher, Eduardo Eyras, Kaoru Fukami-Kobayashi, Gopal R Gopinath, Esther Graudens, Yoonsoo Hahn, Michael Han, Ze-Guang Han, Kousuke Hanada, Hideki Hanaoka, Erimi Harada, Katsuyuki Hashimoto, Ursula Hinz, Momoki Hirai, Teruyoshi Hishiki, Ian Hopkinson, Sandrine Imbeaud, Hidetoshi Inoko, Alexander Kanapin, Yayoi Kaneko, Takeya Kasukawa, Janet Kelso, Paul Kersey, Reiko Kikuno, Kouichi Kimura, Bernhard Korn, Vladimir Kuryshev, Izabela Makalowska, Takashi Makino, Shuhei Mano, Regine Mariage-Samson, Jun Mashima, Hideo Matsuda, Hans-Werner Mewes, Shinsei Minoshima, Keiichi Nagai, Hideki Nagasaki, Naoki Nagata, Rajni Nigam, Osamu Ogasawara, Osamu Ohara, Masafumi Ohtsubo, Norihiro Okada, Toshihisa Okido, Satoshi Oota, Motonori Ota, Toshio Ota, Tetsuji Otsuki, Dominique Piatier-Tonneau, Annemarie Poustka, Shuang-Xi Ren, Naruya Saitou, Katsunaga Sakai, Shigetaka Sakamoto, Ryuichi Sakate, Ingo Schupp, Florence Servant, Stephen Sherry, Rie Shiba, Nobuyoshi Shimizu, Mary Shimoyama, Andrew J Simpson, Bento Soares, Charles Steward, Makiko Suwa, Mami Suzuki, Aiko Takahashi, Gen Tamiya, Hiroshi Tanaka, Todd Taylor, Joseph D Terwilliger, Per Unneberg, Vamsi Veeramachaneni, Shinya Watanabe, Laurens Wilming, Norikazu Yasuda, Hyang-Sook Yoo, Marvin Stodolsky, Wojciech Makalowski, Mitiko Go, Kenta Nakai, Toshihisa Takagi, Minoru Kanehisa, Yoshiyuki Sakaki, John Quackenbush, Yasushi Okazaki, Yoshihide Hayashizaki, Winston Hide, Ranajit Chakraborty, Ken Nishikawa, Hideaki Sugawara, Yoshio Tateno, Zhu Chen, Michio Oishi, Peter Tonellato, Rolf Apweiler, Kousaku Okubo, Lukas Wagner, Stefan Wiemann, Robert L Strausberg, Takao Isogai, Charles Auffray, Nobuo Nomura, Takashi Gojobori, Sumio Sugano
ABSTRACT The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-protein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.
- Citations (5)
-
Cited In (0)
-
Article: Getting more from less: algorithms for rapid protein identification with multiple short peptide sequences.
[show abstract] [hide abstract]
ABSTRACT: We describe two novel sequence similarity search algorithms, FASTS and FASTF, that use multiple short peptide sequences to identify homologous sequences in protein or DNA databases. FASTS searches with peptide sequences of unknown order, as obtained by mass spectrometry-based sequencing, evaluating all possible arrangements of the peptides. FASTF searches with mixed peptide sequences, as generated by Edman sequencing of unseparated mixtures of peptides. FASTF deconvolutes the mixture, using a greedy heuristic that allows rapid identification of high scoring alignments while reducing the total number of explored alternatives. Both algorithms use the heuristic FASTA comparison strategy to accelerate the search but use alignment probability, rather than similarity score, as the criterion for alignment optimality. Statistical estimates are calculated using an empirical correction to a theoretical probability. These calculated estimates were accurate within a factor of 10 for FASTS and 1000 for FASTF on our test dataset. FASTS requires only 15-20 total residues in three or four peptides to robustly identify homologues sharing 50% or greater protein sequence identity. FASTF requires about 25% more sequence data than FASTS for equivalent sensitivity, but additional sequence data are usually available from mixed Edman experiments. Thus, both algorithms can identify homologues that diverged 100 to 500 million years ago, allowing proteomic identification from organisms whose genomes have not been sequenced.Molecular & Cellular Proteomics 03/2002; 1(2):139-47. · 7.40 Impact Factor -
Article: Natural rubber latex allergy: prevalence and risk factors in patients with spina bifida compared with atopic children and controls.
[show abstract] [hide abstract]
ABSTRACT: Type 1 allergy against natural rubber latex is an increasing problem in health care workers and children with spina bifida or urogenital malformations. The aim of our study was to evaluate the prevalence of latex IgE antibodies and cross-reacting fruit antibodies in patients with spina bifida compared with atopic and non-atopic controls. Risk factors for sensitization should be determined. Sera of 148 patients with spina bifida and 98 controls (44 with atopy) were screened for IgE antibodies against latex, banana and kiwi by fluorescence enzyme immunoassay (CAP system). Atopies, allergic symptoms after latex contacts and the number of operations were compiled by a questionnaire. Patients with spina bifida developed latex IgE antibodies (> or =0.7 kU/l) more frequently (40.5%) than atopic children (11.4%) or healthy controls (1.9%). All 18 symptomatic patients belonged to the spina bifida group and had high values of latex antibodies. The risk for developing latex antibodies increases with the number of operations. There was no difference in the history of atopic diseases and in a screening test of IgE antibodies against inhalative allergens between latex sensitized and not sensitized children with spina bifida. Antibodies against banana were more frequent in the latex sensitized children with spina bifida. (18.3% vs 3.4%, P = 0.002). CONCLUSION: The high prevalence of latex antibodies in children with spina bifida justifies a primary prophylaxis by avoiding latex contacts, especially during anaesthesia and surgery, a correlation between the number of operations and the development of latex antibodies exists.European Journal of Pediatrics 01/1998; 157(1):13-6. · 1.88 Impact Factor -
Article: The Drosophila gene collection: identification of putative full-length cDNAs for 70% of D. melanogaster genes.
Mark Stapleton, Guochun Liao, Peter Brokstein, Ling Hong, Piero Carninci, Toshiyuki Shiraki, Yoshihide Hayashizaki, Mark Champe, Joanne Pacleb, Ken Wan, Charles Yu, Joe Carlson, Reed George, Susan Celniker, Gerald M Rubin[show abstract] [hide abstract]
ABSTRACT: Collections of full-length nonredundant cDNA clones are critical reagents for functional genomics. The first step toward these resources is the generation and single-pass sequencing of cDNA libraries that contain a high proportion of full-length clones. The first release of the Drosophila Gene Collection Release 1 (DGCr1) was produced from six libraries representing various tissues, developmental stages, and the cultured S2 cell line. Nearly 80,000 random 5' expressed sequence tags (5' expressed sequence tags [ESTs]from these libraries were collapsed into a nonredundant set of 5849 cDNAs, corresponding to ~40% of the 13,474 predicted genes in Drosophila. To obtain cDNA clones representing the remaining genes, we have generated an additional 157,835 5' ESTs from two previously existing and three new libraries. One new library is derived from adult testis, a tissue we previously did not exploit for gene discovery; two new cap-trapped normalized libraries are derived from 0-22-h embryos and adult heads. Taking advantage of the annotated D. melanogaster genome sequence, we clustered the ESTs by aligning them to the genome. Clusters that overlap genes not already represented by cDNA clones in the DGCr1 were analyzed further, and putative full-length clones were selected for inclusion in the new DGC. This second release of the DGC (DGCr2) contains 5061 additional clones, extending the collection to 10,910 cDNAs representing >70% of the predicted genes in Drosophila.Genome Research 08/2002; 12(8):1294-300. · 13.61 Impact Factor
Page 1
Integrative Annotation of 21,037 Human Genes
Validated by Full-Length cDNA Clones
Tadashi Imanishi1, Takeshi Itoh1,2, Yutaka Suzuki3,68, Claire O’Donovan4, Satoshi Fukuchi5, Kanako O. Koyanagi6,
Roberto A. Barrero5, Takuro Tamura7,8, Yumi Yamaguchi-Kabata1, Motohiko Tanino1,7, Kei Yura9, Satoru Miyazaki5,
Kazuho Ikeo5, Keiichi Homma5, Arek Kasprzyk4, Tetsuo Nishikawa10,11, Mika Hirakawa12, Jean Thierry-Mieg13,14,
Danielle Thierry-Mieg13,14, Jennifer Ashurst15, Libin Jia16, Mitsuteru Nakao3, Michael A. Thomas17, Nicola Mulder4,
Youla Karavidopoulou4, Lihua Jin5, Sangsoo Kim18, Tomohiro Yasuda11, Boris Lenhard19, Eric Eveno20,21, Yoshiyuki
Suzuki5, Chisato Yamasaki1, Jun-ichi Takeda1, Craig Gough1,7, Phillip Hilton1,7, Yasuyuki Fujii1,7, Hiroaki Sakai1,7,22,
Susumu Tanaka1,7, Clara Amid23, Matthew Bellgard24, Maria de Fatima Bonaldo25, Hidemasa Bono26, Susan K.
Bromberg27, Anthony J. Brookes19, Elspeth Bruford28, Piero Carninci29, Claude Chelala20, Christine Couillault20,21,
Sandro J. de Souza30, Marie-Anne Debily20, Marie-Dominique Devignes31, Inna Dubchak32, Toshinori Endo33, Anne
Estreicher34, Eduardo Eyras15, Kaoru Fukami-Kobayashi35, Gopal R. Gopinath36, Esther Graudens20,21, Yoonsoo Hahn18,
Michael Han23, Ze-Guang Han21,37, Kousuke Hanada5, Hideki Hanaoka1, Erimi Harada1,7, Katsuyuki Hashimoto38, Ursula
Hinz34, Momoki Hirai39, Teruyoshi Hishiki40, Ian Hopkinson41,42, Sandrine Imbeaud20,21, Hidetoshi Inoko1,7,43,
Alexander Kanapin4, Yayoi Kaneko1,7, Takeya Kasukawa26, Janet Kelso44, Paul Kersey4, Reiko Kikuno45, Kouichi
Kimura11, Bernhard Korn46, Vladimir Kuryshev47, Izabela Makalowska48, Takashi Makino5, Shuhei Mano43, Regine
Mariage-Samson20, Jun Mashima5, Hideo Matsuda49, Hans-Werner Mewes23, Shinsei Minoshima50,52, Keiichi Nagai11,
Hideki Nagasaki51, Naoki Nagata1, Rajni Nigam27, Osamu Ogasawara3, Osamu Ohara45, Masafumi Ohtsubo52, Norihiro
Okada53, Toshihisa Okido5, Satoshi Oota35, Motonori Ota54, Toshio Ota22, Tetsuji Otsuki55, Dominique Piatier-
Tonneau20, Annemarie Poustka47, Shuang-Xi Ren21,37, Naruya Saitou56, Katsunaga Sakai5, Shigetaka Sakamoto5,
Ryuichi Sakate39, Ingo Schupp47, Florence Servant4, Stephen Sherry13, Rie Shiba1,7, Nobuyoshi Shimizu52, Mary
Shimoyama27, Andrew J. Simpson30, Bento Soares25, Charles Steward15, Makiko Suwa51, Mami Suzuki5, Aiko
Takahashi1,7, Gen Tamiya1,7,43, Hiroshi Tanaka33, Todd Taylor57, Joseph D. Terwilliger58, Per Unneberg59, Vamsi
Veeramachaneni48, Shinya Watanabe3, Laurens Wilming15, Norikazu Yasuda1,7, Hyang-Sook Yoo18, Marvin Stodolsky60,
Wojciech Makalowski48, Mitiko Go61, Kenta Nakai3, Toshihisa Takagi3, Minoru Kanehisa12, Yoshiyuki Sakaki3,57, John
Quackenbush62, Yasushi Okazaki26, Yoshihide Hayashizaki26, Winston Hide44, Ranajit Chakraborty63, Ken Nishikawa5,
Hideaki Sugawara5, Yoshio Tateno5, Zhu Chen21,37,64, Michio Oishi45, Peter Tonellato65, Rolf Apweiler4, Kousaku
Okubo5,40, Lukas Wagner13, Stefan Wiemann47, Robert L. Strausberg16, Takao Isogai10,66, Charles Auffray20,21, Nobuo
Nomura40, Takashi Gojobori1,5,67*, Sumio Sugano3,40,68
1 Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan, 2 Bioinformatics
Laboratory, Genome Research Department, National Institute of Agrobiological Sciences, Ibaraki, Japan, 3 Human Genome Center, The Institute of Medical Science, The
University of Tokyo, Tokyo, Japan, 4 EMBL Outstation—European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, United Kingdom, 5 Center for
Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Shizuoka, Japan, 6 Nara Institute of Science and Technology, Nara, Japan, 7 Integrated
Database Group, Japan Biological Information Research Center, Japan Biological Informatics Consortium, Tokyo, Japan, 8 BITS Company, Shizuoka, Japan, 9 Quantum
Bioinformatics Group, Center for Promotion of Computational Science and Engineering, Japan Atomic Energy Research Institute, Kyoto, Japan, 10 Reverse Proteomics
Research Institute, Chiba, Japan, 11 Central Research Laboratory, Hitachi, Tokyo, Japan, 12 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto,
Japan, 13 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America, 14 Centre
National de la Recherche Scientifique (CNRS), Laboratoire de Physique Mathematique, Montpellier, France, 15 The Wellcome Trust Sanger Institute, Wellcome Trust Genome
Campus, Cambridge, United Kingdom, 16 National Cancer Institute, National Institutes of Health, Bethesda, Maryland, United States of America, 17 Department of Biological
Sciences, Idaho State University, Pocatello, Idaho, United States of America, 18 Korea Research Institute of Bioscience and Biotechnology, Taejeon, Korea, 19 Center for
Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden, 20 Genexpress—CNRS—Functional Genomics and Systemic Biology for Health, Villejuif Cedex,
France, 21 Sino-French Laboratory in Life Sciences and Genomics, Shanghai, China, 22 Tokyo Research Laboratories, Kyowa Hakko Kogyo Company, Tokyo, Japan,
23 MIPS—Institute for Bioinformatics, GSF—National Research Center for Environment and Health, Neuherberg, Germany, 24 Centre for Bioinformatics and Biological
Computing, School of Information Technology, Murdoch University, Murdoch, Western Australia, Australia, 25 Medical Education and Biomedical Research Facility, University
of Iowa, Iowa City, Iowa, United States of America, 26 Genome Exploration Research Group, RIKEN Genomic Sciences Center, RIKEN Yokohama Institute, Kanagawa, Japan,
27 Medical College of Wisconsin, Milwaukee, Wisconsin, United States of America, 28 HUGO Gene Nomenclature Committee, University College London, London, United
Kingdom, 29 Genome Science Laboratory, RIKEN, Saitama, Japan, 30 Ludwig Institute of Cancer Research, Sao Paulo, Brazil, 31 CNRS, Vandoeuvre les Nancy, France,
32 Lawrence Berkeley National Laboratory, Berkeley, California, United States of America, 33 Department of Bioinformatics, Medical Research Institute, Tokyo Medical and
Dental University, Tokyo, Japan, 34 Swiss Institute of Bioinformatics, Geneva, Switzerland, 35 Bioresource Information Division, RIKEN BioResource Center, RIKEN Tsukuba
Institute, Ibaraki, Japan, 36 Genome Knowledgebase, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America, 37 Chinese National Human
Genome Center at Shanghai, Shanghai, China, 38 Division of Genetic Resources, National Institute of Infectious Diseases, Tokyo, Japan, 39 Graduate School of Frontier
Sciences, Department of Integrated Biosciences, University of Tokyo, Chiba, Japan, 40 Functional Genomics Group, Biological Information Research Center, National Institute
PLoS Biology | http://biology.plosjournals.orgJune 2004 | Volume 2 | Issue 6 | Page 0856
P PL Lo oS S BIOLOGY
Page 2
of Advanced Industrial Science and Technology, Tokyo, Japan, 41 Department of Primary Care and Population Sciences, Royal Free University College Medical School,
University College London, London, United Kingdom, 42 Clinical and Molecular Genetics Unit, The Institute of Child Health, London, United Kingdom, 43 Department of
Genetic Information, Division of Molecular Life Science, School of Medicine, Tokai University, Kanagawa, Japan, 44 South African National Bioinformatics Institute, University
of the Western Cape, Bellville, South Africa, 45 Kazusa DNA Research Institute, Chiba, Japan, 46 RZPD Resource Center for Genome Research, Heidelberg, Germany,
47 Molecular Genome Analysis, German Cancer Research Center-DKFZ, Heidelberg, Germany, 48 Pennsylvania State University, University Park, Pennsylvania, United States
of America, 49 Department of Bioinformatic Engineering, Graduate School of Information Science and Technology, Osaka University, Osaka, Japan, 50 Medical Photobiology
Department, Photon Medical Research Center, Hamamatsu University School of Medicine, Shizuoka, Japan, 51 Computational Biology Research Center, National Institute of
Advanced Industrial Science and Technology, Tokyo, Japan, 52 Department of Molecular Biology, Keio University School of Medicine, Tokyo, Japan, 53 Department of
Biological Sciences, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, Kanagawa, Japan, 54 Global Scientific Information and Computing
Center, Tokyo Institute of Technology, Tokyo, Japan, 55 Molecular Biology Laboratory, Medicinal Research Laboratories, Taisho Pharmaceutical Company, Saitama, Japan,
56 Department of Population Genetics, National Institute of Genetics, Shizuoka, Japan, 57 Human Genome Research Group, Genomic Sciences Center, RIKEN Yokohama
Institute, Kanagawa, Japan, 58 Columbia University and Columbia Genome Center, New York, New York, United States of America, 59 Department of Biotechnology, Royal
Institute of Technology, Stockholm, Sweden, 60 Biology Division and Genome Task Group, Office of Biological and Environmental Research, United States Department of
Energy, Washington, D.C., United States of America, 61 Faculty of Bio-Science, Nagahama Institute of Bio-Science and Technology, Shiga, Japan, 62 Institute for Genomic
Research, Rockville, Maryland, United States of America, 63 Center for Genome Information, Department of Environmental Health, University of Cincinnati, Cincinnati, Ohio,
United States of America, 64 State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, Rui-Jin Hospital, Shanghai Second Medical University, Shanghai,
China, 65 PointOne Systems, Wauwatosa, Wisconsin, United States of America, 66 Graduate School of Life and Environmental Sciences, University of Tsukuba, Ibaraki, Japan,
67 Department of Genetics, Graduate University for Advanced Studies, Shizuoka, Japan, 68 Department of Medical Genome Sciences, Graduate School of Frontier Sciences,
University of Tokyo, Tokyo, Japan
The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein
requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of
investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene
prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus
performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as
complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level.
Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length
cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also
manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene
database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following:
integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms,
non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein
three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic
microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB
analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information
build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates
(1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-
protein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within
human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together
with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing
phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources
needed for the exploration of human biology and pathology.
Introduction
The draft sequences of the human, mouse, and rat genomes
are already available (Lander et al. 2001; Marshall 2001;
Venter et al. 2001; Waterston et al. 2002). The next challenge
comes in the understanding of basic human molecular
biology through interpretation of the human genome. To
display biological data optimally we must first characterize
the genome in terms of not only its structure but also
function and diversity. It is of immediate interest to identify
factors involved in the developmental process of organisms,
non-protein-coding functional RNAs, the regulatory network
of gene expression within tissues and its governance over
states of health, and protein–gene and protein–protein
interactions. In doing so, we must integrate this information
in an easily accessible and intuitive format. The human
genome may encode only 30,000 to 40,000 genes (Lander et al.
2001; Venter et al. 2001), suggesting that complex interde-
Received December 19, 2003; Accepted April 1, 2004; Published April 20,
2004
DOI: 10.1371/journal.pbio.0020162
Copyright: ? 2004 Imanishi et al. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited.
Abbreviations: 3D, three-dimensional; AS, alternative splicing; CAI, codon
adaptation index; dbSNP, Single Nucleotide Polymorphism Database; DDBJ, DNA
Data Bank of Japan; EC, Enzyme Commission; EMBL, European Molecular Biology
Laboratories; EST, expressed sequence tag; FANTOM, Functional Annotation of
Mouse; FLcDNA, full-length cDNA; FLJ, Full-Length Long Japan; FTHFD, formyltet-
rahydrofolate dehydrogenase; GO, Gene Ontology; GTOP, Genomes TO Protein
structures and functions database; H-Angel, Human Anatomic Gene Expression
Library; H-Inv or H-Invitational, Human Full-Length cDNA Annotation Invitational;
H-InvDB, H-Invitational Database; iAFLP, introduced amplified fragment length
polymorphism; NCBI, National Center for Biotechnology Information; ncRNAs, non-
protein-coding RNAs; OMIM, Online Mendelian Inheritance in Man; ORF, open
reading frame; PDB, Protein Data Bank; RefSeq, Reference Sequence Collection;
SMO, Similarity, Motif, and ORF; SNP, single nucleotide polymorphism
Academic Editor: Richard Roberts, New England Biolabs
*To whom correspondence should be addressed. E-mail: tgojobor@genes.nig.ac.jp
PLoS Biology | http://biology.plosjournals.orgJune 2004 | Volume 2 | Issue 6 | Page 0857
Integrative Annotation of Human Genes
Page 3
pendent gene regulation mechanisms exist to account for the
complex gene networks that differentiate humans from
lower-order organisms. In organisms with small genomes, it
is relatively straightforward to use direct computational
prediction based upon genomic sequence to identify most
genes by their long open reading frames (ORFs). However,
computational gene prediction from the genomic sequence
of organisms with short exons and long introns can be
somewhat error-prone (Ashburner 2000; Reese et al. 2000;
Lander et al. 2001).
Previous efforts to catalogue the human transcriptome
were based on expressed sequence tags (ESTs) used for the
identification of new genes (Adams et al. 1991; Auffray et al.
1995; Houlgatte et al. 1995), chromosomal assignment of
genes (Gieser and Swaroop 1992; Khan et al. 1992; Camargo
et al. 2001), prediction of genes (Nomura et al. 1994), and
assessment of gene expression (Okubo et al. 1992). Recently,
Camargo et al. (2001) generated a large collection of ORF
ESTs, and Saha et al. (2002) conducted a large-scale serial
analysis of gene expression patterns to identify novel human
genes. The availability of human full-length transcripts from
many large-scale sequencing projects (Nomura et al. 1994;
Nagase et al. 2001; Wiemann et al. 2001; Yudate 2001; Kikuno
et al. 2002; Strausberg et al. 2002) has provided a unique
opportunity for the comprehensive evaluation of the human
transcriptome through the annotation of a variety of RNA
transcripts. Protein-coding and non-protein-coding sequen-
ces, alternative splicing (AS) variants, and sense–antisense
RNA pairs could all be functionally identified. We thus
designed an international collaborative project to establish
an integrative annotation database of 41,118 human full-
length cDNAs (FLcDNAs). These cDNAs were collected from
six high-throughput sequencing projects and evaluated at the
first international jamboree, entitled the Human Full-length
cDNA Annotation Invitational (H-Invitational or H-Inv)
(Cyranoski 2002). This event was held in Tokyo, Japan, and
took place from August 25 to September 3, 2002.
Efforts which have been made in the same area as the H-Inv
annotation work include the Functional Annotation of Mouse
(FANTOM) project (Kawai et al. 2001; Bono et al. 2002;
Okazaki et al. 2002), Flybase (GOC 2001), and the RIKEN
Arabidopsis full-length cDNA project (Seki et al. 2002). In our
own project, great effort has been taken at all levels, not only
in the annotation of the cDNAs but also in the way the data
can be viewed and queried. These aspects, along with the
applications of our research to disease research, distinguish
our project from other similar projects.
This manuscript provides the first report by the H-Inv
consortium, showing some of the discoveries made so far and
introducing our new database of the human transcriptome. It
is hoped that this will be the first in a long line of publications
announcing discoveries made by the H-Inv consortium. Here
we describe results from our integrative annotation in four
major areas: mapping the transcriptome onto the human
genome, functional annotation, polymorphism in the tran-
scriptome, and evolution of the human transcriptome. We
then introduce our new database of the human transcrip-
tome, the H-Invitational Database (H-InvDB; http://www.h-
invitational.jp), which stores all annotation results by the
consortium. Free and unrestricted access to the H-Inv
annotation work is available through the database. Finally,
we summarize our most important findings thus far in the H-
Inv project in Concluding Remarks.
Results/Discussion
Mapping the Transcriptome onto the Human Genome
Construction of the nonredundant human FLcDNA data-
base. We present the first experimentally validated non-
redundant transcriptome of human FLcDNAs produced by
six high-throughput cDNA sequencing projects (Ota et al.
1997, 2004; Strausberg et al. 1999; Hu et al. 2000; Wiemann et
al. 2001; Yudate 2001; Kikuno et al. 2002) as of July 15, 2002.
The dataset consists of 41,118 cDNAs (H-Inv cDNAs) that
were derived from 184 diverse cell types and tissues (see
Dataset S1). The number of clones, the number of libraries,
major tissue origins, methods, and URLs of cDNA clones for
each cDNA project are summarized in Table 1. H-Inv cDNAs
include 8,324 cDNAs recently identified by the Full-Length
Long Japan (FLJ) project. The FLJ clones represent about half
of the H-Inv cDNAs (Table 1). The policies for library
selection and the results of initial analysis of the constituent
projects were reported by the participants themselves: the
Chinese National Human Genome Center (CHGC) (Hu et al.
2000), the Deutsches Krebsforschungszentrum (DKFZ/MIPS)
(Wiemann et al. 2001), the Institute of Medical Science at the
University of Tokyo (IMSUT) (Suzuki et al. 1997; Ota et al.
2004), the Kazusa cDNA sequence project of the Kazusa DNA
Research Institute (KDRI) (Hirosawa et al. 1999; Nagase et al.
1999; Suyama et al. 1999; Kikuno et al. 2002), the Helix
Research Institute (HRI) (Yudate et al. 2001), and the
Mammalian Gene Collection (MGC) (Strausberg et al. 1999;
Moonen et al. 2002), as well as FLJ mentioned earlier (Ota et
al. 2004). The variation in tissue origins for library con-
struction among these six groups resulted in rare occurrences
of sequence redundancy among the collections. In a recent
study, the FLJ project has described the complete sequencing
and characterization of 21,243 human cDNAs (Ota et al.
2004). On the other hand, the H-Inv project characterized
cDNAs from this project and six high-throughput cDNA
producers by using a different suite of computational analysis
techniques and an alternative system of functional annota-
tion.
The 41,118 H-Inv cDNAs were mapped on to the human
genome, and 40,140 were considered successfully aligned. The
alignment criterion was that a cDNA was only aligned if it had
both 95% identity and 90% length coverage against the
genome (Figure 1). The mean identity of all the alignments
between 40,140 mapped cDNAs and genomic sequences was
99.6 %, and the mean coverage against the genomic sequence
was 99.6%. In some cases, terminal exons were aligned with
low identity or low coverage. For example, 89% of internal
exons have identity of 99.8% or higher, while only 78% and
50% of the first and last exons do, respectively. These
alignments with low identity or low coverage seemed to be
caused by the unsuccessful alignments of the repetitive
sequences found in UTR regions and the misalignments of
39 terminal poly-A sequences. Although better alignments
could be obtained for these sequences by improving the
mapping procedure, we concluded that the quality of the
FLcDNAs was high overall.
Due to redundancy and AS within the human tran-
scriptome, these 40,140 cDNAs were clustered to 20,190 loci
PLoS Biology | http://biology.plosjournals.orgJune 2004 | Volume 2 | Issue 6 | Page 0858
Integrative Annotation of Human Genes
Page 4
(H-Inv loci). For the remaining 978 unmapped cDNAs, we
conducted cDNA-based clustering, which yielded 847 clusters.
The clusters created had an average of 2.0 cDNAs per locus
(Table 2). The average was only 1.2 for unmapped clusters,
probably because many of these genes are encoded by
heterochromatic regions of the human genome and show
limited levels of gene expression. The gene density for each
chromosome varied from 0.6 to 19.0 genes/Mb, with an
average of 6.5 genes/Mb. This distribution of genes over the
genome is far from random. This biased gene localization
concurs with the gene density on chromosomes found in
similar previous reports (Lander et al. 2001; Venter et al.
2001). This indicates that the sampled cDNAs are unbiased
with respect to chromosomal location. Most cDNAs were
mapped only at a single position on the human genome.
However, 1,682 cDNAs could be mapped at multiple positions
(with mean values of 98.2% identity and 98.1% coverage).
The multiple matching may be caused by either recent gene
duplication events or artificial duplication of the human
genome caused by misassembled contigs. In our study we have
selected only the ‘‘best’’ loci for the cDNAs (see Materials and
Methods for details).
In total, 21,037 clusters (20,190 mapped and 847
unmapped) were identified and entered into the H-InvDB.
We assigned H-Inv cluster IDs (e.g., HIX0000001) to the
clusters and H-Inv cDNA IDs (e.g., HIT000000001) to all
curated cDNAs. A representative sequence was selected from
each cluster and used for further analyses and annotation.
Comparison of the mapped H-Inv cDNAs with other
annotated datasets. In order to evaluate the H-Inv dataset,
we compared all of the mapped H-Inv cDNAs with the
Reference Sequence Collection (RefSeq) mRNA database
(Pruitt and Maglott 2001) (Figure 2). The RefSeq mRNA
database consists of two types of datasets. These are the
curated mRNAs (accession prefix NM and NR) and the model
mRNAs that are provided through automated processing of
the genome annotation (accession prefix XM and XR).
From the comparison, we found that 5,155 (26%) of the H-
Inv loci had no counterparts and were unique to the H-Inv.
All of these 5,155 loci are candidates for new human genes,
although non-protein-coding RNAs (ncRNAs) (25%), hypo-
thetical proteins with ORFs less than 150 amino acids (55%),
and singletons (91%) were enriched in this category. In fact,
1,340 of these H-Inv-unique loci were questionable and
require validation by further experiments because they
consist of only single exons, and the 39 termini of these loci
align with genomic poly-A sequences. This feature suggests
internal poly-A priming although some occurrences might be
bona fide genes. The most reliable set of newly identified
human genes in our dataset is composed of 1,054 protein-
Table 1. Summary of cDNA Resources
cDNA
Sequence
Provider*
Number of
cDNAs
(Without
Redundancy)
Number
of
Library
Origins
Major
Tissue
Library
Origins
Method URLReference
CHGC758 (754) 30Adrenal gland,
hypothalamus,
CD34þ
stem cell
Testis, brain,
lymph node
Selecting FLcDNA
clones from EST
libraries
http://www.chgc.
sh.cn/
Hu et al. 2000
DKFZ/MIPS 5,555 (5,521) 14 Selecting FLcDNA
clones from 59- and
39- EST libraries
Oligo-capping method
and selection by
one-pass sequences
Oligo-capping method
and selection by
one-pass sequences
Selection by
one-pass sequences
In vitro protein
synthesis and
selection by
one-pass sequences
Selecting gene
candidates from
59-EST libraries
http://mips.gsf.de/
projects/cdna
Wiemann et al. 2001
FLJ/HRI8,066 (8,057) 46Teratocarcinoma,
placenta,
whole embryo
Brain, testis,
bone marrow
http://www.hri.co.
jp/HUNT/
Ota et al. 1997, 2004;
Yudate et al. 2001
FLJ/IMSUT 12,585 (12,560) 81 http://cdna.ims.u-tokyo.
ac.jp/
Suzuki et al. 1997;
Ota et al. 2004
FLJ/KDRI 348(342)1 Spleenhttp://www.kazusa.or.
jp/NEDO/
http://www.kazusa.or.
jp/huge/
Ota et al. 2004
KDRI2,000 (2,000)9Brain
Hirosawa et al. 1997;
Nagase et al. 1999;
Suyama et al. 1999;
Kikuno et al. 2002
Strausberg et al. 1999
MGC/NIH 11,806(11,414) 69Placenta,
lung,
skin
http://mgc.nci.nih.gov/
*FLcDNA data were provided for H-Inv project by the FLJ project of NEDO (URL: http://www.nedo.go.jp/bio-e/) and six high-throughput cDNA clone producers Chinese
National Human Genome Center (CHGC), the Deutsches Krebsforschungszentrum (DKFZ/MIPS), Helix Research Institute (HRI), the Institute of Medical Science in the
University of Tokyo (IMSUT), the Kazusa DNA Research Institute (KDRI), and the Mammalian Gene Collection (MGC/NIH).
DOI: 10.1371/journal.pbio.0020162.t001
PLoS Biology | http://biology.plosjournals.orgJune 2004 | Volume 2 | Issue 6 | Page 0859
Integrative Annotation of Human Genes
Page 5
coding and 179 non-protein-coding genes that have multiple
exons. Therefore, at least 6.1% (1,233/20,190) of the H-Inv
loci could be used to newly validate loci that the RefSeq
datasets do not presently cover. These genes are possibly less
expressed since the proportion of singletons (H-Inv loci
consisting of a single H-Inv cDNA) was high (84%).
On the other hand, 78% (11,974/15,439) of the curated
RefSeq mRNAs were covered by the H-Inv cDNAs. These
figures suggest that further extensive sequencing of FLcDNA
clones will be required in order to cover the entire human
gene set. Nonetheless, this effort provides a systematic
approach using the H-Inv cDNAs, even though a portion of
the cDNAs have already been utilized in the RefSeq datasets.
It is noteworthy that H-Inv cDNAs overlapped 3,061 (17%)
of RefSeq model mRNAs, supporting this proportion of the
hypothetical RefSeq sequences. These newly confirmed 3,061
loci have a mean number of exons greater than RefSeq model
mRNAs that were not confirmed, but smaller than RefSeq
curated mRNAs. The overlap between H-Inv cDNAs and
RefSeq model mRNAs was smaller than that between H-Inv
cDNAs and RefSeq curated mRNAs. This suggests that the
genes predicted from genome annotation may tend to be less
expressed than RefSeq curated genes, or that some may be
artifacts. All these results highlight the great importance of
comprehensive collections of analyzed FLcDNAs for validat-
Table 2. The Clustering Results of Human FLcDNAs onto the Human Genome
Chromosome Number of Loci Number of cDNAsNumber of cDNAs/LocusNumber of Loci/Mb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
X
Y
UNa
Unmapped
Total
1,998
1,408
1,224
809
920
1,027
1,008
761
817
863
1,116
1,014
394
626
693
865
1,110
334
1,210
536
197
480
646
29
105
847
21,037
4,057
2,791
2,455
1,527
1,851
1,912
1,994
1,448
1,630
1,705
2,245
2,071
743
1,363
1,415
1,851
2,245
593
2,378
1,124
379
985
1,173
32
173
978
41,118
2.0
2.0
2.0
1.9
2.0
1.9
2.0
1.9
2.0
2.0
2.0
2.0
1.9
2.2
2.0
2.1
2.0
1.8
2.0
2.1
1.9
2.1
1.8
1.1
1.6
1.2
2.0
8.1
5.8
6.1
4.2
5.1
6.0
6.4
5.2
6.0
6.4
8.3
7.7
3.5
5.9
6.9
9.6
13.6
4.4
19.0
8.4
4.2
9.7
4.2
0.6
–
–
–
aUN represents contigs that were not mapped onto any chromosome.
DOI: 10.1371/journal.pbio.0020162.t002
Figure 1. Procedure for Mapping and Clustering the H-Inv cDNAs
The cDNAs were mapped to the genome and clustered into loci. The
remaining unmapped cDNAs were clustered based upon the group-
ing of significantly similar cDNAs.
DOI: 10.1371/journal.pbio.0020162.g001
PLoS Biology | http://biology.plosjournals.orgJune 2004 | Volume 2 | Issue 6 | Page 0860
Integrative Annotation of Human Genes
Page 6
ing gene prediction from genome sequences. This may be
especially true for higher organisms such as humans.
Incomplete parts of the human genome sequences. The
existence of 978 unmapped cDNAs (847 clusters) suggests that
the human genome sequence (National Center for Biotech-
nolgy Information [NCBI] build 34 assembly) is not yet
complete. The evidence supporting this statement is twofold.
First, most of those unmapped cDNAs could be partially
mapped to the human genome. Using BLAST, 906 of the
unmapped cDNAs (corresponding to 786 clusters) showed at
least one sequence match to the human genome with a bit
score higher than 100. Second, most of the cDNAs could be
mapped unambiguously to the mouse genome sequences. A
total of 907 unmapped cDNAs (779 clusters; 92%) could be
mapped to the mouse genome with coverage of 90% or
higher. If we adopted less stringent requirements, more
cDNAs could be mapped to the mouse genome. The rest
might be less conserved genes, genes in unfinished sections of
the mouse genome, or genes that were lost in the mouse
genome. Based on these observations, we conclude that the
human genome sequence is not yet complete, leaving some
portions to be sequenced or reassembled.
The proportion of the genome that is incomplete is
estimated to be 3.7%–4.0%. The figure of 4.0% is based
upon the proportion of H-Inv cDNA clusters that could not
be mapped to the genome (847/21,037), while the 3.7%
estimate is based on both H-Inv cDNAs and RefSeq sequences
(only NMs). This statistic indicates that a minimum of one out
of every 25–27 clusters appears to be unrepresented in the
current human genome dataset, in its full form. Possible
reasons for this include unsequenced regions on the human
genome and regions where an error may have occurred
during sequence assembly. If this is the case, this lends
support to the use of cDNA mapping to facilitate the
completion of whole genome sequences (Kent and Haussler
2001). For example, we can predict the arrangement of
contigs based on the order of mapped exons. In addition we
can use the sequences of unmapped exons to search for those
clones that contain unsequenced parts of the genome. The
mapping results of partially mapped cDNAs are thus quite
useful.
Primary structure of genes on the human genome. Using
the H-Inv cDNAs, the precise structures of many human
genes could be identified based on the results of our cDNA
mapping (Table S1). The median length of last exons (786 bp)
was found to be longer than that of other exons, and the
median length of first introns (3,152 bp) longer than that of
other introns. These observed characteristics of human gene
structures concur with the previous work using much smaller
datasets (Hawkins 1988; Maroni 1996; Kriventseva and
Gelfand 1999).
In the human genome, 50% of the sequence is occupied by
repetitive elements (Lander et al. 2001). Repetitive elements
were previously regarded by many as simply ‘‘junk’’ DNA.
However, the contribution of these repetitive stretches to
genome evolution has been suggested in recent works
(Makalowski 2000; Deininger and Batzer 2002; Sorek et al.
2002; Lorenc and Makalowski 2003). The 21,037 loci of
representative cDNAs were searched for repetitive elements
using the RepeatMasker program. RepeatMasker indicated
that 9,818 (47%) of the H-Inv cDNAs, including 5,442 coding
hypothetical proteins, contained repetitive sequences. The
existence of Alu repeats in 5% of human cDNAs was reported
previously (Yulug et al. 1995). Our results revealed a
significant number of repetitive sequences including Alu in
the human transcriptome. Among them, 1,866 cDNAs over-
lapped repetitive sequences in their ORFs. Moreover, 554 of
1,866 cDNAs had repetitive sequences contained completely
within their ORFs, including 81 cDNAs that were identical or
similar to known proteins. This may indicate the involvement
of repetitive elements in human transcriptome evolution, as
suggested by the presence of Alu repeats in AS exons (Sorek
et al. 2002) and the contribution to protein variability by
repetitive elements in protein-coding regions (Makalowski
2000). We detected 2,254 and 5,427 cDNAs containing
repetitive sequences in their 59 UTR and 39 UTR, respectively.
The positioning of the repetitive elements suggests they play
a regulatory role in the control of gene expression (Deininger
and Batzer 2002) (see Table S1 or the H-InvDB for details).
AS transcripts. We wished to investigate the extent to
which the functional diversity of the human proteome is
affected by AS. In order to do this, we searched for potential
AS isoforms in 7,874 loci that were supported by at least two
H-Inv cDNAs. We examined whether or not these cDNAs
represented mutually exclusive AS isoforms, using a combi-
nation of computational methods and human curation (see
Materials and Methods). All AS isoforms that were supported
independently by both methods were defined as the H-Inv AS
dataset. Our analysis showed that 3,181 loci (40 % of the 7,874
loci) encoded 8,553 AS isoforms expressing a total of 18,612
AS exons. On average, 2.7 AS isoforms per locus were
identified in these AS-containing loci. This figure represents
Figure 2. A Comparison of the Mapped H-Inv FLcDNAs and the RefSeq
mRNAs
The mapped H-Inv cDNAs, the RefSeq curated mRNAs (accession
prefixes NM and NR), and the RefSeq model mRNAs (accession
prefixes XM and XR) provided by the genome annotation process
were clustered based on the genome position. The numbers of loci
that were identified by clustering are shown.
DOI: 10.1371/journal.pbio.0020162.g002
PLoS Biology | http://biology.plosjournals.orgJune 2004 | Volume 2 | Issue 6 | Page 0861
Integrative Annotation of Human Genes
Page 7
half of the AS isoforms predicted by another group (Lander
et al. 2001). Our result highlights the degree to which full-
length sequencing of redundant clones is necessary when
characterizing the complete human transcriptome. The
relative positions of AS exons on the loci varied: 4,383
isoforms comprising 1,538 loci were 59 terminal AS variants;
5,678 isoforms comprising 1,979 loci were internal AS
variants; and 2,524 isoforms comprising 921 loci were 39
terminal AS variants.
The AS isoforms found in the H-Inv AS dataset have
strikingly diverse functions. Motifs are found over a wide
range of protein sequences. For certain types of subcellular
targeting signals, such as signal peptides, position within the
entire protein sequence appears crucial. A total of 3,020 (35
%) AS isoforms contained AS exons that overlapped protein-
coding sequences. 1,660 out of 3,020 AS isoforms (55%)
harbored AS exons that encoded functional motifs. Addi-
tionally, 1,475 loci encoded AS isoforms that had different
subcellular localization signals, and 680 loci had AS isoforms
that had different transmembrane domains. These results
suggest marked functional differentiation between the vary-
ing isoforms. If this is the case, it would appear that AS
contributes significantly to the functional diversity of the
human proteome.
As the coverage of the human transcriptome by H-Inv
cDNAs is incomplete, it would be misleading to conjecture
that our dataset comprehensively includes all AS transcripts
from every human gene. However, the current collection is a
robust characterization of the existing functional diversity of
the human proteome, and it represents a valuable resource of
full-length clones for the characterization of experimentally
determined AS isoforms.
In the cases where three-dimensional (3D) structures could
be assigned to H-Inv cDNA protein products, we have
examined the possible impact of AS rearrangements on the
3D structure. Our analysis was performed using the Genomes
TO Protein structures and functions database (GTOP)
(Kawabata et al. 2002). We found that some of the sequence
regions in which internal exons vary between different
isoforms contained regions encoding SCOP domains (Lo
Conte et al. 2000). This discovery allowed us to perform a
simple analysis of the structural effects of AS. Our analysis of
the SCOP domain assignments revealed that the loci
displaying AS are much more likely to contain class c (b–a–
b units, a/b) SCOP domains than class d (segregated a and b
regions, aþb) or class g (small) domains.
An example of exon differences between AS isoforms is
presented in Figure 3. The structures shown are those of
proteins in the Brookhaven Protein Data Bank (PDB) (Ber-
man et al. 2000) to which the amino acid sequences of the
corresponding AS isoforms are aligned. Segments of the AS
isoform sequences that are not aligned with the correspond-
ing 3D structure are shown in purple. Figure 3 demonstrates
that exon differences resulting from AS sometimes give rise
to significant alternations in 3D structure.
Functional Annotation
We predicted the ORFs of 41,118 H-Inv cDNA sequences
using a computational approach (see Figure S1), of which
39,091 (95.1%) were protein coding and the remaining 2,027
(4.9%) were non-protein-coding. Since the structures and
functions of protein products from AS isoforms are expected
to be basically similar, we selected a ‘‘representative tran-
script’’ from each of the loci (see Figure S2). Then we
identified 19,660 protein-coding and 1,377 non-protein-
coding loci (Table 3). Human curation suggested that a total
of 86 protein-coding transcripts should be deemed ques-
tionable transcripts. Once identified as dubious these
sequences were excluded from further analysis. The remain-
ing representatives from the 19,574 protein-coding loci were
used to define a set of human proteins (H-Inv proteins). The
tentative functions of the H-Inv proteins were predicted by
computational methods. Following computational predic-
tions was human curation.
After determination of the H-Inv proteins, we performed a
standardized functional annotation as illustrated in Figure 4,
during which we assigned the most suitable data source ID to
each H-Inv protein based on the results of similarity search
and InterProScan. We classified the 19,574 H-Inv proteins
according to the levels of the sequence similarity. Using a
system developed for the human cDNA annotation (see
Figure S2), we classified the H-Inv proteins into five
categories (Table 3). Three categories contain translated
Figure 3. An Example of Different Structures Encoded by AS Variants
Exons are presented from the 59 end, with those shared by AS variants
aligned vertically. The AS variants, with accession numbers AK095301
and BC007828, are aligned to the SCOP domain d.136.1.1 and
corresponding PDB structure 1byr. Helices and beta sheets are red
and yellow, respectively. Green bars indicate regions aligned to the
PDB structure, while open rectangles represent gaps in the align-
ments. AK095301 is aligned to the entire PDB structure shown, while
BC007828 is lacking the alignment to the purple segment of the
structure.
DOI: 10.1371/journal.pbio.0020162.g003
PLoS Biology | http://biology.plosjournals.orgJune 2004 | Volume 2 | Issue 6 | Page 0862
Integrative Annotation of Human Genes
Page 8
gene products that are related to known proteins: 5,074
(25.9%) were defined as identical to a known human protein
(Category I proteins); 4,104 (21.0%) were defined as similar to
a known protein (Category II proteins); and 2,531 (12.9%) as
domain-containing proteins (Category III proteins). In total,
we were able to assign biological function to 59.9% of H-Inv
proteins by similarity or motif searches. The remaining
proteins, for which no biological functional was inferred,
were annotated as conserved hypothetical proteins (Category
IV proteins; 1,706, 8.7%) if they had a high level of similarity
to other hypothetical proteins in other species, or as
hypothetical proteins (Category V proteins; 6,159, 31.5%) if
they did not.
To predict the functions of hypothetical proteins (Category
IV and V proteins), we used 196 sequence patterns of
functional importance derived from tertiary structures of
protein modules, termed 3D keynotes (Go 1983; Noguti et al.
1993). Application of the 3D keynotes to the H-Inv proteins
resulted in the prediction of functions in 350 hypothetical
proteins (see Protocol S1).
Features of ORFs deduced from human FLcDNAs. The
mean and median lengths of predicted ORFs were calculated
for the 19,574 H-Inv proteins. These were 1,095 bp and 806
bp, respectively (Table 4). The values obtained were smaller
than those from other eukaryotes, and are inconsistent with
estimates reported previously (Shoemaker et al. 2001).
However, as has been seen in the earlier annotation of the
fission yeast genome (Das et al. 1997), our dataset might
contain stretches which mimic short ORFs. This would lead to
a bias in our ORF prediction and result in an erroneous
estimate of the average ORF length. We examined the size
distributions of ORFs from the five categories, and found that
the distribution pattern was quite similar across categories.
The exception was Category V, in which short ORFs were
unusually abundant (Figure S3). Judging from the length
distribution of ORFs in the five categories of H-Inv proteins,
the majority of ORFs shorter than 600 bps in Category V
seemed questionable. In order to have a protein dataset that
contains as many sequences to be further analyzed as
possible, we have taken the longest ORFs over 80 amino
acids if no significant candidates were detected by the
sequence similarity and gene prediction (see Figure S1). The
consequence of this is that Category V appears to contain
short questionable ORFs, a certain fraction of which may be
prediction errors. Nevertheless, these ORFs could be true. It
is also possible that those ORFs were in fact translated in vivo
when we curated the cDNAs manually. The existence of many
functional short proteins in the human proteome is already
confirmed, and there are 199 known human proteins that are
80 amino acids or shorter in the current Swiss-Prot database.
We think that the H-Inv hypothetical proteins require
experimentally verification in the future. Excluding the
hypothetical proteins from the analysis, we obtained mean
and median lengths for the ORFs of 1,368 bp and 1,130 bp,
respectively, which are reasonably close to those for other
eukaryotes (Table 4).
Of the 4,104 Category II proteins, 3,948 proteins (96.2%)
were similar to the functionally identified proteins of
Table 3. Statistics Obtained from the Functional Annotation Results
Category Number of Loci
H-Inv proteinsI. Identical to a known human protein
II. Similar to a known protein
III. InterPro domain containing protein
IV. Conserved hypothetical protein
V. Hypothetical protein
Total number of H-Inv proteins
Putative ncRNA
Uncharacterized transcript
Unclassifiable
Hold
Total number of non-protein-coding transcripts
5,074
4,104
2,531
1,706
6,159
19,574
296
675
329
77
1,377
86
21,037
Non-protein-coding transcripts
Questionable transcripts
Total number of H-Inv loci
DOI: 10.1371/journal.pbio.0020162.t003
Figure 4. Schematic Diagram of Human Curation for H-Inv Proteins
The diagram illustrates the human curation pipeline to classify H-Inv
proteins into five similarity categories; Category I , II, III, IV, and V
proteins.
DOI: 10.1371/journal.pbio.0020162.g004
PLoS Biology | http://biology.plosjournals.org June 2004 | Volume 2 | Issue 6 | Page 0863
Integrative Annotation of Human Genes
Page 9
mammals (Figure S4). This implies that the predicted
functions in this study were based on the comparative study
with closely related species, so that the functional assignment
retains a high level of accuracy if we suppose that protein
function is more highly conserved in more closely related
species. Moreover, the patterns of codon usage and the codon
adaptation index (CAI; http://biobase.dk/embossdocs/cai.html)
of H-Inv proteins were investigated (Table S2). The results
indicated that the ORF prediction scheme worked equally
well in the five similarity categories of H-Inv proteins.
Each H-Inv protein in the five categories was investigated
in relation to the tissue library of origin (Table S3). We found
that at least 30% of the clones mainly isolated from dermal
connective, muscle, heart, lung, kidney, or bladder tissues
could be classified as Category I proteins. Hypothetical
proteins (Category V), on the other hand, were abundant in
both endocrine and exocrine tissues. This bias may indicate
that expression in some tissues may not have been studied in
enough detail. If this is the case, then there is likely a
significant gap between our current knowledge of the human
proteome and its true dimensions.
Non-protein-coding genes. Over recent years, ncRNAs have
been found to play key roles in a variety of biological
processes in addition to their well-known function in protein
synthesis (Moore and Steitz 2002; Storz 2002). Analysis of the
H-Inv cDNA dataset revealed that 6.5% of the transcripts are
possibly non-protein-coding, although the number is much
smaller than that estimated in mice (Okazaki et al. 2002). We
believe that this difference between the two species is mainly
due to the larger number of mouse libraries that were used
and to a rare-transcript enrichment step that was applied to
these collections.
To identify ncRNAs, we manually annotated 1,377 repre-
sentative non-protein-coding transcripts, which were classi-
fied into four categories (see Table 3; Figure 5): putative
ncRNAs, uncharacterized transcripts (possible 39 UTR frag-
ments supported by ESTs), unclassifiable transcripts (possible
genomic fragments), and hold transcripts (not stringently
mapped onto the human genome). Of these, 296 (19.5%) were
putative ncRNAs with no neighboring transcripts in the close
vicinity (. 5 kb) and supported by ESTs with a poly-A signal
or a poly-A tail, indicating that these may represent genuine
ncRNA genes. On the other hand, a large fraction of the non-
protein-coding transcripts (675; 44.5%) were classified as
possible 39 UTRs of genes that were mapped less than 5 kb
upstream. The 5-kb range is an arbitrary distance that we
defined as one of our selection criteria for identifying
ncRNAs. However, authentic non-protein-coding genes
might be located adjacent to other protein-coding genes (as
described earlier). Thus, some of the transcripts initially
annotated as uncharacterized ESTs may correspond to
ncRNAs when these sequences satisfy the other selection
criteria.
We defined a manual annotation strategy (Figure 5) that
allowed us to select convincing putative ncRNAs with various
Figure 5. The Manual Annotation Flow Chart of ncRNAs
Candidate non-protein-coding genes were compared with the human
genome, ESTs, cDNA 39-end features and the locus genomic
environment. The candidates were then classified into four catego-
ries: hold (cDNAs improperly mapped onto the human genome);
uncharacterized transcripts (transcripts overlapping a sense gene or
located within 5 kb of a neighboring gene with EST support); putative
ncRNAs (multiexon or single exon transcripts supported by ESTs or
39-end features); and unclassifiable (possible genomic fragments).
DOI: 10.1371/journal.pbio.0020162.g005
Table 4. The Features of Predicted ORFs
Number of ORFs Mean (bp) Median (bp) Percent GC of
Third Codon Position
Human—H-Inv datasets (categories I–IV)
Human—all of the H-Inv datasets
Fly
Worm
Budding yeast
Fission yeast
Plant
Bacteria
13,415
19,574
17,878
21,118
6,408
4,968
27,228
4,289
1,368
1,095
1,580
1,327
1,403
1,426
1,269
951
1,130
806
1,212
1,038
1,128
1,161
1,074
834
52.3
52.4
53.9
42.9
40.3
39.7
44.2
51.9
Nonredundant proteome datasets of nonhuman species were obtained from the following URLs: fly (Drosophila melanogaster; http://flybase.bio.indiana.edu/), worm
(Caenorhabditis elegans; http://www.wormbase.org/), budding yeast (Saccharomyces cerevisiae; http://www.pasteur.fr/externe), fission yeast (Schizosaccharomyces pombe;
http://www.sanger.ac.uk/), plant (Arabidopsis thaliana; http://mips.gsf.de/proj/thal/index.html), and bacteria (Escherichia coli K12; http://www.ncbi.nlm.nih.gov/).
DOI: 10.1371/journal.pbio.0020162.t004
PLoS Biology | http://biology.plosjournals.org June 2004 | Volume 2 | Issue 6 | Page 0864
Integrative Annotation of Human Genes
Page 10
lines of supporting evidence. These are the following: absence
of a neighboring gene in the close vicinity, overlap with
human or mouse ESTs, occurrence in the 39 end of cDNA
sequences, as well as overlap with mouse cDNAs. Out of 296
annotated putative ncRNAs, we identified 47 ncRNAs with
conserved RNA secondary structure motifs (Rivas and Eddy
2001), and nearly 60% of these were found expressed in up to
eight human tissues (data not shown), indicating that the
manual curation strategy employed in this study may
facilitate the identification of novel non-protein-coding
genes in other species.
The functions of human proteins identified through an
analysis of domains. Proteins in many cases are composed of
distinct domains each of which corresponds to a specific
function. The identification and classification of functional
domains are necessary to obtain an overview of the whole
human proteome. In particular, the analysis of functional
domains allows us to elucidate the evolution of the novel
domain architectures of genes that life forms have acquired
in conjunction with environmental changes. The human
proteome deduced from the H-Inv cDNAs was subjected to
InterProScan, which assigned functional motifs from the
PROSITE, PRINTS, SMART, Pfam, and ProDom databases
(Mulder et al. 2003). A total of 19,574 H-Inv proteins were
examined, and 9,802 of them (50.1%) were assigned at least
one InterPro code that was classified into either repeats (a
region that is not expected to fold into a globular domain on
its own), domains (an independent structural unit that can be
found alone or in conjunction with other domains or
repeats), and/or families (a group of evolutionarily related
proteins that share one or more domains/repeats in common)
when compared with those of fly, worm, budding and fission
yeasts, Arabidopsis thaliana, and Escherichia coli (Table S4).
Moreover, the proteins were classified according to the Gene
Ontology (GO) codes that were assigned to InterPro entries
(Table S5).
Identification of human enzymes and metabolic pathways.
One of the most important goals of the functional annotation
of human cDNAs is to predict and discover new, previously
uncharacterized enzymes. In addition, revealing their posi-
tions in the metabolic pathways helps us understand the
underlying biochemical and physiological roles of these
enzymes in the cells. We thus searched for potential enzymes
among the H-Inv proteins, and mapped them to a database of
known metabolic pathways.
We could assign 656 kinds of potential Enzyme Commis-
sion (EC) numbers to 1,892 of the 19,574 H-Inv proteins based
on matches to the InterPro entries and GO assignments and
on the similarity to well-characterized Swiss-Prot proteins
(see Dataset S2). The number of characterized human
enzymes significantly increased through this analysis. The
most abundant enzymes in the H-Inv proteins were protein–
tyrosine kinases (EC 2.7.1.112), which is consistent with the
large number of kinases found in the InterPro assignments.
The other major enzymes were small monomeric GTPase (EC
3.6.1.47), adenosinetriphosphatase (EC 3.6.1.3), phosphopro-
tein phosphatase (EC 3.1.3.16), ubiquitin thiolesterase (EC
3.1.2.15), and ubiquitin-protein ligase (EC 6.3.2.19). These
enzymes are members of large multigene families that are
important for the functions of higher organisms. Further-
more, we could assign 726 EC numbers to mouse representa-
tive transcripts and proteins (Okazaki et al. 2002), and most of
them appeared to be shared between human and mouse (data
not shown). The high similarity of the enzyme repertoire
between these two species is not surprising if we consider the
close evolutionary relatedness between them. It does, how-
ever, indicate the usefulness of the mouse as a model
organism for studies concerning metabolism.
We then mapped all H-Inv proteins on the metabolic
pathways of the KEGG database, a large collection of
information on enzyme reactions (Kanehisa et al. 2002). In
total, we mapped 963 H-Inv proteins on a total of 1,613
KEGG pathways, of which 641 were based on their EC
number assignments (Figure S5). Those based on EC number
assignments do not necessarily function as they are assigned
because they have yet to be verified experimentally. However,
if all other enzymes along the same pathway exist in humans,
the functional assignment has a high probability of being
correct. Using this method, we discovered a total of 32 newly
assigned human enzymes from the H-Inv proteins with the
support of KEGG pathways (Table S6). For example, we
identified (1) pyridoxamine-phosphate oxidase (EC 1.4.3.5;
AK001397), an enzyme in the ‘‘salvage pathway,’’ the function
of which is the reutilization of the coenzyme pyridoxal-59-
phosphate (its role in epileptogenesis was recently reported
[Bahn et al. 2002]), (2) ATP-hydrolysing 5-oxoprolinase (EC
3.5.2.9; AL096750) that cleaves 5-oxo-L-proline to form L-
glutamate (whose deficiency is described in the Online
Mendelian Inheritance in Man [OMIM] database
[ID=260005]), and (3) N-acetylglucosamine-6-phosphate de-
acetylase (EC 3.5.1.25; BC018734), which catalyzes N-acetyl-
glucosamine at the second step of its catabolism, the activity
of which in human erythrocytes was detected by a biochem-
ical study (Weidanz et al. 1996). Many of the newly identified
enzymes were supported by currently available experimental
and genomic data. An example is a putative urocanase (EC
4.2.1.49; AK055862) that mapped onto the ‘‘histidine metab-
olism’’ that urocanic acid catabolises. A14C Histidine tracer
study unexpectedly revealed that NEUT2 mice deficient in
10-formyltetrahydrofolate dehydrogenase (FTHFD) excrete
urocanic acid in the urine and lack urocanase activity in their
hepatic cytosol (Cook 2001). We then found that both the
FTHFD and AK055862 genes were located within the same
NCBI human contig (NT005588) on Chromosome 3. More-
over, the distance between the two genes was consistent with
the genetic deletion of NEUT2 (. 30 kb). We thus assumed
that FTHFD and urocanase might be coincidentally defective
in mice. This analysis could confirm that the AK055862
protein is a true urocanase. This example demonstrates that
this kind of in silico analysis is a powerful method in defining
the functions of proteins.
Polymorphism in the Transcriptome
Sites of potential polymorphism in cDNAs. Due to the
rapidly increasing accumulation of genetic polymorphism
data, it is necessary to classify the polymorphism data with
respect to gene structure in order to elucidate potential
biological effects (Gaudieri et al. 2000; Sachidanandam et al.
2001; Akey et al. 2002; Bamshad and Wooding 2003). For this
purpose, we examined the relationship between publicly
available polymorphism data and the structure of our H-Inv
cDNA sequences. A total of 4 million single nucleotide
polymorphisms (SNPs) and insertion/deletion length varia-
tions (indels) with mapping information from the Single
PLoS Biology | http://biology.plosjournals.orgJune 2004 | Volume 2 | Issue 6 | Page 0865
Integrative Annotation of Human Genes
Page 11
Nucleotide Polymorphism Database (dbSNP; http://
www.ncbi.nlm.nih.gov/SNP/, build 117) (Sherry et al. 1999)
were used for the search. We could identify 72,027 uniquely
mapped SNPs and indels in the representative H-Inv cDNAs
and observed an average SNP density of 1/689 bp. To classify
SNPs and indels with respect to gene structure, the genomic
coordinates of SNPs were converted into the corresponding
nucleotide positions within the mapped cDNAs. The SNPs
and indels were classified into three categories according to
their positions: 59 UTR, ORF, and 39 UTR (Table 5). The
density of indels was higher in 59 UTRs (1/15,999 bp) and 39
UTRs (1/12,553 bp) than in ORFs (1/45,490 bp). This is
possibly due to different levels of functional constraints. We
also examined the length of indels and found a higher
frequency of indels in those ORFs that had a length divisible
by three and that did not change their reading frames. We
observed that the density of SNPs was higher in both the 59
and 39 UTRs (1/569 bp and 1/536 bp, respectively) than in
ORFs (1/833 bp).
SNPs located in ORFs were classified as either synonymous,
nonsynonymous, or nonsense substitutions (Table 5). We
identified 13,215 nonsynonymous SNPs that affect the amino
acid sequence of a gene product. At least 4,998 of these
nonsynonymous SNPs are ‘‘validated’’ SNPs (as defined by
dbSNP). This data can be used to predict SNPs that affect
gene function. SNPs that create stop codons can cause
polymorphisms that may critically alter gene function. We
identified 358 SNPs that caused either a nonsense mutation
or an extension of the polypeptide. We classified these 358
SNPs into these two types based on the alleles of the cDNA.
Most of these SNPs (315/358) were predicted to cause
truncation of the gene products and produce a shorter
polypeptide compared with the alleles of H-Inv cDNAs. For
example, Reissner’s fiber glycoprotein I (AK093431) contains
a nonsense SNP that results in the loss of the last 277 amino
acids of the protein, and consequently the loss of a
thrombospondin type I domain located in its C-terminal
end. This SNP is highly polymorphic in the Japanese
population, the frequencies of G (normal) and T (termina-
tion) being 0.43 and 0.57, respectively. As seen in this
example, the identification of SNPs within cDNAs provides
important insights into the potential diversity of the human
transcriptome. Thus, polymorphism data crossreferenced to a
comprehensively annotated human transcriptome might
prove to be a valuable tool in the hands of researchers
investigating genetic diseases.
Sites of microsatellite repeats. Among the 19,442 repre-
sentative protein-coding cDNAs, we identified a total of 2,934
di-, tri-, tetra-, and penta-nucleotide microsatellite repeat
motifs (Table 6). Interestingly, 1,090 (37.2%) of these were
found in coding regions, the majority of which (86.9%) were
tri-nucleotide repeats. Di-, tetra-, and penta-nucleotide
repeats made up the greatest proportion of repeats in 59
UTRs and 39 UTRs. Coding regions contained mostly tri-
Table 5. The Numbers of SNPs and indels Occurring in the Representative cDNAs
59 UTRCoding Region39 UTR
SNPsa
Synonymous
Nonsynonymous
Truncationb
Extensionb
Synonymous SNP at stop codon
Total
11,014(1/325 bp)
13,215(1/1,206 bp)
315
43
28
24,679c(1/833 bp)
452(1/45,490 bp)
10,715(1/569 bp)
381(1/15,999 bp)
31,852(1/536 bp)
1,364(1/12,553 bp) Indels
aThe numbers of SNPs and indels are summarized for representative cDNA sequences which were mapped on the genome. The numbers in parentheses represent the
densities of SNPs and indels.
bSNPs that cause nonsense mutation or extension of polypeptides were classified assuming that the cDNAs represent original alleles.
cThis figure includes 64 unclassifiable SNPs.
DOI: 10.1371/journal.pbio.0020162.t005
Table 6. The Numbers of Microsatellite Repeat Motifs That Occurred in the Representative cDNAs
Microsatellite Repeats
Di-Tri-Tetra- Penta-Total
59 UTR
Coding region
39 UTR
Total
162 (50)
70 (13)
482 (121)
714 (184)
394 (3)
947 (10)
340 (3)
1,681 (16)
117 (4)
63 (2)
281 (8)
461 (14)
21 (1)
10 (0)
47 (1)
78 (2)
694 (58)
1,090 (25)
1,150 (133)
2,934 (216)
Microsatellites were defined as those sequences having at least ten repeats for di-nucleotide repeats and at least five repeats for tri-, tetra-, and penta-nucleotide repeats.
Numbers of polymorphic microsatellites inferred by comparisons of cDNA and genomic sequences are shown in parenthesis. See Table S2 for a list of accession numbers for
these cDNAs.
DOI: 10.1371/journal.pbio.0020162.t006
PLoS Biology | http://biology.plosjournals.orgJune 2004 | Volume 2 | Issue 6 | Page 0866
Integrative Annotation of Human Genes
Page 12
nucleotide repeats. This result is consistent with the idea that
microsatellites are prone to mutations that cause changes in
numbers of repeats. Only tri-nucleotide repeats can conserve
original reading frames when extended or shortened by
mutations. A previous study showed that many of the
microsatellite motifs identified in human genomic sequences,
including those in coding regions, are highly polymorphic in
human populations (Matsuzaka et al. 2001). We found this to
be the case in our study: 36 of the microsatellite repeats we
detected were found to be polymorphic in human popula-
tions according to dbSNP records (data not shown). We
identified 216 microsatellite repeats in 213 genes that showed
contradictory numbers of repeats between cDNA and
genome sequences (see Dataset S3). This figure includes 25
microsatellites in ORFs that have the potential to alter the
protein sequences. Individual cases need to be verified by
further experimental studies, but many of these micro-
satellites may really be polymorphic in human populations
and have marked phenotypic effects.
There were 382 cDNAs that possessed two or more
microsatellites in their nucleotide sequences. This is illus-
trated in RBMS1 (BC018951), a cDNA which encodes an RNA-
binding motif. This cDNA has four microsatellites, (GGA)7,
(GAG)9, (GAG)6, and (GCC)6, in its 59 UTR. These micro-
satellites are all located at least 98 bp upstream of the start
codon, but they could still have pronounced regulatory
effects on gene expression. Another example is the cDNA
that encodes CAGH3 (AB058719). This cDNA has four
microsatellites, (CAG)8, (CAG)6, (CAG)8, and (CAG)8,all of
which are located within the ORF. These microsatellites all
encode stretches of poly-glutamine, which are known to have
transcription factor activity (Gerber et al. 1994) and often
cause neurodegenerative diseases when the number of
repeats exceeds a certain limit. A typical example of a
disorder caused by these repeats is Huntington’s disease
(Andrew et al. 1993; Duyao et al. 1993; Snell et al. 1993).
We also searched for repeat motifs containing the same
amino acid residue in the encoded protein sequences. We
located a total of 3,869 separate positions where the same
amino acid was repeated at least five times. The most
frequent repetitive amino acids are glutamic acid, proline,
serine, alanine, leucine, and glycine. The glutamine repeats of
this nature were found in 160 different locations.
Evolution of the Human Transcriptome
Beyond the study of individual genes, the comparison of
numerous complete genome sequences facilitates the eluci-
dation of evolutionary processes of whole gene sets. More-
over, the FLcDNA datasets of humans and mice give us an
opportunity to investigate the genome-wide evolution of
these two mammals by using the sequences supported by
physical clones. Here we compared our human cDNA
sequences with all proteins available in the public databases.
Focusing on our results, we discuss when and how the human
proteome may have been established during evolution.
Furthermore, the evolution of UTRs is examined through
comparisons with cDNAs from both primates and rodents.
Conserved and derived protein-coding genes in humans.
An advantage of large-scale cDNA sequencing is that it can
generate a nearly complete gene set with good evidence for
transcription. The human proteome deduced from the
FLcDNA sequences gives us an opportunity to decipher the
evolution of the entire proteome. Here we compare the
representative H-Inv cDNAs with the Swiss-Prot and TrEMBL
protein databases using FASTY (Pearson 2000), and we
describe the distributions of the homologs among taxonomic
groups at two different similarity levels. The number of
representative H-Inv cDNAs that have homolog(s) in a given
taxon was counted (Figure S6), and the cDNAs were classified
into functional categories (Figure 6). These results indicated
that homologs of the human proteins were probably
conserved much more in the animal kingdom than in the
others at both moderate (E ,10?10) and weak (E , 10?5)
similarity levels (see Figure S6). Moreover, human sequences
had as many nonmammalian animal homologs as mammalian
homologs, with seemingly little bias to any one function (see
Figure 6). This suggests that the genetic background of
humans may have already been established in an early stage of
animal evolution and that many parts of the whole genetic
system have probably been stable throughout animal evolu-
tion despite the seemingly drastic morphological differences
between various animal species. This result is consistent with
our previous observation that the distribution of the func-
tional domains is highly conserved among animal species (see
Table S4). The number of homologs may have been inflated
by recent gene duplication events within the human lineage.
Hence we counted the number of paralog clusters instead of
cDNAs that had homologs in the databases, and obtained
essentially the same results (Figure S7).
This analysis also revealed a number of potential human-
specific proteins, which did not have any homologs in the
current sequence databases. In this case the creation of
lineage-specific genes through speciation is not completely
excluded. However, most ORFs with no similarity to known
proteins would not be genuine for the reasons discussed
above. Therefore, the number of ‘‘true’’ human-specific
proteins is expected to be relatively small.
We conducted further BLASTP searches matching entries
from the Swiss-Prot database against the H-Inv dataset itself.
Figure 6. The Functional Classification of H-Inv Proteins That Are
Homologous to Proteins in Each Taxonomic Group
The numbers of representative H-Inv cDNAs with sequence
homology to other species’ proteins (E , 10?5) were calculated. The
cDNAs for which we could not assign any functions were discarded.
Mammalian species were excluded from the ‘‘animal’’ group.
‘‘Eukaryote’’ represents eukaryotic species other than those included
in the mammal, animal, fungi, and plant groups. See also Table S7.
DOI: 10.1371/journal.pbio.0020162.g006
PLoS Biology | http://biology.plosjournals.orgJune 2004 | Volume 2 | Issue 6 | Page 0867
Integrative Annotation of Human Genes
Page 13
As a result, 12,813 (45.3%) of 28,263 vertebrate proteins had
homologs in nonvertebrates at E , 10–30. Taking into account
that the dataset is relatively small (approximately 12,000
sequences) and as a result may be biased, animal species may
conceivably share a similar protein-coding gene set.
Ohno (1996) proposed that the emergence of a large
number of animal phyla in a short period of time would
endow them with almost identical genomes. These were
collectively referred to as the pananimalia genome. Our data
support Ohno’s hypothesis from the perspective that the
basic gene repertoires of animals are essentially highly similar
among diverse species that have evolved separately since the
Cambrian explosion. Subsequently, morphological evolution
seems to have been brought about mainly by changes in gene
regulation. The number of transcription regulator homologs
is different between animals and other phyla (Table S7). In
this analysis it was not possible to examine the genes recently
deleted from the human lineage. However, the similarity of
the proteome sets between distantly related mammals such as
human and mouse (Waterston et al. 2002) suggests that not
many genes have been deleted specifically from humans since
humans and mice diverged.
A unique feature of the Animalia proteome is, for example,
the presence of apoptosis regulator homologs, which are
found widely in the animal kingdom, whilst they are rare in
the other phyla (Table S7). Since apoptosis plays an
important role during the development of multicellular
animals, this observation indicates that apoptosis was
established independently of both plants and fungi during
the early evolution of multicellularization in the kingdom
Animalia. Likewise, signal transducers and cell-adhesion
proteins are distinctive. In contrast, enzymes, translation
regulators, molecular chaperones, etc. were highly conserved
among all taxonomic groups. These proteins may have played
such essential roles that any alterations were eliminated by
strong purifying selection. It is assumed some functions were
presumably derived from ancient endocellular symbionts
(mitochondria and chloroplasts) (Martin 2002).
Evolution of untranslated regions. The UTRs of mRNA are
known to be involved in the regulation of gene expression at
the posttranscriptional level through control of translation
efficiency (Kozak 1989; Geballe and Morris 1994; Sonenberg
1994), mRNA stability (Zaidi and Malter 1994; McCarthy and
Kollmus 1995), and mRNA localization (Curtis et al. 1995;
Lithgow et al. 1997). Only a few studies on very limited
datasets have been carried out so far to describe quantita-
tively either the evolutionary dynamics of mRNA UTRs
(Larizza et al. 2002), or their general structural and composi-
tional features (Pesole et al. 1997). The human transcriptome
presented here along with the murid data obtained mainly
from the FANTOM2 project enables us to stabilize a
mammalian genome perspective on the subject (Table S8).
A sliding window analysis of UTR sequence identities
between humans and mice revealed a positive correlation
between the number of indels in an untranslated region and
the distance from the coding sequence (Figure 7). Unlike
indels, mismatches are distributed equally along whole
untranslated regions. In other words, indels seem to be less
tolerated in close proximity to a coding sequence, while
substitutions are evenly distributed along the untranslated
regions of the mRNAs. This seems to be a general pattern
observed similarly in other species (data not shown). Indels in
UTRs may have been avoided so that the distance between the
coding region and a signal sequence for regulation in the
UTR could be conserved throughout evolution, while purify-
ing selection against substitutions appeared to be relatively
weak.
Untranslated region replacement. A replacement of the
entire UTR may lead to drastic changes in gene expression,
especially if a UTR having a posttranscriptional signal is
replaced by another. We compared the evolutionary distances
of UTRs between primate and rodent orthologous sequences.
We based our analysis on the UTR sequence distances that
contradicted the expected phylogenetic tree of relatedness.
We could detect 149 UTR replacements distributed among
different species. Some of the observed replacements may
result from selection of different AS isoforms of a single locus
in different species. This is particularly likely if an AS event
involves an alternative first or last exon. It seems that UTR
replacements are more frequent in rodents than in primates,
but the difference is not statistically significant at the 5%
Figure 7. Window Analysis of Similarity between Human and Mouse
UTRs
Results for 59 UTRs presented above and for 39 UTRs below. The
whole mRNA sequences were aligned using a semiglobal algorithm as
implemented in the map program (Huang 1994) with the following
parameters: match 10, mismatch ?3, gap opening penalty ?50, gap
extension penalty?5, and longest penalized gap 10; the terminal gaps
are not penalized at all. A window size of 20 bp was used with a step of
10 bp. The analysis window was moved upstream and downstream of
start and stop codons, respectively. The normalized score for a given
window is calculated as a fraction of an average score for all UTRs in
a given window over the maximum score observed in all 59 or 39
UTRs, respectively.
DOI: 10.1371/journal.pbio.0020162.g007
PLoS Biology | http://biology.plosjournals.orgJune 2004 | Volume 2 | Issue 6 | Page 0868
Integrative Annotation of Human Genes
Page 14
significance level (Table S9). We detected a UTR replacement
in less than 2% of the analyzed sequences. The evolutionary
consequences could be significant because the UTR replace-
ment might result in changes in expression level or the loss of
an mRNA localization signal.
The H-Invitational Database
All the results of the mapping of the FLcDNA sequences
onto the human genome, the clustering of FLcDNA sequen-
ces, sequence alignments, detection of AS transcripts,
sequence similarity searches, functional annotation, protein
structure prediction, subcellular localization prediction, SNP
mapping, and evolutionary analysis, as well as the basic
features of FLcDNA sequences, are stored in the H-InvDB
(Figure S8). The H-InvDB is a unique database that integrates
annotation of sequences, structure, function, expression, and
diversity of human genes into a single entity. It is useful as a
platform for conducting in silico data mining. The database
has functions such as a keyword search, a sequence similarity
search, a cDNA search, and a searchable genome browser. It is
hoped that the H-InvDB will become a vital resource in the
support of both basic and applied studies in the fields of
biology and medicine.
We constructed two kinds of specialized subdatabases
within the H-InvDB. The first is the Human Anatomic Gene
Expression Library (H-Angel), a database of expression
patterns that we constructed to obtain a broad outline of
the expression patterns of human genes. We collected gene
expression data from normal and diseased adult human
tissues. The results were generated using three methods on
seven different platforms. These included iAFLP (Kawamoto
et al. 1999; Sese et al. 2001), DNA arrays (long oligomers, short
oligomers [Haverty et al. 2002], cDNA nylon microarrays
[Pietu et al. 1999], and cDNA glass slide microarrays [Arrays/
IMAGE-Genexpress]), and cDNA sequence tags (SAGE [Vel-
culescu et al. 1995; Boon et al. 2002], EST data [Boguski et al.
1993; Kawamoto et al. 2000], and MPSS [Brenner et al. 2000]).
By normalizing levels of gene expression in experiments
conducted with different methods, we determined the gene
expression patterns of 19,276 H-Inv loci in ten major
categories of tissues. This analysis allowed us to clearly
distinguish broadly and evenly expressed housekeeping genes
from those expressed in a more restricted set of tissues
(details will be published elsewhere). The H-Angel database
comprises the largest and most comprehensive collection of
gene expression patterns currently available. Also provided is
a classification of human genes by expression pattern.
The second subdatabase of the H-InvDB is DiseaseInfo
Viewer. This is a database of known and orphan genetic
diseases. We tried to relate H-Inv loci with disease informa-
tion in two ways. Firstly, 613 H-Inv loci that correspond with
known, characterized disease-related genes were identified by
creating links to entries in both LocusLink (http://
www.ncbi.nlm.nih.gov/LocusLink/) and OMIM (Hamosh et al.
2002). To explore the possibility that cDNAs encoding
unknown proteins may be related to ‘‘orphan pathologies’’
(diseases that have been mapped to chromosomal regions, but
for which associated genes have not yet been described), we
generated a list of H-Inv loci that co-localized with these
cytogenetic regions. The nonredundant orphan disease data-
set we created consists of 586 diseases identified through
OMIM (http://www.ncbi.nlm.nih.gov/Omim/, ver. Jan. 2003),
with an additional 108 identified from GenAtlas (http://
www.dsi.univ-paris5.fr/genatlas/, ver. Jan. 2003). Using the
OMIM and GenAtlas databases in conjunction with the
annotation results from the H-InvDB may accelerate the
process of identifying candidate genes for human genetic
diseases.
Concluding Remarks
There are a number of established collections of nonhu-
man cDNAs, such as those of Drosophila melanogaster (Stapleton
et al. 2002), Danio rerio (Clark et al. 2001), Arabidopsis thaliana
(Seki et al. 2002), Plasmodium falciparum (Watanabe et al. 2002),
and Trypansoma cruzi (Urmenyi et al. 1999). The most extensive
collection of mammalian cDNAs so far has been that of the
RIKEN/FANTOM mouse cDNA project (Kawai et al. 2001;
Okazaki et al. 2002). This wealth of information has spurred a
wide variety of research in the areas of both gene expression
profiling (Miki et al. 2001) and protein–protein interactions
(Suzuki et al. 2001). The H-InvDB provides an integrative
means of performing many more such analyses based on
human cDNAs.
The most important findings that have resulted from the
cDNA annotation are summarized here.
(1) The 41,118 H-Inv cDNAs were found to cluster into
21,037 human gene candidates. Comparison with known and
previously predicted human gene sets revealed that these
21,037 hypothesized gene clusters contain 5,155 new gene
candidates.
(2) The primary structure of 21,037 human gene candidates
was precisely described. For the majority of them we observed
that both first introns and last exons tended to be longer than
the other introns and exons, respectively, implying the
possible existence of intriguing mechanisms of transcrip-
tional control in first introns.
(3) We discovered the existence of 847 human gene
candidates that could not be convincingly mapped to the
human genome. This result suggested that up to 3.7%–4.0%
of the human genome sequences (NCBI build 34 assembly)
may be incomplete, containing either unsequenced regions
or regions where sequence assembly has been performed in
error.
(4) Based on H-Inv cDNAs, we were able to define an
experimentally validated AS dataset. The dataset was com-
posed of 3,181 loci that encoded a total of 8,553 AS isoforms.
In the 55% of ORFs containing AS isoforms, the pattern of
alternative exon usage was found to encode different func-
tional domains at the same loci.
(5) A standardized method of human curation for the H-Inv
cDNAs was created under the tacit consensus of international
collaborations. Using this method, we classified 19,574 H-Inv
proteins into five categories based on sequence similarity and
structural information. We were able to assign functional
definitions to 9,139 proteins, to locate function- or family-
defining InterPro domains in 2,503 further proteins, and to
identify 7,800 transcripts as good candidates for hypothetical
proteins.
(6) A total of 1,892 H-Inv proteins were assigned identities
as one of 656 different EC-numbered enzymes. This enzyme
library includes 32 newly identified human enzymes on
known metabolic pathway maps and comprises the largest
collection of computationally validated human enzymes.
(7) Based on a variety of supporting evidence, 6.5% of H-
PLoS Biology | http://biology.plosjournals.orgJune 2004 | Volume 2 | Issue 6 | Page 0869
Integrative Annotation of Human Genes
Page 15
Inv loci (1,377 loci) do not have a good protein-coding ORF,
of which 296 loci are strong candidates for ncRNA genes.
(8) We identified and mapped 72,027 SNPs and indels to
unique positions on 16,861 loci. Of these, 13,215 non-
synonymous SNPs, 358 nonsense SNPs, and 452 indels were
found in coding regions and may alter protein sequences,
cause phenotypic effects, or be associated with disease. In
addition, we identified 216 polymorphic microsatellite
repeats on 213 loci, 25 of which were located in coding
regions.
(9) During human proteome analysis, it was suggested that
the basic gene set of humans might have been established in
the early stage of animal evolution. Our analysis of UTRs
revealed that insertions or deletions near coding regions were
rare when compared with substitutions, though in some cases
drastic changes such as UTR replacements occurred.
(10) A consequence of the annotation process and our
related research was the development of the H-InvDB to
contain our annotation work. H-InvDB is a comprehensive
database of human FLcDNA annotations that stores all
information produced in this project. As a subdivision of
H-InvDB, we developed two other specialized subdatabases:
H-Angel and DiseaseInfo Viewer. H-Angel is a database of
gene expression patterns for 19,276 loci. DiseaseInfo Viewer
is a database of known disease-related genes and loci co-
localized with 694 orphan pathologies. These pathologies
were mapped onto the genome but were not identified
experimentally.
In the H-Inv project, we collected as many FLcDNAs as
possible and conducted extensive analyses concerning the
quality of cDNAs, such as detection of frameshift errors,
retained introns, and internal poly-A priming, under a
unified criterion. Although these analyses are still in an
elementary state, we store these results in H-InvDB to share
this information with the biological community. We believe
that this is an important contribution of our project, because
it will provide a reliable way to control the quality of the
cDNA clones. In the future, this information will be useful for
improving the methods of clone library construction.
It has been suggested that the human genome encodes
30,000 to 40,000 genes. In this study we comprehensively
evaluated more than 21,000 human gene candidates (up to
70% of the total). Thus, efforts should be continued by the H-
Inv consortium and others to ‘‘fully’’ characterize the human
transcriptome. For this purpose new technologies should be
implemented that are more sensitive in detecting rarely
expressed genes and AS transcripts. Nevertheless, there are
unavoidable limitations for human cDNA collections, such
the identification of embryo-specific genes, for which other
approaches should be employed. One alternative is the use of
ab initio predictions from genomic sequences, in conjunction
with expression profiling studies, to identify rarely expressed
genes that share structural similarity to known genes. Addi-
tionally, a better characterization of cis-regulatory element
units may help to define the boundary of other genes that are
undetected by current gene prediction programs. Another
area that remains to be explored is the identification of
potential hidden RNA gene families that may play vital roles,
such as the recently uncovered family of microRNA genes,
which is involved in the regulation of expression of other
genes (for review see Ambros 2001; Moss 2002).
The proteome determination aspects of this project,
including the identification of new enzymes and hypothetical
proteins, should stimulate more focused biochemical studies.
The functional classifications may allow definition of sub-
proteomes that are related to different physiological pro-
cesses. The H-Inv transcriptome based on the definition of a
consensus proteome (the H-Inv proteins) links both the
analysis of genomic DNA and direct proteome analysis with
the study of expressed mRNA analysis from different tissues,
cells, and disease states. It creates a standard for the
comparison of disease-related alterations of the human
proteome. Moreover, comparison with pathogen proteomes
may yield many possible drug target proteins. Also, the
annotation of ncRNAs raises the possibility of novel ‘‘smart’’
therapeutics that could either inhibit or mimic the mecha-
nisms of these RNAs.
The H-Inv project is the first ever comprehensive compi-
lation of curated and annotated human FLcDNAs. The
project may lead to a more complete understanding of the
human transcriptome and, as a result, of the human
proteome. The preceding examples of the importance of
the H-Inv data in understanding human physiology and
evolution represent just a small fraction of the research
potential of the H-InvDB.
In conclusion, the H-InvDB platform constructed to hold
the results of the comprehensive annotations performed by
our international team of collaborators represents a sub-
stantial contribution to resources that are needed for further
exploration of both human biology and pathology.
Materials and Methods
cDNA resources. 41,118 H-Inv cDNAs were sequenced by the
Human Full-Length cDNA Sequencing Project (Ota et al. 1997;
Yudate et al. 2001; Ota et al. 2004) at the Helix Research Institute, the
Institute of Medical Science at the University of Tokyo, and the
Kazusa DNA Research Institute (20,999 sequences in total); the
Kazusa cDNA Sequencing Project (Kikuno et al. 2002) at the Kazusa
DNA Research Institute (2,000 sequences); the Mammalian Gene
Collection (Strausberg et al. 1999) at the National Institutes of Health
in the United States (11,806 sequences); the German Human cDNA
Project (Wiemann et al. 2001) coordinated by the Deutsches
Krebsforschungszentrum in Heidelberg (5,555 sequences); and the
Chinese National Human Genome Center at Shanghai (Hu et al.
2000) (758 sequences).
Mapping human cDNAs to the human genome and the comparison
of the mapped H-Inv cDNAs with other annotated datasets. We have
mapped human cDNA sequences to the human genome sequence
corresponding to the NCBI build 34 assembly. The datasets we used
were a set of 41,118 H-Inv cDNAs and a set of 37,488 human RefSeq
sequences available on 15 July 2002 and on the 1 September 2003,
respectively. All the revisions for H-Inv cDNA sequences until August
2003 were applied in the datasets. Before performing the mapping
procedure, all the repetitive and low-complexity sequences in all the
cDNA sequences were masked using RepeatMasker (http://ftp.
genome.washington.edu/RM/RepeatMasker.html) and Repbase 7.5.
Then we used the cross_match program to mask the remaining
vector sequences in each cDNA sequence. Any poly-A tails were also
masked by using a custom-made Perl script. In the first step of the
mapping procedure, we conducted BLASTN (ver.2.2.6) searches of all
the sequences against the human genome sequence and extracted the
corresponding genomic regions for each query sequence. Then we
used est2genome (EMBOSS package ver.2.7.1) to align each sequence
to the genomic region with a threshold of 95% identity and 90%
coverage. Coverage of each cDNA sequence was calculated excluding
those from the vector and poly-A tails that were masked in the
previous step. If the sequences were mapped to multiple positions on
the human genome, then we selected their best locus based on the
identity, length coverage, and number of exons of those sequences. As
a result, 77,315 sequences (including 40,140 cDNAs from the H-Inv
project) were successfully mapped onto the human genome and were
clustered into 38,587 clusters based on sharing at least 1 bp of an
PLoS Biology | http://biology.plosjournals.orgJune 2004 | Volume 2 | Issue 6 | Page 0870
Integrative Annotation of Human Genes