The Sri Lankan Personal Genome Project

Article (PDF Available) · November 2011with 155 Reads
DOI: 10.4038/sljbmi.v2i1.3711
Cite this publication
The first Sri Lankan Personal Genome was sequenced heralding the entry of Sri Lanka into the new era of whole genome sequencing. This paper explains the background and the rationale for the project, gives a brief overview of what was found in the Sri Lankan Personal Genome, and discusses the future directions of the project.
Figures - uploaded by Vinod Scaria
Author content
All content in this area was uploaded by Vinod Scaria
Content may be subject to copyright.
Leading Article
The Sri Lankan Personal Genome Project: an overview
Prof. Vajira H. W. Dissanayake MBBS, PhD
Professor, Department of Anatomy; Medical Geneticist, Human Genetics Unit, Faculty of Medicine, University
of Colombo, Sri Lanka
E-Mail address:
Pubudu S. Samarakoon BSc, MSc
Tutor, Biomedical Informatics Course, Postgraduate Institute of Medicine, University of Colombo, Sri Lanka.
Current address: Research Fellow, Department of Medical Genetics, University of Oslo, Norway.
E-Mail address:
Dr. Vinod Scaria MBBS
Scientist, CSIR Institute of Genomics and Integrative Biology (CSIR-IGIB), Delhi, India
E-Mail address:
Ashok Patowary BSc, MSc
Scientist, CSIR Institute of Genomics and Integrative Biology (CSIR-IGIB), Delhi, India
E-Mail address:
Dr. Sridhar Sivasubbu PhD
Scientist, CSIR Institute of Genomics and Integrative Biology (CSIR-IGIB), Delhi, India
E-Mail address:
Dr. Rajesh S. Gokhale PhD
Director, CSIR Institute of Genomics and Integrative Biology (CSIR-IGIB), Delhi, India
E-mail address:
Sri Lanka Journal of Bio-Medical Informatics 2011;2(1):4-8
The first Sri Lankan Personal Genome was sequenced heralding the entry of Sri Lanka into the new era of whole
genome sequencing. This paper explains the background and the rationale for the project, gives a brief overview
of what was found in the Sri Lankan Personal Genome, and discusses the future directions of the project.
Keywords - Sri Lankan Personal Genome Project; Sri Lanka; Genome; Sinhalese; Whole Genome Sequencing
With the completion of the Human Genome Project (HGP)(1,2) and a decade of human
genomics, we are at an interesting juncture in the history of mankind. New technologies have
enabled sequencing of complete human genomes at a fraction of the original cost of what was
spent on the Human Genome Project (HGP). At the same time, these technologies have
significantly improved the scale and the ease of sequencing, and as a result it is now possible
to sequence entire human genomes and understand the genomic make-up of the individual,
with widespread potential applications in healthcare(3).
Major projects worldwide which followed the Human Genome Project, including the
HapMap project(4) and the 1000 genomes project(5), have been cataloguing human genetic
variations at rapid speed. In addition, they have fuelled the growth of technologies and
analytical methods to scan entire genomes for informative genetic markers, aimed at
V. H. W. Dissanayake et al. / Sri Lanka Journal of Biomedical Informatics 2011; 2(1):4-8
understanding the differences and similarities between individuals. This is the first step
towards understanding genotype-phenotype correlations. Such large studies have been
undertaken by multiple groups from around the world(6). This has resulted in the
identification of a large number markers associated with complex disorders and drug-
response(7). This is just the tip of the iceberg, and many new associations continue to be
reported in scientific literature on a daily basis.
In addition to understanding genetic variations and how they contribute to disease, there has
been a large body of work aimed at understanding other important aspects such as epigenetic
mechanisms and genomic regulation. This was made possible by new genomic tools which
enabled researchers to address questions at a genomic level and developments in
bioinformatics, made possible by the availability of cheaper and faster computers which
made it possible to do large-scale data analysis, and the development of robust algorithms to
mine data and model biological phenomenon on a genomic scale.
The Sri Lankan personal genome
Today any country aspiring to provide its people access to state of the art healthcare cannot
ignore the rapid advances in the field of human genomics. It is imperative at this juncture for
every country to acquire the much-required tools and know-how. In addition, it is also
necessary to create the baseline data for understanding the genetic diversity of its population.
The Sri Lankan Personal Genome Project is the first step in this direction in Sri Lanka, and
marks the entry of Sri Lanka to the exciting field of whole genome sequencing. Sri Lanka is
home to over 20 million people with rich racial, cultural and linguistic diversity(8). The
earliest evidence (34,000 BP) of anatomically modern man in South Asia is found in Sri
Lanka(9). The rich diversity of human populations in the island has been influenced by
migration from the mainland India. Sri Lanka also has a rich heritage in organised medical
care. The hospital at Mihintale (437 367 BC) is the most ancient hospital to be discovered
in the World(10). The Sri Lankan population consists presently of six major populations, the
Sinhalese, Sri Lankan Tamils, Indian Tamils, Moors, Burghers, and Malays(8). It is also home
to other smaller diverse populations like Vaddhas, the descendents of the original inhabitants
of the island who were geographically isolated from other populations, and Kaffirs,
descendents of African slaves brought to the island over 500 years ago.
To understand the genetic diversity of the Sri Lankan populations, and to create the baseline
data for disease association studies, we had earlier created the Sri Lankan Genome Variation
Database(11). This database contains information on Single Nucleotide Polymorphisms
(SNPs) found in Sinhalese, Sri Lankan Tamils and Moors, the three major ethnic groups in
the Sri Lankan population. The database presently contains information including genotype
frequencies of 34 genomic variations encompassing 14 medically important genes. The
database has been designed keeping in mind international standards for describing and
annotating variations, including those of the Human Genome Variation Society (HGVS)(12).
In the true spirit of collaborations and open access to data, the database also accepts
submissions from the research community and thus offers a standard access point to the
spectrum of genetic variations in the population to researchers and clinicians. The resource is
accessible at URL:
As a proof of concept towards the goal of interpreting and analysing complete genome data,
we sequenced a complete genome of an anonymous Sinhalese male of Sri Lankan origin with
both upcountry and low country descent. Sequencing was performed using next-generation
V. H. W. Dissanayake et al. / Sri Lanka Journal of Biomedical Informatics 2011; 2(1):4-8
sequencing technology, with over 20x coverage of the genome. Analysis of the genome
resulted in the identification of 2,811,918 SNPs, of which 222,739 were novel in comparison
to dbSNP(13) build 131. This accounted for almost 7.9% of entire set of variations in the
genome, pointing to the necessity of having more complete genomes to have a more
comprehensive picture of the spectrum of genetic variations in humans. Analysis also
resulted in the identification of 489,921 insertion-deletion (InDel) events in the genome.
Future directions
The immediate strategic goals of the Sri Lankan Personal Genome Project for 2011-2012 are
to understand in depth, the genetic variations and their potential phenotypic consequences.
Thus, for the years 2011-2012 we articulate our research in terms of three main streams
annotating the genetic variations unique to Sri Lankans, studying the interactions between
genes in relation to disease phenotypes, visualising the annotation of the Sri Lankan genome
via “Sri Lankan Genome Browser” and the “Sri Lankan Genome Variation Database”.
The first Sri Lankan Personal Genome has revealed over 2.8 million single nucleotide
variations of which over 200,000 are unique variations which have hitherto not been
identified in other populations as revealed by comparison with variations collected in the
dbSNP database build 131. We hope that in depth annotation of these variations will provide
crucial insights into some phenotypes which could be specific for the Sri Lankan population.
Coupling this knowledge with associated clinical phenotypes and traits will potentially enable
scientists to generate new hypothesis on the given association. Consequently, these
hypothesis can be validated by specifically genotyping these unique variations in the Sri
Lankan population. Recent advances in the field of bioinformatics and data mining offers the
tools required for the annotation and functional interpretation of SNP data. [For example see
the article by Harendra et al. in this issue of the Journal(14)]. The value of this information can
be further enhanced by comparative studies with data coming from other projects including
the 1000 Genomes Project and other population specific personal genome projects.
Recent advances in genomic technologies have enabled researchers to unravel many a
biological pathway and process at molecular detail. It would be imperative to exploit this data
and perform integrative analysis so as to understand the biological context and functional
consequence of genomic variations. This would include (i) understanding biological
interaction networks including genetic interaction networks and protein interaction networks
from public databases like OMIM and HuGe Navigator, (ii) curation and integration of the
interaction network so as to understand molecular processes of diseases and drug metabolic
pathways, including integrating association data from public repositories and resources to
understand the molecular pathways of biological processes, (iii) integration of the variation
data with the gene interaction network to understand the potential consequences of the
genomic variations which could be modelled and validated in model systems.
To ensure the wide use and ease of interpretation, we have made available the genomic
variations and annotations of the Sri Lankan Personal Genome on the Sri Lankan Genome
Browser, an online genome browser built on the Generic Model Organisms (GMOD)
Gbrowse(15). This would serve as the central hub for exchange of data, visualisation of
genomic variations and their annotations including data that would come out of the Sri
Lankan Personal Genome Project in the future (Figure 1). The resource is freely accessible
online at
V. H. W. Dissanayake et al. / Sri Lanka Journal of Biomedical Informatics 2011; 2(1):4-8
Figure 1. The Sri Lankan Genome Browser provides quick visualization of genomic
variations and would ease the annotation and interpretation of genetic variations.
In the future, we hope to unravel the genetic diversity of Sri Lankan populations by
sequencing more individuals from different racial groups. We also hope to collaborate both
nationally and internationally to assimilate knowledge and expertise and possibly co-create
resources which would enable the interpretation of data and its application in healthcare. This
includes participation in co-creating open resources like OpenPGx ( for
interpreting genomic variations and participation in collaborative initiatives aimed at
understanding the diversity of Asian populations. We also recognise that application of
genomics in healthcare would not be possible without educating and involving medical
professionals in genomics research and that this would include educating medical
professionals on analysing and interpreting genomic information and using such information
in their clinical practice.
Declaration of conflicts of interest
The authors declare that they have no conflicts of interest.
The Sri Lankan Personal Genome Project was funded by the NOMA Project of the
Postgraduate Institute of Medicine, University of Colombo, Sri Lanka.
1. International Human Genome Sequencing Consortium. Initial sequencing and analysis of
the human genome. Nature 409: 860921. doi:10.1038/35057062
2. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, et al. The sequence of the human
genome. Science 291: 13041351. doi:10.1126/science.1058040
3. Green ED, Guyer MS; Charting a course for genomic medicine from base pairs to
bedside. Nature. 2011; 470: 204-13. doi:10.1038/nature09764
V. H. W. Dissanayake et al. / Sri Lanka Journal of Biomedical Informatics 2011; 2(1):4-8
4. The International HapMap Consortium. The International HapMap Project. Nature 2003;
5. 1000 Genomes Project Consortium. A map of human genome variation from population-
scale sequencing. Nature 2010; 467(7319):1061-73
6. Visited on 1/3/2011.
7. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA.
Potential etiologic and functional implications of genome-wide association loci for human
diseases and traits. Proc Natl Acad Sci U S A. 2009; 106: 9362-7. doi:
Text-11-12-06.pdf. Visited on 1/3/2011.
9. S.U. Deraniyagala. Pre and Proto-historic Settlement in Sri Lanka. Proceedings of the
XIIIth International Congress of the International Union of Prehistoric and Protohistoric
Sciences. 1998:277-285.
10. Müller-Dietz HE, Die Krankenhaus-ruinen in Mihintale (Ceylon). Historia Hospitalium
11. Samarakoon PS, Jayasekara RW, Dissanayake VHW. The Sri Lankan Genome Variation
Database. Sri Lanka Journal of Biomedical Informatics 2011;2(1):10-21. DOI:
12. Scriver CR, Nowacki PM, Lehvaslaiho H. Guidelines and recommendations for content,
structure, and deployment of mutation databases. Hum Mutat 1999; 13:344-50. PMID:
13. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP:
the NCBI database of genetic variation. Nucleic Acids Res. 2001; 29:308-11.
14. Harendra GG, Jayasekara RW, Dissanayake VHW. In silico analysis of Single Nucleotide
Polymorphisms (SNPs) in the Heparin-Binding EGF-like Growth Factor (HBEGF) gene
and their allelic profiles in the Sri Lankan population: a comprehensive approach to
prioritise SNPs for candidate gene studies. Sri Lanka Journal of Bio-Medical Informatics
2011;2(1):22-38. DOI:
15. Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE,
Harris TW, Arva A, Lewis S. The generic genome browser: a building block for a model
organism system database. Genome Res. 2002; 12:1599-610. doi: 10.1101/gr.403602
  • Article
    Full-text available
    Pakistan covers a key geographic area in human history, being both part of the Indus River region that acted as one of the cradles of civilization and as a link between Western Eurasia and Eastern Asia. This region is inhabited by a number of distinct ethnic groups, the largest being the Punjabi, Pathan (Pakhtuns), Sindhi, and Baloch. We analyzed the first ethnic male Pathan genome by sequencing it to 29.7-fold coverage using the Illumina HiSeq2000 platform. A total of 3.8 million single nucleotide variations (SNVs) and 0.5 million small indels were identified by comparing with the human reference genome. Among the SNVs, 129,441 were novel, and 10,315 nonsynonymous SNVs were found in 5,344 genes. SNVs were annotated for health consequences and high risk diseases, as well as possible influences on drug efficacy. We confirmed that the Pathan genome presented here is representative of this ethnic group by comparing it to a panel of Central Asians from the HGDP-CEPH panels typed for ~650 k SNPs. The mtDNA (H2) and Y haplogroup (L1) of this individual were also typical of his geographic region of origin. Finally, we reconstruct the demographic history by PSMC, which highlights a recent increase in effective population size compatible with admixture between European and Asian lineages expected in this geographic region. We present a whole-genome sequence and analyses of an ethnic Pathan from the north-west province of Pakistan. It is a useful resource to understand genetic variation and human migration across the whole Asian continent.
  • Article
    Full-text available
    The whole gamut of new technologies in the past decade has revolutionized DNA sequencing, making it cheaper, efficient, and scalable. The consequent big-data in genomics have posed new challenges and opportunities. The transformation of internet as a fabric that intertwines multiple technological and social layers and the rise of platforms that can organize and integrate massively parallel human activities have transformed the workplaces in many industries and offers a new opportunity in the area of genomics. In this short review, we discuss the state-of-the art of crowdsourcing in genomics research with special focus on pharmacogenomics. We discuss the field, starting with an overview of technology and the major challenges. We also discuss a number of ongoing crowdsourcing approaches in the area of pharmacogenomics and personal genomics. We conclude with deliberating on the issues in genomics and how crowdsourcing could offer a plausible alternative to conventional approaches in genomics.
  • Article
    Full-text available
    With a higher throughput and lower cost in sequencing, second generation sequencing technology has immense potential for translation into clinical practice and in the realization of pharmacogenomics based patient care. The systematic analysis of whole genome sequences to assess patient to patient variability in pharmacokinetics and pharmacodynamics responses towards drugs would be the next step in future medicine in line with the vision of personalizing medicine. Genomic DNA obtained from a 55 years old, self-declared healthy, anonymous male of Malay descent was sequenced. The subject's mother died of lung cancer and the father had a history of schizophrenia and deceased at the age of 65 years old. A systematic, intuitive computational workflow/pipeline integrating custom algorithm in tandem with large datasets of variant annotations and gene functions for genetic variations with pharmacogenomics impact was developed. A comprehensive pathway map of drug transport, metabolism and action was used as a template to map non-synonymous variations with potential functional consequences. Over 3 million known variations and 100,898 novel variations in the Malay genome were identified. Further in-depth pharmacogenetics analysis revealed a total of 607 unique variants in 563 proteins, with the eventual identification of 4 drug transport genes, 2 drug metabolizing enzyme genes and 33 target genes harboring deleterious SNVs involved in pharmacological pathways, which could have a potential role in clinical settings. The current study successfully unravels the potential of personal genome sequencing in understanding the functionally relevant variations with potential influence on drug transport, metabolism and differential therapeutic outcomes. These will be essential for realizing personalized medicine through the use of comprehensive computational pipeline for systematic data mining and analysis.
  • Article
    Full-text available
    In this endeavour, inspired by the Odyssey, we aim to embark with the reader on a journey on a ship from Troy to Ithaca, coursing through the history of the momentous events and achievements that paved the way for personalised medicine. We will set sail amidst important genetic discoveries, beginning with the discovery of the first human genome, and voyage through the projects that contributed to the progress of pharmacogenomic studies. Concurrently, we will propose methods to overcome the obstacles that are slowing the potential full implementation of accumulated knowledge into everyday practice. This journey aims to reflect on the frontiers of current genetic knowledge and the practical use of this knowledge in preventive, diagnostic and pharmacogenomic approaches to directly impact the socio-economic aspects of public health.
  • Article
    Full-text available
    In 1953 an American James Watson and an Englishman Francis Crick described the elegant structure of our genetic code, the DNA double helix (1) . Fifty years later the 'Human Genome Project' succeeded in describing the three billion letters that are found in our genetic code -the Human Genome (2,3) . The Human Genome Project was an enormous collaboration between six leading nations of the world and over 1000 of the world's best and the brightest scientists from a multitude of fields. It took 13 years and three billion dollars to complete the project. Advances in technology since then has made it possible today to sequence a Human Genome in a very short period of time, at a fraction of the cost and with minimal manpower. The first Sri Lankan Personal Genome sequence came from a Sinhalese man with both upcountry and low country heritage. It is hoped that the project would be extended to cover other ethnicities as well in the years to come. The Biomedical Informatics group in the University of Colombo based in the Postgraduate Institute of Medicine and Human Genetics Unit of the Faculty of Medicine will continue to analyse the data generated to make sense of the vast amount of information available in the Sri Lankan Personal Genome.
  • Article
    Full-text available
    South Asia is the home to more than a fifth of the world’s population, and is thought, on genetic grounds, to have been the first main reservoir in the dispersal of modern humans Out of Africa.1, 2 Additionally, high level of endogamy within and between various castes, along with the influence of several evolutionary forces and long-term effective population size, facilitate the formation of complex demographic history of the subcontinent.³ Therefore, the ancestry of peopling of the South Asia is a question of fundamental importance in archaeogenetics, linguistics and historical disciplines, and it is not surprising that the number and timing of migrations in and out of South Asia is still vigorously debated.2, 3, 4 Researches from various disciplines focused on testing the hypothesis that several separate migrations entered to the subcontinent with each migration being associated with different tool technology, linguistic and genetic characteristics.2, 3, 5, 6 The mtDNA (mitochondrial DNA) data suggest deep autochthonous diversity with minor sharing with East and West Eurasians,³ whereas, in contrast with this, the recent autosomal data showed substantial similarities of their genome with Caucasus and West Asians.4, 6 However, at the current resolution, it is unclear that this sharing is extremely ancient or arisen with the arrival of new languages and farming.
  • Article
    Full-text available
    Introduction: Long non-coding RNAs (lncRNAs) are a recently discovered class of non-coding functional RNA which has attracted immense research interest. The growing corpus of literature in the field provides ample evidence to suggest the important role of lncRNAs as regulators in a wide spectrum of biological processes. Recent evidence also suggests the role of lncRNAs in the pathophysiology of disease processes. Areas covered: The authors discuss a conceptual framework for understanding lncRNA-mediated regulation as a function of its interaction with other biomolecules in the cell. They summarize the mechanisms of the known functions of lncRNAs in light of this conceptual framework, and suggest how this insight could help in discovering novel targets for drug discovery. They also argue how certain emerging technologies could be of immense utility, both in discovering potential therapeutic targets as well as in further therapeutic development. Expert opinion: The authors propose how the field could immensely benefit from methodologies and technologies from six emerging fields in molecular and computational biology. They also suggest a futuristic area of lncRNAs design as a potential offshoot of synthetic biology, which would be an attractive field, both for discovery of targets as well as a therapeutic strategy.
  • Article
    Full-text available
    In this paper, using the Heparin-Binding EGF-like growth factor (HBEGF) gene, we illustrate a comprehensive approach to select the most appropriate SNP markers for molecular epidemiological studies. Initially an in silico functional analysis was undertaken to verify the SNPs that regulate HBEGF expression. Thereafter based on predefined criteria (the significance of the function identified, ability to represent other SNPs in the gene (being a tagSNP), presence within an evolutionary conserved region, validation status of the SNP, and the minor allele frequency of the SNP) SNPs with putative functional effects were prioritised to identify the most appropriate HBEGF markers for molecular epidemiological studies. Using 30 Sinhalese men and women, we further established the allele and genotype frequencies of the seven highest priority SNPs identified. These frequencies were compared with those of HapMap populations to understand the genetic identity of the Sinhalese in relation to HapMap populations.
  • Article
    Full-text available
    The Sri Lankan Genome Variation Database (SLGVD) is a database of single nucleotide polymorphisms found in Sinhalese, Sri Lankan Tamils and Moors - the three major ethnic groups in Sri Lanka. Studies of variations in genes among different groups of individuals in the Sri Lankan population have grown steadily during the past few years. These studies generate large amounts of genetic data that is important to study the occurrences of diseases that differ across ethnic groups. There is therefore a need for a central repository of this data. The SLGVD was created to fulfill this void. The SLGVD offers web based access to genetic variation information of Sri Lankan people. It would also be an important informatics tool for both research and clinical purposes. The database was designed conforming to guidelines issued by the Human Genome Variation Society (HGVS). In addition to variation data each variation is linked with the relevant entries of Online Mendelian Inheritance in Man (OMIM), dbSNP and GenBank databases at the National Centre of Biotechnology Information (NCBI), USA. Genotype and allele frequencies of each variation in different ethnic groups are represented in numerical and graphical format. The SLGVD can be freely accessed online at . DOI: Sri Lanka Journal of Bio-Medical Informatics 2011; 2 :9-20
  • Article
    There has been much progress in genomics in the ten years since a draft sequence of the human genome was published. Opportunities for understanding health and disease are now unprecedented, as advances in genomics are harnessed to obtain robust foundational knowledge about the structure and function of the human genome and about the genetic contributions to human health and disease. Here we articulate a 2011 vision for the future of genomics research and describe the path towards an era of genomic medicine.
  • Article
    Full-text available
    We have developed an online catalog of SNP-trait associations from published genome-wide association studies for use in investigating genomic characteristics of trait/disease-associated SNPs (TASs). Reported TASs were common [median risk allele frequency 36%, interquartile range (IQR) 21%-53%] and were associated with modest effect sizes [median odds ratio (OR) 1.33, IQR 1.20-1.61]. Among 20 genomic annotation sets, reported TASs were significantly overrepresented only in nonsynonymous sites [OR = 3.9 (2.2-7.0), p = 3.5 x 10(-7)] and 5kb-promoter regions [OR = 2.3 (1.5-3.6), p = 3 x 10(-4)] compared to SNPs randomly selected from genotyping arrays. Although 88% of TASs were intronic (45%) or intergenic (43%), TASs were not overrepresented in introns and were significantly depleted in intergenic regions [OR = 0.44 (0.34-0.58), p = 2.0 x 10(-9)]. Only slightly more TASs than expected by chance were predicted to be in regions under positive selection [OR = 1.3 (0.8-2.1), p = 0.2]. This new online resource, together with bioinformatic predictions of the underlying functionality at trait/disease-associated loci, is well-suited to guide future investigations of the role of common variants in complex disease etiology.
  • Article
    These Guidelines recognize the need for annotated online mutation databases documenting allelic variation (both pathogenic and phenotype modifying, and also neutral polymorphic); the databases will be both generalized (genomic) and specialized (locus specific), and a seamless integration of the two types is intended. Each requires a Document (its "biography"). Different mutation databases will have different content and structure, but a minimum core of content in a shared syntax is a necessity; the core includes: (1) a unique identifier of the allele; (2) the source/report of the data; (3) context of the allele; and (4) the allele itself (the description). The allele description should be validated. There is no single correct way to design a mutation database. The uses to which databases are put dictate the design. Software and deployment together recognize the different needs of specialized and generalized databases, while making them mutually compatible through shared content and the appropriate search facilities. A set of eight Recommendations completes these Guidelines for Content, Design, and Deployment of Mutation Databases.
  • Article
    Full-text available
    In response to a need for a general catalog of genome variation to address the large-scale sampling designs required by association studies, gene mapping and evolutionary biology, the National Center for Biotechnology Information (NCBI) has established the dbSNP database [S.T.Sherry, M.Ward and K.Sirotkin (1999) Genome Res., 9, 677–679]. Submissions to dbSNP will be integrated with other sources of information at NCBI such as GenBank, PubMed, LocusLink and the Human Genome Project data. The complete contents of dbSNP are available to the public at website: The complete contents of dbSNP can also be downloaded in multiple formats via anonymous FTP at
  • Article
    Full-text available
    A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.
  • Article
    Full-text available
    The Generic Model Organism System Database Project (GMOD) seeks to develop reusable software components for model organism system databases. In this paper we describe the Generic Genome Browser (GBrowse), a Web-based application for displaying genomic annotations and other features. For the end user, features of the browser include the ability to scroll and zoom through arbitrary regions of a genome, to enter a region of the genome by searching for a landmark or performing a full text search of all features, and the ability to enable and disable tracks and change their relative order and appearance. The user can upload private annotations to view them in the context of the public ones, and publish those annotations to the community. For the data provider, features of the browser software include reliance on readily available open source components, simple installation, flexible configuration, and easy integration with other components of a model organism system Web site. GBrowse is freely available under an open source license. The software, its documentation, and support are available at