ArticlePDF Available

DNA sequence and analysis of human chromosome 8

Authors:

Abstract and Figures

The International Human Genome Sequencing Consortium (IHGSC) recently completed a sequence of the human genome. As part of this project, we have focused on chromosome 8. Although some chromosomes exhibit extreme characteristics in terms of length, gene content, repeat content and fraction segmentally duplicated, chromosome 8 is distinctly typical in character, being very close to the genome median in each of these aspects. This work describes a finished sequence and gene catalogue for the chromosome, which represents just over 5% of the euchromatic human genome. A unique feature of the chromosome is a vast region of approximately 15 megabases on distal 8p that appears to have a strikingly high mutation rate, which has accelerated in the hominids relative to other sequenced mammals. This fast-evolving region contains a number of genes related to innate immunity and the nervous system, including loci that appear to be under positive selection--these include the major defensin (DEF) gene cluster and MCPH1, a gene that may have contributed to the evolution of expanded brain size in the great apes. The data from chromosome 8 should allow a better understanding of both normal and disease biology and genome evolution.
Content may be subject to copyright.
© 2006 Nature Publishing Group
DNA sequence and analysis of human
chromosome 8
Chad Nusbaum
1
, Tarjei S. Mikkelsen
1
, Michael C. Zody
1
, Shuichi Asakawa
2
, Stefan Taudien
3
, Manuel Garber
1
,
Chinnappa D. Kodira
1
, Mary G. Schueler
4
, Atsushi Shimizu
2
, Charles A. Whittaker
1
, Jean L. Chang
1
,
Christina A. Cuomo
1
, Ken Dewar
1
, Michael G. FitzGerald
1
, Xiaoping Yang
1
, Nicole R. Allen
1
, Scott Anderson
1
,
Teruyo Asakawa
2
, Karin Blechschmidt
3
, Toby Bloom
1
, Mark L. Borowsky
1
, Jonathan Butler
1
, April Cook
1
,
Benjamin Corum
1
, Kurt DeArellano
1
, David DeCaprio
1
, Kathleen T. Dooley
1
, Lester Dorris III
1
, Reinhard Engels
1
,
Gernot Glo
¨ckner
3
, Nabil Hafez
1
, Daniel S. Hagopian
1
, Jennifer L. Hall
1
, Sabine K. Ishikawa
2
, David B. Jaffe
1
,
Asha Kamat
1
, Jun Kudoh
2
,Ru
¨diger Lehmann
3
, Tashi Lokitsang
1
, Pendexter Macdonald
1
, John E. Major
1
,
Charles D. Matthews
1
, Evan Mauceli
1
, Uwe Menzel
3
, Atanas H. Mihalev
1
, Shinsei Minoshima
2
,
Yuji Murayama
2
, Jerome W. Naylor
1
, Robert Nicol
1
, Cindy Nguyen
1
, Sine
´ad B. O’Leary
1
, Keith O’Neill
1
,
Stephen C. J. Parker
1
, Andreas Polley
3
, Christina K. Raymond
1
, Kathrin Reichwald
3
, Joseph Rodriguez
1
,
Takashi Sasaki
2
, Markus Schilhabel
3
, Roman Siddiqui
3
, Cherylyn L Smith
1
, Tam P. Sneddon
5
, Jessica A. Talamas
1
,
Pema Tenzin
1
, Kerri Topham
1
, Vijay Venkataraman
1
, Gaiping Wen
3
, Satoru Yamazaki
2
, Sarah K. Young
1
,
Qiandong Zeng
1
, Andrew R. Zimmer
1
, Andre Rosenthal
3
, Bruce W. Birren
1
, Matthias Platzer
3
,
Nobuyoshi Shimizu
2
& Eric S. Lander
1
The International Human Genome Sequencing Consortium
(IHGSC) recently completed a sequence of the human genome
1
.
As part of this project, we have focused on chromosome 8.
Although some chromosomes exhibit extreme characteristics in
terms of length, gene content, repeat content and fraction seg-
mentally duplicated, chromosome 8 is distinctly typical in char-
acter, being very close to the genome median in each of these
aspects. This work describes a finished sequence and gene catalo-
gue for the chromosome, which represents just over 5% of the
euchromatic human genome. A unique feature of the chromosome
is a vast region of ,15 megabases on distal 8p that appears to have
a strikingly high mutation rate, which has accelerated in the
hominids relative to other sequenced mammals. This fast-evolving
region contains a number of genes related to innate immunity and
the nervous system, including loci that appear to be under positive
selection
2
these include the major defensin (DEF) gene cluster
3,4
and MCPH1
5,6
, a gene that may have contributed to the evolution
of expanded brain size in the great apes. The data from chromo-
some 8 should allow a better understanding of both normal and
disease biology and genome evolution.
The finished sequence of chromosome 8 contains 145,556,489
bases and is interrupted by only four euchromatic gaps, one gap at
the 8p telomere and one gap containing the centromeric hetero-
chromatin (Fig. 1 and Supplementary Table S1). These gaps are
refractory to current cloning and mapping technology. The esti-
mated total size of the euchromatic gaps is 427 kilobases (kb), based
on direct sizing of three gaps and estimation of the remaining two
gaps at the genome-wide average of ,100 kb each. This corresponds
to ,0.3% of the euchromatic length of the chromosome, similar to
the genome average
1,7–11
. In all, 182.3 megabases (Mb) of finished
sequence were generated by the Broad Institute of MIT and Harvard
(formerly Whitehead Institute/MIT Center for Genome Research
(WICGR)), 27.9 Mb by Keio University School of Medicine, 8.4 Mb
by the Institute of Molecular Biotechnology in Jena, and 5.8 Mb by 10
other groups (Supplementary Tables S2 and S3). These sequences
(which include overlap) were combined to yield the finished path
(see Methods).
We assessed the local accuracy of the clone path by aligning paired-
end sequences from a human Fosmid library (WIBR2, representing
£10 physical coverage) to the finished sequence
7
. Errors in the clone
path were detected by identifying discrepancies between the pre-
dicted and observed distances between Fosmid ends
7
. This revealed
two deleted clones, which were replaced. Finally, an independent
quality assessment exercise commissioned by NHGRI estimated the
accuracy of the finished sequence at less than 1 error in 100,000
bases
12
(J. Schmutz, personal communication).
Several analyses support the idea that nearly the entire euchro-
matic region of chromosome 8 is present and accurately represented.
From the well-curated RefSeq
13
data set 681 transcripts (from 573
unique genes) mapped to chromosome 8. All but one of these are
present and complete in the finished sequence. The finished sequence
shows excellent co-linearity with the genetic map
14
(Supplementary
LETTERS
1
Broad Institute of MIT and Harvard, 320 Charles St, Cambridge, Massachusetts 02141, USA.
2
Department of Molecular Biology, Keio University School of Medicine, 35
Shinanomachi, Shinjuku-ku, Tokyo 160-8582, Japan.
3
Genome Analysis, Institute of Molecular Biotechnology, Beutenbergstrasse 11, Jena 07745, Germany.
4
National Human
Genome Research Institute, National Institutes of Health, 50 South Drive Rm 5529, Bethesda, Maryland 20 982, USA.
5
HUGO Gene Nomenclature Committee (HGNC), The
Galton Laboratory, Department of Biology, University College London, Wolfson House, 4 Stephenson Way, London NW1 2HE, UK. †Present addresses: MIT Center for Cancer
Research, 77 Massachusetts Avenue E18-570, Cambridge, Massachusetts 02139, USA (C.A.W.); McGill University and Genome Quebec Innovation Centre, Montreal, Quebec
H3A 1A4, Canada (K.D.); Department of Genetics and Pathology, Uppsala University, SE-751 85 Uppsala, Sweden (U.M.); Photon Medical Research Center, Hamamatsu
University School of Medicine, Handayama, Hamamatsu, Shizuoka 431-3192, Japan (S.M.); Boston University Bioinformatics and Systems Biology Program, 24 Cummington St,
Boston, Massachusetts 02215, USA (S.C.J.P.); TraitGenetics GmbH, Am Schwabeplan 1b, 06466 Gatersleben, Germany (A.P.); University Clinic for Child and Adolescent
Psychiatry, University of Duisburg-Essen, Virchowstr. 174, 45147 Essen, Germany (K.R.); GSF-Forschungszentrum fu¨r Umwelt und Gesundheit, Ingolsta
¨dter Landstraße 1,
85674 Neuherberg, Germany (G.W.); Signature Diagnostics AG, Voltaireweg 4B, 14469 Potsdam, Germany (A.R.).
Vol 439|19 January 2006|doi:10.1038/nature04406
331
© 2006 Nature Publishing Group
Fig. S1). Among 247 sequence-based genetic markers (Supplemen-
tary Table S4) there are six discrepancies. One discrepancy consists of
eight markers and spans a region in 8p23 known to be the site of a
polymorphic inversion in the human population
15,16
(see below).
Five discrepancies each consist of single markers out of order by one
position; all occur in small regions where the genetic map shows no
recombination in one of the two sexes (Supplementary Table S4). The
sequence also shows good agreement with the radiation hybrid (RH)
map
17
(Supplementary Table S5).
We produced a manually curated gene catalogue, containing 793
gene loci and 301 pseudogene loci (see Methods). The catalogue
includes all previously known genes on chromosome 8 (Table 1).
According to the Hawk2 categorization scheme
18
, there are 614
‘known’ genes, 109 ‘novel CDS’, 43 ‘novel transcripts’, 14 ‘putatives’
and 13 ‘gene fragments’. The small set of novel and putative categories
were annotated by spliced expressed sequence tag (EST) evidence
only; some ‘putative novel’ loci may prove to be pseudogenes.
Comparison of manual annotation performed at the Broad Institute
of MIT and Harvard to manual annotation for specific regions done
at Jena and Keio indicated that they were largely the same, and that
virtually all differences could be attributable to edge effects (see
Supplementary Information).
Full-length transcripts of known genes contain an average of 9.9
exons, comparable to recently published reports
8–11,19
, have an
average length of 3,056 base pairs (bp), and internal exons have an
average length of 155 bp. There is evidence of extensive alternate
splicing. Gene loci have an average of 4.1 distinct transcripts, with
63% having at least two transcripts, values that are similar to recent
reports
8,9,11,20
. Of the 301 pseudogenes on chromosome 8, ,84% are
processed pseudogenes arising from retrotransposition; the remain-
ing 16% are unprocessed. We also identified 13 tRNA genes (Sup-
plementary Table S6). Examples of genes that represent extremes
from these averages are described in Supplementary Information.
Several aspects of the genome landscape are notable. The overall
gene density is 5.6 genes Mb
21
, below the genome average of
,10 genes Mb
21
. Gene distribution is highly heterogeneous, with
44 gene deserts (500 kb without a coding gene, Supplementar y Table
S7) that together comprise 41.9 Mb or ,29% the total length. The
overall GþC content is 39.2%, but varies substantially across
the chromosome (Fig. 1). Nearly half of the chromosome is com-
posed of repeat sequences, with transposable element fossils
comprising 44.5%, low complexity sequence (including simple
sequence repeats and satellite sequences) comprising 1.8%, and
segmental duplications comprising ,2.1% (with interchromosomal
and intrachromosomal duplications at ,1.5% each, with some
sequence included in both categories) (E. Eichler and X. She,
personal communication).
Chromosome 8 is the first human autosome and one of only two
chromosomes (the other being chromosome X
20
)forwhich
sequences span the entire pericentromeric region. The regions on
both arms stretch from unique euchromatin through pericentro-
meric satellites and into the higher-order alpha-satellite array (Fig. 2).
Three variant higher-order repeat units populate the chromosome 8
higher-order array, D8Z2 (ref. 21 and Supplementary Information).
The proximal termini of both the 8p and 8q sequence contigs are
comprised of nine copies of the 1.9-kb unit. The p and q arm higher-
order units are highly identical to each other (96–98%) and occur in
the same head-to-tail orientation, indicating that these sequences
sample the edges of the chromosome 8-specific array. Analysis of the
finished pericentromeric sequence of chromosome 8 is essential to
test and further develop primate centromere evolution hypotheses
using an autosomal model.
The most striking feature on chromosome 8 emerges from
evolutionary and population genetic comparisons (Fig. 3). The
most distal 15 Mb on chromosome 8p show an extremely high
divergence between human and chimpanzee (0.021 substitutions
per site, 4.0 s.d. above the mean of 0.012). The region also shows a
strikingly high polymorphism rate in the human population (0.0018,
3.2 s.d. above the mean of 0.0010). The peak divergence reaches
0.032 (8.6 s.d.), and diversity 0.0028 (7.1 s.d.), across a 1-Mb region
(3.3–4.3 Mb) overlapping the CSMD1 gene. This is the highest
divergence level seen across all autosomes and chromosome
X. Only regions of chromosome Y may be more rapidly diverging,
driven by the high mutation rate in the male germ line. We
excluded trivial explanations for this observation, such as unresolved
segmental duplications (Supplementary Information). Diversity is
also locally high in the chimpanzee, although the data are more
limited.
The high rate of divergence and diversity at distal 8p might reflect
either an extraordinary mutation rate or population genetic history.
The latter alternative would require an unusually long coalescence
time to the most recent common ancestor over a very large region;
this would be remarkable inasmuch as local coalescence times tend to
be correlated over short distances, as the correlation falls below 0.5
within 20 kb (ref. 22). We sought to resolve the issue by examining
the divergence rates with more distant mammalian species, where the
impact of population genetic history should be negligible.
Comparison of ancestral interspersed repeats in the human, dog
23
Figure 1 |Overview of human chromosome 8. The features are addressed in
the order of top to bottom. In the cartoon, blue shading indicates gene
deserts ($500 kb with no transcript, Supplementary Table S7); telomeres
(pTEL and qTEL), the centromere (CEN) and euchromatic sequence gaps
(red lines) are indicated. The following features are represented in discrete
windows of 100 kb: GþC content (on a scale from 30–70%); densities of
LINEs (long interspersed nucleotide elements; red) and SINEs (short
interspersed nucleotide elements; blue); and densities of transcripts (all are
counts of elements). The box at the bottom shows blocks of conserved
synteny (100-kb resolution) with dog, mouse and rat as determined for this
work. Chromosomes are numbered, and are coloured arbitrarily for ease of
distinction.
LETTERS NATURE|Vol 439|19 January 2006
332
© 2006 Nature Publishing Group
and mouse
24
genomes reveals that the region exhibits above-average
lineage-specific divergence rates on all three lineages across
100 million years of evolution, but that the rate is the most elevated
relative to the genome-wide mean in the lineage leading to humans.
The greatest elevation is seen in the most distal 6 Mb of 8p, where the
ancestral interspersed repeat divergence rates in the orthologous
sequences have been 0.19 (3.3 s.d. above the mean of 0.14) on the
human lineage and 0.41 (1.0 s.d. above the mean of 0.38) in the
mouse lineage since the primate–rodent split, and 0.24 (1.9 s.d. above
the mean of 0.20) in the dog lineage since the divergence from the
common boreo-eutherian ancestor.
The biological basis for the apparently high mutation rate is
unclear. Three major factors have been associated with high
mutation rates in the human genome: proximity to telomeres, high
recombination rate and high AþTcontent
25,26
. The region on
chromosome 8p has all three factors. The mean sex-averaged recom-
bination rate across the first 6 Mb is 2.7 cM Mb
21
, with a 1-Mb
window peak of 3.5, as compared to the genome-wide average of 1.2.
The region from 2.5–6 Mb is 62% AþT, as compared to a genome-
wide average of 59%. It is unusual in this regard, because sub-
telomeric regions with high recombination rates are typically
(AþT)-poor. Notably, the region is not subtelomeric in the mouse,
where the lowest rate elevation is observed.
The distal region on chromosome 8p also contains at least two loci
that appear to be undergoing positive selection (Fig. 3). The first
locus is the major cluster of defensin genes, which lies within the
region of high mutation (5.5–7.5 Mb), although ,2.5 Mb from
the peak. The defensin genes express small cationic antimicrobial
Table 1 |Chromosome 8 gene content
Category Gene
number
Gene
percentage
Gene length
(bp)*
Number of alternative
transcripts
Transcript
length (bp)†
Number of exons
per transcript‡
Internal exon
length (bp)§
Intron length
(bp)k
CpG-50
association{
Known gene 614 77 81,744 4.1 3,056 9.9 155 (n¼5,725) 9,630 (n¼7,710) 77
Novel transcript 43 5 96,268 1.8 1,116 3.8 146 (n¼127) 27,207 (n¼248) 42
Putative gene 14 2 45,433 1.2 714 2.4 123 (n¼21) 23,787 (n¼57) 36
Novel CDS 109 14 21,890 1.9 1,142 5.0 138 (n¼487) 5,103 (n¼625) 29
Gene fragment 13 2 648 1.0 648 1.0 8
Total 793 72,334
Pseudogene 301 28 1,334 1.0 875 1.3 195 (n¼50) 1,430 (n¼97) 5
*Average chromosomal distance from beginning of 5 0-most exon to 30-most exon in all transcripts in a gene.
†Average length summed across the footprint of all exons in all transcripts in a gene
total exon space per gene.
‡Average number of exons in transcripts. Exons common to different transcripts were counted once per transcript.
§Average length of exons using the footprint of all non-terminal exons of all transcripts in a gene. Unique overlapping exons or contained exons are counted separately, making this an average
length of unique exons in a gene.
kAverage length of unique introns in a gene. In the case of exon skipping, both the shorter and longer versions of the overlapping introns were counted towards the average.
{Percentage of genes with a transcript having a CpG island (as assessed by FirstEF) within 22 kb and þ1 kb of transcription start.
Figure 2 |8p and 8q pericentromeric contigs extend into chromosome
8-specific higher-order alpha satellite, D8Z2.The pericentromeric region
of chromosome 8 is shown as a truncated ideogram with the extent of
sequence coverage shown below by black bars. Dotter plots show self–self
alignments of the most proximal ,100 kb from each arm including ,36 kb
of the chromosome-specific alpha satellite array (D8Z2). Junctions between
the arm-specific satellite region and D8Z2 are marked with blue arrows.
Dark blocks indicate the highly repetitive nature of the satellite region and
mark similarity between monomers within each satellite family. Gaps in the
dark blocks occur where interspersed elements (LINEs, SINEs and long
terminal repeats) interrupt the satellite sequences. In the alpha satellite array
dotter plot (bottom), D8Z2 from 8p (,18 kb) is joined with that of 8q
(,18 kb). The plot reveals the periodic nature of the centromeric, higher-
order alpha satellite array with black horizontal lines indicating near identity
of sequences spaced at ,1.9-kb intervals. The regions outlined in blue are
self–self alignments (‘8p’ and ‘8q’), whereas the remaining rectangular
region of the plot is an alignment of 8p versus 8q D8Z2.
NATURE|Vol 439|19 January 2006 LETTERS
333
© 2006 Nature Publishing Group
peptides crucial to the innate immune response
27
. Studies
2,3
have
suggested that defensins have been under positive selection, with a
high ratio of non-synonymous to synonymous changes detected in
the mature peptide coding exon. Moreover, gene and segmental
duplication within the cluster have led to extensive copy number
28,29
and haplotype
30
polymorphism within and across populations,
which are thought to influence variation in disease susceptibility
and contribute to ongoing adaptive evolution in both the human and
chimpanzee species. The second locus showing positive selection is
MCPH1, mutations in which cause microcephaly (Online Mendelian
Inheritance in Man (OMIM): 251200); there is clear evidence of
accelerated non-synonymous divergence correlating with the expan-
sion of brain size throughout the lineage from simian ancestors to the
human and chimpanzee
4,5
.
To investigate the diversity of copy number in the defensin clusters,
we resequenced several dozen polymerase chain reaction (PCR)
products from representative intervals from DEFB105A (beta-
defensin cluster) and DEFA1 (alpha-defensin cluster) in 14 chim-
panzees, 1 gibbon, 1 macaque and 4 breeds of dog (see Methods and
Supplementary Information). In all species studied, the gene family
has multiple members, and the members are more similar within a
species than across species. Thus, the defensin clusters have either
independently duplicated in each species or have undergone gene
conversion events within species.
Finally, we note that the majority of the genes in the region of high
divergence in distal 8p play important roles in development or
signalling in the nervous system. Notably, the extremely large
CSMD1 gene, which lies at the peak of divergence and diversity, is
widely expressed in brain tissues. High regional mutation rates and
positive selection are generally assumed to be distinct, but it is
possible that the former may facilitate the latter by increasing the
rate of appearance of potentially advantageous single, or interacting,
alleles (see also ref. 31). It is intriguing to speculate whether the
accelerated divergence rate of this region has contributed to the rapid
expansion and evolution of the primate brain.
METHODS
See Supplementary Information for details on clone path building, generation of
sequence map, sizing of gaps and gene annotation. The final version of the clone
path is available in AGP format (see http://www.ncbi.nlm.nih.gov/genome/
guide/glossary.htm) at http://www.broad.mit.edu/tools/data/data-human.html.
Gene amplification and sequencing. TBLASTN (http://www.ncbi.nlm.nih.gov/
BLAST) was used to identify DEFB105 and DEFA1 orthologues in 16 chimpan-
zees, 1 gibbon, 1 macaque and 4 dog breeds (akita, golden retriever, greyhound
and mastiff). PCR primers for gene amplification were designed using Primer3
(http://frodo.wi.mit.edu/primer3) based on the species reference sequence.
Human and macaque primers were used for gibbon. Amplified products were
cloned, and for each individual/gene combination, 48 or 96 clones were
sequenced.
Haplotype analysis. Neighbourhood Quality Standard
32
(NQS) scores were
computedfor all sequenced products using the published constraints
32
. Reads were
trimmed to the first and last three consecutive NQS bases, and aligned to the
reference sequence using PatternHunter (http://www.bioinformaticssolutions.
com). Multiple sequence alignments were built from the pairwise alignments
and inspected to find SNPs that were: at NQS bases, supported by at least two
reads, and in a ten base window where not more than two other variations were
observed. To minimize false positives due to errors during PCR amplification, we
restricted our analysis to haplotypes that differed in .3bases.
Received 5 August; accepted 6 October 2005.
1. International Human Genome Sequencing Consortium. Initial sequencing and
analysis of the human genome. Nature 409, 860–-921 (2001).
2. Vallender, E. J. & Lahn, B. T. Positive selection on the human genome. Hum.
Mol. Genet. 13 (suppl. 2), R245–-R254 (2004).
3. Maxwell, A. I., Morrison, G. M. & Dorin, J. R. Rapid sequence divergence in
mammalian
b
-defensins by adaptive evolution. Mol. Immunol. 40, 413–-421
(2003).
4. Xiao, Y. et al. A genome-wide screen identifies a single
b
-defensin gene cluster
in the chicken: implications for the origin and evolution of mammalian
defensins. BMC Genom. 5, 56 (2004).
5. Evans, P. D., Anderson, J. R., Vallender, E. J., Choi, S. S. & Lahn, B. T.
Reconstructing the evolutionary history of microcephalin, a gene controlling
human brain size. Hum. Mol. Genet. 13, 1139–-1145 (2004).
6. Evans, P. D. et al. Microcephalin, a gene regulating brain size, continues to
evolve adaptively in humans. Science 309, 1717–-1720 (2005).
7. International Human Genome Sequencing Consortium. Finishing the
euchromatic sequence of the human genome. Nature 431, 931–-945 (2004).
8. Grimwood, J. et al. The DNA sequence and biology of human chromosome 19.
Nature 428, 529–-535 (2004).
9. Deloukas, P. et al. The DNA sequence and comparative analysis of human
chromosome 10. Nature 429, 375–-381 (2004).
10. Martin, J. et al. The sequence and analysis of duplication-rich human
chromosome 16. Nature 432, 988–-994 (2004).
11. Nusbaum, C. et al. DNA sequence and analysis of human chromosome 18.
Nature 437, 551–-555 (2005).
12. Schmutz, J. et al. Quality assessment of the human genome sequence. Nature
429, 365–-368 (2004).
13. Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI Reference Sequence (RefSeq):
Figure 3 |Diversity and divergence on 8p. Coloured lines indicate the
distribution of human diversity (blue) and human–chimpanzee divergence
(red). Values of genome averages and of 2 standard deviations from the
means are indicated (dark and light dashed lines, respectively). Features
mentioned in the text are indicated in the bottom panel, including genes,
two low copy repeats (LCRs) and the common 8p23 inversion. Vertical ticks
in the LCR boxes indicate olfactory receptor genes or pseudogenes, and
vertical ticks in the DEF cluster boxes represent individual defensin (DEF)
genes. There is a discontinuity in the divergence plot from 6.98 to 8.13 Mb.
This region, corresponding to the REPD repeat, is also highly duplicated in
the chimpanzee, making it impossible to align sequence with high enough
confidence to call divergence.
LETTERS NATURE|Vol 439|19 January 2006
334
© 2006 Nature Publishing Group
a curated non-redundant sequence database of genomes, transcripts and
proteins. Nucleic Acids Res. 33, D501–-D504 (2005).
14. Kong, A. et al. A high-resolution recombination map of the human genome.
Nature Genet. 31, 225–-226 (2002).
15. Giglio, S. et al. Olfactory receptor-gene clusters, genomic-inversion
polymorphisms, and common chromosome rearrangements. Am. J. Hum.
Genet. 68, 874–-883 (2001).
16. Shimokawa, O. et al. Molecular characterization of inv dup del(8p): analysis of
five cases. Am. J. Med. Genet. A 128, 133–-137 (2004).
17. Deloukas, P. et al. A physical map of 30,000 genes. Science 282, 744–-746
(1998).
18. Ashurst, J. L. et al. The Vertebrate Genome Annotation (Vega) database.
Nucleic Acids Res. 33, D459–-D465 (2005).
19. Hillier, L. W. et al. Generation and annotation of the DNA sequences of human
chromosomes 2 and 4. Nature 434, 724–-731 (2005).
20. Ross, M. T. et al. The DNA sequence of the human X chromosome. Nature
434, 325–-337 (2005).
21. Ge, Y., Wagner, M. J., Siciliano, M. & Wells, D. E. Sequence, higher order
repeat structure, and long-range organization of alpha satellite DNA specific to
human chromosome 8. Genomics 13, 585–-593 (1992).
22. Reich, D. E. et al. Human genome sequence variation and the influence of
gene history, mutation and recombination. Nature Genet. 32, 135–-142
(2002).
23. Lindblad-Toh, K. et al. Genome sequence, comparative analysis and haplotype
structure of the domestic dog. Nature 438, 803–-819 (2005).
24. Mouse Genome Sequencing Consortium, Initial sequencing and comparative
analysis of the mouse genome. Nature 420, 520–-562 (2002).
25. The Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the
chimpanzee genome and comparison with the human genome. Nature 437,
69–-87 (2005).
26. Hellmann, I. et al. Why do human diversity levels vary at a megabase scale?
Genome Res. 15, 1222–-1231 (2005).
27. Lehrer, R. I. Primate defensins. Nature Rev. Microbiol. 2, 727–-738 (2004).
28. Hollox, E. J., Armour, J. A. & Barber, J. C. Extensive normal copy number
variation of a
b
-defensin antimicrobial-gene cluster. Am. J. Hum. Genet. 73,
591–-600 (2003).
29. Mars, W. M. et al. Inheritance of unequal numbers of the genes encoding the
human neutrophil defensins HP-1 and HP-3. J. Biol. Chem. 270, 30371–-30376
(1995).
30. Taudien, S. et al. Polymorphic segmental duplications at 8p23.1 challenge the
determination of individual defensin gene repertoires and the assembly of a
contiguous human reference sequence. BMC Genom. 5, 92 (2004).
31. Wyckoff, G. J., Malcom, C. M., Vallender, E. J. & Lahn, B. T. A highly
unexpected strong correlation between fixation probability of nonsynonymous
mutations and mutation rate. Trends Genet. 21, 381–-385 (2005).
32. Altshuler, D. et al. An SNP map of the human genome generated by reduced
representation shotgun sequencing. Nature 407, 513–-516 (2000).
Supplementary Information is linked to the online version of the paper at
www.nature.com/nature.
Acknowledgements We thank L. Gaffney for help with figures and tables;
L. French and her group at the Sanger Institute for attempting fibre FISH
analysis to size some clone gaps in the tiling path of chromosome 8; E. Eichler
and X. She for sharing their data on segmental duplications; T. Furey for help
with lists of genetic markers and placement of RefSeq genes; M. Kamal for
assistance and advice with synteny analysis; and K. Lindblad-Toh for sharing
data from the dog genome project. We also acknowledge the HUGO Gene
Nomenclature Committee (S. Povey, chair) for assigning official gene symbols.
We are deeply grateful to all the members, present and past, of the Genome
Sequencing Platform of the Broad Institute (and Whitehead Center for Genome
Research), Keio University School of Medicine and the Institute of Molecular
Biology at Jena for their dedication and for the consistent high quality of their
data that made this work possible. This work was supported by grants from the
National Human Genome Research Institute, RIKEN, the ‘Research for the
Future’ Program from the Japan Society for the Promotion of Science (JSPS), the
Ministry of Education, Culture, Sports, Science and Technology of Japan
(MEXT), the Federal German Ministry of Education, Research and Technology,
and the Thu¨ringer Kultusministerium.
Author Information Accession numbers for all clones contributing to the
finished sequence of human chromosome 8 can be found in Supplementary
Table S2. The updated human chromosome 8 sequence can be accessed
through GenBank accession number NC_000008. Reprints and permissions
information is available at npg.nature.com/reprintsandpermissions. The authors
declare no competing financial interests. Correspondence and requests for
materials should be addressed to C.N. (chad@broad.mit.edu) or N.S.
(shimizu@dmb.med.keio.ac.jp).
NATURE|Vol 439|19 January 2006 LETTERS
335
... Protein kinases are enzymes which catalyze transfer of the -phosphate of adenosine triphosphate (ATP: energy carrying molecule) to amino acid side chains in substrate proteins such as serine, threonine, and tyrosine residues. Many critical protein kinase drug targets in cancer and non-cancerous conditions-including receptor kinases, enzymes, ion channels, and cancer research, thereby permitting the reclassification of adaptive mutability [19][20][21] into 'relative mutability' [22], as explored by our recent investigations [23,24]. We defined relative mutability using two factors associated with high mutation rates in human chromosomes, enabling the analysis of both inherited and somatic mutations. ...
... We defined relative mutability using two factors associated with high mutation rates in human chromosomes, enabling the analysis of both inherited and somatic mutations. Previously, several factors have been reported to be associated with high mutation rates in human genomes, including 1) recombination rate [25], 2) proximity to a telomere, and 3) high adenine/thymine (A+T) content [24,26]. Among these factors, we have previously demonstrated that proximity to a telomere [27] and nucleotide composition (A+T content) can explain some of the genetic mutations linked to monogenic and/or polygenic diseases [23,28]. ...
... The National Institute of Health (NIH) of the United States has recently released more than 300 understudied druggable genomes entitled as the "Commercializing Understudied Proteins from the Illuminating the Druggable Genome" project (PA- [19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34]. From that list, 129 druggable candidates were classified as protein kinases. ...
Article
Full-text available
Mutations of protein kinases and cytokines are common and can cause cancer and other diseases. However, our understanding of the mutability in these genes remains rudimentary. Therefore, given previously known factors which are associated with high mutation rates, we analyzed how many genes encoding druggable kinases match (i) proximity to telomeres or (ii) high A+T content. We extracted this genomic information using the National Institute of Health Genome Data Viewer. First, among 129 druggable human kinase genes studied, 106 genes satisfied either factors (i) or (ii), resulting in an 82% match. Moreover, a similar 85% match rate was found in 73 genes encoding pro-inflammatory cytokines of multisystem inflammatory syndrome in children. Based on these promising matching rates, we further compared these two factors utilizing 20 de novo mutations of mice exposed to space-like ionizing radiation, in order to determine if these seemingly random mutations were similarly predictable with this strategy. However, only 10 of these 20 murine genetic loci met (i) or (ii), leading to only a 50% match. When compared with the mechanisms of top-selling FDA approved drugs, this data suggests that matching rate analysis on druggable targets is feasible to systematically prioritize the relative mutability-and therefore therapeutic potential-of the novel candidates.
... Discharged home in a satisfactory condition at the age of 2 months. As a result of karyotyping of the newborn, the result mos47, XY, +8 [7]/46, XY [9] was obtained -a mosaic variant of trisomy on chromosome 8. Mosaicism of trisomy 8 rare chromosomal anomaly. ...
... Короткое плечо хромосомы 8 содержит около 484 аннотированных генов (NCBI Build 36.3) генома человека [9,10]. Более 50 генов дублированного короткого плеча хромосомы 8 от 8p11.1 до 8p23 связаны с различными генетическими нарушениями и заболеваниями. ...
Article
Varkani syndrome or trisomy 8 mosaicism (T8M) - is a described chromosomal anomaly with a frequency of 1 : 25,000 to 1 : 50,000 births, which is more common in men than in women (5 : 1). In the Department of Pediatric Surgery at Almazov NMRC, a 28-year-old woman was admitted at 39 weeks' gestation with a aggravated somatic history. Pregnancy occurred with the support of auxiliary reproductive technologies: extorporepical fertilization due to the male factor, overpermation and asthenospermia caused by mumps transferred in childhood. Antenatally diagnosed: multiple congenital malformations of the fetus: agenesis of the corpus callosum, triventricularhydrocephalus, macrocephaly, bilateral ureterohydronephrosis; megacystis. The spouses were karyotyped, karyotypes 46, XX - normal female and 46, XY - normal male. According to the results of the karyogram, the fetal karyotype is 46, XY - normal male. From the first day of life, the child has clinical epileptic myolonic generalized attacks. According to the results of the study: Neurosonography confirmed the presence of a malformation of the brain. Echocardiography: Ventricular septal defect (VSD). Ultrasound of the abdominal organs revealed bilateral ureterohydronephrosis with expansion of the pelvicalyceal system and ureters. At the age of 28 days, due to the existing neurogenic disorder of the function of the bladder against the background of a fixed spinal cord in order to constant urine, vesicostomy is performed. Performed diagnostic laparoscopy: in the right iliac region spreads education with a diameter of 2 cm, which is a diverticular-cystic doubling of the ileum. A wedge-shaped resection of the base of the diverticulum was performed, the operation was completed by vesicostomy. Discharged home in a satisfactory condition at the age of 2 months. As a result of karyotyping of the newborn, the result mos47, XY, +8[7]/46, XY [9] was obtained - a mosaic variant of trisomy on chromosome 8. Mosaicism of trisomy 8 rare chromosomal anomaly. The clinical case described by us is characterized by a combination of severe congenital defects, which were not previously reported. This is the first time of the birth of a child with a trisomy of 8 chromosome after auxiliary reproductive technologies: extracurporeal fertilization.
... We have previously shown that there are two factors based on gene characteristics that are associated with high mutation rates (Nusbaum et al., 2006), and can be used to assess the relative mutability of genes associated with various human diseases (Lucas et al., 2021;McKnight et al., 2021) and those targeted by specific drugs (Raines et al., 2022). These two factors are proximity to telomeres and high A + T content and were proposed in studies of human chromosomes (Nusbaum et al., 2006). ...
... We have previously shown that there are two factors based on gene characteristics that are associated with high mutation rates (Nusbaum et al., 2006), and can be used to assess the relative mutability of genes associated with various human diseases (Lucas et al., 2021;McKnight et al., 2021) and those targeted by specific drugs (Raines et al., 2022). These two factors are proximity to telomeres and high A + T content and were proposed in studies of human chromosomes (Nusbaum et al., 2006). Using these two factors as a screening tool, we asked if the mutability of the six genes of interest with one housekeeping gene we assayed from the caudate nucleus of human samples are similar or different across mammalian species. ...
Article
Full-text available
Symptoms of normal pressure hydrocephalus (NPH) and Alzheimer’s disease (AD) are somewhat similar, and it is common to misdiagnose these two conditions. Although there are fluid markers detectable in humans with NPH and AD, determining which biomarker is optimal in representing genetic characteristics consistent throughout species is poorly understood. Here, we hypothesize that NPH can be differentiated from AD with mRNA biomarkers of unvaried proximity to telomeres. We examined human caudate nucleus tissue samples for the expression of transient receptor potential cation channel subfamily V member 4 (TRPV4) and amyloid precursor protein (APP). Using the genome data viewer, we analyzed the mutability of TRPV4 and other genes in mice, rats, and humans through matching nucleotides of six genes of interest and one house keeping gene with two factors associated with high mutation rate: (i) proximity to telomeres or (ii) high adenine and thymine (A+T) content. We found that TRPV4 and microtubule associated protein tau (MAPT) mRNA were elevated in NPH. In AD, mRNA expression of TRPV4 was unaltered unlike APP and other genes. In mice, rats, and humans, the nucleotide size of TRPV4 did not vary, while in other genes, the sizes were inconsistent. Proximity to telomeres in TRPV4 was <50 Mb across species. Our analyses reveal that TRPV4 gene size and mutability are conserved across three species, suggesting that TRPV4 can be a potential link in the pathophysiology of chronic hydrocephalus in aged humans (>65 years) and laboratory rodents at comparable ages.
... In 2007, chromosome 8 was fully sequenced and analyzed by the Human Genome Project, resulting in the description of the first human defensin gene family landscape [44][45][46] (Fig. 1a). The first defensin database was established in the same year, incorporating 350 defensins 47 (Fig. 1a). ...
Article
Full-text available
As a family of cationic host defense peptides, defensins are mainly synthesized by Paneth cells, neutrophils, and epithelial cells, contributing to host defense. Their biological functions in innate immunity, as well as their structure and activity relationships, along with their mechanisms of action and therapeutic potential, have been of great interest in recent years. To highlight the key research into the role of defensins in human and animal health, we first describe their research history, structural features, evolution, and antimicrobial mechanisms. Next, we cover the role of defensins in immune homeostasis, chemotaxis, mucosal barrier function, gut microbiota regulation, intestinal development and regulation of cell death. Further, we discuss their clinical relevance and therapeutic potential in various diseases, including infectious disease, inflammatory bowel disease, diabetes and obesity, chronic inflammatory lung disease, periodontitis and cancer. Finally, we summarize the current knowledge regarding the nutrient-dependent regulation of defensins, including fatty acids, amino acids, microelements, plant extracts, and probiotics, while considering the clinical application of such regulation. Together, the review summarizes the various biological functions, mechanism of actions and potential clinical significance of defensins, along with the challenges in developing defensins-based therapy, thus providing crucial insights into their biology and potential clinical utility.
... Indeed, of the six chromosomes, four (except Chr16 and Chr22) showed clearly better regional correlations in non-SD regions than in SDs ( Supplementary Fig. 10). The poor regional correlations of Chr8 were ascribable to underestimated mutation rates in the region from 0 Mb to 25 Mb (Supplementary Fig. 11), a region reported to have a strikingly high mutation rate 29 . Chr8, Chr9 and Chr16 were reported to have many clustered mutations in regions with accelerated maternal mutation rates, through a unique mutational mechanism 30 . ...
Article
Full-text available
Germline mutation rates are essential for genetic and evolutionary analyses. Yet, estimating accurate fine-scale mutation rates across the genome is a great challenge, due to relatively few observed mutations and intricate relationships between predictors and mutation rates. Here, we present Mutation Rate Learner (MuRaL), a deep learning framework to predict mutation rates at the nucleotide level using only genomic sequences as input. Harnessing human germline variants for comprehensive assessment, we show that MuRaL achieves better predictive performance than current state-of-the-art methods. Moreover, MuRaL can build models with relatively few training mutations and a moderate number of sequenced individuals, and can leverage transfer learning to further reduce data and time demands. We apply MuRaL to produce genome-wide mutation rate maps for four representative species—Homo sapiens, Macaca mulatta, Drosophila melanogaster and Arabidopsis thaliana—demonstrating its high applicability. As an example, we use improved mutation rate estimates to stratify human genes into distinct groups that are enriched for different functions, and highlight that many developmental genes are subject to high mutational burden. The open-source software and generated mutation rate maps can greatly facilitate related research. Mutation rates are crucial for genetic and evolutionary analyses. Fang et al. present a generalizable deep learning method to build fine-scale mutation rate maps with DNA sequences as input, which can benefit analyses reliant on mutation rates.
... Two factors have been associated with high gene mutation rates and cause potential diseases: (i) proximity to telomeres (<50 Mbp) and/or (ii) high adenine and thymine (A + T) content (>59%) [51][52][53][54]. For human RhoA, it shows proximity (49 Mbp) to its telomeres with an A + T content of 56%, which are both very close to the critical values. ...
Article
Full-text available
RhoA, a member of Rho GTPases, regulates myriad cellular processes. Abnormal expression of RhoA has been implicated in various diseases, including cancers, developmental disorders and bacterial infections. RhoA mutations G14V and Q63L have been reported to constitutively activate RhoA. To figure out the mechanisms, in total, 1.8 μs molecular dynamics (MD) simulations were performed here on RhoAWT and mutants G14V and Q63L in GTP-bound forms, followed by dynamic analysis. Both mutations were found to affect the conformational dynamics of RhoA switch regions, especially switch I, shifting the whole ensemble from the wild type’s open inactive state to different active-like states, where T37 and Mg2+ played important roles. In RhoAG14V, both switches underwent thorough state transition, whereas in RhoAQ63L, only switch I was sustained in a much more closed conformation with additional hydrophobic interactions introduced by L63. Moreover, significantly decreased solvent exposure of the GTP-binding site was observed in both mutants with the surrounding hydrophobic regions expanded, which furnished access to water molecules required for hydrolysis more difficult and thereby impaired GTP hydrolysis. These structural and dynamic differences first suggested the potential activation mechanism of RhoAG14V and RhoAQ63L. Together, our findings complemented the understanding of RhoA activation at the atomic level and can be utilized in the development of novel therapies for RhoA-related diseases.
... Interestingly, evolutionarily younger HOR arrays (more homogenous/less mutated) locate close to the functional core of the centromere, flanked by layers of increasingly divergent ancestral alpha satellites (Schueler et al. 2005;Rudd et al. 2006;Shepelev et al. 2009). Finally, the regions flanking the HOR arrays, called the pericentromeres, devoid of HOR units, contain arrays of alpha satellite monomers interspersed with LINEs, SINEs, and various other repeat elements (Schueler et al. 2001;Nusbaum et al. 2006;Rudd et al. 2006). These regions are marked by characteristic constitutive heterochromatin, namely Heterochromatin Protein 1 (HP1) binding, enrichment of H3K9me2/3, H3K27me3 and H4K20me3, hypoacetylation of histones H3 and H4, and transcriptional silencing (reviewed in (Fioriniello et al. 2020)). ...
Centromeres are key architectural components of chromosomes. Here, we examine their construction, maintenance, and functionality. Focusing on the mammalian centromere- specific histone H3 variant, CENP-A, we highlight its coevolution with both centromeric DNA and its chaperone, HJURP. We then consider CENP-A de novo deposition and the importance of centromeric DNA recently uncovered with the added value from new ultra-long-read sequencing. We next review how to ensure the maintenance of CENP-A at the centromere throughout the cell cycle. Finally, we discuss the impact of disrupting CENP-A regulation on cancer and cell fate.KeywordsCentromereCENP-AAlpha-satelliteHJURPHistone variantGenome architectureCell cycleCancer
Article
Proximity to telomeres (i) and high adenine and thymine (A + T) content (ii) are two factors associated with high mutation rates in human chromosomes. We have previously shown that >100 human genes when mutated to cause congenital hydrocephalus (CH) meet either factor (i) or (ii) at 91% matching, while two factors are poorly satisfied in human genes associated with familial Parkinson's disease (fPD) at 59%. Using the sets of mouse, rat, and human chromosomes, we found that 7 genes associated with CH were located on the X chromosome of mice, rats, and humans. However, genes associated with fPD were in different autosomes depending on species. While the contribution of proximity to telomeres in the autosome was comparable in CH and fPD, high A + T content played a pivotal contribution in X-linked CH (43% in all three species) than in fPD (6% in rodents or 13% in humans). Low A + T content found in fPD cases suggests that PARK family genes harbor roughly 3 times higher chances of methylations in CpG sites or epigenetic changes than X-linked genes.
Article
Full-text available
R-spondins are secretory proteins localized in the endoplasmic reticulum and Golgi bodies and are processed through the secretory pathway. Among the R-spondin family, RSPO2 has emanated as a novel regulator of Wnt signaling, which has now been acknowledged in numerous in vitro and in vivo studies. Cancer is an abnormal growth of cells that proliferates and spreads uncontrollably due to the accumulation of genetic and epigenetic factors that constitutively activate Wnt signaling in various types of cancer. Colorectal cancer (CRC) begins when cells in the colon and rectum follow an indefinite pattern of division due to aberrant Wnt activation as one of the key hallmarks. Decades-long progress in research on R-spondins has demonstrated their oncogenic function in distinct cancer types, particularly CRC. As a critical regulator of the Wnt pathway, it modulates several phenotypes of cells, such as cell proliferation, invasion, migration, and cancer stem cell properties. Recently, RSPO mutations, gene rearrangements, fusions, copy number alterations, and altered gene expression have also been identified in a variety of cancers, including CRC. In this review, we addressed the recent updates regarding the recurrently altered R-spondins with special emphasis on the RSPO2 gene and its involvement in potentiating Wnt signaling in CRC. In addition to the compelling physiological and biological roles in cellular fate and regulation, we propose that RSPO2 would be valuable as a potential biomarker for prognostic, diagnostic, and therapeutic use in CRC.
Article
Chiral vicinal diols are important intermediates in the synthesis of pharmaceuticals. Epoxide hydrolases catalyze hydrolytic ring opening of epoxides to produce the corresponding vicinal diols, providing an attractive way to access these building blocks under mild conditions in a stereoselective and atom‐efficient manner. In this study, an epoxide hydrolase is identified and engineered to form ( 3S,4S )‐tetrahydrofurandiol in high optical purity via the desymmetrization of meso ‐3,4‐epoxytetrahydrofuran. In nine rounds of directed evolution, the enzyme's native ( 3R,4R )‐stereopreference was reversed and its activity was dramatically improved to achieve quantitative yield under remarkably high 500 g/L substrate concentration and low enzyme loading. Computational modelling provides insights on the changes in enzyme‐substrate interaction that result in divergent enantioselectivities afforded by evolved variants.
Article
Full-text available
The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers approximately 99% of the euchromatic genome and is accurate to an error rate of approximately 1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human genome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead.
Article
Full-text available
The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence. The rediscovery of Mendel's laws of heredity in the opening weeks of the 20th century 1±3 sparked a scienti®c quest to understand the nature and content of genetic information that has propelled biology for the last hundred years. The scienti®c progress made falls naturally into four main phases, corresponding roughly to the four quarters of the century. The ®rst established the cellular basis of heredity: the chromosomes. The second de®ned the molecular basis of heredity: the DNA double helix. The third unlocked the informa-tional basis of heredity, with the discovery of the biological mechan-ism by which cells read the information contained in genes and with the invention of the recombinant DNA technologies of cloning and sequencing by which scientists can do the same. The last quarter of a century has been marked by a relentless drive to decipher ®rst genes and then entire genomes, spawning the ®eld of genomics. The fruits of this work already include the genome sequences of 599 viruses and viroids, 205 naturally occurring plasmids, 185 organelles, 31 eubacteria, seven archaea, one fungus, two animals and one plant. Here we report the results of a collaboration involving 20 groups from the United States, the United Kingdom, Japan, France, Germany and China to produce a draft sequence of the human genome. The draft genome sequence was generated from a physical map covering more than 96% of the euchromatic part of the human genome and, together with additional sequence in public databases, it covers about 94% of the human genome. The sequence was produced over a relatively short period, with coverage rising from about 10% to more than 90% over roughly ®fteen months. The sequence data have been made available without restriction and updated daily throughout the project. The task ahead is to produce a ®nished sequence, by closing all gaps and resolving all ambiguities. Already about one billion bases are in ®nal form and the task of bringing the vast majority of the sequence to this standard is now straightforward and should proceed rapidly. The sequence of the human genome is of interest in several respects. It is the largest genome to be extensively sequenced so far, being 25 times as large as any previously sequenced genome and eight times as large as the sum of all such genomes. It is the ®rst vertebrate genome to be extensively sequenced. And, uniquely, it is the genome of our own species. Much work remains to be done to produce a complete ®nished sequence, but the vast trove of information that has become available through this collaborative effort allows a global perspective on the human genome. Although the details will change as the sequence is ®nished, many points are already clear. X The genomic landscape shows marked variation in the distribu-tion of a number of features, including genes, transposable elements, GC content, CpG islands and recombination rate. This gives us important clues about function. For example, the devel-opmentally important HOX gene clusters are the most repeat-poor regions of the human genome, probably re¯ecting the very complex
Article
Full-text available
Chromosome 19 has the highest gene density of all human chromosomes, more than double the genome-wide average. The large clustered gene families, corresponding high G + C content, CpG islands and density of repetitive DNA indicate a chromosome rich in biological and evolutionary significance. Here we describe 55.8 million base pairs of highly accurate finished sequence representing 99.9% of the euchromatin portion of the chromosome. Manual curation of gene loci reveals 1,461 protein-coding genes and 321 pseudogenes. Among these are genes directly implicated in mendelian disorders, including familial hypercholesterolaemia and insulin-resistant diabetes. Nearly one-quarter of these genes belong to tandemly arranged families, encompassing more than 25% of the chromosome. Comparative analyses show a fascinating picture of conservation and divergence, revealing large blocks of gene orthology with rodents, scattered regions with more recent gene family expansions and deletions, and segments of coding and non-coding conservation with the distant fish species Takifugu.
Article
Full-text available
Here we report a high-quality draft genome sequence of the domestic dog (Canis familiaris), together with a dense map of single nucleotide polymorphisms (SNPs) across breeds. The dog is of particular interest because it provides important evolutionary information and because existing breeds show great phenotypic diversity for morphological, physiological and behavioural traits. We use sequence comparison with the primate and rodent lineages to shed light on the structure and evolution of genomes and genes. Notably, the majority of the most highly conserved non-coding sequences in mammalian genomes are clustered near a small subset of genes with important roles in development. Analysis of SNPs reveals long-range haplotypes across the entire dog genome, and defines the nature of genetic diversity within and across breeds. The current SNP map now makes it possible for genome-wide association studies to identify genes responsible for diseases and traits, with important consequences for human and companion animal health.
Article
Full-text available
Here we present a draft genome sequence of the common chimpanzee ( Pan troglodytes). Through comparison with the human genome, we have generated a largely complete catalogue of the genetic differences that have accumulated since the human and chimpanzee species diverged from our common ancestor, constituting approximately thirty-five million single-nucleotide changes, five million insertion/deletion events, and various chromosomal rearrangements. We use this catalogue to explore the magnitude and regional variation of mutational forces shaping these two genomes, and the strength of positive and negative selection acting on their genes. In particular, we find that the patterns of evolution in human and chimpanzee protein-coding genes are highly correlated and dominated by the fixation of neutral and slightly deleterious alleles. We also use the chimpanzee genome as an outgroup to investigate human population genetics and identify signatures of selective sweeps in recent human evolution. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
A map of 30,181 human gene–based markers was assembled and integrated with the current genetic map by radiation hybrid mapping. The new gene map contains nearly twice as many genes as the previous release, includes most genes that encode proteins of known function, and is twofold to threefold more accurate than the previous version. A redesigned, more informative and functional World Wide Web site (www.ncbi.nlm.nih.gov/genemap) provides the mapping information and associated data and annotations. This resource constitutes an important infrastructure and tool for the study of complex genetic traits, the positional cloning of disease genes, the cross-referencing of mammalian genomes, and validated human transcribed sequences for large-scale studies of gene expression.