ArticlePDF Available

How Genomes are Sequenced and Why it Matters: Implications for Studies in Comparative Genomics of Humans and Chimpanzees

Authors:

Abstract and Figures

Claims about high genomic DNA sequence similarity between humans and chimpanzees are typically made to audiences that do not understand the various layers of technology and ideological bias imposed upon the origination of the data in question. The recent human-chimp Y-chromosome project introduced a number of important genomic tools to achieve a considerably less-biased analysis. The results indicated a much higher level of dissimilarity in both gene content and overall sequence similarity than the previously reported levels up to 99% similarity. As of yet, no similar study utilizing a less-biased genomic framework for autosomal regions has been reported. When evaluating comparisons between genomes using DNA sequence, it is important to understand the nature of how that sequence was obtained and bioinformatically manipulated before drawing any conclusions. It is not uncommon to arrange the sequence of a genome for which little is known by using the genome of a hypothetical closely related organism that has better developed genomic resources. It is also sequence identity. As a result, evolutionary bias literally colors every aspect of the DNA analysis and annotation. Understanding the technology used to produce a comparative genomic product for At present, a considerably more unbiased approach to comparative genomics needs to be applied to the analysis and annotation of genome.
Content may be subject to copyright.
How Genomes are Sequenced and Why it Matters:
Implications for Studies in Comparative Genomics of
Humans and Chimpanzees
Answer s Research Journal 4 (2011):81–8 8.
ww w.answersingenesis.org/contents/379/arj/v4/genomes _chimpanzees_humans.pdf
Abstract
Claims about high genomic DNA sequence similarity between humans and chimpanzees are
typically made to audiences that do not understand the various layers of technology and ideological
bias imposed upon the origination of the data in question. The recent human -chimp Y-chromosome
project introduced a number of important genomic tools to achieve a considerably less-biased
analysis. The results indicated a much higher level of dissimilarity in both gene content and overall
sequence similarity than the previously reported levels up to 99% similar ity. As of yet, no similar study
utilizing a less-biased genomic framework for autosomal regions has been reported. When evaluating
compar isons between genomes using DNA sequence, it is important to understand the nature of how
that sequence was obtained and bioinformatically manipulated before drawing any conclusions. It is
not uncommon to arrange the sequence of a genome for which little is known by using the genome
of a hypothetical closely related organism that has better developed genomic resources. It is also
QRWXQFRPPRQWRÀUVWVFUHHQ WKHIUDPHZRUNPRGHOJHQRPHWRÀQGUHJLRQVRIKLJKVLPLODULW\SULRUWR
DQ\ FRPSDUDWLYH DQDO\VHV DQG WR HYHQ RPLWJDSV LQ WKHÀQDO '1$DOLJQPHQWV EHIRUHGHWHUPLQLQJ
sequence identity. As a result, evolutionary bias literally colors every aspect of the DNA analysis and
annotation. Understanding the technology used to produce a comparative genomic product for
LQWHUJHQRPHVWXGLHVLVUHTXLUHGSULRUWRPDNLQJDQ\GHÀQLWLYHFRQFOXVLRQVDERXWWKHGDWDSUHVHQWHG
At present, a considerably more unbiased approach to comparative genomics needs to be applied
to the analysis and annotation of genome.
Keywords: comparative genomics, human-chimp similarity, human genome, chimpanzee genome,
DNA sequencing, genome sequencing, cloning DNA
ISSN: 1937-9056 Copyright © 2011 Answers in Genesis. All rights reserved. Consent is given to unlimited copying, downloading, quoting from, and distribution of this article for
QRQFRPPHUFLDOQRQVDOHSXUSRVHVRQO\SURYLGHGWKHIROORZLQJFRQGLWLRQVDUHPHWWKHDXWKRURIWKHDUWLFOHLVFOHDUO\LGHQWLÀHG$QVZHUVLQ*HQHVLVLVDFNQRZOHGJHGDVWKHFRS\ULJKW
RZQHU$QVZHUV5HVHDUFK-RXUQDODQGLWVZHEVLWHZZZDQVZHUVUHVHDUFKMRXUQDORUJDUHDFNQRZOHGJHGDVWKHSXEOLFDWLRQVRXUFHDQGWKHLQWHJULW\RIWKHZRUNLVQRWFRPSURPLVHG
LQDQ\ZD\)RUPRUHLQIRUPDWLRQZULWHWR$QVZHUVLQ*HQHVLV32%R[+HEURQ.<$WWQ(GLWRU$QVZHUV5HVHDUFK-RXUQDO
The views expressed are those of the writer(s) and not necessarily those of the $QVZHUV5HVHDUFK-RXUQDOEditor or of Answers in Genesis.
Jeffrey P. Tomkins, Institute for Creation Research, 1806 Royal Lane, Dallas, TX 75229
Introduction
The ability to sequence the DNA of an organism’s
JHQRPH ZDV DQ LPSRUWDQW VFLHQWLÀF DGYDQFH WKDW
radically changed many aspects of molecular
biology and genetics in both the academic and
private sectors. Unfortunately, many discussions
and interpretations surrounding genomic sequence,
particularly those of a comparative nature, are
errant or misleading because of the type of DNA
sequence in question. Depending on the type of
research approach and technologies used to produce
the overall DNA sequence assembly for a particular
organism, certain limitations to its application and
XVDJHPXVWEHWDNHQLQWRDFFRXQWZKHQDSSO\LQJLW
for any comparative purpose.
Not surprisingly, the role of available research
funds weighed against the cost per base of DNA
sequence is, in most cases, the deciding factor on the
overall amount and quality of sequence produced.
*HWWLQJ PRUH ´EDQJ IRU WKH EXFNµ LV JHQHUDOO\ WKH
way grant funds are used when it comes to DNA
sequencing. This general ideology is true of many post-
KXPDQJHQRPH UHVHDUFKSURMHFWVZKLFK LQFRUSRUDWH
a DNA sequencing strategy called “whole genome
VKRWJXQVHTXHQFLQJµ 7KLVW\SHRIWHFKQRORJ\WDNHV
RQSDUWLFXODUVLJQLÀFDQFHZKHQ WDNLQJ LQWRDFFRXQW
the massive amounts of data now being produced
XVLQJQH[WJHQHUDWLRQ´PDVVLYHO\SDUDOOHOµVHTXHQFLQJ
technologies.
In 2004, the human genome was formally completed
LQ UHJDUG WR VHTXHQFLQJ WKH PDMRU HXFKURPDWLF
sections (International Human Genome Sequencing
Consortium 2004). In 2005 (The Chimpanzee
Sequencing and Analysis Consortium), a rough
draft of the chimpanzee genome was reported with
the hope that its availability would vindicate the
claims of biologists who had been promoting high
VLPLODULW\RUJUHDWHU%ULWWHQDVVRFLDWHG
with an ape to human evolutionary transition. Years
before the DNA revolution began, chimpanzees were
often positioned in the evolutionary tree closest to
humans out of all the extant apes. Some biologists
even went so far as to say that humans and chimps
should be placed in the same genus and considered
separate species (Wildman et al. 2003). However,
most scientists recognized the vast behavioral and
anatomical differences that exist between humans
and chimps and do not agree that they should be
placed in the same genus (Taylor 2009). In addition,
recent research has shown that some sections of the
human genome are more similar to orangutan, and
not chimpanzee producing evolutionary aberrant
'1$ SDWWHUQV FDOOHG ´LQFRPSOHWH OLQHDJH VRUWLQJµ
(Hobolth et al. 2011).
J. P. Tomkins
82
Brief History of DNA Sequencing Technology
7R IXOO\ XQGHUVWDQG WKH UDPLÀFDWLRQV RI WKH
incredibly large amount of DNA sequence data
currently available today in the world’s public
UHSRVLWRULHVLWLVLPSRUWDQWWRÀUVWWDNHDEULHIORRN
at the history of DNA sequencing technologies. This
ZLOOKHOSH[SODLQZK\FHUWDLQDSSURDFKHVZHUHWDNHQ
to sequence certain organisms and also allows an
understanding of the resulting overall quality and
usability for that particular sequence set. For a
WLPHOLQHRIVHOHFWHGPDMRUHYHQWVLQ WKHKLVWRU\RI
DNA sequencing research related to sequencing, see
Fig. 1.
Fig. 1. 7LPHOLQH VKRZLQJVLJQLÀFDQWPLOHVWRQHVUHODWHGWR
the history of DNA sequencing..
The whole modern phenomenon of DNA
VHTXHQFLQJ ZDV LQWURGXFHG E\ WKH ZRUN RI ELRORJLVW
DQG FKHPLVW )UHG 6DQJHU 6DQJHU 1LFNOHQ DQG
Coulson 1977), research that earned him the Nobel
Prize. Surprisingly, the basic chemistry invented by
Fred Sanger, referred to as Sanger-style sequencing,
has remained essentially the same from its earliest
years until the present time. Drastic improvements in
Sanger-style DNA sequencing since 1977 were largely
achieved through four areas:
1. the introduction of the polymerase chain reaction
3&5DQGLQLPSURYHPHQWVLQWKHEDVLFFKHPLFDO
components (various enzymes, reagents and DNA
fragment labeling),
2. the automation of sample preparation via large-
scale microtiter plate (primarily 96 and 384-well
formats) systems using robotically automated
pipetting and thermo-cycler platforms,
3. automated laser-based fragment detection
systems which evolved from 96-lane slab gel
systems to extremely high-throughput/automated
robotic platforms using large arrays of individual
capillaries that could resolve DNA fragments in 96
RUPRUHVHTXHQFLQJ UHDFWLRQV LQ D PDWWHU RI MXVW
D FRXSOH RI KRXUV DQG WKHQ DXWRPDWLFDOO\ UHORDG
themselves, and
4. bioinformatic and computational advances in
hardware and software to edit, process, and submit
massive amounts of DNA sequence data to both
local and off-site database repositories. Advances
in laboratory information management systems
(LIMS) contributed to the overall automation and
integration of the overall process.
One important feature of modern Sanger-style
sequencing is the long high-quality read lengths that
can be achieved. Under relatively optimal conditions,
high-quality DNA sequence with a rate of only 1 error
in 10,000 bases can be routinely obtained with average
individual read lengths up to ~1,200 bases. The public
KXPDQJHQRPHSURMHFWZDV ODUJHO\FRPSOHWHGXVLQJ
Sanger-style technology on DNA libraries constructed
from mapped large-insert DNA clones (International
Human Genome Sequencing Consortium 2001, 2004).
Slab-gel DNA sequencers were used at the beginning
RIWKHSURMHFWDQGZHUHHYHQWXDOO\UHSODFHGZLWKÀUVW
generation capillary technology.
Currently, next generation DNA sequencing
technologies based on an overall strategy called
PDVVLYHO\SDUDOOHOVHTXHQFLQJ0DUGLV5RJHUV
and Venter 2005), have increased overall total DNA
VHTXHQFH RXWSXW +RZHYHU RQH LQKHUHQW GUDZEDFN
to massively parallel sequencing as a whole is the
dramatic reduction in the amount of high quality
sequence per individual read. Based on the next
generation technology variant, individual read
lengths vary from about 25 bases to 100 bases (Mardis
83
How Genomes are Sequenced and Why it Matters
2008) with some recent claims by machine suppliers
as high as 400. The overall trend is that the more
EXON VHTXHQFH SURGXFHG E\ D SDUWLFXODU WHFKQRORJ\
within a certain span of time, the shorter the average
read length of the individual sequences. Massively
SDUDOOHO VHTXHQFLQJ KDV LPSRUWDQW UDPLÀFDWLRQVIRU
comparative genomics that will be discussed after
VRPHEDFNJURXQGLQIRUPDWLRQRQJHQRPHVHTXHQFLQJ
strategies is discussed.
Approaches To Genome Sequencing
7KH ÀUVW JHQRPHV VHTXHQFHG ZHUH VPDOO DQG
microbial in nature and included several species of
EDFWHULD)UDVHUHWDO0XVKHJLDQDQG.RRQLQ
1996). This is because the DNA in bacterial genomes
is relatively void of non-protein coding DNA sequence
ZKLFKLVRIWHQUHSHWLWLYHDQGGLIÀFXOWWRVHTXHQFHDQG
computationally assemble. With highly repetitive
JHQRPHVHTXHQFHLQKLJKHUHXNDU\RWHVFHUWDLQEORFNV
of DNA sequence are repeated for very long stretches.
The problem in such cases is not that the chemistry is
unable to sequence the DNA, but the computational
assembly of the repetitive sequence reads to form
a single long error-free contiguous DNA sequence
(contig) is confounded. In addition to the computational
limitations of assembling highly repetitive sequences,
the incorporation of a single errant sequence into a
contig can also pull in a large number of other related
errant sequences, producing sequencing chimeras. To
VROYHWKLVSUREOHPWHFKQLTXHVWRMXPSRYHUWKHVHDUHDV
RIWKHJHQRPHXVLQJYDULRXVW\SHVRIIUDPHZRUNVDQG
bridging scaffolds were implemented. Nevertheless,
JHQRPHVHTXHQFLQJÀUVWWHVWHGWKHZDWHUVZLWKVPDOO
non-repetitive genomes that were easily assembled
and then moved on to some of the more challenging
HXNDU\RWLFJHQRPHVVXFKDVIUXLWÁ\QHPDWRGHDQG
human.
Genetic Maps
)RU WKHSXEOLFKXPDQJHQRPHSURMHFWDVZHOODV
VHYHUDO RWKHU LQLWLDO HXNDU\RWLF JHQRPHV VXFK DV
QHPDWRGHDQGIUXLWÁ\DIUDPHZRUNEDVHGDSSURDFK
was developed to methodically sequence the genomes.
,QDIUDPHZRUNDSSURDFKDYDULHW\RIJHQRPLFWRROV
DUHLQWHJUDWHG WR ÀUVWIRUPD JHQRPLFVFDIIROGWKDW
can be used to identify targeted regions to sequence
in addition to arranging and orienting sequencing
UHDGV 0H\HUV 6FDODEULQ DQG 0RUJDQWH 
:DUUHQ HW DO  7KH ÀUVW SDUW RI WKH VFDIIROG
is called a molecular genetic map, which involves
WKH SODFHPHQW RI '1$ ODQGPDUNV WKURXJKRXW WKH
JHQRPH E\ REVHUYLQJ KRZ '1$ PDUNHUV VHJUHJDWH
in the offspring of controlled matings or in the case
of humans, utilizing the extant pedigrees of large
families (Kong et al. 2002).
*HQHWLF PDSSLQJ SURMHFWV SURGXFH KXQGUHGV WR
WKRXVDQGVRI'1$PDUNHUV SRVLWLRQHGLQWKHSURSHU
order along chromosomes and separated by relative
frequency-based distances called centimorgans.
Without going into any more detail than this, it is
VXIÀFLHQWWRQRWHWKDWWKHSURFHVVRIJHQHWLFPDSSLQJ
can produce a rather detailed map of a genome
WKDW VKRZV VSHFLÀF ODQGPDUNV DORQJ FKURPRVRPHV
PXFK OLNH D URDGPDS VKRZV FLWLHV SRVLWLRQHG DORQJ
a highway (see Fig. 2 for an example of a genetic
map). While genetic maps can be rather detailed, the
GLVWDQFHEHWZHHQODQGPDUNVLVQRWDSK\VLFDOGLVWDQFH
that can be measured in actual base pairs of DNA,
but rather represents a centimorgan unit which is a
relative distance based on frequency of recombination
EHWZHHQOLQNHGFKURPRVRPDOVLWHV
Physical (Contig-Based) Clone Maps
7KHVHFRQGNH\FRPSRQHQWRIDJHQRPLFIUDPHZRUN
is a physical map, often referred to as a contig-based
clone map which provides literal physical distances
between points in the genome (Meyers, Scalabrin, and
0RUJDQWH:DUUHQ HWDO&ORQLQJ '1$
IUDJPHQWV ZDV D WHFKQRORJ\ ÀUVW GHYHORSHG LQ WKH
early 1970s shortly after the discovery of restriction
HQ]\PHVSURWHLQVWKDWFXW'1$DWVSHFLÀFVHTXHQFH
sites. In cloning DNA, the restriction fragments of the
target organism’s DNA are placed in a small piece of
Fig. 2. Hypothetical genetic map showing sequence tagged
VLWHV676RUJHQHWLFPDUNHUVZLWKUHFRPELQDWLRQEDVHG
distances between them demarcated in centimorgans
F0 DOVR UHIHUUHG WR DV PDS XQLWV *HQHWLF PDUNHU
QRPHQFODWXUHLVGLYHUVHWKH676XVDJHLQWKLVÀJXUHLV
for illustration purposes.
J. P. Tomkins
84
engineered circular DNA called a plasmid.
These plasmids are then transferred into lab
strains of E. coli where they are maintained,
replicated, and frozen for storage. The cloned
DNA can be placed in arrayed sets of clones in
microtiter plates called libraries.
These libraries are often frozen at extremely
low temperatures (–60° to 80° C) and can be
stored for years or discarded following their
use as sequencing reagents. Early bacterial
cloning systems only allowed for the cloning of
small DNA fragments of no more then 10,000
EDVHV NE/DWHUDWWHPSWVDWFORQLQJODUJH
DNA fragments that would facilitate the
representation of entire genomes at redundant
levels in single libraries were initially made
using yeast as a cloning vector, but the yeast
V\VWHP ZDV WHFKQLFDOO\ FKDOOHQJLQJ GLIÀFXOW
to automate and produced libraries with high
levels of chimeric clones.
The revolution in large fragment DNA cloning
ZDV ÀUVW UHSRUWHG LQ  DQG GHVFULEHG D QHZ
type of single-copy plasmid vector called a Bacterial
$UWLÀFLDO&KURPRVRPH %$&6KL]X\DHWDO
The BAC system allowed for the cloning of very large
SLHFHV RI '1$  WR NE XVLQJ HVWDEOLVKHG E.
coli SURWRFROV ZLWK RQO\ PRGHUDWH PRGLÀFDWLRQ ,Q
BAC cloning, the target substrate represents size-
selected large fragment portions of partially digested
DNA. The large partially digested fragments provide
the ability to contiguously assemble overlapping
clones into a genomic physical map. Given this level
of cloning capacity, BAC libraries that represented a
10-fold redundant coverage (or more) of a large
JHQRPHOLNHWKDWRIKXPDQVFRXOGEHGHYHORSHG7KH
ÀUVW UHSRUWHG XVH RI %$& OLEUDULHV ZDV IRU KXPDQ
DNA, but the technology was subsequently utilized
for many animal and plant taxa.
While BAC libraries could be applied to a variety
of genomic applications, their primary utility was in
the development of contig-based clone maps that could
be integrated with genetic maps to form an elaborate
SK\VLFDOJHQHWLF IUDPHZRUN IRU JHQRPH VHTXHQFLQJ
0H\HUV6FDODEULQDQG0RUJDQWH:DUUHQHW
al. 2006). In developing a contig-based clone map,
WKH FORQHV LQ D %$& OLEUDU\ DUH ÀUVW ÀQJHUSULQWHG
meaning that the DNA of each clone fragment is
systematically cut with one or more restriction
enzymes. The fragments are then separated based
on size through a process called electrophoresis. The
patterns of fragmentation are then digitized and
SODFHGLQDGDWDEDVHRIFORQHÀQJHUSULQWV&ORQHVZLWK
VKDUHG IUDJPHQWDWLRQ SDWWHUQV ÀQJHUSULQWV DUH
computationally assembled into sets of overlapping
clones to form large reconstructed sections of
FKURPRVRPHVÀJ
Through a process of tagging the BAC clones in
D SK\VLFDO PDS ZLWK FRUUHVSRQGLQJ PDUNHUV IURP
a genetic map, based on sequence similarity, the
physical map could be integrated with the genetic
PDSÀJ.QRZOHGJHRI%$&FORQH DQGIUDJPHQW
size in a physical-genetic map allows for the
calculation of actual physical distance or base pairs
RI'1$ EHWZHHQJHQHWLFPDUNHUV7KLVLVDQDORJRXV
to determining the actual mileage between cities on a
map. Conversely, the clone-based contigs themselves
can now be positionally oriented in the genome based
RQWKHOLQNDJHJURXSVFRUUHVSRQGLQJWRFKURPRVRPHV
in the genetic map. By assembling the clone contigs
LQWRWKHLUUHVSHFWLYHOLQNDJHJURXSVEDVHGRQWKHLU
DVVRFLDWLRQWRFRUUHVSRQGLQJJHQHWLFPDUNHUVHQWLUH
chromosomes can be reconstructed. The end result
is a highly accurate map of the entire genome of an
RUJDQLVP WKDW FDQ VHUYH DV D IUDPHZRUN WRRO IRU D
YDULHW\RIDSSOLFDWLRQVLQFOXGLQJWKHLGHQWLÀFDWLRQRI
genes of interest, targeted genome sequencing, and
complete genome sequencing.
Sequencing Strategies Developed
in the Human Genome Project
7KHSXEOLFVHFWRURIWKHKXPDQJHQRPHSURMHFWZDV
a consortium of laboratories around the world located
ODUJHO\LQWKH86$(QJODQG)UDQFHDQG-DSDQ8VLQJ
the physical-genetic map, the various labs were each
DVVLJQHG DVSHFLÀFVHWRI RYHUODSSLQJ %$&FORQHV WR
sequence in a methodical clone-by-clone highly ordered
strategy. Multiple locations on chromosomes were
being sequenced at the same time, each initiated by a
single BAC called a seed clone. Despite this technology,
there are still regions of the human genome which
remain unsequenced due to their highly repetitive and
variable nature. These regions are so large that they
cannot be bridged by a BAC clone.
Fig. 3. 'HYHORSPHQW RI D SK\VLFDO IUDPHZRUN IRU DQ LVRODWHG
section of a hypothetical genome. The illustration shows how
overlapping large fragment clones form a contig. The addition
RI JHQHWLF PDUNHUV WR WKH FRQWLJ LV DOVR LOOXVWUDWHG WR IRUP WKH
SK\VLFDOJHQHWLF JHQRPLFIUD PHZRUN (QWLUH FKURPRVRPHV DQG
genomes can be assembled via the development of these contigs
ZKLFKDUHRULHQWHGDQGSRVLWLRQHGZLWKWKHJHQHWLFPDUNHUV
85
How Genomes are Sequenced and Why it Matters
Each BAC clone selected for genome sequencing
became the chief substrate for DNA sequencing. This
was accomplished by the physical shearing of the 100
WR NE %$& FORQH IROORZHG E\ HQGUHSDLU RI WKH
fragments, and cloning into a small-insert plasmid
sequencing vector. The BAC sub-clones are then
production sequenced en masse until about an 8- to
10-fold redundant coverage of the original BAC
clone has been achieved. Following assembly of the
production sequence reads, in most cases there
remain gaps in the sequence that need to be closed
LQDSURFHVVFDOOHG´ÀQLVKLQJµRU´JDSFORVXUHµ*DS
closure often requires the use of a variety of techniques
and chemistries and typically costs as much or more
than the original production sequencing operation.
In cases where a gap could not be closed with actual
DNA sequence, it was often bridged with paired reads
from both sides of the gap with a large DNA clone of
NQRZQVL]H
This whole process of methodical genome
sequencing is quite involved, time consuming, and
expensive. As a result, government DNA sequencing
funding strategies were changed after the human
genome and several model genomes were completed.
Whole Genome Shotgun Sequencing (WGSS)
In contrast to the effort by the public sector, which
GLGQRWSURGXFHDZRUNDEOHGUDIWRIWKHJHQRPHXQWLO
 DQG D QHDUFRPSOHWH ÀQDO YHUVLRQ LQ 
research scientist Craig Venter in the private sector
(Celera Genomics), proposed a more rapid approach
,VWUDLO HW DO  9HQWHU HW DO  :HEHU DQG
Myers 1997). Venter’s method employed a technique
FDOOHG´ZKROHJHQRPH VKRWJXQVHTXHQFLQJ µ:*66
in which construction of an initial genetic-physical
IUDPHZRUN PD\ EHE\SDVVHG,Q VXFK D SURMHFWWKH
entire genome is fragmented en masse and cloned
as large batches of random fragments. To improve
the process, multiple types of plasmid vectors and
fragment sizes are cloned, providing multiple libraries
for sequencing. The clones in each of the libraries are
then production sequenced en masse to certain levels
of genomic redundancy based on research funds.
The caveat of the propaganda surrounding Venter’s
´ZKROHJHQRPH VKRWJXQ VHTXHQFLQJµ HIIRUW ZDV WKH
fact that his laboratory still relied on the use of the
SK\VLFDOJHQHWLF IUDPHZRUN GHYHORSHG E\ WKH SXEOLF
VHFWRURIWKHKXPDQSURMHFWWRVRUWRXWWKHKXJHPDVV
of random DNA sequences and sequencing contigs.
7KLVFDYHDWHYHQWKRXJKFOHDUO\RXWOLQHGLQWKHRIÀFLDO
MRXUQDO SXEOLFDWLRQ 9HQWHU HW DO  ZDV QHYHU
widely discussed in the popular media. Nevertheless,
WKH FRQFHSW RI ´ZKROHJHQRPH VKRWJXQ VHTXHQFLQJµ
became quite popular and was subsequently used as
a cost-effective strategy for genome sequencing for a
wide variety of other plant and animal genomes.
Chimpanzee Shotgun Sequence
and the Human Framework
:KLOH RQH ZRXOG WKLQN WKDW WKH EDVLF WHFKQLFDO
process of producing a genomic sequence would be free
of any philosophical constraints, this is not always
the case. Perhaps the most dramatic example of this
LV WKH FKLPSDQ]HH JHQRPH SURMHFW ZKLFK FRQVLVWHG
of an initial 5-fold redundant shotgun coverage
(The Chimpanzee Genome Consortium 2005). In
FRQWUDVWWRWKH KXPDQ JHQRPHSURMHFWIXQGLQJZDV
OLPLWHG DQG WKH SURMHFWLQLWLDOO\ HPSOR\HG D ´ZKROH
JHQRPHVKRWJXQVHTXHQFLQJµVWUDWHJ\WKDWSURGXFHG
a 5-fold redundant coverage. However, to organize
the millions of sequencing reads, the human genome
SK\VLFDOIUDPHZRUNZDVLQLWLDOO\XVHGDVDVFDIIROG,Q
other words, the chimp genomic sequence was sorted
out and organized according to the human genomic
IUDPHZRUN XQGHU WKH DVVXPSWLRQ WKDW FKLPSDQ]HH
and human are genetically similar, which evolutionists
assume is due to a shared common ancestor about one
to six million years ago.
One concern regarding the use of the human
JHQRPH DV D IUDPHZRUN IRU FKLPSDQ]HH LV WKH
SRVVLELOLW\WKDWWKHUHPD\EHDPDMRUVL]HGLVFUHSDQF\
8VLQJ ÁRZ F\WRPHWU\ WR HVWLPDWH QXFOHDU '1$
content, the human genome is widely used as a
calibration standard at 7.0 picograms for a 2C diploid
cell (Dolezel and Greilhuber 2010), and listed at
3.5 pg for a 1C equivalent at www.genomesize.com. At
WKHVDPHZHEVLWHWKHUHDUHÀYHUHIHUHQFHGHVWLPDWHV
for chimpanzee which range from 3.46 to 3.85 for
& D  WR   LQFUHDVH LQ JHQRPH VL]H FRPSDUHG
to human. The reported average estimated genome
size increase of chimpanzee over human is about 5%.
Interestingly, in 2009, statistics for the chimpanzee
genome sequencing effort posted on the Washington
University Genome Center web site indicated that the
total amount of contiguously assembled chimpanzee
sequence was close to 20% more than the same
parameter for the human genome. However, the
sequencing statistics for chimpanzee were removed
from the web in 2010 even though a new build version
was announced. At the time of this writing (2011),
no current chimpanzee genome assembly statistics
are listed online although DNA sequence and BAC
FORQHÀQJHUSULQWGDWDDUHIUHHO\ DYDLODEOHIRUSXEOLF
download.
Perhaps the most startling human-chimpanzee
genome data of recent times, are the results
from comparing DNA sequence from human
and chimpanzee Y-chromosomes (Hughes et al.
 6SHFLÀFDOO\ WKLV UHFHQW VWXG\ LQYROYHG WKH
FRPSDULVRQ RI WKH PDOHVSHFLÀF UHJLRQV RI WKH <
chromosome (MSY). While much of the human Y
chromosome has been sequenced, only the MSY
region of the chimpanzee Y chromosome was
J. P. Tomkins
86
sequenced to a high level of completion and then
compared to the corresponding region in the human
Y-chromosome.
What made this study unique was that the MSY
region in chimpanzee was largely assembled and
constructed based on a clone-based physical map for
FKLPSDQ]HH QRW WKH KXPDQ SK\VLFDO IUDPHZRUN
This allowed for a relatively reasonable comparison
of the MSY sequence between human and chimp, the
ÀUVW WLPH VXFK DQ DSSDUHQWO\ XQELDVHG ODUJHVFDOH
comparison had actually been done. The results were
completely unexpected and radically contradicted
the standard evolutionary dogma which pervades
WKH VFLHQWLÀF FRPPXQLW\ 7KH UHVHDUFK SDSHU WLWOH
was well chosen and a very accurate one-sentence
VXPPDU\ RI WKH SURMHFW ´&KLPSDQ]HH DQG KXPDQ
FKURPRVRPHV DUHUHPDUNDEO\GLYHUJHQWLQVWUXFWXUH
DQG JHQH FRQWHQWµ 3HUKDSV WKH PRVW LQWHUHVWLQJ
highlight of the study was the difference in gene
content. While the non-genic areas between human
and chimp in the MSY region were also dramatically
different, the human MSY contained 78 genes while
the chimpanzee only contained 37, a 48% difference
in total gene content alone. In addition, the human
MSY contained 27 different classes of genes (gene
families/categories) while chimpanzee contained
RQO\  PHDQLQJ WKDW QLQH HQWLUH FODVVHV RU JHQH
categories were not even present in the chimpanzee
MSY region. Perhaps the best way to summarize the
XQSUHFHGHQWHGSURMHFWLVWRTXRWHVRPHOLQHVIURPWKH
original research report.
+HUH ZH ÀQLVKHG VHTXHQFLQJ RI WKH PDOHVSHFLÀF
region of the Y chromosome (MSY) in our closest
living relative, the chimpanzee, achieving levels of
accuracy and completion previously reached for the
human MSY. By comparing the MSYs of the two
species we show that they differ radically in sequence
structure and gene content . . . The chimpanzee MSY
contains twice as many massive palindromes as the
human MSY, yet it has lost large fractions of the
MSY protein-coding genes and gene families present
in the last common ancestor (excerpt from abstract,
Hughes et a l. 2010, p. 536).
A number of autosomal comparative studies
have been done using both coding and non-coding
sequences. Two of the most prominent studies are
ZRUWKPHQWLRQLQJEULHÁ\7KHÀUVWLVDFRPSDUDWLYH
study between human chromosome 21 and chimpanzee
FKURPRVRPH  VRFDOOHG KRPRORJV :DWDQDEH HW
al. 2004). The chimpanzee sequence was somewhat
limited at the time, but in contrast to the recent Y-
FKURPRVRPHSURMHFWDSK\VLFDOPDSIRUFKLPSDQ]HH
was not utilized. Large insert clones were selected
by screening libraries with human probes and only
WKH PRVW KLJKO\ DOLJQDEOH KXPDQOLNH FORQHV ZHUH
selected. These hand selected and sequenced clones
ZHUH RULHQWHG RQ WKH KXPDQ SK\VLFDO IUDPHZRUN
with the non-alignable sections and gaps ignored.
As a result, the data regarding genomic similarity
was biased or constricted to those areas which were
previously determined to be strong candidates for
similarity.
Although the authors provide interesting data for
the selected regions they analyzed, they do not commit
WRDQ\GHÀQLWLYHOHYHORIRYHUDOOVHTXHQFHVLPLODULW\
other than to say that 83% of the translated protein
coding regions would produce differences in protein
sequence between human and chimp. Considering
that only similar DNA clones were selected, the fact
that 83% of the actual coding sequence would produce
different proteins is indicative of more dissimilarity
WKDQ VLPLODULW\ :H DOVR QRZ NQRZ WKDW SURWHLQ
translation is a complicated mix of non-protein
coding DNA regulation features where a single
gene under differential control can produce a wide
YDULHW\RIWUDQVFULSWV%DUDVKHWDO:DQJDQG
Burge 2008). Nevertheless, evolutionists will cite the
Watanabe et al. (2004) study as a conclusive genomic
effort for high sequence similarity.
The second study of interest is a whole genome
type of comparison using chimpanzee genomic
sequences derived from the ends of large insert clones,
called BAC-end sequences (BES) (Britten 2002).
7KH FKLPSDQ]HH VHTXHQFHV DUH ÀUVW VFUHHQHG IRU
DQ\WKLQJWKDW·VKXPDQOLNHDQGKLJKO\DOLJQDEOHDQG
then the best candidates are passed along for more
detailed analyses. It should also be noted that such a
procedure eliminates large portions of important non-
coding regulatory sequences. Sequences of selected
interest are then, once again, positioned using the
KXPDQ SK\VLFDO IUDPHZRUN DQG WKHQ HYDOXDWHG IRU
similarity.
7KH<FKURPRVRPHSURMHFWRQO\HYDOXDWHGDVLQJOH
LVRODWHG SRUWLRQ RI WKH <FKURPRVRPH WKH RQO\
part that was readily alignable was novel in that
LW XWLOL]HG DQ DFWXDO SK\VLFDO IUDPHZRUN GHULYHG
for the chimpanzee genome to isolate and target
sequence for comparison. The section that was
chosen for the Y-chromosome effort also appears
to be the most readily amenable to comparative
study. A physical map assembly has recently been
reported for chimpanzee (Warren et al. 2006).
However, the only published genomic sequence
comparison between human and chimpanzee using
VSHFLHV VSHFLÀF SK\VLFDO IUDPHZRUNV KDV EHHQ WKH
<FKURPRVRPH SURMHFW ,W ZRXOG EH TXLWH YDOXDEOH
WR HYROXWLRQLVWV DQG FUHDWLRQLVWV DOLNH LI XQELDVHG
large-scale autosomal comparisons between human
and chimpanzee could be completed now that the
resources are available. In fact, the results of the Y-
chromosome study demand that similar approaches
EHWDNHQIRUWKHUHVWRIWKHJHQRPH
87
How Genomes are Sequenced and Why it Matters
Implications for Next Generation Sequencing
Technologies
Massively parallel DNA sequencing representing
next generation technologies refers to literally
thousands of individual reactions conducted
simultaneously by a single machine (see Mardis 2008
for a technological review). The different proprietary
DNA sequencing systems being utilized are based on a
VLQJOHJHQHUDOFRQFHSWWKHDPSOLÀFDWLRQRILQGLYLGXDO
DNA strands in a massively parallel (simultaneous)
fashion. The strand being copied from the template
fragment in each individual reaction is systematically
interrogated by high precision optics such that the
consecutive addition of nucleotide bases up to a
threshold level is determined. In general, for each
WHFKQRORJ\ WKH PRUH EXON '1$ VHTXHQFH REWDLQHG
in a single machine run (~6 to 8 hours), the shorter
the individual read lengths. As mentioned previously,
current systems typically produce 25 to 100 bases
of high quality sequence with some companies now
claiming routine reads up to 400 bases. Despite
WKH PDUNHG UHGXFWLRQ LQ UHDG OHQJWK FRPSDUHG WR
Sanger-style methodologies (still commonly used),
the two primary advantages include: no DNA cloning/
bacterial manipulation is required and the production
of megabase quantities of DNA sequence in a single
run.
The new massively parallel sequencing
technology has proven ideal for the sequencing of
microbial genomes, whole microbial communities
(metagenomics), diverse types of transcriptomes, and
HXNDU\RWLF JHQRPH UHVHTXHQFLQJ IRUSRO\PRUSKLVP
detection (genetic variation). The DNA substrate for
these technologies is often randomly sheared whole
JHQRPH VKRWJXQ IUDJPHQWV VLPLODU WR WKH ÀUVW
step of DNA preparation used in WGSS discussed
previously. Because of this, the same problems apply to
the resulting genomic sequences. In fact, the problem
of sorting out and aligning sequences in the genome is
even worse because of the short read lengths. In other
ZRUGV\RXZLOOQHHGDQH[LVWLQJSK\VLFDOIUDPHZRUNWR
VRUWRXWWKHGDWDSDUWLFXODUO\LQHXNDU\RWLFJHQRPHV
OLNH KXPDQ:KLOHWKHQHZVHTXHQFLQJ WHFKQRORJLHV
are extremely innovative, there are caveats that must
be understood to properly utilize them.
Conclusion
In the early days of biotechnology, it became
apparent that humans, apes, and other mammals
shared protein sequences that were very similar. In
fact, many human proteins exhibit high amino acid
similarity in both ape and non-primate mammalian
taxa (Clamp et al. 2007). One of the primary issues
of concern in various evolutionary studies is that
PRVW VFLHQWLVWV RQO\ WDNH LQWR DFFRXQW VLPLODULWLHV
between biological sequences present in both human
and apes that are pre-selected and already considered
similar at some level. Also, DNA sequences that do
not align well are often discarded or gaps may not
be accounted for in alignment analyses. Another
important consideration is whether an expressed
genomic product is doing the same thing in humans
as it does in apes and is it expressed in the same way?
These factors are often not given proper recognition.
$PDMRULW\RIWKHSXEOLFDQGVFLHQWLÀFFRPPXQLW\DUH
not aware of these caveats and still told hold to the
dogma that the human genome is 98 to 99% similar
WRFKLPSDQ]HHZKLFKLVPRVWOLNHO\QRWWKHFDVH7KH
IDFWLVWKDWPDMRUGLIIHUHQFHVEHWZHHQWKHVWUXFWXUHRI
the human and a chimpanzee genomes are now being
documented as the genomic resources improve.
When evaluating comparisons between genomes
using DNA sequence, it is important to understand
the nature of how that sequence was obtained and
bioinformatically manipulated. It is not uncommon
to arrange the DNA sequence of a genome for which
OLWWOHLVNQRZQE\XVLQJWKHJHQRPHRIDK\SRWKHWLFDO
HYROXWLRQDU\ FRPPRQ DQFHVWRU RU µFORVH UHODWLYHµ
that has better-developed genomic resources. This
obviously introduces an evolutionary bias at several
levels. Furthermore, sequence comparisons that
have yielded similarities are typically screened DNA
clones and regions selected beforehand based on
some level of similarity. While many DNA sequences
LQHXNDU\RWLFJHQRPHVDUHGLIÀFXOWWRZRUNZLWKGXH
to their repetitive nature, they also contain critical
regulatory features that are now appearing to be
MXVWDVLPSRUWDQWDVWKHJHQHVWKHPVHOYHVIRUSURSHU
function. Understanding the technology used to
produce a genomic DNA sequence product is critical
SULRUWRPDNLQJDQ\GHÀQLWLYH FRQFOXVLRQVDERXWWKH
data in question.
Most biologists among creationists and evolutionists
would expect DNA sequence similarities between
humans and apes due to shared anatomical and
SK\VLRORJLFDOIHDWXUHV+RZHYHULWLVYHU\OLNHO\WKDW
earlier comparative genomic studies constrained
by limited resources and propelled primarily by
evolutionary dogma, need to be repeated using better
tools and less bias.
References
Barash, Y. et al. 2010. Deciphering the splicing code. Nature
465:53–59.
%ULWWHQ5-'LYHUJHQFHEHWZHHQVDPSOHVRIFKLPSDQ]HH
and human DNA sequences is 5% counting indels.
Proceedings of the National Academy of Sciences of the
United States of America 99, no. 21:13633–13635.
Clamp et al. 2007. Distinguishing protein-coding and
noncoding genes in the human genome. Proceedings of
the National Academy of Sciences of the United States of
America 104, no. 40:19428–19433.
J. P. Tomkins
88
'ROH]HO-DQG-*UHLOKXEHU1XFOHDUJHQRPHVL]H$UH
we getting closer? Cytometry Part A 77, no. 7:635– 642.
Fraser, C. M. et al. 1995. The minimal gene complement of
Mycoplasma genitalium. Science 270, no. 5235:397–403.
Hobolth, A. et al. 2011. Incomplete lineage sorting patterns
among human, chimpanzee, and orangutan suggest recent
orangutan speciation and widespread selection. Genome
Research 21, no. 5:349–356.
+XJKHV -) HW DO  &KLPSDQ]HH DQG KXPDQ <
FKURPRVRPHVDUH UHPDUNDEO\GLYHUJHQW LQVWUXFWXUHDQG
gene content. Nature 463:536 –539.
International Human Genome Sequencing Consortium. 2001.
Initial sequencing and analysis of the human genome.
Nature 409:861–920.
International Human Genome Sequencing Consortium. 2004.
Finishing the euchromatic sequence of the human genome.
Nature 431:931945.
Istrail et al. 2004. Whole-genome shotgun assembly and
comparison of human genome assemblies. Proceedings of
the National Academy of Sciences of the United States of
America 101, no. 7:1916–1921.
Kong, A. et al. 2002. A high-resolution recombination map of
the human genome. Nature Genetics 31:241–247.
0DUGLV (5  1H[WJHQHUDWLRQ VHTXHQFLQJ PHWKRGV
Annual Review of Genomics and Human Genetics 9:
387–402.
Meyers, B. C., S. Scalabrin, and M. Morgante. 2004. Mapping
and sequencing genomes: Let’s get physical. Nature Reviews
Genetics 5(8):578 –588.
0XVKHJ LDQ$5DQG(9.RRQLQ$PLQLPDOJHQHVHW
for cellular life derived by comparison of complete bacterial
genomes. Proceedings of the National Academy of Sciences
of the United States of America 93, no. 19: 10268–10273.
5RJHUV <+ DQG -& 9HQWHU  0DVVLYHO\ SDUDOOHO
sequencing. Nature 437:326–327.
6DQJHU ) 6 1LFNOHQ DQG $5 &RXOVRQ  '1$
sequencing with chain-terminating inhibitors. Proceedings
of the National Academy of Sciences of the United States of
America 74, no. 12:5463–5467.
Shizuya, H. et al. 1992. Cloning and stable maintenance of 300-
NLOREDVHSDLUIUDJPHQWVRIKXPDQ'1$LQ(VFKHULFKLDFROL
using an F-factor-based vector. Proceedings of the National
Academy of Sciences of the United States of America 89, no.
18:8794– 8797.
7D\ORU-1RWDFKLPS7KHKXQWWRÀQGWKHJHQHV WKDW
make us human2[IRUG8QLYHUVLW\3UHVV1HZ<RUN1HZ
<RUN
The Chimpanzee Sequencing and Analysis Consortium. 2005.
Initial sequence of the chimpanzee genome and comparison
with the human genome. Nature 437:69–87.
9HQ WH U - &HWDO7KHVHTXHQFHRIWKHKXPDQJHQRPH
Science 291(5507):1304–1351.
Wan g, Z. and C. B. Burge. 2008. Splicing regulation: From a
parts list of regulatory elements to an integrated splicing
code. RNA 14:802–813.
:DU UHQ5 /HWDO3K\VLFDOPDSDVVLVWHGZKROHJHQRPH
shotgun assemblies. Genome Research 16:768–775.
Watanabe et al. 2004. DNA sequence and comparative analysis
of chimpanzee chromosome 22. Nature 429:382–388.
:HEHU -/ DQG (: 0\HUV  +XPDQ ZKROHJHQRPH
shotgun sequencing. Genome Research 7:401–409.
Wildman, D. E. et al. 2003. Implications of natural selection
in shaping 99.4% nonsynonymous DNA identity between
humans and chimpan zees: Enlarging genu s Homo.
Proceedings of the National Academy of Sciences of the
United States of America 100, no. 12:7181–7188.
... One of the problems with the current status of the chimpanzee genome is that it has not been constructed on its own merits through the use of an accurate integrated physical-genetic map (Tomkins 2011). Instead, all of the short DNA sequences produced by the DNA sequencing machinery (known as trace reads) have been assembled onto the human genome-using it as a framework scaffold or reference sequence (Mikkelsen et al. 2005;Prado-Martinez et al. 2013;Tomkins 2011). ...
... One of the problems with the current status of the chimpanzee genome is that it has not been constructed on its own merits through the use of an accurate integrated physical-genetic map (Tomkins 2011). Instead, all of the short DNA sequences produced by the DNA sequencing machinery (known as trace reads) have been assembled onto the human genome-using it as a framework scaffold or reference sequence (Mikkelsen et al. 2005;Prado-Martinez et al. 2013;Tomkins 2011). This was done out of budget constraints, convenience, and a healthy dose of evolutionary presupposition that humans evolved from apes. ...
... First, chimpanzee DNA sequences from both Sangerstyle sequencing and next generation sequencing technologies, have been assembled using the human genome as a reference framework (Mikkelsen et al. 2005;Prado-Martinez et al. 2013). In other words, the chimpanzee genome does not stand on its own merits using its own framework-based genomic resources (e.g. an accurate integrated physicalgenetic map for chimpanzee) as I described in an earlier publication (Tomkins 2011). Second, given the fact that significant levels of human DNA exist in non-primate databases due to laboratory and worker contamination (Longo et al. 2011), the potential for human DNA in the pre-assembled chimpanzee sequencing reads is highly probable and could be tested for by simply comparing the chimpanzeehuman BLASTN analyses of the different data sets one to another. ...
Article
Full-text available
The current chimpanzee genome assembly has problems that reduce its veracity as an authentic representation. First, it has been assembled using the human genome as a reference scaffold and does not stand on its own merits. Second, given the fact that significant levels of human DNA exist in non-primate databases due to laboratory and worker contamination, the potential for human DNA in the pre-assembled chimpanzee sequencing reads is highly probable. Therefore, 101 Sanger-style publically available trace read data sets were downloaded, end-trimmed for low quality bases, and purged of vector sequence. Then, 25,000 sequences were selected at random from each of the 101 data sets and queried against the human genome using BLASTN v2.2.31 with gap extension. Results from the BLASTN analysis indicated that two different groups of chimpanzee DNA sequences could be found. Those that were completed early in the chimpanzee genome project that contributed to the initial 5-fold draft genome, were considerably more similar to human than those that were produced later in the project by a difference of about 7% overall data set identity and produced 6% less hits onto the human genome. Sequences (both alignable and non-alignable) from the seemingly less contaminated data sets indicate that the chimpanzee genome is approximately 85% identical overall to human. Extensive poor alignment of chimpanzee DNA sequences that did not have hits on the human genome that were blasted on the chimpanzee genome revealed regions of miss-assembly for the chimpanzee genome.
... To challenge more this quest, the tiny fragments, namely between 25 to 300 bases, might have several errors (it is a probabilistic label attribution). Therefore, the analysis of genomic sequences needs to be addressed with methods/tools that are aware of these difficulties [50]. ...
... Therefore, it can be used with references composed of non-assembled reads obtained directly from the NGS sequencers. In fact, although next-generation sequencing made low cost high speed sequencing possible, it also decreased the size of sequencing reads [50]. On the other hand, most of the primate assembled sequences use the human genome as a reference. ...
... As such, it is similar to finding the order and the content of a few pieces of a huge puzzle, that we want to assemble and analyze. The pieces of the puzzle, in this case, tiny fragments of DNA (between 25 to 300 bases), might also contain several errors, and, as such, the analysis of the genomic sequences needs to be performed with tools/methods that are aware of such difficulties [2]. ...
Conference Paper
Full-text available
The great increase in the amount of sequenced DNA has created a problem: the storage of the sequences. As such, data compression techniques, designed specifically to compress genetic information, is an important area of research and development. Likewise, the ability to search similar DNA sequences in relation to a larger sequence, such as a chromosome, has a really important role in the study of organisms and the possible connection between different species. This paper proposes NET-ASAR, a tool for DNA sequence search, based on data compression, or, specifically, finite-context models, by obtaining a measure of similarity between a reference and a target. The method uses an approach based on finite-context models for the creation of a statistical model of the reference sequence and obtaining the estimated number of bits necessary for the encoding of the target sequence, using the reference model. NET-ASAR is freely available, under license GPLv3, at https://github.com/manuelgaspar/NET-ASAR.
... However, we should note that, in practice, genomic sequences are not just a succession of letters with four possible outcomes (A,C,G,T), which indicate the order and nature of nucleotides within a DNA chemical chain. They are also the outcome of machines and algorithms due to the sequencing and assembling phases [83]. In reality, they are the outcome of a probabilistic capture of small pieces from a huge puzzle with lots of repeated, changed and missing pieces [84,85]. ...
Article
Full-text available
An efficient DNA compressor furnishes an approximation to measure and compare information quantities present in, between and across DNA sequences, regardless of the characteristics of the sources. In this paper, we compare directly two information measures, the Normalized Compression Distance (NCD) and the Normalized Relative Compression (NRC). These measures answer different questions; the NCD measures how similar both strings are (in terms of information content) and the NRC (which, in general, is nonsymmetric) indicates the fraction of one of them that cannot be constructed using information from the other one. This leads to the problem of finding out which measure (or question) is more suitable for the answer we need. For computing both, we use a state of the art DNA sequence compressor that we benchmark with some top compressors in different compression modes. Then, we apply the compressor on DNA sequences with different scales and natures, first using synthetic sequences and then on real DNA sequences. The last include mitochondrial DNA (mtDNA), messenger RNA (mRNA) and genomic DNA (gDNA) of seven primates. We provide several insights into evolutionary acceleration rates at different scales, namely, the observation and confirmation across the whole genomes of a higher variation rate of the mtDNA relative to the gDNA. We also show the importance of relative compression for localizing similar information regions using mtDNA.
... Therefore, it can be used with references composed of non-assembled reads obtained directly from the NGS sequencers. In fact, although next-generation sequencing made low cost high speed sequencing possible, it also decreased the size of sequencing reads 50 . On the other hand, most of the primate assembled sequences use the human genome as a reference. ...
Article
Full-text available
Species evolution is indirectly registered in their genomic structure. The emergence and advances in sequencing technology provided a way to access genome information, namely to identify and study evolutionary macro-events, as well as chromosome alterations for clinical purposes. This paper describes a completely alignment-free computational method, based on a blind unsupervised approach, to detect large-scale and small-scale genomic rearrangements between pairs of DNA sequences. To illustrate the power and usefulness of the method we give complete chromosomal information maps for the pairs human-chimpanzee and human-orangutan. The tool by means of which these results were obtained has been made publicly available and is described in detail. Structural genomic rearrangements are a major source of intra-and inter-species variation. Chromosomal inversions, translocations, fissions and fusions, are part of the naturally occurring genetic diversity of individuals, are selectable and can confer environment-dependent advantages 1. Chromosome rearrangements are also associated with disease, namely, developmental disorders and cancer. For example, many leukaemia patients present a reciprocal translocation between chromosomes 9 and 22, also known as the Philadelphia chromosome. This produces BCR-ABL fusion proteins that are constitutively active tyrosine kinases, contributing to tumour growth and proliferation 2
... Since humans and apes do share high levels of macro-synteny among genes and chromosomes because biochemical function and transcription depend on it, this is to be expected. Of course, given the fact that the chimpanzee genome is primarily assembled based upon the human genomic framework, we really don't know for sure how accurate the chimp genome assembly is at this point since it does not stand on its own merits (Tomkins 2011). ...
Article
Full-text available
A major argument supposedly supporting human evolution from a common ancestor with chimpanzees is the " chromosome 2 fusion model " in which ape chromosomes 2A and 2B purportedly fused end-­to-­ end, forming human chromosome 2. This idea is postulated despite the fact that all known fusions in extant mammals involve satellite DNA and breaks at or near centromeres. In addition, researchers have noted that the hypothetical telomeric end-­to-­end signature of the fusion is very small (~800 bases) and highly degenerate (ambiguous) given the supposed 3 to 6 million years of divergence from a common ancestor. In this report, it is also shown that the purported fusion site (read in the minus strand orientation) is a functional DNA binding domain inside the first intron of the DDX11L2 regulatory RNA helicase gene, which encodes several transcript variants expressed in at least 255 different cell and/or tissue types. Specifically, the purported fusion site encodes the second active transcription factor binding domain in the DDX11L2 gene that coincides with transcriptionally active histone marks and open active chromatin. Annotated DDX11L2 gene transcripts suggest complex post-­transcriptional regulation through a variety of microRNA binding sites. Chromosome fusions would not be expected to form complex multi-­exon, alternatively spliced functional genes. This clear genetic evidence, combined with the fact that a previously documented 614 Kb genomic region surrounding the purported fusion site lacks synteny (gene correspondence) with chimpanzee on chromosomes 2A and 2B (supposed fusion sites of origin), thoroughly refutes the claim that human chromosome 2 is the result of an ancestral telomeric end-­to-­end fusion.
... The inclusion of this data would further drop the genome-wide similarity below 74% identity. For a recent review on how the chimp and human genomes were sequenced and why an understanding of these technologies is essential to interpreting DNA similarity issues, see the recent review by Tomkins (2011a). ...
Article
Full-text available
Human–chimpanzee comparative genome research is essential for narrowing down genetic changes involved in the acquisition of unique human features, such as highly developed cognitive functions, bipedalism or the use of complex language. Here, we report the high-quality DNA sequence of 33.3 megabases of chimpanzee chromosome 22. By comparing the whole sequence with the human counterpart, chromosome 21, we found that 1.44% of the chromosome consists of single-base substitutions in addition to nearly 68,000 insertions or deletions. These differences are sufficient to generate changes in most of the proteins. Indeed, 83% of the 231 coding sequences, including functionally important genes, show differences at the amino acid sequence level. Furthermore, we demonstrate different expansion of particular subfamilies of retrotransposons between the lineages, suggesting different impacts of retrotranspositions on human and chimpanzee evolution. The genomic changes after speciation and their biological consequences seem more complex than originally hypothesized.
Article
Full-text available
We search the complete orangutan genome for regions where humans are more closely related to orangutans than to chimpanzees due to incomplete lineage sorting (ILS) in the ancestor of human and chimpanzees. The search uses our recently developed coalescent hidden Markov model (HMM) framework. We find ILS present in ∼1% of the genome, and that the ancestral species of human and chimpanzees never experienced a severe population bottleneck. The existence of ILS is validated with simulations, site pattern analysis, and analysis of rare genomic events. The existence of ILS allows us to disentangle the time of isolation of humans and orangutans (the speciation time) from the genetic divergence time, and we find speciation to be as recent as 9-13 million years ago (Mya; contingent on the calibration point). The analyses provide further support for a recent speciation of human and chimpanzee at ∼4 Mya and a diverse ancestor of human and chimpanzee with an effective population size of about 50,000 individuals. Posterior decoding infers ILS for each nucleotide in the genome, and we use this to deduce patterns of selection in the ancestral species. We demonstrate the effect of background selection in the common ancestor of humans and chimpanzees. In agreement with predictions from population genetics, ILS was found to be reduced in exons and gene-dense regions when we control for confounding factors such as GC content and recombination rate. Finally, we find the broad-scale recombination rate to be conserved through the complete ape phylogeny.
Article
Full-text available
A bacterial cloning system for mapping and analysis of complex genomes has been developed. The BAC system (for bacterial artificial chromosome) is based on Escherichia coli and its single-copy plasmid F factor. It is capable of maintaining human genomic DNA fragments of greater than 300 kilobase pairs. Individual clones of human DNA appear to be maintained with a high degree of structural stability in the host, even after 100 generations of serial growth. Because of high cloning efficiency, easy manipulation of the cloned DNA, and stable maintenance of inserted DNA, the BAC system may facilitate construction of DNA libraries of complex genomes with fuller representation and subsequent rapid analysis of complex genomic structure.
Article
Full-text available
The complete nucleotide sequence (580,070 base pairs) of the Mycoplasma genitalium genome, the smallest known genome of any free-living organism, has been determined by whole-genome random sequencing and assembly. A total of only 470 predicted coding regions were identified that include genes required for DNA replication, transcription and translation, DNA repair, cellular transport, and energy metabolism. Comparison of this genome to that of Haemophilus influenzae suggests that differences in genome content are reflected as profound differences in physiology and metabolic capacity between these two organisms.
Article
Full-text available
Determination of recombination rates across the human genome has been constrained by the limited resolution and accuracy of existing genetic maps and the draft genome sequence. We have genotyped 5,136 microsatellite markers for 146 families, with a total of 1,257 meiotic events, to build a high-resolution genetic map meant to: (i) improve the genetic order of polymorphic markers; (ii) improve the precision of estimates of genetic distances; (iii) correct portions of the sequence assembly and SNP map of the human genome; and (iv) build a map of recombination rates. Recombination rates are significantly correlated with both cytogenetic structures (staining intensity of G bands) and sequence (GC content, CpG motifs and poly(A)/poly(T) stretches). Maternal and paternal chromosomes show many differences in locations of recombination maxima. We detected systematic differences in recombination rates between mothers and between gametes from the same mother, suggesting that there is some underlying component determined by both genetic and environmental factors that affects maternal recombination rates.
Article
Full-text available
What do functionally important DNA sites, those scrutinized and shaped by natural selection, tell us about the place of humans in evolution? Here we compare approximately 90 kb of coding DNA nucleotide sequence from 97 human genes to their sequenced chimpanzee counterparts and to available sequenced gorilla, orangutan, and Old World monkey counterparts, and, on a more limited basis, to mouse. The nonsynonymous changes (functionally important), like synonymous changes (functionally much less important), show chimpanzees and humans to be most closely related, sharing 99.4% identity at nonsynonymous sites and 98.4% at synonymous sites. On a time scale, the coding DNA divergencies separate the human-chimpanzee clade from the gorilla clade at between 6 and 7 million years ago and place the most recent common ancestor of humans and chimpanzees at between 5 and 6 million years ago. The evolutionary rate of coding DNA in the catarrhine clade (Old World monkey and ape, including human) is much slower than in the lineage to mouse. Among the genes examined, 30 show evidence of positive selection during descent of catarrhines. Nonsynonymous substitutions by themselves, in this subset of positively selected genes, group humans and chimpanzees closest to each other and have chimpanzees diverge about as much from the common human-chimpanzee ancestor as humans do. This functional DNA evidence supports two previously offered taxonomic proposals: family Hominidae should include all extant apes; and genus Homo should include three extant species and two subgenera, Homo (Homo) sapiens (humankind), Homo (Pan) troglodytes (common chimpanzee), and Homo (Pan) paniscus (bonobo chimpanzee).
Article
Full-text available
We report a whole-genome shotgun assembly (called WGSA) of the human genome generated at Celera in 2001. The Celera-generated shotgun data set consisted of 27 million sequencing reads organized in pairs by virtue of end-sequencing 2-kbp, 10-kbp, and 50-kbp inserts from shotgun clone libraries. The quality-trimmed reads covered the genome 5.3 times, and the inserts from which pairs of reads were obtained covered the genome 39 times. With the nearly complete human DNA sequence [National Center for Biotechnology Information (NCBI) Build 34] now available, it is possible to directly assess the quality, accuracy, and completeness of WGSA and of the first reconstructions of the human genome reported in two landmark papers in February 2001 [Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) Science 291, 1304-1351; International Human Genome Sequencing Consortium (2001) Nature 409, 860-921]. The analysis of WGSA shows 97% order and orientation agreement with NCBI Build 34, where most of the 3% of sequence out of order is due to scaffold placement problems as opposed to assembly errors within the scaffolds themselves. In addition, WGSA fills some of the remaining gaps in NCBI Build 34. The early genome sequences all covered about the same amount of the genome, but they did so in different ways. The Celera results provide more order and orientation, and the consortium sequence provides better coverage of exact and nearly exact repeats.
Article
The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.