ArticlePDF AvailableLiterature Review

The Human Genome: An Introduction

  • Janssen Pharmaceutica, Beerse, Belgium


Content may be subject to copyright.
The Human Genome: An Introduction
Department Pharmacogenomics, Janssen Research Foundation, Beerse, Belgium
The Oncologist 2001;6:100-109
Correspondence: Jeroen Aerssens, Ph.D., Department of Pharmacogenomics, Janssen Research Foundation, Turnhoutseweg 30,
Beerse B-2340, Belgium. Telephone: 32-14-606146; Fax: 32-14-607162; e-mail: Received September
29, 2000; accepted for publication November 28, 2000. ©AlphaMed Press 1083-7159/2001/$5.00/0
Fundamentals of Cancer Medicine
During the past two decades, tremendous progress has
been made in genetics and genomics. Diseases that run in fam-
ilies have been recognized for many centuries, but it was only
in the early 1980s that the first mutations in a gene responsible
for a disease could be identified. Subsequently, numerous dis-
coveries of disease-related mutations in other genes have been
found, initially in rare single-gene disorders, but more recently
also in common disorders such as Alzheimer’s disease and
cancer. Applications of this newly discovered information pro-
vide new opportunities for progression in medical science,
including the design of genetic tests to diagnose or predict
(subtypes of) diseases, the redefinition of diseases and the
understanding of their pathogenesis based on the molecular
mechanisms behind them, and the selection of new target mol-
ecules for drug discovery. The time has now come that such
applications will transfer further toward the day-to-day prac-
tice of the clinician and transform the practice of clinical med-
icine. For this to happen, an appropriate education and
understanding of the basic concepts of genetics and genomics
by clinicians is needed. This article aims to provide a general
introduction of these concepts for clinicians not familiar with
these fields. The list of references and the websites indicated in
the manuscript should encourage the reader to get a broader
appreciation of this research area.
Our inherited information is encoded in a macromolecule
called “DNA” (deoxyribonucleic acid). Basically, our DNA is
like a bar code—it encrypts information. The vast majority of
DNA molecules are stored within the nucleus of the cell and
covered with proteins, together forming chromosomes. DNA
completely dissolves in aqueous solutions (and thus in its nat-
ural environment). When precipitated in alcoholic solvents
(ethanol, isopropanol) during extraction procedures, the DNA
becomes visible as a white viscous clot.
All multicellular organisms, including humans, start
their life as a single cell (the fertilized egg), and almost all
cells of the organism developed from this single cell contain
a full and identical copy of the DNA of this single cell. In
humans, each somatic cell contains approximately 0.006
nanogram of DNA which harbors all the genetic information
needed to develop and function normally. For an adult
human body, this makes a total of about 600 grams of DNA.
For clinical genetic analyses, genomic DNA is usually iso-
lated from leukocytes, although identical DNA could be iso-
lated from virtually any other cell of an individual, as well.
From 5 to10 ml whole blood, approximately 100-200 micro-
grams of DNA can be extracted, which is in most cases more
than sufficient to perform genetic analyses.
The DNA in our cells is divided over a constant number
of chromosomes (46 in humans), each of them with a specific
size and form, as can be observed under the microscope using
specific coloring techniques (e.g., Giemsa staining). Two sex
chromosomes (X and Y) determine the gender of an individ-
ual (XX for females, XY for males); the other 44 so-called
autosomal chromosomes are not different in males and
females. Together, these 46 chromosomes comprise two
nearly identical copies of the whole genome: one copy of the
genome is inherited from the father (via a set of 22 autosomal
Aerssens, Armstrong, Gilissen et al.
chromosomes and an X or Y chromosome) and the other
copy from the mother (via another set of 22 autosomal
chromosomes and one X chromosome).
The autosomal chromosomes derived from the father and
mother are two-by-two homologous: they look similar under
the microscope and comprise the same genes, or eventually,
variants of the same genes. These chromosomes are num-
bered from 1 to 22, mainly based on size (1 being the largest
and 22 the smallest chromosome). Thus, each somatic cell
contains two copies of each of these 22 chromosomes, and
thus two copies of each of the genes located on these chro-
mosomes. One could compare this with a library which con-
tains two copies of each book, although there might
sometimes be different editions of each specific book. In
women, the two X chromosomes are also homologous: one
copy is inherited from their mother and the other copy from
their father. Men, on the contrary, have one X chromosome
inherited from their mother and one Y chromosome inherited
from their father. The Y chromosome is much smaller than
the X chromosome and contains many fewer genes as well.
In germ cells, only one copy of each homologous chro-
mosome is present (thus in total, 22 autosomal and one sex
chromosome). During reproduction, two germ cells (one
egg cell and one sperm cell) combine their genetic infor-
mation so that the offspring will contain two copies of each
chromosome: one from the father and one from the mother
(Fig. 1). Thus, the gender of the offspring, determined by
the combination of the sex chromosomes of the mother
(always X) and the father (X or Y) is completely dependent
on whether the sperm cell contains an X or Y chromosome.
From a structural point of view, the DNA looks like a long
chain of connected letters without any spaces or punctuation
marks (Fig. 2A). The total physical length of all the DNA
chains in each of our cells is approximately two meters, with a
diameter of 0.000002 mm. In order to write the DNA text, the
body has at its disposal four different but related building
blocks (called nucleotides or, more precisely, deoxyribonu-
cleotides): A, C, G, and T, representing respectively adenine,
cytosine, guanine, and thymine. These nucleotides are con-
nected by a deoxyribose-phosphate backbone. Each phosphate
links the hydroxyl group on the 3 carbon atom of a deoxyri-
bose of one nucleotide to the hydroxyl group on the 5carbon
atom in the deoxyribose group of the adjacent nucleotide.
Importantly, all the information content encrypted within the
DNA is in the specific sequence of these nucleotides. The
information stored within the DNA code can be used for trans-
lation into functional activity (i.e., production of proteins) only
in one orientation on the backbone, namely the 5-to-3direc-
tion. Therefore, nucleotide sequences are usually displayed
from the 5-to-3end (from left to right).
Attached to this DNA strand is a second DNA strand
which is the exact complement of the first one. This is pos-
sible because of the complementary chemical structure of
the DNA building blocks. Two kinds of base pairs, often
referred to as complementary base pairs, exist in all DNA:
the As on one strand always pair with Ts on the other strand
(via two hydrogen bonds) while Cs pair with Gs (via three
hydrogen bonds). Thus, if the sequence of one strand is
known, the sequence of the com-
plementary strand can easily be
Figure 1. Schematic overview of the
inheritance of our genetic informa-
tion. Each individual has two copies of
each chromosome (each harboring
one copy of the genes on the chromo-
some). The germ cells comprise only
one of these copies. Through combina-
tion of the genetic material from the
sperm cell and egg cell, the offspring
inherits one copy of each chromosome
from the father and one from the
mother. Assume a particular gene in
the DNA which determines the pheno-
type hair style, and for which two
variants exist (A and G). The geno-
type, which is the combination of the
variants on the two inherited homolo-
gous chromosomes in an individual
(e.g., A/G), will determine the pheno-
type. In the example, the G/G genotype is linked with straight hair, while individuals with the other genotypes (A/G and A/A) have curly hair
(indicating that the G variant, associated with straight hair, is a recessive characteristic).
The Human Genome: An Introduction
derived. The two deoxyribose-phosphate backbones have
opposite 5-to-3 orientations and are wound around each
other to form a double-helix structure.
The size of DNA molecules is noted as the number of base
pairs (bp) or a multiple (1,000 bp = 1 kbp, 1,000 kbp = 1 Mbp).
Our complete library of genetic information is called the
human genome, and comprises somewhat more than three bil-
lion bp (3,000 Mbp), distributed over 22 autosomal chromo-
somes (numbered from 1 to 22) and two sex chromosomes
(X and Y). A printed edition of this sequence would require
approximately one million printed pages with single-line spac-
ing. The elucidation of this genomic DNA sequence is of
extreme interest, as it containsin encrypted formall the
inherited information needed to develop and direct the func-
tioning of the human body. Table 1 summarizes the dimensions
and information content of the human genome in numbers.
The unit of information in the DNA is the gene, which is
a stretch of DNA sequence that contains the code for the pro-
duction of a protein, which is a single piece of the whole
machinery required by the cell to normally function in its
environment. More specifically, each gene comprises all the
detailed instructions that determine the precise composition
of a specific protein, as well as the regulatory instructions
Figure 2. (A) The human genome sequence can be compared with a text lacking any spaces or punctuation marks. It is extremely difficult to
read the genomic text without an analysis tool; even special software programs which have been specifically written to identify meaningful sen-
tences (genes) only have a limited success rate. (B) This is because in between the words (the exons) of the text which form a meaningful sen-
tence (a gene), variable amounts of nonsense letters (the introns) are placed. In fact, more than 90% of the human sequence consists of text from
which the meaning is currently not understood. (C) Variation frequently occurs in the human genome (about one letter differs in every 1,000 let-
ters between the genomic texts of two individuals). This might have consequences on the meaning of the sentence, or eventually make the sen-
tence unreadable. These variations in the DNA are called mutations or polymorphisms (depending on their frequency and on whether there is
or is not a direct link to the cause of a disease).
Table 1. The human genome in numbers
3,000,000,000 nucleotides in the human genome (estimated)
22 autosomal chromosomes and two sex chromosomes (X and Y)
46 chromosomes in each somatic cell (two copies of the whole
30,000120,000 genes in the human genome (estimated)
35,000 nucleotide sequences per gene at the genomic level, including
intronic sequences (on average)
1,500 nucleotides directly coding sequence per gene (on average)
less than 5% of the human genome sequence directly encodes for
four different nucleotides in the genome (adenine, cytosine, guanine,
three nucleotides comprised in a codon which encodes 1 amino acid
one nucleotide difference between two unrelated individuals per
1,000 nucleotides sequence (on average)
Aerssens, Armstrong, Gilissen et al.
that determine when this specific protein will be produced
and in what quantity. The size of a gene at the genomic level
can vary widely (usually between 10,000 and 150,000 bp).
Although most of the genomic DNA sequence is currently
known, there is still a large debate ongoing on the number
of genes present: estimates of experts in the field vary
between 30,000 and 120,000, with an average around 60,000
( Very intriguingly, the
regions in the genomic DNA which encode for proteins
account for about 150 million nucleotides, which is less than
5% of the complete human genome. Apart from some of the
DNA sequences which comprise instructions needed to reg-
ulate the expression of the genes and specific instructions for
the chromosomes to function correctly, the significance of
the other 95% of the genome is at present largely unknown
and/or poorly understood. This latter part of the genome con-
tains large numbers of highly repeated DNA sequence fami-
lies. Two major types of repeat families can be distinguished:
tandemly repeated DNA and interspersed repetitive DNA.
Tandemly repeated DNA families consist of long or short
arrays of DNA repeat units, with the repeat being a simple or
moderately complex sequence (size usually between 2 and
100 bp). Depending on the size of arrays of repeat units, this
is called satellite DNA (>100 kbp), minisatellite DNA (0.1-
20 kbp), or microsatellite DNA (<150 bp). Interspersed
repetitive DNA consists of individual repeat units which are
not clustered at a specific location on a chromosome, but are
dispersed at numerous locations. Among these are the SINEs
(short interspersed nuclear elements) and the LINEs (long
interspersed nuclear elements). Well-known examples are
Alu repeats (SINE with full-length of 280 bp; approximately
1,000,000 copies in the human genome) and LINE-1 or L1
element (LINE with full-length of 6.1 kbp; approximately
80,000 copies in the human genome).
In 1987, a worldwide scientific effort called the Human
Genome Project was initiated to unravel the complete DNA
sequence of the human genome. Recently, a first draft
covering 85%-90% of the complete human genome
sequence (3.12 billion bp) has been announced simultane-
ously by scientists of the publicly funded Human Genome
Project (;;;
and the private company Celera Genomics (Rockville, MD; As a consequence of the strategies used to
determine the human genome sequence, experts anticipate
that it will be another two years before the complete human
genome sequence will be known with a confidence of more
than 99.99% [1]. Although the complete sequence of the
whole human genome will soon be known, it is expected that
it will take many more decades before all this information
(i.e., identification of all genes and their regulation, signifi-
cance of genetic variations, etc.) will be fully understood.
Nevertheless, the scientific importance of the achievements
reached so far by the Human Genome Project can hardly be
overestimated and is at least of the same order of magnitude
as the Apollo lunar program.
The main role of the DNA in the cell is to permanently
store and make available all the information needed to regulate
each of the activities in the cell. The production of proteins
which are the functionally active molecules in the celltakes
place in the cytoplasm of the cell. Since the instructions for how
to make the proteins is within the DNA which is stored in the
nucleus, an intermediate molecule (messenger RNA, or
mRNA) is used to transfer this information from the nucleus to
the cytoplasmic protein factory. As a matter of comparison,
imagine a library (the genomic DNA) which contains many
books (genes) on many different topics. A reader makes a copy
(the mRNA) of a specific book which contains the specific
information on how to make a cake and takes this to his home
as he is not allowed to make cake in the library. At home, the
person can then make a cake (the protein) using all the required
ingredients and supplies as described in the copied information.
When and how much of a gene should be expressed in a
cell is directed by specific proteins (transcription factors) which
are present in the nucleus and which can interact in a stimula-
tory or inhibitory manner with regulatory sequences in the
DNA flanking the coding part of the gene. When this fine-tuned
regulation mechanism indicates that additional copies of the
gene should be expressed, an enzyme in the nucleus (RNA
polymerase) transcribes the genetic information from the DNA
template into an RNA (ribonucleic acid) copy. The structure of
RNA is similar to a single-strand DNA molecule, although
thymine (T) is replaced by uracil (U). Because the protein-cod-
ing information in the DNA is interrupted by irrelevant
sequences (called introns), the RNA must be further edited
(spliced) to remove these intron sequences and join the coding
sequences (called exons) (Fig. 2B). In some genes, a choice
between several alternative exons is being made during this
splicing process, which will result in different proteins. The
RNA molecule that results from transcription and splicing is
called messenger RNA (mRNA). This mRNA (on average
1,500 bp) is transported to the cytoplasm where it is used as a
template for the generation of a protein. Thus, the mRNA is
threaded through ribosomes as a tape is threaded through the
head of a tape player in order to decode the information and
assemble the amino acids into chains. For the decoding, each
subsequent group (called a codon) of three nucleotides on the
mRNA specifies a new amino acid. Mostly starting from a
The Human Genome: An Introduction
so-called start codon with the sequence ATG (which encodes
a methionine), each adjacent codon on the mRNA specifies the
next amino acid to be linked to the growing protein chain. After
completion of this translation process, additional modifications
are made to the protein (e.g., phosphorylation, glycosylation),
resulting in a mature and functional protein.
In summary, the properties of each protein depend on the
sequence of the amino acids used to construct it, and this
sequence in turn is determined directly by the nucleotide
sequence of the mRNA, which in turn is an (edited) copy of
the genomic DNA sequence (Fig. 3). It should be noted that,
although generally the information for making any single
protein is always encoded by a single gene, one gene may (as
a result of differential splicing) carry the information needed
to make several (usually related) proteins.
As indicated above, the DNA content is identical in each
cell or tissue type of the body; however, not all our cells are
identical in terms of structure, function, or behavior. What
makes them different is the pattern of genes which are
expressed and translated into proteins during the life cycle of
the cells. Some cell types express many genes (e.g., in brain
cells approximately 30,000 genes are expressed), while in
others a large number of the genes are transcriptionally
inactive (e.g., in red blood cells only 30 genes are expressed).
Apart from an overall switching of the expression of specific
genes from on to off (or vice versa), fine-tuning of the
expression level of specific genes might also occur. Changes
in the level of expression may be the result of a disease or
may eventually lead to a disease. Therefore, there is an enor-
mous scientific interest in studying and comparing the level
of expression of genes, i.e., gene expression in disease status
versus in healthy controls.
Analysis of mRNA samples is very useful, as these con-
tain only the transcribed sequences of the human genome
(and thus the genes). Therefore, several research groups and
biotech companies have cloned and analyzed large libraries
from mRNA sequences. For example, the mRNA extracted
from a brain sample contains copies of thousands of tran-
scribed genes which might be of interest. Technically, in
order to clone the transcribed genes, the mRNA molecules
first need to be converted into double-strand molecules. This
can be done by adding the complement nucleotides on a sec-
ond DNA strand (a process called reverse transcription),
resulting in double-stranded DNA molecules which contain
only the exon sequences of the genes (but not the intron
sequences). These are called cDNA molecules (copy DNA),
and contain the open reading frame of the gene which can
easily be converted into the amino acid sequence of the
resulting protein. In large projects, several thousands of these
Figure 3. Schematic overview of how the information comprised within the genetic code is being used to synthesize the proteins. This
involves the processes of transcription from DNA into RNA, RNA splicing to form mRNA, transport of the mRNA from the nucleus to the cyto-
plasm, translation into a chain of amino acids, and finally post-translational modifications and folding of the synthesized protein. Note that at
both the 5
and 3
ends of the coding region in the exonic sequence, an untranslated region (UTR) is also transcribed and spliced into mRNA
(respectively 5
-UTR and 3
-UTR regions).
Aerssens, Armstrong, Gilissen et al.
cDNA clones have been partially sequenced and have
revealed previously unknown fragments of expressed genes
(often called ESTs, expressed sequence tagged sites).
Comparison of databases of EST sequences might eventu-
ally also reveal new information on tissue-specific expression
of some genes. In the laboratory, the evaluation of gene
expression levels in tissue samples can now also be evaluated
simultaneously in thousands of genes, thanks to the enormous
progression in the development of microarray technology
(more popularly, DNA chip technology) during the last few
years. Today, this technology enables scientists to simultane-
ously compare the expression levels of several thousand genes
in a single experiment, on a surface smaller than a stamp. This
technology is based on the hybridization of RNA samples
(e.g., extracted from diseased and healthy tissue) on glass
slides (DNA chips) containing DNA molecules with the spe-
cific sequences of thousands of different genes. The inten-
sity of the hybridization signals, which are a measure of the
expression levels of the different genes, can be evaluated
using powerful software [2].
When the DNA sequence of a gene is identified in dif-
ferent individuals from the population, some differences in
the nucleotide sequence are often detected (Fig. 2C). The
information content of DNA can be altered dramatically by
such variations in the nucleotide sequence, especially if
these differences are located in protein-coding or regulatory
sequences. The consequence of such variations might lead
to the insertion of a different amino acid on a specific posi-
tion in the protein, or to a different level of expression of a
protein. Variations located in the intronic regions of genes or
outside the genes will usually have fewer consequences. The
different forms of a genetic variation are called the alleles
of the variation. Frequently occurring variations are often
called polymorphisms, while more rare variations (with
allele frequency below 1%) and variations with a direct rela-
tionship to a disease are often called mutations (although
these definitions are arbitrary). Genetic variations can
involve only 1 bp (called single nucleotide polymorphism,
SNP), a few bp (e.g., di- and trinucleotide repeat polymor-
phisms), up to large stretches of DNA. Roughly, the varia-
tions can be divided into substitutions, insertions, deletions,
amplifications, and translocations (Fig. 4).
The major contributors to genetic variation, comprising
some 80% of all known polymorphisms, are the single
nucleotide polymorphisms. An SNP located in the coding
region of a gene is indicated as cSNP. It has been estimated
that, on average, the DNA sequence of two unrelated individ-
uals differs in 0.1% (1 in 1,000 bp), which would in the com-
plete genome account for three million nucleotides. As a
Figure 4. Schematic summary of the various forms of variations which occur in genes. These might involve only one or a few base pairs (small
mutations) or large genomic regions (large mutations). Adapted from [9].
The Human Genome: An Introduction
comparison, the DNA sequence of a human and a chimpanzee
is estimated to differ 2% (1 in 50 base pairs).
There is major interest among scientists in studying
variations in genes, especially in the regulatory and protein-
coding sequences, because such variations might be directly
related to specific diseases or other specific characteristics
(e.g., eye or hair color). The investigation of potential rela-
tionships of variations in specific genes with a specific dis-
order might be very useful if the candidate gene(s) to be
investigated can be well chosen. Such choice of candidate
genes could be based on scientific knowledge or on new exper-
imental evidence (e.g., altered serum level of a protein in a spe-
cific patient group, microarray expression experiments, etc.).
Good candidate genes are, unfortunately, not always avail-
able. Therefore, genetic approaches have been developed in
the past based on the analysis of highly polymorphic dinu-
cleotide repeat markers (microsatellites) in DNA samples from
individuals from large families with multiple disease-affected
individuals. The strategy is based on the identification of chro-
mosomal markers cosegregating with the disease in the fami-
lies. Such linkage studies are very attractive because they
allow identification of a chromosomal region on the genetic
map which contains a disease-causing gene without requiring
any functional knowledge of the disease gene. Once a chro-
mosomal region with significant linkage is found, the disease-
causing gene needs to be cloned and the responsible
mutation(s) identified. This positional cloning strategy has
been very successful, especially for identifying genes involved
in single-gene disorders (also called simple genetic disorders,
or Mendelian inherited disorders). Indeed even for some more
common disorders such as breast cancer or Alzheimers dis-
ease, a positional cloning strategy has been successfully
applied and has led to the identification of the genes involved
(BRCA1 and BRCA2 in familial breast cancer, and presenilin
genes in early onset Alzheimers disease, respectively).
Although it is clear that genetic tests for these mutations might
be extremely useful for predicting disease risk in other mem-
bers of these families, it should be noted that defects in these
genes can explain only a small fraction (usually less than 5%)
of the whole population of patients suffering these common
disorders, consisting mainly of non-familial cases.
Unfortunately, however, the resolution that is obtained
using these family studies is rather limitedat the very best,
up to a region of about one million bp. As this is still a very
large regionand may eventually contain more than 50
genesit is key to refining this region of interest. Because of
their high frequency in the genome, the analysis of SNP mark-
ers has been proposed as a possible tool. SNP markers in the
region of interest can be analyzed in a population of affected
individuals and a population of matched healthy controls. For
each of the analyzed SNPs, the allele frequency in both pop-
ulations is then compared. A statistically significant differ-
ence in allele frequency of a genetic marker is suggestive for
an association of this marker with the disease.
Following the successes in genetic mapping and identifi-
cation of the molecular basis of Mendelian traits, attention
has rapidly shifted to more complex and more prevalent
genetic disorders that involve multiple genes and environ-
mental effects (e.g., cardiovascular disease, diabetes, and
schizophrenia). It is believed that SNPs could probably be
the best available markers in the search for the origins of
complex genetic diseases. Moreover, it has been hypothe-
sized that ultimately, if enough SNP markers would become
available with a chromosomal localization evenly dispersed
over the whole human genome, it should be feasible to
directly perform population-based whole-genome associa-
tion studies which would permit skipping of the initial step of
family-based linkage studies. As a consequence of the great
promise of SNPs, ten of the worlds pharmaceutical giants,
along with five academic partners, entered into a close col-
laboration in April 1999 called The SNP Consortium. The
major mission of this consortium is to create a high-quality,
dense, genome-wide SNP map, which will be made available
to the public. More specifically, The SNP Consortium aims
to generate genome SNP maps which would allow whole-
genome, population-based association studies. It is estimated
that this will require at least one marker every 5 to 50 kbp of
DNA. To cover the whole genome at this resolution would
require the identification and chromosomal localization of
200,000-300,000 new SNP markers. In July 2000, already
more than 800,000 SNPs were made available to the public
A powerful tool for understanding the human genome is
comparison with the genome information from other organ-
isms; this area of research is called comparative genomics
[3]. The currently available sequence technology allows deter-
mination of the complete genome sequence of organisms
within reasonable time frames. In 1995, the first entire sequence
of an organism, Haemophilus influenza (1.8 Mbp), was pub-
lished. Since then, the complete genome sequence of a con-
stantly growing list of microorganisms (bacteria and viruses)
became known (size usually between 0.5-5 Mbp). The
sequence information can be used to identify specific genes and
their structure, regulation, and function, which might poten-
tially lead to new drugs which target a specific microorganism.
Aerssens, Armstrong, Gilissen et al.
Saccharomyces cereviseae (bakers yeast) was the first
eukaryotic organism from which the entire genome sequence
(15 Mbp) was published (
Saccharomyces/). An enormous amount of information is
known about the structure, regulation, and function of yeast
genes. Of particular interest for developmental biology
research is the availability of the complete genome sequence
of the long roundworm Caenorhabditis elegans (97 Mbp);
this animal consists of 959 somatic cells, the exact lineage of
which is known for every cell (
A cross-comparison of the complete gene sets of S. cerevisiae
(6,000 genes) and C. elegans (19,000 genes) has revealed
that 23% of the proteins encoded by yeast genes have appar-
ent homologues in the nematode worm, reflecting functions
common to both organisms [4].
The fruit fly (Drosophila melanogaster) (137 Mbp) has a
long history in genetic research, especially for its ease of cor-
relation of genotype and phenotype. Because crucially impor-
tant gene functions and developmental processes appear to be
highly conserved between species, the relevance for human
disease research becomes clear. Moreover, there is also an
important conservation in the area of the cell-cycle control
genes (and DNA repair and apoptosis), with immediate
relevance to human cancer (
The mouse genome (3,000 Mbp) shows large subchromo-
somal areas with a strong conservation of linkage (synteny)
between mouse and humans. This implies that, based on the
chromosomal localization of a gene on the mouse genome,
predictions can be made on the chromosomal localization of its
human homologue. Nearly every human gene appears to
have a mouse homologue (
Homology/). Because of their small body size, the short gen-
eration time, and the technical ability to modify the DNA con-
tent of mice cells at the germline level, these animals provide
also a powerful tool for studying gene expression and function
and for creating models of human disease (eventually by
means of knock-out and/or transgenic mice) [5].
An interesting observation emerging from comparing
complete gene sets in model organisms known to date is that
gene number is not necessarily a good measure of complex-
ity. For example, the fruit fly would be considered more
anatomically complex (with 10× more cells) than the nema-
tode, and the fruit fly undergoes a more complex develop-
mental process than C. elegans. Yet the fruit fly genome
contains only 13,000 genes, compared with the 19,000 genes
found in the nematode genome. It is generally expected that
comparative genomic assessments will become increas-
ingly important because they allow expansion of the utility
of the genomic information known and documented (anno-
tated) in one species toward other species, including
The two copies of a specific gene inherited from the
father and the mother are not always identical, because for
most genes many variants (alleles) exist. The combination of
the two alleles present on the two homologous chromosomes
of the DNA is defined as the genotype for a specific genetic
variation. For example, imagine an SNP in a gene with two
possible alternative alleles: allele A and allele G. The possible
genotypes are thus A/A, A/G, and G/G. When two different
alleles of a gene are identified on the two homologous chro-
mosomes of an individual, the genotype is called heterozy-
gous (e.g., A/G); if the same allele is present on both
homologous chromosomes, the genotype is homozygous
(e.g., A/A and G/G). More generally, the genotype of an indi-
vidual can be defined as the complete composition of an
individuals genome (including all the information on the
variations within his/her genome), as has been defined at
conception. When used in clinical genetic applications,
however, the term genotype usually refers to some specific
variation(s) in a small part of the DNA, often named for the
gene involved. For example, the APOE gene in the DNA can
exist as allele e2, e3, or e4, the latter of which is associated
with an increased risk of developing Alzheimers disease. The
age of onset of the disease is lower in individuals with a geno-
type harboring one or more copies of the e4 allele, namely the
e2/e4 or e3/e4, and especially the e4/e4 genotype.
Opposite to the genotype is the phenotype, which can
be defined as the combination of all the observable or mea-
surable characteristics of an individual (e.g., eye color, hair
style, body height, affected by disease, etc.). The phenotype
isat least partiallydetermined by the genotype, because it
depends on the level at which specific genes can be expressed.
The latter depends on the variations in the DNA sequence but
also on the environmental influences (e.g., nutrition status).
Very importantly, it should be pointed out that although
the phenotype may appear to be equal in two individuals, their
genotypes might be different. This could be due to a significant
environmental influence which overrules the genetic impreg-
nation, or alternatively because some of the possible genotypes
do not result in phenotypic differences (e.g., genotypes A/A
and A/G can both have straight hair, while only genotype G/G
shows curly hairFig. 1). In a molecular diagnostic setting,
the determination of the genotype usually aims to predict the
phenotype. Indeed the genotype might sometimes be fully pre-
dictable (especially in single-gene Mendelian inherited disor-
ders, such as cystic fibrosis). Unfortunately, most often the
analyzed genotype is not fully predictable for the phenotype
but merely allows assignment of a certain risk level to an indi-
vidual for expressing or developing a specific phenotype. For
example, susceptibility-conferring genotypes at the BRCA1
and BRCA2 gene loci confer a relative risk of breast cancer of
The Human Genome: An Introduction
about 5. Consequently, it is strongly advised that all results of
genetic testing are accompanied by an interpretation for each
molecular genetic diagnostic report to be used in the clinic.
As mentioned above, familial cases of some specific can-
cer types are known, indicative of an inherited trait similar to
any other genetic disorder. For some of these cancer types,
successful positional cloning projects have allowed identifi-
cation of genes harboring mutations which cause the disease.
As these mutations are inherited by the next generation, it
implies that the responsible gene defect is present in the
germline cells. At present, more than 20 different hereditary
cancer syndromes have been defined and attributed to specific
germline mutations. Collectively, these syndromes affect
approximately 1% of all cancer patients [6]. For several of the
inherited cancer syndromes, genetic testing for disease suscep-
tibility is feasible and already part of the clinical management
of affected families. Controversy on its value has been raised,
however, especially in cases where the risks of developing
cancer associated with a predisposing mutation are less cer-
tain, or where there is no effective intervention to offer those
with a positive result [7].
Most cancer patients do not have any pronounced fam-
ily history, yet genomic defects are at the basis of the dis-
ease. The genetic information present in normal cells can
also be altered (e.g., due to incorrect DNA duplication dur-
ing cell division), either by gross chromosomal changes
such as translocations, deletions, inversions, and amplifica-
tions, or through more subtle changes such as point muta-
tions and microdeletions [8]. The accumulation of these
genetic alterations can finally lead to the expression of the
full cancer phenotype. It should be noticed thatapart from
the familial casesthese changes in the DNA do occur in
the somatic cells and do not transfer to the germline cells.
Consequently, these abnormalities are not inherited by the
children of these patients.
Historically, chromosomal abnormalities in tumors were
first recognized when an unusually small chromosome, the
Philadelphia chromosome, was observed in white blood
cells as a hallmark of chronic myeloid leukemia. The signifi-
cance of these chromosomal abnormalities has only relatively
recently become clear by a combination of improved cytoge-
netics and molecular biology. The central concept is that of
proto-oncogenes and tumor suppressor genes: normal cellular
genes controlling growth, development, differentiation, DNA
repair, and DNA modification become deregulated in the neo-
plastic cancer cell due to mutations, fusions, or deletions. The
normal structure of a resident proto-oncogene may be con-
verted to a dominant oncogene by mutations or chromosomal
rearrangements. Such conversion in one copy of the gene (one
chromosome homologue) is sufficient to result in neoplastic
transformation. On the contrary, loss or inactivation of tumor
suppressor genes may release a cell from constraints imposed
by these genes, resulting in uncontrolled growth. Their behav-
ior is recessive, and both allele copies must be lost for tumor
activation to occur. Therefore, recurrent deletions of chromo-
somal material are recognized as indications for the presence
of tumor-suppressor genes. On the other hand, recurring spe-
cific chromosomal aberrations (translocations, amplifications,
and inversions) have been instrumental in identifying proto-
oncogenes. The cloning of the chromosomal breakpoints of
such aberrations has proven to be an effective strategy for
identifying mutant genes in tumors (e.g., ETV6 in leukemia,
c-MYC in Burkitts lymphoma). At present, more than 50
chromosomal translocation breakpoints have been molecu-
larly cloned and the involved genes identified. The vast
majority of these tumors were of hematopoietic origin, as
cytogenetic data on these tumors are easier to obtain and are
thus more extensively studied. It is clear that cytogenetic
analysis might allow subtyping of patients with an apparently
similar phenotype. Depending on the chromosomal abnor-
malities (and thus the genes involved), the efficacy of therapy
can be predicted to a certain extent. Therefore, cytogenetic
analysis has now become a routine analysis in many centers
CancerCyto.htmlx). Finally, the colorful fluorescence in situ
hybridization (FISH) technique allows direct identification
of the breakpoint region involved in a chromosomal abnor-
mality at a relative high resolution (10-100 kbp). Specific
probes to be used by FISH which recognize recurrent abnor-
malities of specific regions of the genome are now commer-
cially available for routine analysis of cancer cells derived
from oncology patients.
A nice overview on the currently known genetics and
genomics behind cancers is provided in a subsection of the
web site of the National Center for Biotechnology
Information (NCBI), which specifically deals with this
topic (
Ongoing research in oncology is further directed toward a
more complete and basic understanding of why some
somatic cells at a certain point in time become tumor cells.
In this respect, the National Cancer Institute coordinates the
Cancer Genome Anatomy Project (CGAP), which provides
a valuable resource of information and technological tools
required to analyze the molecular anatomy of the cancer
cell (
Originally, molecular genetics was used in medicine only
to identify gene defects in major single-gene disorders such as
Aerssens, Armstrong, Gilissen et al.
cystic fibrosis. The excitement in the field has shifted gradu-
ally toward more common and complex genetic disorders. It is
therefore not surprising that the amount of genetic information
on an ever-increasing number of diseases has exploded over
the past few years. The Online Mendelian Inheritance In Man
(OMIM) is a database of bibliographic information about
human genes and genetic disorders and is freely available
online (http://www.ncbi.nlm.nih.giv/omim/). With more than
10,000 entries (newly defined for each distinct disease gene or
genetic disorder for which sufficient information exists), it
provides probably the most comprehensive, authoritative, and
timely compendium of information in human genetics.
Clinicians can use OMIM as an aid in differential diagnosis by
searching the database using key clinical features of a patient.
An important question for the clinician is which impact on
future medical practice might be expected from ongoing
research activities in genomics and genetics. The main contri-
bution to date has been in the identification of new molecular
targets for drug action, which might in the long term result in
new and better drug therapies. It is expected, however, that
clinical practice will also be increasingly affected by new diag-
nostic tests based on genetic markers associated with increased
disease risk, therapeutic efficacy, or adverse events. In this
respect, the research area designated as pharmacogenomics is
expected to become a driving force toward a more rational use
of pharmaceutical products. Expensive therapies might possi-
bly no longer be authorized without a definite diagnosis based
on a genetic test. Validated genetic tests enabling prediction of
increased risk on disease development might eventually lead to
a shift from curative toward predictive treatment, long before
clinical symptoms of the disease can be observed.
In conclusion, it might be expected that genomics and
genetics will largely impact future medical practice. Several
genomics-based applications are on their way to enter the
clinic within the next few years, and many more will most
probably follow in a later stage. For clinicians of the 21st
century, it will be key to be well prepared and open-minded
for this molecular future of medicine.
1 Macilwain C. World leaders heap praise on human genome
landmark. Nature 2000;405:983-984.
2 Lockhart DJ, Winzeler EA. Genomics, gene expression and
DNA arrays. Nature 2000;405:827-836.
3 Bentley DR. Decoding the human genome sequence. Hum
Mol Genet 2000;9:2353-2358.
4 Rubin GM, Yandell MD, Wortman JR et al. Comparative
genomics of the eukaryotes. Science 2000;87:2204-2215.
5OBrien S, Menotti-Raymond M, Murphy WJ et al. The
promise of comparative genomics in mammals. Science
6 Fearon ER. Human cancer syndromes: clues to the origin and
nature of cancer. Science 1997;278:1043-1050.
7 Ponder B. Genetic testing for cancer risk. Science
8 Lengauer C, Kinzler KW, Vogelstein B. Genetic instabilities
in human cancers. Nature 1998;396:643-649.
9 Varmus H, Weinberg RA. Genes and the biology of cancer.
New York: Scientific American Library, 1993:10-14.
Brown PO, Hartwell L. Genomics and human diseasevaria-
tions on variation. Nat Genet 1998;18:91-93.
Collins FS, Guyer MS, Chakravarti A. Variations on a theme:
cataloging human DNA sequence variation. Science
Collins FS. Medical and societal consequences of the human
genome project. N Engl J Med 1999;341:28-37.
Hamosh A, Scott AE, Amberger J et al. Online Mendelian
Inheritance in Man (OMIM). Hum Mutat 2000;15:57-61.
Holtzman NA, Marteau TM. Will genetics revolutionize medi-
cine? N Engl J Med 2000;343:141-144.
Lander ES, Schork NJ. Genetic dissection of complex traits.
Science 1994;265:2037-2048.
Poste G. Molecular medicine and information-based targeted
healthcare. Nat Biotech 1998;16(suppl 1):19-21.
Roses AD. Pharmacogenetics and future drug development and
delivery. The Lancet 2000;355:1358-1361.
Schafer AJ, Hawkins JR. DNA variation and the future of human
genetics. Nat Biotech 1998;16:33-39.
Strachan T, Read AP. Human Molecular Genetics, 2nd Ed. Oxford:
Bios Scientific Publishers Ltd., 1999:1-53, 139-168, 295-314,
351-375, 427-444.
Wolf CR, Smith G, Smith RL. Pharmacogenetics. BMJ
... Assume that k 2: 1 and that C1 and C2 do not induce [1,1] In other words, a row with only two 2's is never resolved in a way which would introduce a conflict which is not introduced by another row with more than two 2's. This leads us to the following definition of a special type of hypergraph coverings. ...
... Assume that k 2: 1 and that C1 and C2 do not induce [1,1] In other words, a row with only two 2's is never resolved in a way which would introduce a conflict which is not introduced by another row with more than two 2's. This leads us to the following definition of a special type of hypergraph coverings. ...
... Cl, C2 induce all three pairs[1,1]' [0,1] and[1,0] in A, and there exists an assignment I B such that(3) for' every x rCjcz E XA, IB(X rCjcz ) = 1 if Cl, C2 induce both [0,1] and[1,0] in A; andWhen considering the GTNH problem, the above rules (except for (4)) do not apply since violating rules(1 )-(3) just creates a conflict between corresponding columns / characters in B. Based on the above characterization by Bafna [8], we have the following lemma about the relationship between the conflict graph of B and the assignment IB. Lemma 15. ...
... RNA help carry out this blueprint during organism life. Usually DNA consists of a long linear chains of subunits called nucleotides [3] that is a long but finite string over the 4 letter "nucleotide" alphabet {A; C; T; G}(e.g. of DNA sequence ….ATCGCTGA ...). The Central Dogma of molecular biology is shown in figure 1, and it was first proposed by Francis Crick [2]. ...
Full-text available
Deep learning (DL) methods have been achieving amazing results in solving a variety of problems in many different fields especially in the area of big data. With the advances of the big data era in bioinformatics, applying DL techniques, the DNA sequences can be classified with accurate and scalable prediction. The strength of DL methods come from the development of software and hardware, such as processing abilities graphical processing units (GPU) for the hardware and new learning or inference algorithms for the software, which reducing the main primary difficulties that faced the training process. In This work, we start from the previous classification methods such as alignment methods pointing out the problems, which are face to use these methods. After that, we demonstrate deep learning, from artificial neural networks to hyperparameter tuning, and the most recent state-of-the-art DL architectures used in DNA classification. After that, the paper ended with limitations and suggestions.
... Deoxyribonucleic Acid (DNA) is the most important chemical compounds in living cells, bacteria, and some viruses [1,2]. It is composed of a long sequence of nucleotides that is long but finite string over the four character {A, C, T, G} called nucleotides as shown in figure1. ...
Full-text available
Abstract—the analyzing hidden features of DNA sequence is main challenge in bioinformatics. Since learning from DNA sequences based on analytical approaches to identify the hidden patterns provides a vital role in various genomic applications. In classification tasks, Re-current neural network with Bidirectional Long Short-Term Memory (BLSTM) is usually used for sequential data that is strongly dependent on the feature's extraction stage. Recently, digital signal processing (DSP) techniques such as spectral transformations has been used in genomic data for extracting the hidden features and periodicities within the DNA fragments. The objective of this paper is comparing different spectral transformations of DNA sequences such as Discrete Cosine Transform (DCT), Discrete Fourier Transform (DFT) and Discrete Wavelet Transform (DWT) based on BLSTM to achieve high performance in the taxonomic classification of bacteria. The reason for applying these transformations emerges from its wide and effective use for extracting features, decorrelation, ordering and dimensionality reduction purposes in the fields of speech and image processing. Evaluation metrics such as F1 score and accuracy show that DWT features give surpassing performance compared with other features.
... Deoxyribonucleic Acid (DNA) is the most important chemical compounds in living cells, bacteria, and some viruses [1,2]. It is composed of a long sequence of nucleotides that is long but finite string over the four character {A, C, T, G} called nucleotides as shown in figure1. ...
Conference Paper
Full-text available
the analyzing hidden features of DNA sequence is main challenge in bioinformatics. Since learning from DNA sequences based on analytical approaches to identify the hidden patterns provides a vital role in various genomic applications. In classification tasks, Recurrent neural network with Bidirectional Long Short-Term Memory (BLSTM) is usually used for sequential data that is strongly dependent on the feature's extraction stage. Recently, digital signal processing (DSP) techniques such as spectral transformations has been used in genomic data for extracting the hidden features and periodicities within the DNA fragments. The objective of this paper is comparing different spectral transformations of DNA sequences such as Discrete Cosine Transform (DCT), Discrete Fourier Transform (DFT) and Discrete Wavelet Transform (DWT) based on BLSTM to achieve high performance in the taxonomic classification of bacteria. The reason for applying these transformations emerges from its wide and effective use for extracting features, decorrelation, ordering and dimensionality reduction purposes in the fields of speech and image processing. Evaluation metrics such as F1 score and accuracy show that DWT features give surpassing performance compared with other features.
... The genome is the hereditary information of all organisms, which is recorded in the DNA or in some viruses in the RNA. The DNA can be represented by a sequence of nucleotides that is long but finite over the 4 letter "nucleotide" alphabet {A, C, T, G}. [1][2][3] Recently, an increasing amount of genomic data has become available. Analyzing information contained in the growing DNA sequence database of the human and all other organisms has been considered as the most dominant and the common target in biology. ...
Full-text available
For high accuracy classification of DNA sequences through Convolutional Neural Networks (CNNs), it is essential to use an efficient sequence representation that can accelerate similarity comparison between DNA sequences. In addition, CNN networks can be improved by avoiding the dimensionality problem associated with multi-layer CNN features. This paper presents a new approach for classification of bacterial DNA sequences based on a custom layer. A CNN is used with Frequency Chaos Game Representation (FCGR) of DNA. The FCGR is adopted as a sequence representation method with a suitable choice of the frequency k-lengthen words occurrence in DNA sequences. The DNA sequence is mapped using FCGR that produces an image of a gene sequence. This sequence displays both local and global patterns. A pre-trained CNN is built for image classification. First, the image is converted to feature maps through convolutional layers. This is sometimes followed by a down-sampling operation that reduces the spatial size of the feature map and removes redundant spatial information using the pooling layers. The Random Projection (RP) with an activation function, which carries data with a decent variety with some randomness, is suggested instead of the pooling layers. The feature reduction is achieved while keeping the high accuracy for classifying bacteria into taxonomic levels. The simulation results show that the proposed CNN based on RP has a trade-off between accuracy score and processing time.
... In the following we shall consider three levels: genes, cells and blood. The human genoma length in a single cell is about 2 m. 12 Taking into account that the total number of cells in an adult organism is ca. 6 × 10 13 , 13 the total length of the human genetic material is around 1 2 × 10 14 m. ...
The architecture of a surveillance system, able to make up for the poor ability of the immune system in recognizing endogeneous diseases (like cancer, arteriosclerotic vascular disease, Alzheimer disease, etc.), is designed. Based on the use of swarms of nanobots communicating with a resident central unit, this architecture can actually be implemented, though at a rudimentary level, with current technologies based on the functionalization of integrated circuits.
In order to understand the molecular biology of cancer it is necessary to first obtain an understanding of normal cell biology. This chapter discusses the key features of a malignant clone: immortality, independence, mobility, subversion, instability, and angiogenesis induction. It briefly recaps the features of normal controls on cell growth, maturation, division and death. Apoptosis is the mechanism by which the body eliminates cells which have acquired defects in their genome, which is a vital part of the defences against cancer. The malignant transformation requires a single cell to acquire a complex of genetic changes, which allow that cell to subvert normal controls regulating cell growth, division and death. Translocations may lead to production of a novel fusion gene not seen in nature, or lead to a functional gene coming under the influence of an inappropriate control gene. There are basically three classes of cancer genes: oncogenes, tumour suppressor genes, and DNA repair genes.
Interventional cardiology took off in 1977, with the development of percutaneous coronary balloon angioplasty (1). Despite immediate success in regaining vessel patency, long-term results were undermined by luminal loss secondary to the vessel injury induced by the balloon (2,3). A decade later a new technique involving stent deployment had been designed to overcome vessel recoil (4), initially, considered the major contributor to the loss of lumen diameter. Rather than ridding interventional cardiology of restenosis, a purely iatrogenic process, stents have shifted the focus of attention toward the phenomenon of intimal hyperplasia (5).
Congenital heart disease remains a significant cause of morbidity and mortality. In recent years, significant advances in molecular genetics, improved understanding of morphogenesis, recognition of specific patterning of abnormalities within and between species, and the impact of the Human Genome Project have accounted for these advances. Continued rapid developments in genomics and proteomics are anticipated. Epidemiologic investigations continue to be necessary to assess the influence of the environment on genetics. We are on the threshold of influencing the occurrence of congenital heart diseases.
Full-text available
The use of DNA variants in the mapping of the human genome and in the positional cloning of monogenic disease genes is well established. Determining the genetic bases of the more common "multifactorial" diseases, however, presents a major challenge. The genetics of these diseases are complicated by the interplay between many genes and the environment. These investigations will require large numbers of DNA markers and the technology to screen large populations with these markers. The systematic identification of the common DNA polymorphisms in the human genome coupled with the development of high throughput screening methods should allow ultimately the elucidation of the genetic component of most clinical and nonclinical phenotypes.
A comparative analysis of the genomes ofDrosophila melanogaster, Caenorhabditis elegans, and Saccharomyces cerevisiae—and the proteins they are predicted to encode—was undertaken in the context of cellular, developmental, and evolutionary processes. The nonredundant protein sets of flies and worms are similar in size and are only twice that of yeast, but different gene families are expanded in each genome, and the multidomain proteins and signaling pathways of the fly and worm are far more complex than those of yeast. The fly has orthologs to 177 of the 289 human disease genes examined and provides the foundation for rapid analysis of some of the basic processes involved in human disease.
Online Mendelian Inheritance In Man (OMIM) is a public database of bibliographic information about human genes and genetic disorders. Begun by Dr. Victor McKusick as the authoritative reference Mendelian Inheritance in Man, it is now distributed electronically by the National Center for Biotechnology Information (NCBI). Material in OMIM is derived from the biomedical literature and is written by Dr. McKusick and his colleagues at Johns Hopkins University and elsewhere. Each OMIM entry has a full text summary of a genetic phenotype and/or gene and has copious links to other genetic resources such as DNA and protein sequence, PubMed references, mutation databases, approved gene nomenclature, and more. In addition, NCBI's neighboring feature allows users to identify related articles from PubMed selected on the basis of key words in the OMIM entry. Through its many features, OMIM is increasingly becoming a major gateway for clinicians, students, and basic researchers to the ever-growing literature and resources of human genetics.
Medical genetics was revolutionized during the 1980s by the application of genetic mapping to locate the genes responsible for simple Mendelian diseases. Most diseases and traits, however, do not follow simple inheritance patterns. Genetics have thus begun taking up the even greater challenge of the genetic dissection of complex traits. Four major approaches have been developed: linkage analysis, allele-sharing methods, association studies, and polygenic analysis of experimental crosses. This article synthesizes the current state of the genetic dissection of complex traits--describing the methods, limitations, and recent applications to biological problems.
Genetic testing for cancer susceptibility is already part of the clinical management of families with some of the well-defined (but uncommon) inherited cancer syndromes. In cases where the risks associated with a predisposing mutation are less certain, or where there is no clearly effective intervention to offer those with a positive result, its use is more controversial. Careful evaluation of costs and benefits, and of the efficacy of interventions in those found to be at risk, is essential and is only just beginning. An immediate challenge is to ensure that both health professionals and the public understand clearly the issues involved.
More than 20 different hereditary cancer syndromes have now been defined and attributed to specific germline mutations in various inherited cancer genes. Collectively, the syndromes affect about 1 percent of cancer patients. An individual who carries a mutant allele of an inherited cancer gene has a variable risk of cancer that is influenced by the particular mutation, other cellular genes, and dietary, lifestyle, and environmental factors. Though hereditary cancer syndromes are rare, their study has provided powerful insights into more common forms of cancer. Somatic mutations in sporadic cancers frequently alter the inherited cancer genes, and the functions of cell signaling pathways have been illuminated by study of the affected genes. Further investigation of inherited mutations that affect susceptibility to cancer will aid efforts to effectively prevent, detect, and treat the disease.