ArticlePDF AvailableLiterature Review

Bioinformatics: Science, medicine, and the future

Authors:
Clinical review
Science, medicine, and the future
Bioinformatics
Ardeshir Bayat
An unprecedented wealth of biological data has been
generated by the human genome project and sequenc-
ing projects in other organisms. The huge demand for
analysis and interpretation of these data is being man-
aged by the evolving science of bioinformatics.
Bioinformatics is defined as the application of tools of
computation and analysis to the capture and interpret-
ation of biological data. It is an interdisciplinary field,
which harnesses computer science, mathematics, phys-
ics, and biology (fig 1). Bioinformatics is essential for
management of data in modern biology and medicine.
This paper describes the main tools of the bioinforma-
tician and discusses how they are being used to
interpret biological data and to further understanding
of disease. The potential clinical applications of these
data in drug discovery and development are also
discussed.
Methods
This article is based on personal experience in
bioinformatics and on selected articles in recent issues
of Nature Genetics,Nature Genetics Reviews,Nature Medi-
cine, and Science. Key terms including bioinformatics,
comparative and functional genomics, proteomics,
microarray, disease, and medicine were used to search
for relevant articles in the peer reviewed scientific
literature.
Bioinformatics and its impact on
genomics
Last year it was announced that the entire human
genome had been mapped as a result of the efforts of
the worldwide human genome project and a private
genomic company.12However, in recent years, the sci-
entific world has witnessed the completion of whole
genome sequences of many other organisms. The
analysis of the emerging genomic sequence data and
the human genome project is a landmark achievement
for bioinformatics.
A novel strategy for random sequencing of the
whole genome (the so called “shot gun” technique) was
used to sequence the genome of Haemophilus influenzae
in 1995.3This was the very first complete genome of
any free living organism to be sequenced. Other bacte-
rial genomes, such as those of Mycoplasma genitalium
and Mycobacterium tuberculosis, were sequenced soon
after,45and the sequence of the plague bacterium Ye r s -
inia pestis was recently completed.6The sequence and
annotation of the first eukaryotic genome, that of Sac-
charomyces cerevisiae (a yeast),7was followed by those of
other eukaryotic species such as Caenorhabtidis elegans
(a worm),8Drosophila melanogaster (fruit fly),9and Arab-
dopsis thaliana (mustard weed)10 (see fig A on bmj.com).
Sequencing of several other species, including
An additional figure
appears on
bmj.com
Biology Medicine
Maths/physics Computer science
Bioinformatics
Fig 1 Interaction of disciplines that have contributed to the
formation of bioinformatics
Summary points
Bioinformatics is the application of tools of
computation and analysis to the capture and
interpretation of biological data
Bioinformatics is essential for management of
data in modern biology and medicine
The bioinformatics toolbox includes computer
software programs such as BLAST and Ensembl,
which depend on the availability of the internet
Analysis of genome sequence data, particularly
the analysis of the human genome project, is one
of the main achievements of bioinformatics to
date
Prospects in the field of bioinformatics include its
future contribution to functional understanding
of the human genome, leading to enhanced
discovery of drug targets and individualised
therapy
Bioinformatics,
a new
interdisciplinary
science, is
essential to
managing,
understanding,
and harnessing
clinical benefit
from new
genetic data
Centre for
Integrated Genomic
Medical Research,
University of
Manchester,
Manchester
M13 9PT
Ardeshir Bayat
MRC fellow
Correspondence to:
ardeshir.bayat@
man.ac.uk
BMJ 2002;324:1018–22
1018 BMJ VOLUME 324 27 APRIL 2002 bmj.com
zebrafish, pufferfish, mouse, rat, and non-human
primates, are either under way or nearing completion
by both private and public sequencing initiatives.11 The
knowledge obtained from these sequence data will
have considerable implications for our understanding
of biology and medicine. As a result of comparative
genomic and proteomic research, we will soon be able
to not only locate each human gene but also fully
understand its function.12
Bioinformatic tools
The main tools of a bioinformatician are computer
software programs and the internet. A fundamental
activity is sequence analysis of DNA and proteins using
various programs and databases available on the world
wide web. Anyone, from clinicians to molecular
biologists, with access to the internet and relevant web-
sites can now freely discover the composition of
biological molecules such as nucleic acids and proteins
by using basic bioinformatic tools. This does not imply
that handling and analysis of raw genomic data can
easily be carried out by all. Bioinformatics is an evolv-
ing discipline, and expert bioinformaticians now use
complex software programs for retrieving, sorting out,
analysing, predicting, and storing DNA and protein
sequence data.
Large commercial enterprises such as pharmaceu-
tical companies employ bioinformaticians to perform
and maintain the large scale and complicated bioinfor-
matic needs of these industries. With an ever-
increasing need for constant input from bioinformatic
experts, most biomedical laboratories may soon have
their own in-house bioinformatician. The individual
researcher, beyond a basic acquisition and analysis of
simple data, would certainly need external bioinfor-
matic advice for any complex analysis.
The growth of bioinformatics has been a global
venture, creating computer networks that have allowed
easy access to biological data and enabled the develop-
ment of software programs for effortless analysis.
Multiple international projects aimed at providing
gene and protein databases are available freely to the
whole scientific community via the internet.
Bioinformatic analysis
The escalating amount of data from the genome
projects has necessitated computer databases that fea-
ture rapid assimilation, usable formats and algorithm
software programs for efficient management of
biological data.13 Because of the diverse nature of
emerging data, no single comprehensive database
exists for accessing all this information. However, a
growing number of databases that contain helpful
information for clinicians and researchers are avail-
able. Information provided by most of these databases
is free of charge to academics, although some sites
require subscription and industrial users pay a licence
fee for particular sites. Examples range from sites pro-
viding comprehensive descriptions of clinical disor-
ders, listing disease susceptibility genetic mutations
and polymorphisms, to those enabling a search for dis-
ease genes given a DNA sequence (box).
These databases include both “public” repositories
of gene data as well as those developed by private com-
panies. The easiest way to identify databases is by
Useful bioinformatic websites (available freely
on the internet)
National Center for Biotechnology Information
(www.ncbi.nlm.nih.gov)
maintains bioinformatic tools
and databases
National Center for Genome Resources
(www.ncgr.org/)
links scientists to bioinformatics
solutions by collaborations, data, and software
development
Genbank (www.ncbi.nlm.nih.gov/Genbank)
stores
and archives DNA sequences from both large scale
genome projects and individual laboratories
Unigene (www.ncbi.nlm.nih.gov/UniGene)
gene
sequence collection containing data on map location
of genes in chromosomes
European Bioinformatic Institute
(www.ebi.ac.uk)
centre for research and services in
bioinformatics; manages databases of biological data
Ensembl (www.ensembl.org)
automatic annotation
database on genomes
BioInform (www.bioinform.com)
global
bioinformatics news service
SWISS-PROT (www.expasy.org/sprot/)
important
protein database with sequence data from all
organisms, which has a high level of annotation
(includes function, structure, and variations) and is
minimally redundant (few duplicate copies)
International Society for Computational Biology
(www.iscb.org/)
aims to advance scientific
understanding of living systems through computation;
has useful bioinformatic links
Fig 2 Ensembl website: a genomic data search facility freely available on the internet.
Ensembl is a joint project between the European Bioinformatic Institute and the Sanger
Centre, which is capable of automatically tracking the sequenced pieces of the human
genome and assembling and analysing them to identify genes and other features of interest
to biomedical researchers
Clinical review
1019BMJ VOLUME 324 27 APRIL 2002 bmj.com
searching for bioinformatic tools and databases in any
one of the commonly used search engines. Another
way to identify bioinformatic sources is through
database links and searchable indexes provided by one
of the major public databases. For example, the
National Center for Biotechnology Information
(www.ncbi.nlm.nih.gov) provides the Entrez browser,
which is an integrated database retrieval system that
allows integration of DNA and protein sequence data-
bases. The European Bioinformatic Institute archives
gene and protein data from genome studies of all
organisms, whereas Ensembl produces and maintains
automatic annotation on eukaryotic genomes (fig 2).
The quality and reliability of databases vary; certainly
some of the better known and more established ones,
such as those above, are superior to others.
One of the simplest and better known search tools
is called BLAST (basic local alignment search tool, at
www.ncbi.nlm.nih.gov/BLAST/). This algorithm soft-
ware is capable of searching databases for genes with
similar nucleotide structure (fig 3) and allows compari-
son of an unknown DNA or amino acid sequence with
hundreds or thousands of sequences from human or
other organisms until a match is found. Databases of
known sequences are thus used to identify similar
sequences, which may be homologues of the query
sequence. Homology implies that sequences may be
related by divergence from a common ancestor or
share common functional aspects. When a database is
searched with a newly determined sequence (the query
sequence), local alignment occurs between the query
sequence and any similar sequence in the database.
The result of the search is sorted in order of priority on
the basis of maximum similarity.The sequence with the
highest score in the database of known genes is the
homologue. If homologues or related molecules exist
for a query sequence, then a newly discovered protein
may be modelled and the gene product may be
predicted without the need for further laboratory
experiments.
Functional genomics
Since the completion of the first draft of the human
genome,12 the emphasis has been changing from
genes themselves to gene products. Functional genom-
ics assigns functional relevance to genomic infor-
mation. It is the study of genes, their resulting proteins,
and the role played by the proteins.
Analysis and interpretation of biological data con-
siders information not only at the level of the genome
but at the level of the proteome and the transcriptome
(fig 4). Proteomics is the analysis of the total amount of
proteins (proteome) expressed by a cell, and transcrip-
tomics refers to the analysis of the messenger RNA
transcripts produced by a cell (transcriptome). DNA
microarray technology determines the expression level
of genes and includes genotyping and DNA sequenc-
ing. Gene expression arrays allow simultaneous analy-
sis of the messenger RNA expression levels of
thousands of genes in benign and malignant tumours,
such as keloid and melanoma. Expression profiles clas-
sify tumours and provide potential therapeutic
targets.14
Bioinformatic protein research draws on annotated
protein and two dimensional electrophoresis data-
bases. After separation, identification, and characterisa-
tion of a protein, the next challenge in bioinformatics is
the prediction of its structure. Structural biologists also
Fig 3 Web page illustrating freely available BLAST services run by
the National Center for Biotechnology Information. BLAST (basic local
alignment search tool) is a set of similarity search programs
designed to explore all of the available DNA sequence databases
Genomics
...agcttgatattatacgcgcggca
Transcriptomics
Proteomics
DNA
I
II
IV
III
RNA
Protein
Makes
Makes
C
o
m
p
l
e
x
i
t
y
Fig 4 Schematic diagram representing complexity of genomic data processing. Analysis and
interpretation of biological data considers information at every level from the genome (total
genetic content) to the proteome (total protein content) and transcriptome (total messenger
RNA content) of the cell. The images numbered I-IV to the right of the diagram represent
relevant examples of DNA (image I is base pair nucleotides); RNA (image II is a microarray
showing levels of gene expression); and protein (image III is a structure of a single protein;
image IV is a two dimensional gel electrophoresis showing separation of all proteins of a
cell—each spot corresponds to a different protein chain)
Clinical review
1020 BMJ VOLUME 324 27 APRIL 2002 bmj.com
use bioinformatics to handle the vast and complex data
from xray crystallography, nuclear magnetic reso-
nance, and electron microscopy investigations to create
three dimensional models of molecules.15
Other applications of bioinformatics
Apart from analysis of genome sequence data,
bioinformatics is now being used for a vast array of
other important tasks, including analysis of gene varia-
tion and expression, analysis and prediction of gene
and protein structure and function, prediction and
detection of gene regulation networks, simulation envi-
ronments for whole cell modelling, complex modelling
of gene regulatory dynamics and networks, and
presentation and analysis of molecular pathways in
order to understand gene-disease interactions.16
Although on a smaller scale, simpler bioinformatic
tasks valuable to the clinical researcher can vary from
designing primers (short oligonucleotide sequences
needed for DNA amplification in polymerase chain
reaction experiments) to predicting the function of
gene products.
Clinical application of bioinformatics
The clinical applications of bioinformatics can be
viewed in the immediate, short, and long term. The
human genome project plans to complete the human
sequence by 2003, producing a database of all the vari-
ations in sequence that distinguish us all. The project
could have considerable impact on people living in
2020
for example, a complete list of human gene
products may provide new drugs and gene therapy for
single gene diseases may become routine
(www.ornl.gov/hgmis/medicine/tnty.html).
Basic bioinformatic tools are already accessed in
certain clinical situations to aid in diagnosis and treat-
ment plans. For example, PubMed (www.nlm.nih.gov)
is accessed freely for biomedical journals cited in
Medline, and OMIM (Online Mendelian Inheritance in
Man at www3.ncbi.nlm.nih.gov/Omim/), a search tool
for human genes and genetic disorders, is used by cli-
nicians to obtain information on genetic disorders in
the clinic or hospital setting. An example of the appli-
cation of bioinformatics in new therapeutic advances is
the development of novel designer targeted drugs such
as imatinib mesylate (Gleevec), which interferes with
the abnormal protein made in chronic myeloid
leukaemia.17 (Imatinib mesylate was synthesised at
Novartis Pharmaceuticals by identifying a lead in a
high throughput screen for tyrosine kinase inhibitors
and optimising its activity for specific kinases.) The
ability to identify and target specific genetic markers by
using bioinformatic tools facilitated the discovery of
this drug.
In the short term, as a result of the emerging bioin-
formatic analysis of the human genome project, more
disease genes will be identified and new drug targets
will be simultaneously discovered. Bioinformatics will
serve to identify susceptibility genes and illuminate the
pathogenic pathways involved in illness, and will there-
fore provide an opportunity for development of
targeted therapy. Recently, potential targets in cancers
were identified from gene expression profiles.18
In the longer term, integrative bioinformatic analy-
sis of genomic, pathological, and clinical data in clinical
trials will reveal potential adverse drug reactions in
individuals by use of simple genetic tests. Ultimately,
pharmacogenomics (using genetic information to
individualise drug treatment) is likely to bring about a
new age of personalised medicine; patients will carry
gene cards with their own unique genetic profile for
certain drugs aimed at individualised therapy and tar-
geted medicine free from side effects.
Future directions
The practice of studying genetic disorders is changing
from investigation of single genes in isolation to
discovering cellular networks of genes, understanding
their complex interactions, and identifying their role in
disease.19 As a result of this, a whole new age of
individually tailored medicine will emerge. Bioinfor-
matics will guide and help molecular biologists and
clinical researchers to capitalise on the advantages
brought by computational biology.20 The clinical
Additional educational resources
Journals
Specific bioinformatic journals exist (for example,
www.bioinformatics.oupjournals.org), but papers from
every area of science and medicine involving
bioinformatic analysis are published in any biomedical
journal. Examples include:
The human genome (special issue). Nature
2001;409:813-933.
The human genome (special issue). Science
2001;5507:1145-434.
The human genome (special issue). JAMA
2001;286:2211-333.
The human genome (special issue). Scientific
American 2000;283:38-57.
Luscombe NM, Greenbaum D,Gerstein M. What is
bioinformatics? Method Inform Med 2001;40:346-58.
Online Lectures on Bioinformatics
(www.lectures.molgen.mpg.de/)
Books
Mount DW. Bioinformatics: sequence and genome
analysis. Cold Spring Harbor Laboratory Press, 2001.
Baxevanis AD, Ouellette BFF. Bioinformatics: a
practical guide to the analysis of genes and proteins. 2nd ed.
John Wiley and Sons, 2001.
Lengauer T (Ed). Bioinformatics. Wiley-VCH Series,
2001. (Methods and principles in medicinal chemistry
series.)
Higgins D, Taylor W. Bioinformatics. Oxford
University Press, 2000. (Practical approach series.)
Baldi P, Brunak S. Bioinformatics. 2nd ed. MIT Press,
2001. (Adaptive computation and machine learning
series.)
BMJ archive
Aitman TJ. DNA microarrays in medical practice.
BMJ 2001;323:611-5.
Mathew CG. Postgenomic technologies: hunting the
genes for common disorders. BMJ 2001;322:1031-4.
Stewart A, Haites N, Rose P. Online medical genetics
resources: a UK perspective. BMJ 2001;322:1037-9.
Savill J. Molecular genetic approaches to
understanding disease. BMJ 1997;314:126-9.
Clinical review
1021BMJ VOLUME 324 27 APRIL 2002 bmj.com
research teams that will be most successful in the com-
ing decades will be those that can switch effortlessly
between the laboratory bench, clinical practice, and the
use of these sophisticated computational tools.
I thank Tessa Richards, Dipak Roy, and Professor Bill Ollier for
advice on the preparation of this manuscript and Andy Brass for
providing me with some of the diagrams.
Funding: Medical Research Council.
Competing interests: None declared.
1 Inter national Human Genome Sequencing Consortium. Initial sequenc-
ing and analysis of the human genome. Nature 2001;409:860-921.
2 Venter JC, Adams MD, Myers EW, Li PW,Mural RJ, Sutton GG, et al. The
sequence of the human genome. Science 2001;291:1304-51.
3 Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlav-
age AR, et al. Whole-genome random sequencing and assembly of Hae-
mophilus influenzae Rd. Science 1995;269:496-512.
4 Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleischmann
RD, et al. The minimal gene complement of Mycoplasma genitalium.
Science 1995;270:397-403.
5 Cole ST,Brosch R, Parkhill J, Gar nier T,Churcher C, Har ris D, et al. Deci-
phering the biology of Mycobacterium tuberculosis from the complete
genome sequence. Nature 1998;393:537-44.
6 Parkhill J, Wren BW,Thomson NR, Titball RW, Holden MT, Prentice MB,
et al. Genome sequence of Yersinia pestis, the causative agent of plague.
Nature 2001;413:523-27.
7 Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, et al.
Life with 6000 genes. Science 1996;274:546.
8 The C. elegans Sequencing Consortium. Genome sequence of the nema-
tode C. elegans: a platform for investigating biology. Science
1998;282:2012-8.
9 Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, et al.
A whole-genome assembly of Drosophila. Science 2000;287:2196-204.
10 Arabidopsis Genomics Initiative. Analysis of the genome sequence of the
flowering plant Arabidopsis thaliana. Nature 2000;408:796-815.
11 Stein L. Genome annotation: from sequence to biology. Nat Rev Genet
2001;2:493-503.
12 Subramanian G, Adams MD, Venter JC, Broder S. Implications of the
human genome for understanding human biology and medicine. JAMA
2001;286:2296-306.
13 Benton D. Bioinformatics
principles and potential of a new multidisci-
plinary tool. Trends Biotech 1996;14:261-312.
14 Maggio ET, Ramnarayan K. Recent developments in computational pro-
teomics. Trends Biotech 2001;19:266-72.
15 Burley SK, Almo SC, Bonanno JB, Capel M, Chance MR, Gaasterland T,
et al. Structural genomics: beyond the human genome project. Nat Genet
1999;23:151-7.
16 Tsoka S, Ouzounis CA. Recent developments and future directions in
computational genomics. FEBS Lett 2000;480:42-8.
17 Druker BJ, Sawyers CL, Kantarjian H, Resta DJ, Reese SF, Ford JM, et al.
Activity of a specific inhibitor of the BCR-ABL tyrosine kinase in the blast
crisis of chronic myeloid leukemia and acute lymphoblastic leukemia
with the Philadelphia chromosome. N Engl J Med 2001;344:1038-42.
18 Graeber TG, Eisenberg D. Bioinformatic identification of potential auto-
crine signaling loops in cancers from gene expression profiles. Nat Genet
2001;29:295-300.
19 Debouk C, Metcalf B. The impact of genomics on drug discovery. Annu
Rev Pharmacol Toxicol 2000;40:193-208.
20 Butler D. Are you ready for the revolution? Nature 2001;409:758-60.
When I use a word
Meta-
Mr John Gleave, a neurosurgeon,has written to ask me
the origin of the meta- in meta-analysis. The answer
comes from Aristotle.
The Greek preposition ìåôá´ (meta) had several
meanings, depending on whether it governed the
accusative, genitive,or dative case. With the accusative
it could mean coming into or among, in pursuit of, or
coming after in place or time; with the genitive it could
mean in the midst of, between, or in common with;
and with the dative it could mean in the company of or
over and above. It was also used as a prefix to express
such notions as sharing, being in the midst of,
succession, pursuit, reversal, and (most commonly)
change. Examples of the last include metabolism,
metamorphosis, and metaplasia.
In scientific English words its uses include
“consequent upon” (as in the obsolete terms
meta-arthritic, metapneumonic), “behind” or “beyond”
in an anatomical sense (metabranchial, metacarpal,
metaphysis), “coming later” (metaphase, which comes
after prophase), or “changing” (metachromasia, a
property of materials that stain a different colour from
the stain used). In geology meta- is used to distinguish
various types of metamorphic processes. And chemists
use meta- to differentiate certain metameric chemical
compounds (such as metacresol, paracresol,
orthocresol).
And so to Aristotle. Some 250 years after his death,
Aristotle’s manuscripts came into the hands of
Andronicus of Rhodes, who edited them. Andronicus
called one set of papers The Physics (ôá`*ïõóéêá´), dealing
as they did with natural science. Then he published a
set of papers that he called The Metaphysics (ôá`ìåôá`ôá`
*ïõóéêá´), simply because it came after The Physics.
However, because The Metaphysics dealt with what
Aristotle called “primary philosophy,” or ontology,
metaphysics came to be misunderstood as “the science
of that which transcends the physical.
As a result, the prefix meta- was then used to
designate any higher science (actual or hypothetical)
that dealt with more fundamental problems than the
original science itself. This use first appeared in the
early 17th century (John Donne, for example, writes
about metatheology) but did not become really
popular until the middle of the 19th century. Examples
include metaethics (the study of the foundations of
ethics, especially the nature of ethical statements) and
metahistory (an inquiry into the principles that govern
historical events).
Then, from about 1940, it became commonplace to
prefix meta- to designate concern with basic principles.
A metacriterion is a criterion that defines criteria. A
metatheorem is a theorem about theorems. A
metalanguage is a language that supplies terms for
analysing a language; a metametalanguage does the
same for a metalanguage. And Jean Tinguely described
his machine-like sculptures as “metamechanical.”(But
a metaphysician is not a doctor’s doctor.)
In these poststructuralist times we recognise many
metaforms. Mantissa, a medical novel by John Fowles, is
metafiction; Francois Truffaut’s film La Nuit Amercaine
is metacinema; several paintings by Magritte, notably
La Condition Humaine, are meta-art; and John Cage’s
piano piece 4’33’’ is metamusic.
So meta-analysis is an analysis of analyses, in which
sets of previously published (or unpublished) data are
themselves subjected as a whole to further analysis. In
this statistical sense it was first used in the 1970s by GV
Glass (Educ Res 1976;3(Nov):2). As he wrote, “The term
is a bit grand, but it is precise and apt. Incidentally,
meta-analysis should not be confused with metanalysis,
which is the process whereby, for example,“a nadder”
becomes “an adder” (see BMJ 1999;318:1758 and
2000;321:953).
I trust that this cures Mr Gleave’s metagrobolism.
Jeff Aronson clinical pharmacologist, Oxford
We welcome articles up to 600 words on topics such as
A memorable patient, A paper that changed my practice,My
most unfortunate mistake, or any other piece conveying
instruction, pathos, or humour.If possible the article
should be supplied on a disk. Permission is needed
from the patient or a relative if an identifiable patient is
referred to.
Clinical review
1022 BMJ VOLUME 324 27 APRIL 2002 bmj.com
... The automation of data management and analysis reduces errors and enhances efficiency. By bridging various disciplines, bioinformatics drives innovation, advancing our understanding of cancer biology and paving the way for more effective diagnostic, prognostic, and therapeutic strategies [12][13][14]. ...
Article
Full-text available
Backgrounds: Renal cell carcinoma (RCC) is the most common type of kidney cancer in adults. RCC begins in the renal tubule epithelial cells, essential for blood filtration and urine production. Methods: In this study, we aim to uncover the molecular mechanisms underlying kidney renal clear cell carci-noma (KIRC) by analyzing various non-coding RNAs (ncRNAs) and protein-coding genes involved in the disease. Using high-throughput sequencing datasets from the Gene Expression Omnibus (GEO), we identified differentially expressed mRNAs (DEMs), miRNAs (DEMIs), and circRNAs (DECs) in KIRC samples compared to normal kidney tissues. Our approach combined differential expression analysis, functional enrichment through Gene Ontology (GO) and KEGG pathway mapping, and a Protein-Protein Interaction (PPI) network to identify crucial hub genes in KIRC progression. Results: Key findings include the identification of hub genes such as EGFR, FN1, IL6, and ITGAM, which were closely associated with immune responses, cell signaling, and metabolic dysregulation in KIRC. Further analysis indicated that these genes could be potential biomarkers for prognosis and therapeutic targets. We constructed a competitive endogenous RNA (ceRNA) network involving lncRNAs, circRNAs, and miRNAs, suggesting complex regulatory interactions that drive KIRC pathogenesis. Additionally, the study examined drug sensitivity associated with the expression of hub genes, revealing the potential for personalized treatments. Immune cell infiltration patterns showed significant correlations with hub gene expression, highlighting the importance of immune modulation in KIRC. Conclusion: This research provides a foundation for developing targeted therapies and diagnostic biomarkers for KIRC while underscoring the need for experimental validation to confirm these bioinformatics insights.
... Bioinformatics studies have been found very useful in aiding and complementing studies from wet labs (Vignani et.al, 2019). These computeraided studies have made drug discovery and design more feasible by revealing hidden information about structural and functional knowledge of nucleic acids, proteins, and potential drug targets (Bayat, 2002;Behera et.al, 2021a, b). This study employed bioinformatics techniques to investigate the inhibitory effects of selected ligands from popularly consumed vegetables, herbs, spices and medicinal plants on the COX-2 enzyme. ...
Article
Introduction: Inflammation has been shown to be implicated in many communicable and noncommunicable diseases. Several studies have indicated the beneficial/protective effects of phytochemicals from many commonly used herbs, spices, vegetables and medicinal plants against ailments that have inflammatory components. Our study investigated the possible anti-inflammatory properties of ligands from these commonly used plants by exploring their interactions with the cyclooxygenase-2 enzyme, using bioinformatics techniques. Cyclooxygenase-2 (COX-2) is a key enzyme involved in the production of prostaglandins implicated in inflammatory disorders. Materials and Methods: Twenty-eight ligands from plants were used for the study; ibuprofen and celecoxib served as reference ligands. The 3-D structures of the 30 ligands were retrieved from the PubChem database in their Structure Data Format (SDF). COX-2 was retrieved in its Protein Data Bank (PDB) format. The ligands and the protein were converted to their pdbqt formats and subjected to molecular docking through standard bioinformatics procedures. One of the ligands (quercetin) was further subjected to molecular dynamics simulation using the Desmond Maestro software. Results: Many of the ligands compared very well with celecoxib in their binding properties and exhibited more negative binding energies than ibuprofen. Additional interactions of H bonds and hydrophobic bonds were noticed post molecular dynamics simulation of quercetin with COX-2, indicating dynamic forces fluctuations. MD simulations showed that Gln42, Gly45, Pro 153, Pro154 and Glu465 were the best amino acid side chains that interacted with quercetin for the stabilization of the protein-ligand complex. The energy values and protein-ligand interactions indicate affinity and stability of complex. Conclusion: Many of the ligands subjected to molecular docking and MD simulation can be taken as promising drug targets and subjected to ADMET (absorption, distribution, metabolism, excretion, toxicity) properties analysis and clinical trials. This is especially important in view of the various side effects of both selective and nonselective NSAIDs. In addition, the authors, through the findings of this study, recommend more consumption of natural foods that have health benefits rather than processed and artificial products.
... In recent years, the scientific world has developed a lot of mapping of the human genome, even now starting to map genomes for other organisms. The analysis of emerging genomic sequence data and genome projects of humans and other organisms is an important achievement for bioinformatics (Bayat, 2002). ...
Article
Full-text available
The huge increase in the amount of data is a problem today. The increase in large amounts of data makes storage very large and processing data becomes very long. Meanwhile, the speed of the process is very necessary to streamline time. This research is dedicated to solving storage and process problems as a big data processing solution by creating a string matching computational model using the Boyer-Moore Horspool algorithm using the Big Data platform, Apache Spark where the Hadoop Distributed File System as data storage on the cluster. In this study, a comparison of string matching process time between stand-alone, the use of Apache Spark single nodes, the use of Apache Spark 3 nodes, 5 nodes, 11 nodes and 16 nodes using Hadoop Distributed File System storage on clusters on Google Cloud Platform. The case study used is bioinformatics by solving two problems in the field of biology, namely the search for motives related to determining the group of flowering plants with other plant groups and the search for motives as detection of begomovirous symptoms as the cause of curly leaf disease. In the results of the study, insignificant time was obtained because the data used could still be processed by classical programs so that the execution time was not much different. The accuracy of the program run on Apache Spark is 83.5%.
... These tools play a crucial role in modern biological research, enabling researchers to analyze biological data more efficiently and accurately, leading to new discoveries and insights into the workings of living organisms (Table 6-4.6). Bioinformatics encompasses a wide range of tools, methods, and applications [70]. In Table 6-4.6, ...
Chapter
Article
Full-text available
In recent years, advancements in gene structure prediction have been significantly driven by the integration of deep learning technologies into bioinformatics. Transitioning from traditional thermodynamics and comparative genomics methods to modern deep learning-based models such as CDSBERT, DNABERT, RNA-FM, and PlantRNA-FM prediction accuracy and generalization have seen remarkable improvements. These models, leveraging genome sequence data along with secondary and tertiary structure information, have facilitated diverse applications in studying gene functions across animals, plants, and humans. They also hold substantial potential for multi-application in early disease diagnosis, personalized treatment, and genomic evolution research. This review combines traditional gene structure prediction methods with advancements in deep learning, showcasing applications in functional region annotation, protein-RNA interactions, and cross-species genome analysis. It highlights their contributions to animal, plant, and human disease research while exploring future opportunities in cancer mutation prediction, RNA vaccine design, and CRISPR gene editing optimization. The review also emphasizes future directions, such as model refinement, multimodal integration, and global collaboration. By offering a concise overview and forward-looking insights, this article aims to provide a foundational resource and practical guidance for advancing nucleic acid structure prediction research.
Chapter
Worldwide problems, including increasing population, depletion of natural resources, and climate change, have led to a food crisis. Moreover, various diseases affecting livestock are impacting humans. Hence, the scientific community has been advised to adopt new and innovative approaches, and the use of bioinformatics in veterinary science is one such useful technique. In the current artificial intelligence (AI) era, several problems can be solved by decoding livestock systems using bioinformatic tools. Bioinformatics is a vast subject that has applications in various fields, and the approach incorporating it with veterinary and animal science is called “vetinformatics.” It can help understand the complex molecular mechanisms in farm animals that in turn would aid in developing practices for caring, breeding, and disease management, leading to robust livestock productivity. This chapter seeks to introduce the concept behind vetinformatics, the related challenges and opportunities, and its applications in veterinary and animal sciences.
Chapter
RNA viruses are responsible for numerous animal diseases, leading to significant economic impact on the livestock and poultry industry. A safe and effective method to generate broad and long-lasting immunity against these viruses is challenging, necessitating the development of reverse genetics technologies. Reverse genetics is vital in vaccine development, allowing precise modification of viral genomes to create safer and more effective vaccines. Its significance lies in enabling rapid vaccine generation, tailoring vaccines to specific viral strains, creating multivalent vaccines, ensuring vaccine safety, studying immune responses, and developing novel vaccine platforms. Overall, reverse genetics revolutionizes vaccine development by providing unprecedented control over viral genomes, leading to safer, more effective, and tailored vaccines against various infectious diseases. Bioinformatics plays a crucial role in developing reverse genetics-based vaccine platforms, which manipulate the genetic makeup of an organism to examine the function of specific genetic mutations or markers. Different tools and databases are used to analyse gene sequences, anticipate the consequences of genetic alterations, and create and assess the effectiveness of techniques for introducing those modifications. Computational methods can also be used to interpret and analyse data generated from reverse genetics research, such as transcriptome and proteome profiling, to better understand the functional effect of genetic mutations. It predicts how genetic alterations impact protein structure and function, designs specific oligonucleotides, and facilitates the generation of reverse genetic clones through in silico cloning. Integrating bioinformatics with CRISPR-Cas9 and TALENs improves the efficiency and specificity of these genome editing systems, selecting the most effective guide RNAs to minimize off-target effects and optimizing TALE proteins for precise editing. This chapter delves into current research on developing reverse genetic platforms for vaccine development against RNA viruses in domestic animals, reviews the use of computational tools and methodologies in creating efficient reverse genetic platforms, and discusses the challenges and future directions of this research area.
Article
Full-text available
The 97-megabase genomic sequence of the nematode Caenorhabditis elegans reveals over 19,000 genes. More than 40 percent of the predicted protein products find significant matches in other organisms. There is a variety of repeated sequences, both local and dispersed. The distinctive distribution of some repeats and highly conserved genes provides evidence for a regional organization of the chromosomes.
Article
Full-text available
An approach for genome analysis based on sequencing and assembly of unselected pieces of DNA from the whole chromosome has been applied to obtain the complete nucleotide sequence (1,830,137 base pairs) of the genome from the bacterium Haemophilus influenzae Rd. This approach eliminates the need for initial mapping efforts and is therefore applicable to the vast array of microbial species for which genome maps are unavailable. The H. influenzae Rd genome sequence (Genome Sequence DataBase accession number L42023) represents the only complete genome sequence from a free-living organism.
Article
Full-text available
The complete nucleotide sequence (580,070 base pairs) of the Mycoplasma genitalium genome, the smallest known genome of any free-living organism, has been determined by whole-genome random sequencing and assembly. A total of only 470 predicted coding regions were identified that include genes required for DNA replication, transcription and translation, DNA repair, cellular transport, and energy metabolism. Comparison of this genome to that of Haemophilus influenzae suggests that differences in genome content are reflected as profound differences in physiology and metabolic capacity between these two organisms.
Article
The mapping of the human genome was completed earlier this year and efforts are underway to understand the role of gene products (i.e. proteins) in biological pathways and human disease and to exploit their functional roles to derive protein therapeutics and protein-based drugs. A key component to the next revolution in the 'post-genomic' era will be the increasingly widespread use of protein structure in rational experimental design. Improvements in quality, availability and utility of large-scale three- and four-dimensional protein structural information are enabling a revolution in rational design, having particular impact on drug discovery and optimization. New computational methodologies now yield modeled structures that are, in many cases, quantitatively comparable with crystal structures, at a fraction of the cost.
Article
High-throughput gene sequencing has revolutionized the process used to identify novel molecular targets for drug discovery. Thousands of new gene sequences have been generated but only a limited number of these can be converted into validated targets likely to be involved in disease. We describe here some of the approaches used at SmithKline Beecham to select and validate novel targets. These include the identification of selective tissue gene product expression, such as for cathepsin K, a novel osteoclast-specific cysteine protease. We also describe the discovery and functional characterization of novel members of the G-protein coupled receptor superfamily and their pairing with natural ligands. Lastly, we discuss the promises of gene microarrays and proteomics, developing technologies that allow the parallel analyses of tissue expression patterns of thousands of genes or proteins, respectively.
Article
The flowering plant Arabidopsis thaliana is an important model system for identifying genes and determining their functions. Here we report the analysis of the genomic sequence of Arabidopsis. The sequenced regions cover 115.4 megabases of the 125-megabase genome and extend into centromeric regions. The evolution of Arabidopsis involved a whole-genome duplication, followed by subsequent gene loss and extensive local gene duplications, giving rise to a dynamic genome enriched by lateral gene transfer from a cyanobacterial-like ancestor of the plastid. The genome contains 25,498 genes encoding proteins from 11,000 families, similar to the functional diversity of Drosophila and Caenorhabditis elegans— the other sequenced multicellular eukaryotes. Arabidopsis has many families of new proteins but also lacks several common protein families, indicating that the sets of common proteins have undergone differential expansion and contraction in the three multicellular eukaryotes. This is the first complete genome sequence of a plant and provides the foundations for more comprehensive comparison of conserved processes in all eukaryotes, identifying a wide range of plant-specific gene functions and establishing rapid systematic ways to identify genes for crop improvement.
Article
The 97-megabase genomic sequence of the nematode Caenorhabditis elegans reveals over 19,000 genes. More than 40 percent of the predicted protein products find significant matches in other organisms. There is a variety of repeated sequences, both local and dispersed. The distinctive distribution of some repeats and highly conserved genes provides evidence for a regional organization of the chromosomes.