ArticlePDF Available

WholeCellKB: Model organism databases for comprehensive whole-cell models

Authors:

Abstract and Figures

Whole-cell models promise to greatly facilitate the analysis of complex biological behaviors. Whole-cell model development requires comprehensive model organism databases. WholeCellKB (http://wholecellkb.stanford.edu) is an open-source web-based software program for constructing model organism databases. WholeCellKB provides an extensive and fully customizable data model that fully describes individual species including the structure and function of each gene, protein, reaction and pathway. We used WholeCellKB to create WholeCellKB-MG, a comprehensive database of the Gram-positive bacterium Mycoplasma genitalium using over 900 sources. WholeCellKB-MG is extensively cross-referenced to existing resources including BioCyc, KEGG and UniProt. WholeCellKB-MG is freely accessible through a web-based user interface as well as through a RESTful web service.
Content may be subject to copyright.
WholeCellKB: model organism databases for
comprehensive whole-cell models
Jonathan R. Karr
1
, Jayodita C. Sanghvi
2
, Derek N. Macklin
2
, Abhishek Arora
3
and
Markus W. Covert
2,
*
1
Graduate Program in Biophysics,
2
Department of Bioengineering and
3
Department of Electrical Engineering,
Stanford University, 318 Campus Drive West, Stanford, CA 94305, USA
Received August 15, 2012; Revised October 1, 2012; Accepted October 19, 2012
ABSTRACT
Whole-cell models promise to greatly facilitate the
analysis of complex biological behaviors. Whole-
cell model development requires comprehensive
model organism databases. WholeCellKB (http://
wholecellkb.stanford.edu) is an open-source web-
based software program for constructing model
organism databases. WholeCellKB provides an ex-
tensive and fully customizable data model that fully
describes individual species including the structure
and function of each gene, protein, reaction and path-
way. We used WholeCellKB to create WholeCellKB-
MG, a comprehensive database of the Gram-positive
bacterium Mycoplasma genitalium using over 900
sources. WholeCellKB-MG is extensively cross-
referenced to existing resources including BioCyc,
KEGG and UniProt. WholeCellKB-MG is freely
accessible through a web-based user interface as
well as through a RESTful web service.
INTRODUCTION
A primary challenge in computational biology is to predict
how complex phenotypes such as growth and replication
arise from networks of individual molecules. Whole-cell
models promise to tackle this challenge by integrating het-
erogeneous molecular data into predictive computational
models. This integration requires model organism data-
bases which comprehensively provide readily computable
molecular data.
WholeCellKB is an open-source, web-based software
program for developing comprehensive model organism
databases for whole-cell models. As illustrated in
Figure 1, WholeCellKB enables whole-cell modeling by
organizing diverse molecular data from primary research
articles, reviews, books and databases into a single
database. The WholeCellKB data model supports
detailed descriptions of individual species including their
genes, operons, proteins, macromolecular complexes,
molecular interactions, chemical reactions and
pathways. Importantly, WholeCellKB also facilitates
extensive source documentation. We used WholeCellKB
to develop WholeCellKB-MG, an extensive database of
the pathogenic Gram-positive bacterium Mycoplasma
genitalium.
Here, we describe WholeCellKB-MG’s content, cur-
ation, user interface and implementation. We also
compare WholeCellKB-MG to existing resources, high-
lighting WholeCellKB-MG’s greater scope and granular-
ity. Finally, we discuss our future plans for WholeCellKB.
CONTENT
Our goal was to create a database comprehensive enough
to enable a whole-cell model (1). As illustrated in Figure 2,
WholeCellKB-MG broadly represents M. genitalium mo-
lecular biology including (i) its subcellular organization;
(ii) its chromosome sequence; (iii) the location, length, dir-
ection and essentiality of each gene; (iv) the organization
and promoter of each transcription unit; (v) the expression
and degradation rate of each RNA transcript; (vi) the spe-
cific folding and maturation pathway of each RNA and
protein species including the localization, N-terminal
cleavage, signal sequence, prosthetic groups, disulfide
bonds and chaperone interactions of each protein
species; (vii) the subunit composition of each macromol-
ecular complex; (viii) its genetic code; (ix) the binding sites
and footprint of every DNA-binding protein; (x) the struc-
ture, charge and hydrophobicity of every metabolite; (xi)
the stoichiometry, catalysis, coenzymes, energetics and
kinetics of every chemical reaction; (xii) the regulatory
role of each transcription factor; (xiii) its chemical com-
position and (xiv) the composition of its laboratory
growth medium. Table 1 summarizes WholeCellKB-
MG’s size and content.
CURATION
We curated WholeCellKB-MG in five steps based on
>900 primary research articles, reviews, books and
*To whom correspondence should be addressed. Tel: +1 650 7256615; Fax: +1 650 7211409; Email: mcovert@stanford.edu
Nucleic Acids Research, 2012, 1–6
doi:10.1093/nar/gks1108
ß The Author(s) 2012. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/3.0/), which
permits non-commercial reuse, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact
journals.permissions@oup.com.
Nucleic Acids Research Advance Access published November 21, 2012
at Stanford University on December 5, 2012http://nar.oxfordjournals.org/Downloaded from
databases. First, we curated the overall structure of
M. genitalium including its size, shape, subcellular organ-
ization and chemical composition based on several experi-
mental studies including Morowitz et al. (2). We also
assembled the chemical composition of Mycoplasma la-
boratory growth medium based on analyses reported by
Solabia (3).
Second, we curated the structure of the M. genitalium
chromosome including its sequence, the location, length
and direction of each gene and its transcription unit or-
ganization based on the Comprehensive Microbial
Resource (CMR) annotation (4) and a recent study by
Gu
¨
ell et al. (5). We reconstructed the location of each
promoter and the expression, degradation rate and essen-
tiality of each gene product from four recent studies (6–9).
We catalogued DNA-binding sites and transcriptional
regulatory interactions from several sources including
DBTBS (10).
Third, we assembled the structure of each RNA and
protein gene product. We compiled the post-tran-
scriptional processing and modification of each RNA
transcript from several sources including Peil (11). We
reconstructed the signal sequence, localization,
chaperone-mediated folding, post-translational modifica-
tion, disulfide bonds, subunit composition and DNA
footprint of each protein and macromolecular complex
from a large number of primary research articles,
computational models and databases. We assembled the
chemical regulation of each gene product from several
sources including DrugBank (12). We used ExPASy
ProtParam (13) to calculate the pI, extinction coefficient,
half-life, instability index, aliphatic index and grand
average of hydropathy of every protein species.
Fourth, we curated the specific chemical reactions
catalyzed by each gene product starting from the CMR
(4), GenBank (14), KEGG (15) and UniProt (16) genome
annotations and the reconstructed RNA and protein mat-
uration pathways. To maximize the scope of the database
and to fill gaps in the genome annotation, we expanded
each gene product’s annotation based on primary research
articles we identified by searching PubMed (17) and
Google Scholar (http://scholar.google.com). We consulted
BioCyc (18), KEGG (15), two flux-balance analysis (FBA)
models of bacterial metabolism (19,20) and hundreds of
additional primary research articles to curate the stoichi-
ometry of each chemical reaction. We assembled the
thermodynamics and kinetics of each chemical reaction
from several databases including BRENDA (21),
SABIO-RK (22) and UniProt (16 ) and a FBA model (20).
Finally, we compiled the M. genitalium metabolome.
We included all metabolites involved in the reconstructed
reactions, biomass or growth medium. We curated the
empirical formula, structure, charge and intracellular con-
centration of each metabolite from several databases
including BioCyc (18), CyberCell (23) and PubChem
(24) and a comprehensive mass-spectrometry study (25).
We used ChemAxon Marvin (http://www.chemaxon.com/
products/marvin) to calculate the molecular weight, van
der Waals volume, pI, log
d
and log
p
of each metabolite.
In order to create a comprehensive description of
M. genitalium physiology, we based WholeCellKB-MG
on studies of closely related organisms where studies of
M. genitalium were unavailable. In cases where multiple
observations were available, we based the reconstruction
on the most closely related organism. We used
bi-directional best BLAST (26) to identify homologous
genes. To provide model transparency, we tracked the
species, experimental conditions and citation of each
piece of evidence.
COMPARISON TO EXISTING RESOURCES
WholeCellKB represents the specific molecular inter-
actions of individual species similar to previous databases
such as BioCyc (18,27) and BiGG (28). In particular,
WholeCellKB’s data model, user interface and species-
specific content were heavily inspired by BioCyc.
Importantly, WholeCellKB-MG also has several
major differences from existing resources. First,
WholeCellKB-MG more broadly represents cell physi-
ology. WholeCellKB-MG represents the molecular
details of 28 cellular processes including well-studied
processes such as metabolism as well as less well-
understood processes such as DNA damage and repair
and RNA and protein degradation. The online documen-
tation at http://wholecellkb.stanford.edu/about provides
further information about the WholeCellKB-MG data
(a)
(b)
(c)
Figure 1. WholeCellKB-MG enables whole-cell modeling by
integrating diverse data sources into a single database. (a) Currently,
WholeCellKB-MG integrates >900 primary research articles, reviews,
books and databases. (b) WholeCellKB-MG comprehensively repre-
sents all aspects of molecular physiology including metabolomics,
genomics, transcriptomics and proteomics. (c) WholeCellKB-MG
provides molecular data for whole-cell models.
2 Nucleic Acids Research, 2012
at Stanford University on December 5, 2012http://nar.oxfordjournals.org/Downloaded from
model and how WholeCellKB-MG represents each
cellular process. Figure 3 compares WholeCellKB-MG’s
content to that of several existing databases.
Second, whole-cell modeling requires model organism
databases which explicitly define the participants of
each molecular interaction and chemical reaction.
WholeCellKB-MG addresses this need by representing
the specific molecules involved in every molecular inter-
action and by requiring structures for each molecule. For
example, WholeCellKB-MG represents the specific RNA
bases involved in every RNA methylation reaction,
whereas existing resources lump RNA methylation inter-
actions into a single generic reaction. WholeCellKB-MG
represents every major cellular process including RNA
processing and protein processing, modification and trans-
location with similarly fine molecular resolution.
Third, where available WholeCellKB-MG contains not
only structural but also quantitative functional
descriptions of each molecule and molecular interaction.
For example, WholeCellKB-MG contains chemical
reaction rate laws and kinetic parameters, RNA transcript
expressions and half-lives, and cellular and growth medium
chemical compositions. In total, WholeCellKB-MG repre-
sents 1836 heterogeneous model parameters. Table 2 sum-
marizes how WholeCellKB represents these heterogeneous
parameters using several types of database entries.
DATA INPUT
WholeCellKB provides administrators with two editing
interfaces: (i) a web form to edit single entries and (ii)
an Excel-based interface to simultaneously edit multiple
entries. We believe that these two interfaces enable collab-
orative model organism database development.
In the beginning of our M. genitalium curation efforts,
we primarily used the batch interface to quickly import
large amounts of data from other genome annotations.
We continued to use the batch interface throughout the
project to import high-throughput molecular data. Later
in our M. genitalium curation efforts, we primarily used
the form interface to refine our annotation based on
specific biochemical studies. Overall, we found that
WholeCellKB improved the quality of our annotation
and in particular encouraged us to thoroughly annotate
the original source of each datum.
Data submitted to WholeCellKB was extensively
validated to ensure consistency and correctness. For
example, WholeCellKB checked that each chemical
formula was valid, that each reaction was mass-balanced
and that every molecule and kinetic parameter was defined
in each reaction rate law. WholeCellKB provided hints on
Figure 2. WholeCellKB aims to comprehensively describe cell physiology including the structure and dynamics of every metabolite, gene, RNA
transcript and protein. Boxes illustrate several molecular properties represented by WholeCellKB.
Table 1. WholeCellKB-MG size
Entry type Number
Cellular state 16
Chromosome feature 2305
Compartment 6
Gene 525
Metabolite 722
Pathway 17
Process 28
Protein complex 201
Protein monomer 482
Reaction 1857
Transcription unit 335
Transcriptional regulatory interaction 30
Nucleic Acids Research, 2012 3
at Stanford University on December 5, 2012http://nar.oxfordjournals.org/Downloaded from
how to correct invalid data such as the atom imbalance of
invalid reactions.
DATA ACCESS
WholeCellKB-MG is freely accessible through a simple
and intuitive web-based interface at http://wholecellkb.
stanford.edu. This web-based interface allows users to
quickly browse, search and export the database. It
also allows administrators to add, edit and delete
entries. Importantly, the interface is extensively com-
mented and hyperlinked, allowing users to easily find the
primary source of each datum.
WholeCellKB-MG is also accessible through a RESTful
interface. This interface provides the content of every
HTML page in JSON and XML formats. We are cur-
rently using this interface to develop software for
visualizing whole-cell simulations.
DEVELOPER API
WholeCellKB was designed to enable modelers to develop
model organism databases for whole-cell models,
including designing custom data models and user
interfaces. WholeCellKB provides a framework for
viewing, searching, exporting and editing database
entries which developers can combine with custom data
models and HTML templates. This allows developers to
build custom model organism databases with minimal
effort and without any knowledge of database design.
Furthermore, because WholeCellKB is open source and
implemented with Python, modelers can easily display sci-
entific calculations alongside curated data in the user
interface. The online documentation provides further
instructions on how to customize WholeCellKB.
IMPLEMENTATION
WholeCellKB was implemented in Python using the
Django (http://www.djangoproject.com) web framework
and stored using the relational database MySQL (http://
www.mysql.com). Full-text search was implemented using
Haystack (http://haystacksearch.org) and Xapian (http://
xapian.org). Excel, JSON and XML export were
implemented using OpenPyXL (http://bitbucket.org/eric
Figure 3. Detailed comparison of the content of WholeCellKB-MG and several existing biological databases. In addition to containing detailed
descriptions of genetics, metabolism and transcriptional regulation comparable to existing resources such as BiGG (28), BioCyc (18) and CMR (4),
WholeCellKB-MG has detailed representations of RNA degradation, RNA and protein maturation and protein translocation. Black boxes indicate
physiology represented with fine granularity including the specific molecules involved in each specific interaction (e.g. specific metabolites involved in
each metabolic reaction). Gray boxes indicate coarsely represented physiology, for example lumping families of similar reactions such as RNA
methylation into a single database entry rather than representing the specific RNA bases involved in each individual reaction. White boxes indicate
unrepresented physiology.
Table 2. WholeCellKB-MG parameters
Type Number
Cell composition 73
Media composition 83
Reaction K
eq
225
Reaction K
m
483
Reaction V
max
434
RNA expression 525
RNA half-life 525
Stimulus values\ 10
Transcriptional regulation 32
Activity 30
Affinity 2
Other 154
4 Nucleic Acids Research, 2012
at Stanford University on December 5, 2012http://nar.oxfordjournals.org/Downloaded from
gazoni/openpyxl), simplejson (http://pypi.python.org/
pypi/simplejson) and xml.dom (http://docs.python.org/
library/xml.dom.html). WholeCellKB runs on the
Apache (http://www.apache.org) web server using the
mod_wsgi (http://code.google.com/p/modwsgi) module.
All of the software used to implement WholeCellKB is
available open source.
SUMMARY AND FUTURE DIRECTIONS
WholeCellKB-MG is an extensive database of
M. genitalium designed to facilitate whole-cell modeling.
Currently, we are continuing to curate the database as well
as starting to create equally comprehensive databases of
other model microorganisms. Beyond facilitating realistic
whole-cell models, we believe that these databases are
useful platforms for experimental and computational
biologists.
We created WholeCellKB-MG using WholeCellKB, an
open-source, web-based software program which enables
modelers to quickly develop model organism databases
for whole-cell modeling.
Beyond continuing to curate model organisms, we also
plan to continue to strengthen the WholeCellKB software.
We plan to add additional tools for importing databases
curated with other tools such as PathwayTools (27),
storing the detailed history of each database entry and
comparing model organism databases as well as expand-
ing the search functionality of the RESTful API. As the
whole-cell modeling community grows, in the future we
also plan to enable open-editing similar to Wikipedia.
Finally, we are currently using WholeCellKB’s RESTful
API to develop tools for visualizing whole-cell
simulations.
We hope that other researchers will use WholeCellKB
to develop model organism databases and whole-cell
models. We believe that WholeCellKB will not only
speed up database curation and whole-cell model develop-
ment but also encourage best annotation practices.
Ultimately, we hope that WholeCellKB in combination
with whole-cell models will accelerate biological discovery
and bioengineering.
ACKNOWLEDGEMENTS
We thank Elsa Birch, Nick Ruggero and Ruby Lee for
enlightening discussions on database design, curation,
modeling and visualization.
FUNDING
NIH Director’s Pioneer Award [5DP1LM01150-05] and a
Hellman Faculty Scholarship (to M.W.C.); NDSEG, NSF
and Stanford Graduate Fellowships (to J.R.K.); NSF and
Bio-X Graduate Student Fellowships (to J.C.S.) and a
Stanford Graduate Fellowship (to D.N.M.). Funding for
open access charge: NIH Director’s Pioneer Award
[5DP1LM01150-05].
Conflict of interest statement. None declared.
REFERENCES
1. Karr,J.R., Sanghvi,J.C., Macklin,D.N., Jacobs,J.M.,
Gutschow,M.V., Bolival,B., Assad-Garcia,N., Glass,J.I. and
Covert,M.W. (2012) A whole-cell computational model predicts
phenotype from genotype. Cell, 150, 389–401.
2. Morowitz,H.J., Tourtellotte,M.E., Guild,W.R., Castro,E. and
Woese,C. (1962) The chemical composition and submicroscopic
morphology of Mycoplasma gallisepticum, Avian PPLO 5969.
J. Mol. Biol., 4, 93–103.
3. Solabia. (2011) Biotechnology Products, Retrieved from http://
www.solabia.com/ (14 March 2011, date last accessed).
4. Davidsen,T., Beck,E., Ganapathy,A., Montgomery,R., Zafar,N.,
Yang,Q., Madupu,R., Goetz,P., Galinsky,K., White,O. et al.
(2010) The comprehensive microbial resource. Nucleic Acids Res.,
38, D340–D345.
5. Gu
¨
ell,M., van Noort,V., Yus,E., Chen,W.H., Leigh-Bell,J.,
Michalodimitrakis,K., Yamada,T., Arumugam,M., Doerks,T.,
Ku
¨
hner,S. et al. (2009) Transcriptome complexity in a
genome-reduced bacterium. Science, 326, 1268–1271.
6. Weiner,J. 3rd, Herrmann,R. and Browning,G.F. (2000)
Transcription in Mycoplasma pneumoniae. Nucleic Acids Res., 2,
241–249.
7. Weiner,J. 3rd, Zimmerman,C.U., Go
¨
hlmann,H.W. and
Herrmann,R. (2003) Transcription profiles of the bacterium
Mycoplasma pneumoniae grown at different temperatures. Nucleic
Acids Res., 37, 6306–6320.
8. Bernstein,J.A., Khodursky,A.B., Lin,P.H., Lin-Chao,S. and
Cohen,S.N. (2002) Global analysis of mRNA decay and
abundance in Escherichia coli at single-gene resolution using
two-color fluorescent DNA microarrays. Proc. Natl Acad. Sci.
USA, 22, 235–244.
9. Glass,J.I., Assad-Garcia,N., Alperovich,N., Yooseph,S.,
Lewis,M.R., Maruf,M., Hutchison,C.A. 3rd, Smith,H.O. and
Venter,J.C. (2006) Essential genes of a minimal bacterium. Proc.
Natl Acad. Sci. USA, 77, 1175–1181.
10. Sierro,N., Makita,Y., de Hoon,M. and Nakai,K. (2008) DBTBS:
a database of transcriptional regulation in Bacillus subtilis
containing upstream intergenic conservation information. Nucleic
Acids Res., 5, e8664.
11. Peil,L. (2009) Ribosome assembly factors in Escherichia coli,
Master Thesis. Tartu University.
12. Knox,C., Law,V., Jewison,T., Liu,P., Ly,S., Frolkis,A., Pon,A.,
Banco,K., Mak,C., Neveu,V. et al. (2011) DrugBank 3.0: a
comprehensive resource for ‘omics’ research on drugs.
Nucleic
Acids Res., 14,
D554–D556.
13. Gasteiger,E., Hoogland,C., Gattiker,A., Duvaud,S., Wilkins,M.R.,
Appel,R.D. and Bairoch,A. (2005) Protein identification and
analysis tools on the ExPASy server. In: Gasteiger,E.,
Hoogland,C., Gattiker,A., Duvaud,S., Wilkins,M.R., Appel,R.D.
and Bairoch,A. (eds), The Proteomics Protocols Handbook.
Humana Press, Totowa, NJ, pp. 571–607.
14. Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J. and
Sayers,E.W. (2011) GenBank. Nucleic Acids Res., 39, D32–D37.
15. Kanehisa,M., Goto,S., Sato,Y., Furumichi,M. and Tanabe,M.
(2012) KEGG for integration and interpretation of large-scale
molecular datasets. Nucleic Acids Res., 40, D109–D114.
16. The UniProt Consortium. (2012) Reorganizing the protein space
at the Universal Protein Resource (UniProt). Nucleic Acids Res.,
40, D71–D75.
17. Sayers,E.W., Barrett,T., Benson,D.A., Bolton,E., Bryant,S.H.,
Canese,K., Chetvernin,V., Church,D.M., Dicuccio,M., Federhen,S.
et al. (2010) Database resources of the National Center for
Biotechnology Information. Nucleic Acids Res., 38, D5–D16.
18. Keseler,I.M., Collado-Vides,J., Santos-Zavaleta,A., Peralta-Gil,M.,
Gama-Castro,S., Muniz-Rascado,L., Bonavides-Martinez,C.,
Paley,S., Krummenacker,M., Altman,T. et al. (2011) EcoCyc: a
comprehensive database of Escherichia coli biology. Nucleic Acids
Res., 39, D583–D590.
19. Suthers,P.F., Dasika,M.S., Kumar,V.S., Denisov,G., Glass,J.I.
and Maranas,C.D. (2009) A genome-scale metabolic
reconstruction of Mycoplasma genitalium, iPS189. PLoS Comput.
Biol., 26, 4694–4708.
Nucleic Acids Research, 2012 5
at Stanford University on December 5, 2012http://nar.oxfordjournals.org/Downloaded from
20. Feist,A.M., Henry,C.S., Reed,J.L., Krummenacker,M.,
Joyce,A.R., Karp,P.D., Broadbelt,L.J., Hatzimanikatis,V. and
Palsson,B.Ø. (2007) A genome-scale metabolic reconstruction
for Escherichia coli K-12 MG1655 that accounts for 1260
ORFs and thermodynamic information. Mol. Syst. Biol. , 28,
15–33.
21. Scheer,M., Grote,A., Chang,A., Schomburg,I., Munaretto,C.,
Rother,M., So
¨
hngen,C., Stelzer,M., Thiele,J. and Schomburg,D.
(2011) BRENDA, the enzyme information system in 2011.
Nucleic Acids Res., 39, D670–D676.
22. Wittig,U., Kania,R., Golebiewski,M., Rey,M., Shi,L., Jong,L.,
Algaa,E., Weidemann,A., Sauer-Danzwith,H., Mir,S. et al. (2012)
SABIO-RK—database for biochemical reaction kinetics. Nucleic
Acids Res., 40, D790–D796.
23. Sundararaj,S., Guo,A., Habibi-Nazhad,B., Rouani,M.,
Stothard,P., Ellison,M. and Wishart,D.S. (2004) The
CyberCell Database (CCDB): a comprehensive, self-updating,
relational database to coordinate and facilitate in
silico modeling of Escherichia coli. Nucleic Acids Res., 32,
D293–D295.
24. Bolton,E., Wang,Y., Thiessen,P.A. and Bryant,S.H. (2008)
PubChem: integrated platform of small molecules and biological
activities. In: Bolton,E., Wang,Y., Thiessen,P.A. and Bryant,S.H.
(eds), Annual Reports in Computational Chemistry. American
Chemical Society, Washington, DC, pp. 217–241.
25. Bennett,B.D., Kimball,E.H., Gao,M., Osterhout,R., Van Dien,S.J.
and Rabinowitz,J.D. (2009) Absolute metabolite concentrations
and implied enzyme active site occupancy in Escherichia coli. Nat.
Chem. Biol., 5, 593–599.
26. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J.
(1990) Basic local alignment search tool. J. Mol. Biol., 215,
403–410.
27. Karp,P.D., Paley,S.M., Krummenacker,M., Latendresse,M.,
Dale,J.M., Lee,T.J., Kaipa,P., Gilham,F., Spaulding,A.,
Popescu,L. et al. (2010) Pathway tools version 13.0: integrated
software for pathway/genome informatics and systems biology.
Brief. Bioinform., 11, 40–79.
28. Schellenberger,J., Park,J.O., Conrad,T.M. and Palsson,B.Ø. (2010)
BiGG: a biochemical genetic and genomic knowledgebase of large
scale metabolic reconstructions. BMC Bioinformatics, 11, 213.
6 Nucleic Acids Research, 2012
at Stanford University on December 5, 2012http://nar.oxfordjournals.org/Downloaded from
... Besides providing A3D results on solubility and stability, the A3D-MODB incorporates an additional wealth of information on membrane proteins' topology by incorporating TOPCONS ( 28 ) and the per-residue model confidence metric (pLDDT) provided by AlphaFold. Moreover, each A3D-MODB entry has been linked to UniProt Protein Knowledgebase (UniProt KB) ( 29 ) and organism-specific gold standard databases such as the Human Protein Atlas, EchoBASE, MGD , SGD , Wormbase, FlyBase, RGD, TAIR, ZFIN, Pom-Base, WholeCellKB-MG (30)(31)(32)(33)(34)(35)(36)(37)(38)(39). To our knowledge, A3D-MODB represents the most comprehensive resource available for structure-based aggregation predictions and is designed to significantly advance protein aggregation research at the proteomic level. ...
Article
Full-text available
Protein aggregation has been associated with aging and different pathologies and represents a bottleneck in the industrial production of biotherapeutics. Numerous past studies performed in Escherichia coli and other model organisms have allowed to dissect the biophysical principles underlying this process. This knowledge fuelled the development of computational tools, such as Aggrescan 3D (A3D) to forecast and re-design protein aggregation. Here, we present the A3D Model Organism Database (A3D-MODB) http://biocomp.chem.uw.edu.pl/A3D2/MODB, a comprehensive resource for the study of structural protein aggregation in the proteomes of 12 key model species spanning distant biological clades. In addition to A3D predictions, this resource incorporates information useful for contextualizing protein aggregation, including membrane protein topology and structural model confidence, as an indirect reporter of protein disorder. The database is openly accessible without any need for registration. We foresee A3D-MOBD evolving into a central hub for conducting comprehensive, multi-species analyses of protein aggregation, fostering the development of protein-based solutions for medical, biotechnological, agricultural and industrial applications.
... The capability to simulate the integrated function of every gene and molecule in a cell would assist clinicians in individualizing therapy [18,32,22] and enable computeraided designs in synthetic biology [33,29]. Their development would encourage the unification of our currently disconnected and heterogeneous biological datasets [6,24] and facilitate the discovery of emergent phenomena [46]. This worthy goal has been touted as a 'grand challenge' for 21 st century systems biology [47] and will require extensive interdisciplinary collaboration if we are to be successful [25]. ...
Preprint
Full-text available
Enzyme-catalysed reactions involve two distinct timescales. There is a short timescale on which enzymes bind to substrate molecules to produce bound complexes, and a comparatively long timescale on which the complex is transformed into a product. The rate at which the substrate is converted into product is characteristically non-linear and is traditionally derived by applying of singular perturbation theory to the system's governing equations. Central to this analysis is the assumption that complex formation is effectively instantaneous on the timescale over which significant substrate degradation occurs. This prevents many particle-based simulations of reaction-diffusion systems from accurately modelling enzyme kinetics since they rely on proximity based reaction conditions that do not correctly model the fast reactions associated with the complex on the long timescale. In this paper we derive a new proximity based reaction condition that correctly incorporates the reactions that occur on the short timescale for a specific enzymatic system. We present proof of concept particle-based simulations that demonstrate that non-linear reaction rates typical of enzyme kinetics can be reproduced without needing to explicitly simulate reactions on the short timescale.
... Machine learning can automatically rebuild knowledge bases for data sorting and cleaning [63,122,123]. Open-source tools can be used for data training [64,65,124,125]. (2) Submodel integration. ...
Article
Full-text available
Genome-scale metabolic models (GEMs) are effective tools for metabolic engineering and have been widely used to guide cell metabolic regulation. However, the single gene–protein-reaction data type in GEMs limits the understanding of biological complexity. As a result, multiscale models that add constraints or integrate omics data based on GEMs have been developed to more accurately predict phenotype from genotype. This review summarized the recent advances in the development of multiscale GEMs, including multiconstraint, multiomic, and whole-cell models, and outlined machine learning applications in GEM construction. This review focused on the frameworks, toolkits, and algorithms for constructing multiscale GEMs. The challenges and perspectives of multiscale GEM development are also discussed.
... Information about MG molecular content, abundance, and localization was obtained from the WC-MG model [27] and its related web-based databases WholeCellKB [47], WholeCellSimDB [48] and WholeCellViz [49]. WholeCellKB is an extensive database that collects detailed information about every species described in the MG-WC model, including the structure and function of each gene, protein, reaction and pathway. ...
Article
Building structural models of entire cells has been a long-standing cross-discipline challenge for the research community, as it requires an unprecedented level of integration between multiple sources of biological data and enhanced methods for computational modeling and visualization. Here, we present the first 3D structural models of an entire Mycoplasma genitalium (MG) cell, built using the CellPACK suite of computational modeling tools. Our model recapitulates the data described in recent whole-cell system biology simulations and provides a structural representation for all MG proteins, DNA and RNA molecules, obtained by combining experimental and homology-modeled structures and lattice-based models of the genome. We establish a framework for gathering, curating and evaluating these structures, exposing current weaknesses of modeling methods and the boundaries of MG structural knowledge, and visualization methods to explore functional characteristics of the genome and proteome. We compare two approaches for data gathering, a manually-curated workflow and an automated workflow that uses homologous structures, both of which are appropriate for the analysis of mesoscale properties such as crowding and volume occupancy. Analysis of model quality provides estimates of the regularization that will be required when these models are used as starting points for atomic molecular dynamics simulations.
... Indeed, the latest effort of the PDB to support integrative structures based on varied data from multiple methods (67) is narrowing the gap between the PDB and the whole-cell mapping. Other key community resources provide for standardization, archival, and dissemination of models, thus facilitating explicit and implicit collaboration among a diverse set of researchers (9,13,(68)(69)(70). ...
Article
Full-text available
Significance Cells are the basic units of life, yet their architecture and function remain to be fully characterized. This work describes Bayesian metamodeling, a modeling approach that divides and conquers a large problem of modeling numerous aspects of the cell into computing a number of smaller models of different types, followed by assembling these models into a complete map of the cell. Metamodeling enables a facile collaboration of multiple research groups and communities, thus maximizing the sharing of expertise, resources, data, and models. A proof of principle is provided by a model of glucose-stimulated insulin secretion produced by the Pancreatic β-Cell Consortium.
Article
Mathematical modeling plays a vital role in mammalian synthetic biology by providing a framework to design and optimize design circuits and engineered bioprocesses, predict their behavior, and guide experimental design. Here, we review recent models used in the literature, considering mathematical frameworks at the molecular, cellular, and system levels. We report key challenges in the field and discuss opportunities for genome-scale models, machine learning, and cybergenetics to expand the capabilities of model-driven mammalian cell biodesign.
Article
Full-text available
Whole-cell modeling is “the ultimate goal” of computational systems biology and “a grand challenge for 21st century” (Tomita, Trends in Biotechnology, 2001, 19(6), 205–10). These complex, highly detailed models account for the activity of every molecule in a cell and serve as comprehensive knowledgebases for the modeled system. Their scope and utility far surpass those of other systems models. In fact, whole-cell models (WCMs) are an amalgam of several types of “system” models. The models are simulated using a hybrid modeling method where the appropriate mathematical methods for each biological process are used to simulate their behavior. Given the complexity of the models, the process of developing and curating these models is labor-intensive and to date only a handful of these models have been developed. While whole-cell models provide valuable and novel biological insights, and to date have identified some novel biological phenomena, their most important contribution has been to highlight the discrepancy between available data and observations that are used for the parametrization and validation of complex biological models. Another realization has been that current whole-cell modeling simulators are slow and to run models that mimic more complex (e.g., multi-cellular) biosystems, those need to be executed in an accelerated fashion on high-performance computing platforms. In this manuscript, we review the progress of whole-cell modeling to date and discuss some of the ways that they can be improved.
Article
With advances in high‐throughput, large‐scale in vivo measurement and genome modification techniques at the single‐nucleotide level, there is an increasing demand for the development of new technologies for the flexible design and control of cellular systems. Computer‐aided design is a powerful tool to design new cells. Whole‐cell modeling aims to integrate various cellular subsystems, determine their interactions and cooperative mechanisms, and predict comprehensive cellular behaviors by computational simulations on a genome‐wide scale. It has been applied to prokaryotes, yeasts, and higher eukaryotic cells, and utilized in a wide range of applications, including production of valuable substances, drug discovery, and controlled differentiation. Whole‐cell modeling, consisting of several thousand elements with diverse scales and properties, requires innovative model construction, simulation, and analysis techniques. Furthermore, whole‐cell modeling has been extended to multiple scales, including high‐resolution modeling at the single‐nucleotide and single‐amino acid levels and multicellular modeling of tissues and organs. This review presents an overview of the current state of whole‐cell modeling, discusses the novel computational and experimental technologies driving it, and introduces further developments toward multihierarchical modeling on a whole‐genome scale. This article is protected by copyright. All rights reserved.
Article
Despite substantial potential to transform bioscience, medicine, and bioengineering, whole-cell models remain elusive. One of the biggest challenges to whole-cell models is assembling the large and diverse array of data needed to model an entire cell. Thanks to rapid advances in experimentation, much of the necessary data is becoming available. Furthermore, investigators are increasingly sharing their data due to growing recognition of the importance of research that is transparent and reproducible to others. However, the scattered organization of this data continues to hamper modeling. Toward more predictive models, we highlight the challenges to assembling the data needed for whole-cell modeling and outline how we can overcome these challenges by working together to build a central data warehouse.
Data
Full-text available
The mission of UniProt is to support biological research by providing a freely accessible, stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces. UniProt is comprised of four major components, each optimized for different uses: the UniProt Archive, the UniProt Knowledgebase, the UniProt Reference Clusters and the UniProt Metagenomic and Environmental Sequence Database. A key development at UniProt is the provision of complete, reference and representative proteomes. UniProt is updated and distributed every 4 weeks and can be accessed online for searches or download at http://www.uniprot.org.
Article
Full-text available
Applying microarray technology, we have investigated the transcriptome of the small bacterium Mycoplasma pneumoniae grown at three different temperature conditions: 32, 37 and 32 degrees C followed by a heat shock for 15 min at 43 degrees C, before isolating the RNA. From 688 proposed open-reading frames, 676 were investigated and 564 were found to be expressed (P < 0.001; 606 with P < 0.01) and at least 33 (P < 0.001; 77 at P < 0.01) regulated. By quantitative real-time PCR of selected mRNA species, the expression data could be linked to absolute molecule numbers. We found M.pneumoniae to be regulated at the transcriptional level. Forty-seven genes were found to be significantly up-regulated after heat shock (P < 0.01). Among those were the conserved heat shock genes like dnaK, lonA and clpB, but also several genes coding for ribosomal proteins and 10 genes of unassigned functions. In addition, 30 genes were found to be down-regulated under the applied heat shock conditions. Further more, we have compared different methods of cDNA synthesis (random hexamer versus gene-specific primers, different RNA concentrations) and various normalization strategies of the raw microarray data.
Chapter
Full-text available
Protein identification and analysis software performs a central role in the investigation of proteins from two-dimensional (2-D) gels and mass spectrometry. For protein identification, the user matches certain empirically acquired information against a protein database to define a protein as already known or as novel. For protein analysis, information in protein databases can be used to predict certain properties about a protein, which can be useful for its empirical investigation. The two processes are thus complementary. Although there are numerous programs available for those applications, we have developed a set of original tools with a few main goals in mind. Specifically, these are: 1. To utilize the extensive annotation available in the Swiss-Prot database (1) wherever possible, in particular the position-specific annotation in the Swiss-Prot feature tables to take into account posttranslational modifications and protein processing. 2. To develop tools specifically, but not exclusively, applicable to proteins prepared by twodimensional gel electrophoresis and peptide mass fingerprinting experiments. 3. To make all tools available on the World-Wide Web (WWW), and freely usable by the scientific community.
Article
Applying microarray technology, we have investigated the transcriptome of the small bacterium Mycoplasma pneumoniae grown at three different temperature conditions: 32, 37 and 32°C followed by a heat shock for 15 min at 43°C, before isolating the RNA. From 688 proposed open‐reading frames, 676 were investigated and 564 were found to be expressed (P < 0.001; 606 with P < 0.01) and at least 33 (P < 0.001; 77 at P < 0.01) regulated. By quantitative real‐time PCR of selected mRNA species, the expression data could be linked to absolute molecule numbers. We found M.pneumoniae to be regulated at the transcriptional level. Forty‐seven genes were found to be significantly up‐regulated after heat shock (P < 0.01). Among those were the conserved heat shock genes like dnaK, lonA and clpB, but also several genes coding for ribosomal proteins and 10 genes of unassigned functions. In addition, 30 genes were found to be down‐regulated under the applied heat shock conditions. Further more, we have compared different methods of cDNA synthesis (random hexamer versus gene‐specific primers, different RNA concentrations) and various normalization strategies of the raw microarray data.
Article
PubChem is an open repository for experimental data identifying the biological activities of small molecules. PubChem contents include more than: 1000 bioassays, 28 million bioassay test outcomes, 40 million substance contributed descriptions, and 19 million unique compound structures contributed from over 70 depositing organizations. PubChem provides a significant, publicly accessible platform for mining the biological information of small molecules.
Article
Understanding how complex phenotypes arise from individual molecules and their interactions is a primary challenge in biology that computational approaches are poised to tackle. We report a whole-cell computational model of the life cycle of the human pathogen Mycoplasma genitalium that includes all of its molecular components and their interactions. An integrative approach to modeling that combines diverse mathematics enabled the simultaneous inclusion of fundamentally different cellular processes and experimental measurements. Our whole-cell model accounts for all annotated gene functions and was validated against a broad range of data. The model provides insights into many previously unobserved cellular behaviors, including in vivo rates of protein-DNA association and an inverse relationship between the durations of DNA replication initiation and replication. In addition, experimental analysis directed by model predictions identified previously undetected kinetic parameters and biological functions. We conclude that comprehensive whole-cell models can be used to facilitate biological discovery.
Article
Studies have been carried out on the chemical composition and morphological subunits of Mycoplasma gallisepticum (avian PPLO 5969). The morphological units which were identified were cell membrane, double-stranded DNA, ribosomes and soluble protein. DNA and RNA base ratios were determined as was amino acid composition. An analysis of the lipid components was carried out. A very striking feature of these cells is the very small size and low total content of DNA.