ArticlePDF Available

WholeCellKB: Model organism databases for comprehensive whole-cell models

November 2012
Nucleic Acids Research 41(Database issue)

November 2012
41(Database issue)

DOI:10.1093/nar/gks1108

Source
PubMed

License
CC BY-NC 3.0

Authors:

Jonathan Karr

Icahn School of Medicine at Mount Sinai

Show all 5 authorsHide

Whole-cell models promise to greatly facilitate the analysis of complex biological behaviors. Whole-cell model development requires comprehensive model organism databases. WholeCellKB (http://wholecellkb.stanford.edu) is an open-source web-based software program for constructing model organism databases. WholeCellKB provides an extensive and fully customizable data model that fully describes individual species including the structure and function of each gene, protein, reaction and pathway. We used WholeCellKB to create WholeCellKB-MG, a comprehensive database of the Gram-positive bacterium Mycoplasma genitalium using over 900 sources. WholeCellKB-MG is extensively cross-referenced to existing resources including BioCyc, KEGG and UniProt. WholeCellKB-MG is freely accessible through a web-based user interface as well as through a RESTful web service.

. WholeCellKB-MG size

…

WholeCellKB-MG enables whole-cell modeling by integrating diverse data sources into a single database. (a) Currently, WholeCellKB-MG integrates >900 primary research articles, reviews, books and databases. (b) WholeCellKB-MG comprehensively represents all aspects of molecular physiology including metabolomics, genomics, transcriptomics and proteomics. (c) WholeCellKB-MG provides molecular data for whole-cell models.

…

WholeCellKB aims to comprehensively describe cell physiology including the structure and dynamics of every metabolite, gene, RNA transcript and protein. Boxes illustrate several molecular properties represented by WholeCellKB.

…

Detailed comparison of the content of WholeCellKB-MG and several existing biological databases. In addition to containing detailed descriptions of genetics, metabolism and transcriptional regulation comparable to existing resources such as BiGG (28), BioCyc (18) and CMR (4), WholeCellKB-MG has detailed representations of RNA degradation, RNA and protein maturation and protein translocation. Black boxes indicate physiology represented with fine granularity including the specific molecules involved in each specific interaction (e.g. specific metabolites involved in each metabolic reaction). Gray boxes indicate coarsely represented physiology, for example lumping families of similar reactions such as RNA methylation into a single database entry rather than representing the specific RNA bases involved in each individual reaction. White boxes indicate unrepresented physiology.

…

Figures - uploaded by Jonathan Karr

Content may be subject to copyright.

Content uploaded by Jonathan Karr

Content may be subject to copyright.

WholeCellKB: model organism databases for

comprehensive whole-cell models

Jonathan R. Karr

, Jayodita C. Sanghvi

, Derek N. Macklin

, Abhishek Arora

and

Markus W. Covert

Graduate Program in Biophysics,

Department of Bioengineering and

Department of Electrical Engineering,

Stanford University, 318 Campus Drive West, Stanford, CA 94305, USA

Received August 15, 2012; Revised October 1, 2012; Accepted October 19, 2012

ABSTRACT

Whole-cell models promise to greatly facilitate the

analysis of complex biological behaviors. Whole-

cell model development requires comprehensive

model organism databases. WholeCellKB (http://

wholecellkb.stanford.edu) is an open-source web-

based software program for constructing model

organism databases. WholeCellKB provides an ex-

tensive and fully customizable data model that fully

describes individual species including the structure

and function of each gene, protein, reaction and path-

way. We used WholeCellKB to create WholeCellKB-

MG, a comprehensive database of the Gram-positive

bacterium Mycoplasma genitalium using over 900

sources. WholeCellKB-MG is extensively cross-

referenced to existing resources including BioCyc,

KEGG and UniProt. WholeCellKB-MG is freely

accessible through a web-based user interface as

well as through a RESTful web service.

INTRODUCTION

A primary challenge in computational biology is to predict

how complex phenotypes such as growth and replication

arise from networks of individual molecules. Whole-cell

models promise to tackle this challenge by integrating het-

erogeneous molecular data into predictive computational

models. This integration requires model organism data-

bases which comprehensively provide readily computable

molecular data.

WholeCellKB is an open-source, web-based software

program for developing comprehensive model organism

databases for whole-cell models. As illustrated in

Figure 1, WholeCellKB enables whole-cell modeling by

organizing diverse molecular data from primary research

articles, reviews, books and databases into a single

database. The WholeCellKB data model supports

detailed descriptions of individual species including their

genes, operons, proteins, macromolecular complexes,

molecular interactions, chemical reactions and

pathways. Importantly, WholeCellKB also facilitates

extensive source documentation. We used WholeCellKB

to develop WholeCellKB-MG, an extensive database of

the pathogenic Gram-positive bacterium Mycoplasma

genitalium.

Here, we describe WholeCellKB-MG’s content, cur-

ation, user interface and implementation. We also

compare WholeCellKB-MG to existing resources, high-

lighting WholeCellKB-MG’s greater scope and granular-

ity. Finally, we discuss our future plans for WholeCellKB.

CONTENT

Our goal was to create a database comprehensive enough

to enable a whole-cell model (1). As illustrated in Figure 2,

WholeCellKB-MG broadly represents M. genitalium mo-

lecular biology including (i) its subcellular organization;

(ii) its chromosome sequence; (iii) the location, length, dir-

ection and essentiality of each gene; (iv) the organization

and promoter of each transcription unit; (v) the expression

and degradation rate of each RNA transcript; (vi) the spe-

ciﬁc folding and maturation pathway of each RNA and

protein species including the localization, N-terminal

cleavage, signal sequence, prosthetic groups, disulﬁde

bonds and chaperone interactions of each protein

species; (vii) the subunit composition of each macromol-

ecular complex; (viii) its genetic code; (ix) the binding sites

and footprint of every DNA-binding protein; (x) the struc-

ture, charge and hydrophobicity of every metabolite; (xi)

the stoichiometry, catalysis, coenzymes, energetics and

kinetics of every chemical reaction; (xii) the regulatory

role of each transcription factor; (xiii) its chemical com-

position and (xiv) the composition of its laboratory

growth medium. Table 1 summarizes WholeCellKB-

MG’s size and content.

CURATION

We curated WholeCellKB-MG in ﬁve steps based on

>900 primary research articles, reviews, books and

*To whom correspondence should be addressed. Tel: +1 650 7256615; Fax: +1 650 7211409; Email: mcovert@stanford.edu

Nucleic Acids Research, 2012, 1–6

doi:10.1093/nar/gks1108

ß The Author(s) 2012. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/3.0/), which

permits non-commercial reuse, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact

journals.permissions@oup.com.

Nucleic Acids Research Advance Access published November 21, 2012

at Stanford University on December 5, 2012http://nar.oxfordjournals.org/Downloaded from

databases. First, we curated the overall structure of

M. genitalium including its size, shape, subcellular organ-

ization and chemical composition based on several experi-

mental studies including Morowitz et al. (2). We also

assembled the chemical composition of Mycoplasma la-

boratory growth medium based on analyses reported by

Solabia (3).

Second, we curated the structure of the M. genitalium

chromosome including its sequence, the location, length

and direction of each gene and its transcription unit or-

ganization based on the Comprehensive Microbial

Resource (CMR) annotation (4) and a recent study by

ell et al. (5). We reconstructed the location of each

promoter and the expression, degradation rate and essen-

tiality of each gene product from four recent studies (6–9).

We catalogued DNA-binding sites and transcriptional

regulatory interactions from several sources including

DBTBS (10).

Third, we assembled the structure of each RNA and

protein gene product. We compiled the post-tran-

scriptional processing and modiﬁcation of each RNA

transcript from several sources including Peil (11). We

reconstructed the signal sequence, localization,

chaperone-mediated folding, post-translational modiﬁca-

tion, disulﬁde bonds, subunit composition and DNA

footprint of each protein and macromolecular complex

from a large number of primary research articles,

computational models and databases. We assembled the

chemical regulation of each gene product from several

sources including DrugBank (12). We used ExPASy

ProtParam (13) to calculate the pI, extinction coefﬁcient,

half-life, instability index, aliphatic index and grand

average of hydropathy of every protein species.

Fourth, we curated the speciﬁc chemical reactions

catalyzed by each gene product starting from the CMR

(4), GenBank (14), KEGG (15) and UniProt (16) genome

annotations and the reconstructed RNA and protein mat-

uration pathways. To maximize the scope of the database

and to ﬁll gaps in the genome annotation, we expanded

each gene product’s annotation based on primary research

articles we identiﬁed by searching PubMed (17) and

Google Scholar (http://scholar.google.com). We consulted

BioCyc (18), KEGG (15), two ﬂux-balance analysis (FBA)

models of bacterial metabolism (19,20) and hundreds of

additional primary research articles to curate the stoichi-

ometry of each chemical reaction. We assembled the

thermodynamics and kinetics of each chemical reaction

from several databases including BRENDA (21),

SABIO-RK (22) and UniProt (16 ) and a FBA model (20).

Finally, we compiled the M. genitalium metabolome.

We included all metabolites involved in the reconstructed

reactions, biomass or growth medium. We curated the

empirical formula, structure, charge and intracellular con-

centration of each metabolite from several databases

including BioCyc (18), CyberCell (23) and PubChem

(24) and a comprehensive mass-spectrometry study (25).

We used ChemAxon Marvin (http://www.chemaxon.com/

products/marvin) to calculate the molecular weight, van

der Waals volume, pI, log

and log

of each metabolite.

In order to create a comprehensive description of

M. genitalium physiology, we based WholeCellKB-MG

on studies of closely related organisms where studies of

M. genitalium were unavailable. In cases where multiple

observations were available, we based the reconstruction

on the most closely related organism. We used

bi-directional best BLAST (26) to identify homologous

genes. To provide model transparency, we tracked the

species, experimental conditions and citation of each

piece of evidence.

COMPARISON TO EXISTING RESOURCES

WholeCellKB represents the speciﬁc molecular inter-

actions of individual species similar to previous databases

such as BioCyc (18,27) and BiGG (28). In particular,

WholeCellKB’s data model, user interface and species-

speciﬁc content were heavily inspired by BioCyc.

Importantly, WholeCellKB-MG also has several

major differences from existing resources. First,

WholeCellKB-MG more broadly represents cell physi-

ology. WholeCellKB-MG represents the molecular

details of 28 cellular processes including well-studied

processes such as metabolism as well as less well-

understood processes such as DNA damage and repair

and RNA and protein degradation. The online documen-

tation at http://wholecellkb.stanford.edu/about provides

further information about the WholeCellKB-MG data

(a)

(b)

(c)

Figure 1. WholeCellKB-MG enables whole-cell modeling by

integrating diverse data sources into a single database. (a) Currently,

WholeCellKB-MG integrates >900 primary research articles, reviews,

books and databases. (b) WholeCellKB-MG comprehensively repre-

sents all aspects of molecular physiology including metabolomics,

genomics, transcriptomics and proteomics. (c) WholeCellKB-MG

provides molecular data for whole-cell models.

2 Nucleic Acids Research, 2012

at Stanford University on December 5, 2012http://nar.oxfordjournals.org/Downloaded from

model and how WholeCellKB-MG represents each

cellular process. Figure 3 compares WholeCellKB-MG’s

content to that of several existing databases.

Second, whole-cell modeling requires model organism

databases which explicitly deﬁne the participants of

each molecular interaction and chemical reaction.

WholeCellKB-MG addresses this need by representing

the speciﬁc molecules involved in every molecular inter-

action and by requiring structures for each molecule. For

example, WholeCellKB-MG represents the speciﬁc RNA

bases involved in every RNA methylation reaction,

whereas existing resources lump RNA methylation inter-

actions into a single generic reaction. WholeCellKB-MG

represents every major cellular process including RNA

processing and protein processing, modiﬁcation and trans-

location with similarly ﬁne molecular resolution.

Third, where available WholeCellKB-MG contains not

only structural but also quantitative functional

descriptions of each molecule and molecular interaction.

For example, WholeCellKB-MG contains chemical

reaction rate laws and kinetic parameters, RNA transcript

expressions and half-lives, and cellular and growth medium

chemical compositions. In total, WholeCellKB-MG repre-

sents 1836 heterogeneous model parameters. Table 2 sum-

marizes how WholeCellKB represents these heterogeneous

parameters using several types of database entries.

DATA INPUT

WholeCellKB provides administrators with two editing

interfaces: (i) a web form to edit single entries and (ii)

an Excel-based interface to simultaneously edit multiple

entries. We believe that these two interfaces enable collab-

orative model organism database development.

In the beginning of our M. genitalium curation efforts,

we primarily used the batch interface to quickly import

large amounts of data from other genome annotations.

We continued to use the batch interface throughout the

project to import high-throughput molecular data. Later

in our M. genitalium curation efforts, we primarily used

the form interface to reﬁne our annotation based on

speciﬁc biochemical studies. Overall, we found that

WholeCellKB improved the quality of our annotation

and in particular encouraged us to thoroughly annotate

the original source of each datum.

Data submitted to WholeCellKB was extensively

validated to ensure consistency and correctness. For

example, WholeCellKB checked that each chemical

formula was valid, that each reaction was mass-balanced

and that every molecule and kinetic parameter was deﬁned

in each reaction rate law. WholeCellKB provided hints on

Figure 2. WholeCellKB aims to comprehensively describe cell physiology including the structure and dynamics of every metabolite, gene, RNA

transcript and protein. Boxes illustrate several molecular properties represented by WholeCellKB.

Table 1. WholeCellKB-MG size

Entry type Number

Cellular state 16

Chromosome feature 2305

Compartment 6

Gene 525

Metabolite 722

Pathway 17

Process 28

Protein complex 201

Protein monomer 482

Reaction 1857

Transcription unit 335

Transcriptional regulatory interaction 30

Nucleic Acids Research, 2012 3

at Stanford University on December 5, 2012http://nar.oxfordjournals.org/Downloaded from

how to correct invalid data such as the atom imbalance of

invalid reactions.

DATA ACCESS

WholeCellKB-MG is freely accessible through a simple

and intuitive web-based interface at http://wholecellkb.

stanford.edu. This web-based interface allows users to

quickly browse, search and export the database. It

also allows administrators to add, edit and delete

entries. Importantly, the interface is extensively com-

mented and hyperlinked, allowing users to easily ﬁnd the

primary source of each datum.

WholeCellKB-MG is also accessible through a RESTful

interface. This interface provides the content of every

HTML page in JSON and XML formats. We are cur-

rently using this interface to develop software for

visualizing whole-cell simulations.

DEVELOPER API

WholeCellKB was designed to enable modelers to develop

model organism databases for whole-cell models,

including designing custom data models and user

interfaces. WholeCellKB provides a framework for

viewing, searching, exporting and editing database

entries which developers can combine with custom data

models and HTML templates. This allows developers to

build custom model organism databases with minimal

effort and without any knowledge of database design.

Furthermore, because WholeCellKB is open source and

implemented with Python, modelers can easily display sci-

entiﬁc calculations alongside curated data in the user

interface. The online documentation provides further

instructions on how to customize WholeCellKB.

IMPLEMENTATION

WholeCellKB was implemented in Python using the

Django (http://www.djangoproject.com) web framework

and stored using the relational database MySQL (http://

www.mysql.com). Full-text search was implemented using

Haystack (http://haystacksearch.org) and Xapian (http://

xapian.org). Excel, JSON and XML export were

implemented using OpenPyXL (http://bitbucket.org/eric

Figure 3. Detailed comparison of the content of WholeCellKB-MG and several existing biological databases. In addition to containing detailed

descriptions of genetics, metabolism and transcriptional regulation comparable to existing resources such as BiGG (28), BioCyc (18) and CMR (4),

WholeCellKB-MG has detailed representations of RNA degradation, RNA and protein maturation and protein translocation. Black boxes indicate

physiology represented with ﬁne granularity including the speciﬁc molecules involved in each speciﬁc interaction (e.g. speciﬁc metabolites involved in

each metabolic reaction). Gray boxes indicate coarsely represented physiology, for example lumping families of similar reactions such as RNA

methylation into a single database entry rather than representing the speciﬁc RNA bases involved in each individual reaction. White boxes indicate

unrepresented physiology.

Table 2. WholeCellKB-MG parameters

Type Number

Cell composition 73

Media composition 83

Reaction K

225

Reaction K

483

Reaction V

max

434

RNA expression 525

RNA half-life 525

Stimulus values\ 10

Transcriptional regulation 32

Activity 30

Afﬁnity 2

Other 154

4 Nucleic Acids Research, 2012

at Stanford University on December 5, 2012http://nar.oxfordjournals.org/Downloaded from

gazoni/openpyxl), simplejson (http://pypi.python.org/

pypi/simplejson) and xml.dom (http://docs.python.org/

library/xml.dom.html). WholeCellKB runs on the

Apache (http://www.apache.org) web server using the

mod_wsgi (http://code.google.com/p/modwsgi) module.

All of the software used to implement WholeCellKB is

available open source.

SUMMARY AND FUTURE DIRECTIONS

WholeCellKB-MG is an extensive database of

M. genitalium designed to facilitate whole-cell modeling.

Currently, we are continuing to curate the database as well

as starting to create equally comprehensive databases of

other model microorganisms. Beyond facilitating realistic

whole-cell models, we believe that these databases are

useful platforms for experimental and computational

biologists.

We created WholeCellKB-MG using WholeCellKB, an

open-source, web-based software program which enables

modelers to quickly develop model organism databases

for whole-cell modeling.

Beyond continuing to curate model organisms, we also

plan to continue to strengthen the WholeCellKB software.

We plan to add additional tools for importing databases

curated with other tools such as PathwayTools (27),

storing the detailed history of each database entry and

comparing model organism databases as well as expand-

ing the search functionality of the RESTful API. As the

whole-cell modeling community grows, in the future we

also plan to enable open-editing similar to Wikipedia.

Finally, we are currently using WholeCellKB’s RESTful

API to develop tools for visualizing whole-cell

simulations.

We hope that other researchers will use WholeCellKB

to develop model organism databases and whole-cell

models. We believe that WholeCellKB will not only

speed up database curation and whole-cell model develop-

ment but also encourage best annotation practices.

Ultimately, we hope that WholeCellKB in combination

with whole-cell models will accelerate biological discovery

and bioengineering.

ACKNOWLEDGEMENTS

We thank Elsa Birch, Nick Ruggero and Ruby Lee for

enlightening discussions on database design, curation,

modeling and visualization.

FUNDING

NIH Director’s Pioneer Award [5DP1LM01150-05] and a

Hellman Faculty Scholarship (to M.W.C.); NDSEG, NSF

and Stanford Graduate Fellowships (to J.R.K.); NSF and

Bio-X Graduate Student Fellowships (to J.C.S.) and a

Stanford Graduate Fellowship (to D.N.M.). Funding for

open access charge: NIH Director’s Pioneer Award

[5DP1LM01150-05].

Conﬂict of interest statement. None declared.

REFERENCES

1. Karr,J.R., Sanghvi,J.C., Macklin,D.N., Jacobs,J.M.,

Gutschow,M.V., Bolival,B., Assad-Garcia,N., Glass,J.I. and

Covert,M.W. (2012) A whole-cell computational model predicts

phenotype from genotype. Cell, 150, 389–401.

2. Morowitz,H.J., Tourtellotte,M.E., Guild,W.R., Castro,E. and

Woese,C. (1962) The chemical composition and submicroscopic

morphology of Mycoplasma gallisepticum, Avian PPLO 5969.

J. Mol. Biol., 4, 93–103.

3. Solabia. (2011) Biotechnology Products, Retrieved from http://

www.solabia.com/ (14 March 2011, date last accessed).

4. Davidsen,T., Beck,E., Ganapathy,A., Montgomery,R., Zafar,N.,

Yang,Q., Madupu,R., Goetz,P., Galinsky,K., White,O. et al.

(2010) The comprehensive microbial resource. Nucleic Acids Res.,

38, D340–D345.

5. Gu

ell,M., van Noort,V., Yus,E., Chen,W.H., Leigh-Bell,J.,

Michalodimitrakis,K., Yamada,T., Arumugam,M., Doerks,T.,

hner,S. et al. (2009) Transcriptome complexity in a

genome-reduced bacterium. Science, 326, 1268–1271.

6. Weiner,J. 3rd, Herrmann,R. and Browning,G.F. (2000)

Transcription in Mycoplasma pneumoniae. Nucleic Acids Res., 2,

241–249.

7. Weiner,J. 3rd, Zimmerman,C.U., Go

hlmann,H.W. and

Herrmann,R. (2003) Transcription proﬁles of the bacterium

Mycoplasma pneumoniae grown at different temperatures. Nucleic

Acids Res., 37, 6306–6320.

8. Bernstein,J.A., Khodursky,A.B., Lin,P.H., Lin-Chao,S. and

Cohen,S.N. (2002) Global analysis of mRNA decay and

abundance in Escherichia coli at single-gene resolution using

two-color ﬂuorescent DNA microarrays. Proc. Natl Acad. Sci.

USA, 22, 235–244.

9. Glass,J.I., Assad-Garcia,N., Alperovich,N., Yooseph,S.,

Lewis,M.R., Maruf,M., Hutchison,C.A. 3rd, Smith,H.O. and

Venter,J.C. (2006) Essential genes of a minimal bacterium. Proc.

Natl Acad. Sci. USA, 77, 1175–1181.

10. Sierro,N., Makita,Y., de Hoon,M. and Nakai,K. (2008) DBTBS:

a database of transcriptional regulation in Bacillus subtilis

containing upstream intergenic conservation information. Nucleic

Acids Res., 5, e8664.

11. Peil,L. (2009) Ribosome assembly factors in Escherichia coli,

Master Thesis. Tartu University.

12. Knox,C., Law,V., Jewison,T., Liu,P., Ly,S., Frolkis,A., Pon,A.,

Banco,K., Mak,C., Neveu,V. et al. (2011) DrugBank 3.0: a

comprehensive resource for ‘omics’ research on drugs.

Nucleic

Acids Res., 14,

D554–D556.

13. Gasteiger,E., Hoogland,C., Gattiker,A., Duvaud,S., Wilkins,M.R.,

Appel,R.D. and Bairoch,A. (2005) Protein identiﬁcation and

analysis tools on the ExPASy server. In: Gasteiger,E.,

Hoogland,C., Gattiker,A., Duvaud,S., Wilkins,M.R., Appel,R.D.

and Bairoch,A. (eds), The Proteomics Protocols Handbook.

Humana Press, Totowa, NJ, pp. 571–607.

14. Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J. and

Sayers,E.W. (2011) GenBank. Nucleic Acids Res., 39, D32–D37.

15. Kanehisa,M., Goto,S., Sato,Y., Furumichi,M. and Tanabe,M.

(2012) KEGG for integration and interpretation of large-scale

molecular datasets. Nucleic Acids Res., 40, D109–D114.

16. The UniProt Consortium. (2012) Reorganizing the protein space

at the Universal Protein Resource (UniProt). Nucleic Acids Res.,

40, D71–D75.

17. Sayers,E.W., Barrett,T., Benson,D.A., Bolton,E., Bryant,S.H.,

Canese,K., Chetvernin,V., Church,D.M., Dicuccio,M., Federhen,S.

et al. (2010) Database resources of the National Center for

Biotechnology Information. Nucleic Acids Res., 38, D5–D16.

18. Keseler,I.M., Collado-Vides,J., Santos-Zavaleta,A., Peralta-Gil,M.,

Gama-Castro,S., Muniz-Rascado,L., Bonavides-Martinez,C.,

Paley,S., Krummenacker,M., Altman,T. et al. (2011) EcoCyc: a

comprehensive database of Escherichia coli biology. Nucleic Acids

Res., 39, D583–D590.

19. Suthers,P.F., Dasika,M.S., Kumar,V.S., Denisov,G., Glass,J.I.

and Maranas,C.D. (2009) A genome-scale metabolic

reconstruction of Mycoplasma genitalium, iPS189. PLoS Comput.

Biol., 26, 4694–4708.

Nucleic Acids Research, 2012 5

at Stanford University on December 5, 2012http://nar.oxfordjournals.org/Downloaded from

20. Feist,A.M., Henry,C.S., Reed,J.L., Krummenacker,M.,

Joyce,A.R., Karp,P.D., Broadbelt,L.J., Hatzimanikatis,V. and

Palsson,B.Ø. (2007) A genome-scale metabolic reconstruction

for Escherichia coli K-12 MG1655 that accounts for 1260

ORFs and thermodynamic information. Mol. Syst. Biol. , 28,

15–33.

21. Scheer,M., Grote,A., Chang,A., Schomburg,I., Munaretto,C.,

Rother,M., So

hngen,C., Stelzer,M., Thiele,J. and Schomburg,D.

(2011) BRENDA, the enzyme information system in 2011.

Nucleic Acids Res., 39, D670–D676.

22. Wittig,U., Kania,R., Golebiewski,M., Rey,M., Shi,L., Jong,L.,

Algaa,E., Weidemann,A., Sauer-Danzwith,H., Mir,S. et al. (2012)

SABIO-RK—database for biochemical reaction kinetics. Nucleic

Acids Res., 40, D790–D796.

23. Sundararaj,S., Guo,A., Habibi-Nazhad,B., Rouani,M.,

Stothard,P., Ellison,M. and Wishart,D.S. (2004) The

CyberCell Database (CCDB): a comprehensive, self-updating,

relational database to coordinate and facilitate in

silico modeling of Escherichia coli. Nucleic Acids Res., 32,

D293–D295.

24. Bolton,E., Wang,Y., Thiessen,P.A. and Bryant,S.H. (2008)

PubChem: integrated platform of small molecules and biological

activities. In: Bolton,E., Wang,Y., Thiessen,P.A. and Bryant,S.H.

(eds), Annual Reports in Computational Chemistry. American

Chemical Society, Washington, DC, pp. 217–241.

25. Bennett,B.D., Kimball,E.H., Gao,M., Osterhout,R., Van Dien,S.J.

and Rabinowitz,J.D. (2009) Absolute metabolite concentrations

and implied enzyme active site occupancy in Escherichia coli. Nat.

Chem. Biol., 5, 593–599.

26. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J.

(1990) Basic local alignment search tool. J. Mol. Biol., 215,

403–410.

27. Karp,P.D., Paley,S.M., Krummenacker,M., Latendresse,M.,

Dale,J.M., Lee,T.J., Kaipa,P., Gilham,F., Spaulding,A.,

Popescu,L. et al. (2010) Pathway tools version 13.0: integrated

software for pathway/genome informatics and systems biology.

Brief. Bioinform., 11, 40–79.

28. Schellenberger,J., Park,J.O., Conrad,T.M. and Palsson,B.Ø. (2010)

BiGG: a biochemical genetic and genomic knowledgebase of large

scale metabolic reconstructions. BMC Bioinformatics, 11, 213.

6 Nucleic Acids Research, 2012

at Stanford University on December 5, 2012http://nar.oxfordjournals.org/Downloaded from

A3D Model Organism Database (A3D-MODB): a database for proteome aggregation predictions in model organisms

Article

Full-text available

Oct 2023

Protein aggregation has been associated with aging and different pathologies and represents a bottleneck in the industrial production of biotherapeutics. Numerous past studies performed in Escherichia coli and other model organisms have allowed to dissect the biophysical principles underlying this process. This knowledge fuelled the development of computational tools, such as Aggrescan 3D (A3D) to forecast and re-design protein aggregation. Here, we present the A3D Model Organism Database (A3D-MODB) http://biocomp.chem.uw.edu.pl/A3D2/MODB, a comprehensive resource for the study of structural protein aggregation in the proteomes of 12 key model species spanning distant biological clades. In addition to A3D predictions, this resource incorporates information useful for contextualizing protein aggregation, including membrane protein topology and structural model confidence, as an indirect reporter of protein disorder. The database is openly accessible without any need for registration. We foresee A3D-MOBD evolving into a central hub for conducting comprehensive, multi-species analyses of protein aggregation, fostering the development of protein-based solutions for medical, biotechnological, agricultural and industrial applications.

Enzyme kinetics simulation at the scale of individual particles

Preprint

Full-text available

Feb 2023

Enzyme-catalysed reactions involve two distinct timescales. There is a short timescale on which enzymes bind to substrate molecules to produce bound complexes, and a comparatively long timescale on which the complex is transformed into a product. The rate at which the substrate is converted into product is characteristically non-linear and is traditionally derived by applying of singular perturbation theory to the system's governing equations. Central to this analysis is the assumption that complex formation is effectively instantaneous on the timescale over which significant substrate degradation occurs. This prevents many particle-based simulations of reaction-diffusion systems from accurately modelling enzyme kinetics since they rely on proximity based reaction conditions that do not correctly model the fast reactions associated with the complex on the long timescale. In this paper we derive a new proximity based reaction condition that correctly incorporates the reactions that occur on the short timescale for a specific enzymatic system. We present proof of concept particle-based simulations that demonstrate that non-linear reaction rates typical of enzyme kinetics can be reproduced without needing to explicitly simulate reactions on the short timescale.

Construction of Multiscale Genome-Scale Metabolic Models: Frameworks and Challenges

Article

Full-text available

May 2022

Genome-scale metabolic models (GEMs) are effective tools for metabolic engineering and have been widely used to guide cell metabolic regulation. However, the single gene–protein-reaction data type in GEMs limits the understanding of biological complexity. As a result, multiscale models that add constraints or integrate omics data based on GEMs have been developed to more accurately predict phenotype from genotype. This review summarized the recent advances in the development of multiscale GEMs, including multiconstraint, multiomic, and whole-cell models, and outlined machine learning applications in GEM construction. This review focused on the frameworks, toolkits, and algorithms for constructing multiscale GEMs. The challenges and perspectives of multiscale GEM development are also discussed.

Building Structural Models of a Whole Mycoplasma Cell

Article

Nov 2021
J MOL BIOL

Building structural models of entire cells has been a long-standing cross-discipline challenge for the research community, as it requires an unprecedented level of integration between multiple sources of biological data and enhanced methods for computational modeling and visualization. Here, we present the first 3D structural models of an entire Mycoplasma genitalium (MG) cell, built using the CellPACK suite of computational modeling tools. Our model recapitulates the data described in recent whole-cell system biology simulations and provides a structural representation for all MG proteins, DNA and RNA molecules, obtained by combining experimental and homology-modeled structures and lattice-based models of the genome. We establish a framework for gathering, curating and evaluating these structures, exposing current weaknesses of modeling methods and the boundaries of MG structural knowledge, and visualization methods to explore functional characteristics of the genome and proteome. We compare two approaches for data gathering, a manually-curated workflow and an automated workflow that uses homologous structures, both of which are appropriate for the analysis of mesoscale properties such as crowding and volume occupancy. Analysis of model quality provides estimates of the regularization that will be required when these models are used as starting points for atomic molecular dynamics simulations.

Bayesian metamodeling of complex biological systems across varying representations

Article

Full-text available

Aug 2021

Significance Cells are the basic units of life, yet their architecture and function remain to be fully characterized. This work describes Bayesian metamodeling, a modeling approach that divides and conquers a large problem of modeling numerous aspects of the cell into computing a number of smaller models of different types, followed by assembling these models into a complete map of the cell. Metamodeling enables a facile collaboration of multiple research groups and communities, thus maximizing the sharing of expertise, resources, data, and models. A proof of principle is provided by a model of glucose-stimulated insulin secretion produced by the Pancreatic β-Cell Consortium.

Mechanistic Model-Driven Biodesign in Mammalian Synthetic Biology

Article

Mar 2024

Mathematical modeling plays a vital role in mammalian synthetic biology by providing a framework to design and optimize design circuits and engineered bioprocesses, predict their behavior, and guide experimental design. Here, we review recent models used in the literature, considering mathematical frameworks at the molecular, cellular, and system levels. We report key challenges in the field and discuss opportunities for genome-scale models, machine learning, and cybergenetics to expand the capabilities of model-driven mammalian cell biodesign.

Multi-scale models of whole cells: progress and challenges

Article

Full-text available

Nov 2023

Whole-cell modeling is “the ultimate goal” of computational systems biology and “a grand challenge for 21st century” (Tomita, Trends in Biotechnology, 2001, 19(6), 205–10). These complex, highly detailed models account for the activity of every molecule in a cell and serve as comprehensive knowledgebases for the modeled system. Their scope and utility far surpass those of other systems models. In fact, whole-cell models (WCMs) are an amalgam of several types of “system” models. The models are simulated using a hybrid modeling method where the appropriate mathematical methods for each biological process are used to simulate their behavior. Given the complexity of the models, the process of developing and curating these models is labor-intensive and to date only a handful of these models have been developed. While whole-cell models provide valuable and novel biological insights, and to date have identified some novel biological phenomena, their most important contribution has been to highlight the discrepancy between available data and observations that are used for the parametrization and validation of complex biological models. Another realization has been that current whole-cell modeling simulators are slow and to run models that mimic more complex (e.g., multi-cellular) biosystems, those need to be executed in an accelerated fashion on high-performance computing platforms. In this manuscript, we review the progress of whole-cell modeling to date and discuss some of the ways that they can be improved.

The Ontology of Physics for Biology: Semantic Modeling of Multiscale, Multidomain Physiological Systems

Book

Nov 2023

Technologies for whole‐cell modeling: Genome‐wide reconstruction of a cell in silico

Article

Oct 2023

With advances in high‐throughput, large‐scale in vivo measurement and genome modification techniques at the single‐nucleotide level, there is an increasing demand for the development of new technologies for the flexible design and control of cellular systems. Computer‐aided design is a powerful tool to design new cells. Whole‐cell modeling aims to integrate various cellular subsystems, determine their interactions and cooperative mechanisms, and predict comprehensive cellular behaviors by computational simulations on a genome‐wide scale. It has been applied to prokaryotes, yeasts, and higher eukaryotic cells, and utilized in a wide range of applications, including production of valuable substances, drug discovery, and controlled differentiation. Whole‐cell modeling, consisting of several thousand elements with diverse scales and properties, requires innovative model construction, simulation, and analysis techniques. Furthermore, whole‐cell modeling has been extended to multiple scales, including high‐resolution modeling at the single‐nucleotide and single‐amino acid levels and multicellular modeling of tissues and organs. This review presents an overview of the current state of whole‐cell modeling, discusses the novel computational and experimental technologies driving it, and introduces further developments toward multihierarchical modeling on a whole‐genome scale. This article is protected by copyright. All rights reserved.

Centralizing data to unlock whole-cell models

Article

Jun 2021

Despite substantial potential to transform bioscience, medicine, and bioengineering, whole-cell models remain elusive. One of the biggest challenges to whole-cell models is assembling the large and diverse array of data needed to model an entire cell. Thanks to rapid advances in experimentation, much of the necessary data is becoming available. Furthermore, investigators are increasingly sharing their data due to growing recognition of the importance of research that is transparent and reproducible to others. However, the scattered organization of this data continues to hamper modeling. Toward more predictive models, we highlight the challenges to assembling the data needed for whole-cell modeling and outline how we can overcome these challenges by working together to build a central data warehouse.

Basic Local Alignment Search Tool

Article

Full-text available

Oct 1990

Stephen F Altschul

Reorganizing the protein space at the Universal Protein Resource (UniProt)The UniProt ConsortiumNucleic Acids Res201140D1D71D75324512022102590

Data

Full-text available

Nov 2012

Philippe Le Mercier

The mission of UniProt is to support biological research by providing a freely accessible, stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces. UniProt is comprised of four major components, each optimized for different uses: the UniProt Archive, the UniProt Knowledgebase, the UniProt Reference Clusters and the UniProt Metagenomic and Environmental Sequence Database. A key development at UniProt is the provision of complete, reference and representative proteomes. UniProt is updated and distributed every 4 weeks and can be accessed online for searches or download at http://www.uniprot.org.

Transcription profiles of the bacterium Mycoplasma pneumoniae grown at different temperatures

Article

Full-text available

Nov 2003
NUCLEIC ACIDS RES

Applying microarray technology, we have investigated the transcriptome of the small bacterium Mycoplasma pneumoniae grown at three different temperature conditions: 32, 37 and 32 degrees C followed by a heat shock for 15 min at 43 degrees C, before isolating the RNA. From 688 proposed open-reading frames, 676 were investigated and 564 were found to be expressed (P < 0.001; 606 with P < 0.01) and at least 33 (P < 0.001; 77 at P < 0.01) regulated. By quantitative real-time PCR of selected mRNA species, the expression data could be linked to absolute molecule numbers. We found M.pneumoniae to be regulated at the transcriptional level. Forty-seven genes were found to be significantly up-regulated after heat shock (P < 0.01). Among those were the conserved heat shock genes like dnaK, lonA and clpB, but also several genes coding for ribosomal proteins and 10 genes of unassigned functions. In addition, 30 genes were found to be down-regulated under the applied heat shock conditions. Further more, we have compared different methods of cDNA synthesis (random hexamer versus gene-specific primers, different RNA concentrations) and various normalization strategies of the raw microarray data.

Protein Identification and Analysis Tool on the ExPASy Server

Chapter

Full-text available

Sep 2007

Protein identification and analysis software performs a central role in the investigation of proteins from two-dimensional (2-D) gels and mass spectrometry. For protein identification, the user matches certain empirically acquired information against a protein database to define a protein as already known or as novel. For protein analysis, information in protein databases can be used to predict certain properties about a protein, which can be useful for its empirical investigation. The two processes are thus complementary. Although there are numerous programs available for those applications, we have developed a set of original tools with a few main goals in mind. Specifically, these are: 1. To utilize the extensive annotation available in the Swiss-Prot database (1) wherever possible, in particular the position-specific annotation in the Swiss-Prot feature tables to take into account posttranslational modifications and protein processing. 2. To develop tools specifically, but not exclusively, applicable to proteins prepared by twodimensional gel electrophoresis and peptide mass fingerprinting experiments. 3. To make all tools available on the World-Wide Web (WWW), and freely usable by the scientific community.

In The Proteomics Protocols Handbook

Book

Full-text available

Jan 2005

Without Abstract

A genome-scale metabolic reconstruction for Escherichia coli K-12 MG1655 that accounts for 1260 ORFs and thermodynamic information

Article

Jan 2007
MOL SYST BIOL

Transcription pro®les of the bacterium Mycoplasma pneumoniae grown at different temperatures

Article

Nov 2003
NUCLEIC ACIDS RES

J. Weiner III

Applying microarray technology, we have investigated the transcriptome of the small bacterium Mycoplasma pneumoniae grown at three different temperature conditions: 32, 37 and 32°C followed by a heat shock for 15 min at 43°C, before isolating the RNA. From 688 proposed open‐reading frames, 676 were investigated and 564 were found to be expressed (P < 0.001; 606 with P < 0.01) and at least 33 (P < 0.001; 77 at P < 0.01) regulated. By quantitative real‐time PCR of selected mRNA species, the expression data could be linked to absolute molecule numbers. We found M.pneumoniae to be regulated at the transcriptional level. Forty‐seven genes were found to be significantly up‐regulated after heat shock (P < 0.01). Among those were the conserved heat shock genes like dnaK, lonA and clpB, but also several genes coding for ribosomal proteins and 10 genes of unassigned functions. In addition, 30 genes were found to be down‐regulated under the applied heat shock conditions. Further more, we have compared different methods of cDNA synthesis (random hexamer versus gene‐specific primers, different RNA concentrations) and various normalization strategies of the raw microarray data.

Chapter 12 PubChem: Integrated Platform of Small Molecules and Biological Activities

Article

Dec 2008

PubChem is an open repository for experimental data identifying the biological activities of small molecules. PubChem contents include more than: 1000 bioassays, 28 million bioassay test outcomes, 40 million substance contributed descriptions, and 19 million unique compound structures contributed from over 70 depositing organizations. PubChem provides a significant, publicly accessible platform for mining the biological information of small molecules.

A Whole-Cell Computational Model Predicts Phenotype from Genotype

Article

Jul 2012

Understanding how complex phenotypes arise from individual molecules and their interactions is a primary challenge in biology that computational approaches are poised to tackle. We report a whole-cell computational model of the life cycle of the human pathogen Mycoplasma genitalium that includes all of its molecular components and their interactions. An integrative approach to modeling that combines diverse mathematics enabled the simultaneous inclusion of fundamentally different cellular processes and experimental measurements. Our whole-cell model accounts for all annotated gene functions and was validated against a broad range of data. The model provides insights into many previously unobserved cellular behaviors, including in vivo rates of protein-DNA association and an inverse relationship between the durations of DNA replication initiation and replication. In addition, experimental analysis directed by model predictions identified previously undetected kinetic parameters and biological functions. We conclude that comprehensive whole-cell models can be used to facilitate biological discovery.

The chemical composition and submicroscopic morphology of Mycoplasma gallisepticum, Avian PPLO A5969

Article

Mar 1962

Studies have been carried out on the chemical composition and morphological subunits of Mycoplasma gallisepticum (avian PPLO 5969). The morphological units which were identified were cell membrane, double-stranded DNA, ribosomes and soluble protein. DNA and RNA base ratios were determined as was amino acid composition. An analysis of the lipid components was carried out. A very striking feature of these cells is the very small size and low total content of DNA.

WholeCellKB: Model organism databases for comprehensive whole-cell models

Abstract and Figures

Recommended publications

PathLocdb: A comprehensive database for the subcellular localization of metabolic pathways and its a...

An integrated multi-level comparison highlights common aspects and specific features between distant...

BioExtract Server A Web Based, Distributed System for Bioinformatic Workflow Analysis

ORCAdb