ArticlePDF Available

Chromosomal origin of replication coordinates logically distinct types of bacterial genetic regulation

Authors:

Abstract and Figures

For a long time it has been hypothesized that bacterial gene regulation involves an intricate interplay of the transcriptional regulatory network (TRN) and the spatial organization of genes in the chromosome. Here we explore this hypothesis both on a structural and on a functional level. On the structural level, we study the TRN as a spatially embedded network. On the functional level, we analyze gene expression patterns from a network perspective ("digital control"), as well as from the perspective of the spatial organization of the chromosome ("analog control"). Our structural analysis reveals the outstanding relevance of the symmetry axis defined by the origin (Ori) and terminus (Ter) of replication for the network embedding and, thus, suggests the co-evolution of two regulatory infrastructures, namely the transcriptional regulatory network and the spatial arrangement of genes on the chromosome, to optimize the cross-talk between two fundamental biological processes: genomic expression and replication. This observation is confirmed by the functional analysis based on the differential gene expression patterns of more than 4000 pairs of microarray and RNA-Seq datasets for E. coli from the Colombos Database using complex network and machine learning methods. This large-scale analysis supports the notion that two logically distinct types of genetic control are cooperating to regulate gene expression in a complementary manner. Moreover, we find that the position of the gene relative to the Ori is a feature of very high predictive value for gene expression, indicating that the Ori-Ter symmetry axis coordinates the action of distinct genetic control mechanisms. npj Systems Biology and Applications (2020) 6:5 ; https://doi.
Content may be subject to copyright.
ARTICLE OPEN
Chromosomal origin of replication coordinates logically
distinct types of bacterial genetic regulation
Kosmas Kosmidis
1,2
, Kim Philipp Jablonski
3,5
, Georgi Muskhelishvili
4
and Marc-Thorsten Hütt
3
For a long time it has been hypothesized that bacterial gene regulation involves an intricate interplay of the transcriptional
regulatory network (TRN) and the spatial organization of genes in the chromosome. Here we explore this hypothesis both on a
structural and on a functional level. On the structural level, we study the TRN as a spatially embedded network. On the functional
level, we analyze gene expression patterns from a network perspective (digital control), as well as from the perspective of the
spatial organization of the chromosome (analog control). Our structural analysis reveals the outstanding relevance of the
symmetry axis dened by the origin (Ori) and terminus (Ter) of replication for the network embedding and, thus, suggests the co-
evolution of two regulatory infrastructures, namely the transcriptional regulatory network and the spatial arrangement of genes on
the chromosome, to optimize the cross-talk between two fundamental biological processes: genomic expression and replication.
This observation is conrmed by the functional analysis based on the differential gene expression patterns of more than 4000 pairs
of microarray and RNA-Seq datasets for E. coli from the Colombos Database using complex network and machine learning methods.
This large-scale analysis supports the notion that two logically distinct types of genetic control are cooperating to regulate gene
expression in a complementary manner. Moreover, we nd that the position of the gene relative to the Ori is a feature of very high
predictive value for gene expression, indicating that the OriTer symmetry axis coordinates the action of distinct genetic control
mechanisms.
npj Systems Biology and Applications (2020) 6:5 ; https://doi.org/10.1038/s41540-020-0124-1
INTRODUCTION
In spite of the tremendous progress made in Systems Biology
13
and the construction of computational models of biological
cells,
4,5
we still lack the appropriate understanding of the
underlying principles of genetic regulation to predict, for example,
the gene expression pattern of a bacterium. Since the beginning
of Systems Biology, the investigation of bacterial gene regulation
has been an important source of hypotheses about the principles
of biological regulation.
69
The transcriptional regulatory network
(TRN) of the classical model organism Escherichia coli has been the
subject of a vast number of statistical analyses. In fact, this
network has been the rst example of a complex network for
which a non-random network motif distribution (deviations from
randomness of the counts of small subgraphs) has been
reported.
6,10
In spite of its prominence and the diversity of
investigations, this network has been mostly studied in isolation.
It is becoming ever clearer that beyond network topology itself,
the spatial embedding of complex networks provides an
important additional layer of information for understanding a
networks function.
11,12
While this aspect has been explored in
transportation networks,
13,14
brain networks
15,16
and a wide range
of other natural and technical systems,
17,18
it has not been studied
in much detail in the gene regulatory system. In particular, only
few aspects of the spatial embedding of the E. coli TRN have been
studied before, e.g., the spatial (i.e., chromosomal) distribution of
genes with and without a reported link in the TRN
19,20
and the
orientation of genes on the genome.
21
It is also intuitive (and in fact a prominent research trend of the
last years, see e.g.,
12
) that spatially embedded networks need to
be analyzed with a different set of tools than graphs without such
a spatial embedding. For example, the concept of a dimension,
which has been rarely discussed in complex network theory was
found to be an important property of spatially embedded
networks.
22
In a spatially embedded network typically long-
ranging links have a different systemic purpose than short-ranging
links. For example, in social networks most people have their
friends in their neighborhood, and the arrangement of connec-
tions in power grids and transportation networks obviously
depends on the distance between the connected units. Consider-
ing the network of passenger ights, it is systemically plausible
that such links occur only above a certain spatial distance. The
transcriptional regulatory network is a somewhat non-standard
example of a spatially embedded network, as the space, i.e., the
3D organization of the circular chromosome, is not immediately
obvious. We explore the hypothesis that bacterial gene regulation
is organized as an interplay of two distinct types of controlone
exerted by the TRN (digital control) and one arising from the
spatial organization of the chromosome (analog control). This
hypothesis has been formulated,
23,24
put into a broader con-
text
25,26
and supported by statistical analyses
2729
in a range of
studies over the last decade, but has yet to be conrmed as a
consistent organizational principle across all layers of quantita-
tively assessable information.
Here we rst address the interplay of digital and analog control
on a purely structural level, by analyzing the chromosomal
embedding of the TRN of E. coli. Next, we extend this investigation
to a functional level by employing a method proposed in the
ref.
27
to quantify the strengths of the two control types by using
data from the COLOMBOS
30
database and perform a large-scale
study of the interplay between digital and analog control.
1
Division of Theoretical Physics, Physics Department, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece.
2
PharmaInformatics Unit, Research Center ATHENA, Athens,
Greece.
3
Department of Life Sciences and Chemistry, Jacobs University, Bremen, Germany.
4
Department of Biology, Agricultural University of Georgia, Tbilisi, Georgia.
5
Present
address: Department of Biosystems Science and Engineering, ETH Zürich, Zürich, Switzerland email: m.huett@jacobs- university.de
www.nature.com/npjsba
Published in partnership with the Systems Biology Institute
1234567890():,;
COLOMBOS is a collection of expression data from published
microarrays and RNA-Seq experiments performed in E. coli (and
several other prokaryotes). COLOMBOS combines expression data
analyses across different research papers, labs, and platforms. A
key idea in COLOMBOS is to compare relative expression values to
a reference state as sets of condition contrasts. This should
correct for platform-dependent differences between studies. The
expression data contained within the database are also linked to a
manually curated, standardized condition annotation, and ontol-
ogy. Figures 1,2summarizes the design of our study.
By analyzing (i) the distribution statistics of links in the
transcriptional regulatory network, (ii) the agreement of gene
expression patterns with both, the TRN and the distribution of
genes in the genome, and (iii) the learnabilityof gene expression
patterns by a decision tree employing various chromosomal and
regulatory features we have been able to establish two main
components of the logic underlying bacterial gene expression: (1)
The OriTer axis is a relevant organizer of gene expression; (2)
Chromosomal structure (analog control) and transcription
factors (digital control) contribute to regulation in an eitheror
fashion with one level of control buffering the other.
RESULTS
Structural evidence for the interplay of digital and analog control
and the relevance of the OriTer axis
We start by employing methods from statistical physics of
complex networks, in order to identify the non-random features
characterizing the chromosomal embedding of the transcriptional
regulatory network.
A key assumption of our investigation is that in spite of the
complex and spatiotemporally variable 3D organization of the E.
coli chromosome, the linear organization given by the positional
order of genes along the chromosome is a relevant coordinate
system for investigating a non-random spatial embedding of the
TRN. This assumption is strongly supported by the study of
distributions of genes with and without TRN participation,
20
the
statistical analysis of gene expression patterns along the
chromosomal coordinates
27,31
and the phenotypic changes
contingent on positional shifts of genes encoding the transcrip-
tion factors in the chromosome.
32,33
Perhaps unsurprisingly, application of standard tools for the
analysis of spatially embedded networks does not reveal striking
non-random features of the chromosomal embedding of the TRN
and only provides weak evidence for the co-evolution of these
two biological structures
34
(see Supplementary Figs 1, 2).
As a consequence, we develop and apply a set of methods
tailor-made for the biological system analyzed here, thus
addressing the core question: Is the transcriptional regulatory
network systematically shaped by the chromosomal embedding?
Given the known relevance of the OriTer axis dividing the
circular chromosome into the right and left arms for genetic
regulation,
31
we can rephrase the question: Is the chromosomal
embedding of the transcriptional regulatory network particular
with respect to the OriTer axis? Our analysis strategy in the
following is to compute network quantities under rotation of this
axis and see, whether the true axis stands out. The method,
termed EDURA (edge distribution under rotation of an axis; see
Fig. 2) is based on the principles described below.
Given the position of an axis, we distinguish between six
categories of edges, namely edges on the right chromosomal arm
pointing away from the origin of replication Ori, r
+
, or towards the
Ori, r
and the same on the left arm, l
+
and l
, respectively, as well
as edges across the two chromosomal arms, from right to left, rl,
and from left to right, lr. Figure 1illustrates these categories using
a small sample graph. The labels leftand rightare understood
looking from Ori to Ter. A schematic representation of the EDURA
method is given in Supplementary Fig. 3.
It is clear that, when a different axis is chosen, these edge
categories change. The counts n(r
+
), etc. can now be evaluated for
each position of an axis. Under rotation of the axis an edge will
undergo a systematic sequence of category transitions. Starting
from lr as an example, a typical sequence under axis rotation will
be lr r
+
rl l
. As a consequence, counts of link types are
highly correlated and strongly dependent on gene density and
Fig. 1 Summary of our investigation and overview of the
workow. On the structural level (obtained from RegulonDB,
55
),
the spatial embedding of the TRN within the circular chromosome is
evaluated via the EDURA (Edge Distribution Under Rotation of an
Axis) method. The functional level is contributed by the COLOMBOS
database
30
and analyzed jointly with the structural information
using the concepts of digital and analog control strengths,
27
as well
as decision trees.
Fig. 2 Illustration of the edge categories used in the subsequent
analysis. The light blue circle represents the circular chromosome,
while the dots represent genes (red: right chromosomal arm; blue:
left chromosomal arm). Directed edges indicate interactions
between genes. The dashed blue line denotes the axes used for
the assignment of edge categories (with the longer end
representing Ori).
K. Kosmidis et al.
2
npj Systems Biology and Applications (2020) 5 Published in partnership with the Systems Biology Institute
1234567890():,;
node degree. Figure 3shows the counts n(r
+
), etc. as a function of
the axis position for the real chromosomally embedded E. coli TRN.
The highly volatile nature of these counts, as well as the strong
inuence of gene density, systematic transitions (leading to high
correlations among the curves) and contribution of hubs (e.g.,
dramatic changes in the curves due to many edges changing
category at the same time) are clearly visible.
In order to remove these direct effects on the edge categories,
we have to (1) consider asymmetries, rather than absolute counts,
of the edge categories and (2) subtract the average signal
observed in a randomized graph (see Methods section). The ±
asymmetry indicating a mismatch between edges going down
and going up for one chromosomal arm can be dened as
Ar
±¼nðrþÞnðrÞ
nðrþÞþnðrÞþnðrlÞ;(1)
and accordingly for Al
±. The crossalongasymmetry, measuring
the asymmetry between edges along the chromosomal arms and
across them, is dened as
A$l ¼nðlrÞþnðrlÞnðlþÞnðlÞnðrþÞnðrÞ
nðlrÞþnðrlÞþnðlþÞþnðlÞþnðrþÞþnðrÞ;(2)
which is the number of edges across the arms minus the number
of edges along the arms, normalized by the total number of
edges. Without any clear systematics with respect to a given axis,
the asymmetries will display strong correlations, due to the
transition rules outlined above. Any disruption of these correla-
tions at some axis position is an indicator of the non-random
features of the network for this axis.
Using randomly generated networks with a systematic edge
distribution with respect to an axis (systematic random networks;
see Methods section) we can test and calibrate the EDURA
method (see Supplementary Information). These tests show that,
indeed, a specic axis inscribed in the edge distribution is
detected via the EDURA method as drastic drops in correlation
among the edge category asymmetries for this axis position (see
Supplementary Fig. 4). In Fig. 4the same analysis is performed for
the real E. coli TRN.
The rst statistical analysis supports the earlier ndings,
20
where
via point process statistics it was found that genes under known
direct transcriptional regulation are systematically more distal on
the chromosome than genes without transcriptional regulation,
conrming the general idea that bacterial gene regulation is
organized as an interplay of network-based (digital) control and
(analog) control based on the spatial organization of the
chromosome. In order to go beyond the conrmation of the
previous nding
20
we resort to a common technique of network
science, the comparison with randomized graphs as a method for
identifying the higher-order non-random features of a given
network. In this way we nd structural evidence for the
chromosomal embedding of the network being systematic with
respect to one characteristic spatial axis in the circular chromo-
some (Fig. 4). This axis is dened by the origin (Ori) and terminus
(Ter) of replication. Previously it was argued that the spatiotem-
poral organization of genomic transcription indeed follows the
two replichores as the main spatial organizers.
31
Here we nd that
the network architecture itself carries an evolutionary imprint of
this bilinear space dened by the two replichores.
Functional evidence for the interplay of digital and analog control
In order to assess the functional implications of this structural
interdependence of the transcriptional regulatory network and
the spatial organization of the chromosome we resort to the
COLOMBOS representation of the Gene Expression Omnibus
(GEO) database. For a large number of gene expression datasets,
we measure the agreement of signicant expression changes with
the network and with chromosomal neighborhoods by employing
the quantication methods for digital and analog control
strengths dened previously.
27
We start our investigation by creating the transcriptional
regulatory network (TRN) and the gene proximity network (GPN)
Fig. 3 Analysis of edge categories in the E. coli TRN. a
Representation of the chromosomally embedded E. coli TRN. As in
Fig. 1the large light blue circle represents the circular chromosome
and the blue line indicates the OriTer axis. Black dashes on the
chromosome indicate genes. bEdge categories for the chromoso-
mally embedded E. coli TRN from aas a function of the axis position.
The true Ori and Ter positions are indicated as a reference.
Fig. 4 Edge category asymmetry analysis for the real E. coli TRN.
Asymmetries as a function of the assumed axis position (upper
panel). Correlation coefcient of Ar
±and Al
±as a function of the
assumed axis position (lower panel).
K. Kosmidis et al.
3
Published in partnership with the Systems Biology Institute npj Systems Biology and Applications (2020) 5
of the E.coli genome. Details of these two networks are found in
the Methods section.
Then, for each of the ~4000 experimental datasets obtained
from the COLOMBOS database we generate effectiveTRN and
GPN networks by removing the nodes that are not signicantly
differentially expressed as well as all their links.
Supplementary Fig. 5 (left) shows and example of such an
effective TRN. It depicts data from the experiment GSE10158
which contains microarray data on the expression prole of E. coli
treated with cefsulodin and mecillinam, both alone and in
combination. The gure shows the 22 genes whose expression
level was signicantly altered comparing the contrasts with ids
GSM256904_ch1 and GSM256868_ch1. Genes on the graph are
positioned on a circle according to their coordinates on the E.coli
chromosome. Supplementary Fig. 5 (right) presents a view of an
extendedTRN subgraph which contains the differentially
expressed genes (blue points) plus all the genes that are
connected to the differentially expressed ones in the E. coli TRN
although without signicantly altered expression levels (yellow
points). The extended network comprises 90 genes. A complete
understanding of regulatory control, about which the present
manuscript is a rst step, should aim in explaining why the blue
genes were differentially expressed while the yellowones were
not, despite their immediate connection on the TRN which
indicates a strong interaction between the two.
Figures 5,6show scatter plots of the digital vs. analog control
strengths of more than 4000 effective E. coli networks derived
from the COLOMBOS database (see Methods section and
particularly the Effective Networks subsection). Figure 5shows
data for 104 high quality RNA-Seq experiments. The results
demonstrate an anti-correlation between digital and analog
control strength with a Spearman correlation coefcient of 0.34.
Figure 6shows data for 3969 effective networks constructed
from contrasts of microarray experiments. The results demonstrate
once more an anti-correlation between digital and analog control
strengths with a Spearman correlation coefcient of 0.16. A
heatmap representation of the corresponding rank-based
scatterplots is given in Supplementary Figs 15 (RNA-Seq) and 16
(microarray). In line with the negative correlations discussed here,
these heatmap representations show the buffering relationship
between digital and analog control: high rank values of analog
control go along with low rank values of digital control and
vice versa.
Our analysis reveals the systematic anti-correlation of digital
and analog control as a large global trend in bacterial
transcriptomes. In this way, it conrms the buffering of these
two categories of bacterial gene regulation that was hypothesized
before
27
based on a set of transcriptome proles obtained under a
dedicated experimental variation of both, the machinery of digital
control (via the analysis of hub mutants in the transcriptional
regulatory network) and analog control (via the alteration of
supercoiling energy in the genome induced by topoisomerase
poisons). Our new results show that the anti-correlation of these
distinct logical types of regulation is not limited to perturbations
of the regulatory machinery, but persists across a wide range of
experimental conditions (phenotypes) and genotypes.
Strikingly, the anti-correlation between digital and analog
control strengths is much stronger for signicantly downregulated
than for upregulated genes, when these two categories are
analyzed separately (see Supplementary Figs 8, 9). This statistical
difference between downregulated and upregulated genes is in
line with the earlier observation
35
that the ground state(or
default state) in prokaryotic gene regulation is nonrestrictive (or
on), as opposed to eukaryotic gene regulation, where the
ground state is restrictive (or off). It is then intuitive that the
systematic interplay between digital and analog control reveals
itself in the pattern of deviations from the ground state, i.e., in the
downregulated genes.
We have re-computed our results varying the two main
parameters of our analysis, namely the logFC threshold for
determining differentially expressed genes and the distance
threshold dening links in the GPN (see Supplementary Figs 11,
12). This parameter variation conrms the robustness and strong
statistical validity of our ndings. As an additional conrmation we
Fig. 5 Digital vs. analog control for gene-level RNA-Seq data. Data
for 104 effective networks resulting from contrasts of RNA-Seq
experiments have been analyzed. Central panel: Scatter plot of the
digital vs. analog control strengths. Top panel: Histogram of the
distribution of analog control strength. Right panel: Histogram of
the distribution of digital control strength.
Fig. 6 Digital vs. analog control for gene-level microarray data.
Data for 3969 effective networks resulting from contrasts of
microarray experiments have been analyzed. Central panel: Scatter
plot of the digital vs. analog control strengths. Top panel: Histogram
of the distribution of analog control strength. Right panel:
Histogram of the distribution of digital control strength.
K. Kosmidis et al.
4
npj Systems Biology and Applications (2020) 5 Published in partnership with the Systems Biology Institute
computed the interplay of digital and analog control strengths
also on the operon level (see Supplementary Figs 13, 14), leading
qualitatively to the same results.
Summarizing, our results demonstrate for the rst time that the
tight coupling between digital (network-based) and analog
(chromosomal) control goes far beyond the response to
perturbations specically designed to affect one of the two
control types (as reported previously
27
), but is a universal property
of bacterial gene regulation.
Decision tree analysis of transcriptome proles and the relevance
of the OriTer axis
In order to ascertain that the anti-correlation observed in the large
set of transcriptome proles is not dependent on our choice of
quantication method, namely the digital and analog control
strengths, we also used a machine learning framework in order to
measure, which structural features are employed to predict
signicant expression changes from the database of transcriptome
proles, when the feature set consists of nine quantitative
variables (see Methods section, Decision Trees) selected to
represent a wide range of network and chromosomal properties.
This set of features is specically designed to highlight the
impacts of either digital or analog control in a given expression
prole.
Here we are not interested in the quality of the classication
(and, hence, do not separate the data into training and test data),
but rather in the features employed by the decision tree to split
the genes in an expression contrast into differentially expressed
and not differentially expressed. As our main goal is to assess the
interplay between digital and analog control at work in these
gene expression patterns, we use a set of features, which can
either be associated with digital control (number of differentially
expressed regulators of a gene in the TRN, hns regulating the
gene, s regulating the gene, crp regulating the gene) or analog
control (number of differentially expressed neighbors in the GPN,
hns binding site density near a genes location, s binding site
density, crp binding site density), as well as one feature not
directly classiable as digital or analog control, namely the
position of a gene relative to the Ori. For each gene expression
contrast we can now compute the relative importance of each of
these features in classifying genes according to their differential
expression. The question here is, whether the decision tree
predominantly employs analog features in contrasts with high
analog control strength and, conversely, digital features in
contrasts with high digital control strength.
In this analysis, we use the binding sites only as a proxy for local
structural features of chromosome. Previous studies demonstrated
that the gene order along the OriTer axis is highly conserved in
bacteria.
31
This spatial order is apparent not only for the principal
regulatory genes (such as e.g., RNA polymerase sigma factors and
nucleoid-associated proteins) but also for their targets. Further-
more, while the chromosomal position of a gene is thought to be
determinative for the gene copy number and expression level,
3639
recent studies strongly suggest that it is also determinative for the
spatial location of the gene product in the cell. In particular, the
regulatory proteins were found to diffuse from their cognate sites
of production forming gradients.
40,41
This suggests that the
genomic distances between the transcriptional regulators and
their targets are subject to evolutionary constraints and that in
general, regulatory genes would be preferentially positioned in
the vicinity of target genes.
21
However, spatial considerations
imply a different effect in the case of highly abundant DNA
architectural proteins, such as e.g., the nucleoid-associated
proteins, that diffuse over relatively large distances, bind
cooperatively at hundreds of chromosomal sites and compact
the DNA by constraining DNA supercoils over extended chromo-
somal regions.
24,4244
Variable spatial distributions of
nucleoprotein complexes formed by global regulators such as
e.g., FIS and H-NS modulate the structural dynamics of the
chromosome exerting continuous or analog effects on genetic
expression reected in directionally coherent changes of tran-
script patterns involving neighboring genes
45,46
that can be
readily measured by estimating the analog control strength in the
effective networks.
We have generated decision trees for each of the microarray
and RNA-Seq effective networks in our possession and used them
to estimate the importance of all of the nine features in each case.
In order to exclude the effects of randomness in the estimation of
importance, the differentially expressed genes for each of the
individual effective networks were shufed 100 times and decision
trees were used to estimate the feature importance of these
randomized cases. Subsequently, we subtracted the mean of the
randomized importances from the actual feature importance.
The results form a matrix of nine columns and 3969 rows for the
microarray data and 104 rows for the RNA-Seq data. This matrix
can be augmented, if we include the digitalCTC and the
analogCTC as two additional columns, especially since we are
interested in investigating how the proposed measures of
digitalCTC and analogCTC correlate with the features used to
characterize the expressed E. coli genes. As before, the RNA-Seq
and microarray experiments were analyzed separately. Figure 7
(left panel) shows the Spearman correlation coefcient between
analogCTC and each of the features for networks derived from the
RNA-Seq experiments. Figure 7(right panel) shows the same, but
for the digitalCTC. Similarly, Fig. 8(left panel) shows the Spearman
correlation coefcient between analogCTC and each of the
features for networks derived from the microarray experiments
and Fig. 8(right panel) shows the same for the digitalCTC.
In all cases we observe a high correlation between digitalCTC
and the digitalfeatures, i.e., trn cont,hns dig cont,s dig cont and
crp dig cont. We also observe a high correlation between
analogCTC and gpn cont, which is a very characteristic analog
control feature. The overall trend is in agreement of what is
expected by the assumption of the existence of two complemen-
tary forms of transcriptional control and provides a proof that the
quantities of analogCTC and digitalCTC are a reliable means of
Fig. 7 Feature importance correlations for RNA-Seq data. (Left)
Spearman correlation coefcient between analogCTC and each of
the features for networks derived from the rnaseq experiments.
(Right) The same for the digitalCTC. The color code above each bar
is: blueanalog control feature, reddigital control feature, green
dual feature (related to the OriTer axis).
K. Kosmidis et al.
5
Published in partnership with the Systems Biology Institute npj Systems Biology and Applications (2020) 5
measuring the impact of digital and analog control mechanisms of
gene regulation in the expression proles.
The systematic anti-correlation between features associated
with digital control and features associated with analog control, as
well as the relevance of the OriTer axis (via the relevance of the
distance from Ori as a feature integrated in the machine learning
analysis) conrm the results from the previous two sections.
More specically, we nd that the relative position of a gene
with regard to the chromosomal Ori, that is along the OriTer axis,
is a feature of very high predictive value for gene expression.
Despite its clear impact, we cannot readily attribute the inuence
of the OriTer symmetry axis either to purely digital or to purely
analog type of control. Therefore, we consider this axis as a third
essential and distinct regulatory element of the system, i.e., a
coordinator of gene expression.
DISCUSSION
In this study we set out to integrate two different modes of
transcriptional regulation, one mediated by the TRN and another
by spatial organization of the chromosome. For this purpose we
applied a set of structural and functional analyses. We hypothe-
sized that the TRN is spatially embedded in the chromosome such
that the pattern of directed links is highly non-random with
respect to a single axis converting the circular space (i.e., the
circular chromosome) into two linear branches (the chromosomal
arms). We set out to identify this organizational principle, as well
as the position of the axis, by computing a set of statistical
quantiers for each candidate position of such an axis and then
observing clear systematics in the behavior of these topological
features emerging when we approach the true axis position. This
novel data analysis is called EDURA (Edge Distribution Under
Rotation of an Axis). One counter-intuitive aspect of the EDURA
result is that the disruption of correlations is the relevant signal.
This follows from the transformation properties of link categories
indicated in the previous section with l
+
transforming rst into rl
and then into r
upon rotation of the axis. Highly correlated link
counts are therefore the default, while a sudden drop in
correlation is indicative of a systematic arrangement of links with
respect to this position of the axis. We have tested and conrmed
this view by a detailed analysis of random graphs with a
systematic link distribution bias with respect to a randomly
selected axis (see Methods section, Systematic random net-
works; see Supplementary Fig. 4 for an example of EDURA for a
systematic random network). Beyond the set of ndings on
bacterial gene regulation, we are convinced that the EDURA
method, together with the axis-systematic random networks, will
be of relevance for the analysis of a range of other spatially
embedded networks.
We discovered a systematic orientation of the network with
respect to the OriTer axis indicative of an evolutionary co-
adaptation of replication and transcriptional regulation. Indeed, in
E. coli and other bacteria the collisions between the transcription
machinery and the replisomes progressing bi-directionally from
the Ori towards the Ter pose problems potentially leading to
genetic instability,
47
and this conict has been widely studied
both in prokaryotes
48
and eukaryotes.
49
It is revealing that we nd
an evidence for an evolutionary adjustment of these two
fundamental levels of cellular DNA transactionsreplication and
gene regulationon a purely structural level (via the embedding
of the network in the chromosome), as well as on a functional
level in gene expression proles. This means that replication and
transcription are coordinated from the same assessment center
using the OriTer axis as a system of coordinates, providing a new
rationale for understanding the evolution of chromosomal gene
organization.
Our results conrm previous observations of two logically
distinctdigital and analogtypes of transcriptional regulation in
E. coli
20,27
and demonstrate an anti-correlation of digital and
analog control strengths across a wide range of genotypes and
phenotypes. Futhermore we clearly show the dualistic nature of
digital and analog control via feature selection in a classication of
transcriptome proles based on machine learning, with additional
evidence for relevance of the OriTer axis. This is clearly visible in
Fig. 7(RNA-Seq) and Fig. 8(microarray), which show the
correlation of each feature with analog (left) and digital (right)
control strengths. First of all, the previously discussed anti-
correlation of digital and analog control strengths is clearly visible.
Furthermore, one can see that, indeed, the analog features tend to
correlate with analog control strengths, while the digital features
rather correlate with digital control strength. Only the position
relative to the Ori stands out: It is (slightly) negatively correlated
with both, digital and analog control strengths. These systematics
of the correlations are the same for RNA-Seq data (Fig. 7) and
microarray data (Fig. 8). Hence we denote this symmetry axis as a
coordinator of genetic regulation. It is noteworthy that our data
predominantly consist of statistical signals made visible by
comparison with null models as well as methods from machine
learning. In all cases we average over a wide range of conditions
and individual cases (for example, the transcriptome proles of
diverse origins or the high variation in gene density across the
chromosome). As a consequence of these averaging procedures
most of the signals will necessarily be rather faint. We would like
to emphasize, however, that each of the signals reported here is
highly signicant and, in combination, the collection of statistical
signals from the structural investigation, the assessment of
transcriptome proles and the machine learning classication
task furnish a structural and dynamical foundation of digital and
analog control in bacterial gene regulation.
Taken together, these data provide a very clear picture, where
the set of ideas about chromosomal DNA topology as a
fundamental level embedding the transcriptional regulation in
bacteria discussed in the literature already for several dec-
ades
23,24,26,27,50
is conrmed and the evidence for an evolutionary
alignment of two fundamental levels of cellular organization
replication and gene regulationis found on a purely structural
level (via the embedding of the network in the chromosome), as
Fig. 8 Feature importance correlations microarray data. (Left)
Spearman correlation coefcient between analogCTC and each of
the features for networks derived from the microarray experiments.
(Right) The same for the digitalCTC. The color code above each bar
is: blueanalog control feature, reddigital control feature, green
dual feature (related to the OriTer axis).
K. Kosmidis et al.
6
npj Systems Biology and Applications (2020) 5 Published in partnership with the Systems Biology Institute
well as on a functional level in the design of gene expression
patterns. Notably, the embedding of a network in space along a
systemically dened axis can potentially be of relevance for a wide
range of systems. Sensory systems for example have an axis along
the hierarchical depth from the input nodes to processing nodes
generating ever more abstract representations of the sensory
input.
51,52
The same is true for manufacturing systems with their
hierarchy of input, production and assembly/output layers.
53
The situation we face in our attempts to epitomize the
spatiotemporal constraints imposed on the emergence of gene
expression patterns is reminicent of the work by Brockmann and
Helbing
54
on the spread of epidemic diseases. They show that the
complex spatiotemporal pattern of disease occurrences becomes
a simple propagating wave on a tree graph derived from shortest
ight distances. Distances in this re-arranged worldare much
more meaningful than geographic distances. Here, spaceis the
genome and the air trafc network is the TRN, which facilitates the
spreading of information across the genome. In that work
54
the
spatial distance was ignored. Adapting their formalism to the case,
where a mixture of spatial and network distances denes the re-
arranged worldwould allow a completely novel view on gene
expression proles.
METHODS
Transcriptional regulatory network
In our investigation, a TRN is a graph whose nodes represent genes. If a
gene aencodes a transcription factor A, which regulates another gene b
then a link pointing from ato bis inserted in the graph. The E. coli TRN
used for the present paper was created using data from RegulonDB,
55
a
freely available database of the regulatory network of Escherichia coli K-12.
The TRN we have used has 1771 nodes and 3975 edges. Its minimum node
degree is equal to one, the maximum node degree is equal to 496 and the
average degree is equal to 4.49. The largest cluster of the TRN consists of
1678 nodes and has 3788 edges. It is a disassortative network with a
degree assortativity coefcient equal to 0.32 meaning that high degree
nodes tend to connect to low degree nodes and slightly avoid "hub”–"hub
connections.
It is well known that genes are organized in operons i.e., groups of genes
sharing a regulatory domain. In order to check for and exclude the
contributions of the operoneffect we have also performed our
investigations on a modied version of the TRN where the nodes are
operons instead of genes and a link between two operons is present if a
gene in one operon produces a transcription factor that regulates a gene
of the second operon. The largest cluster of this operon TRN consists of
816 nodes and 1551 edges with max degree equal to 220 and average
degree equal to 3.80.
Gene proximity network
The GPN is an undirected graph of 4609 nodes and 90,878 edges. It is a
formal representation of the spatial organization of the chromosome.
Following the prescription from earlier work,
27
nodes represent genes and
are connected to each other, if their centersare separated by a distance
less or equal to a distance threshold of T
GPN
=20 kilobase pairs (kbp) on
the circular DNA chromosome. The info required for constructing the GPN
i.e., gene names and their starting and ending positions on the E. coli
chromosome were again obtained from RegulonDB. As a genescenter
position we have considered the average of its starting and ending
positions.
Transcriptome proles and effective networks
COLOMBOS offers RNA-Seq and microarray experimental data containing
differentially expressed genes between pairs of experiments. Differentially
expressed genes are determined by computing the log-fold change
(logFC) between the two experimental conditions. We consider three
modes of differential expression depending on the logFC: differentially
regulated(absdge) (abs(logFC) > T
FC
), upregulated(posdge) (logFC >
T
FC
), downregulated(negdge) (logFC < T
FC
).
We construct an effective TRN by taking the subgraph from the
complete TRN consisting of all differentialy expressed genes and the links
among them (see Supplementary Fig. 5 left for an example). Thus, an
effective TRN is a directed subgraph of the TRN where the nodes are only
the genes whose expression level has been signicantly altered. Similarly,
we construct an effective GPN by taking the subgraph from the complete
GPN consisting of all differentialy expressed genes and the links among
them. Consequently, each pair of experiments has one effective TRN and
one effective GPN associated with it. In total we have analyzed 104
effective TRNs and effective GPNs from RNA-Seq data and 3969 effective
TRNs and effective GPN from microarray data.
Differential expression on the gene level is translated to the operon level
in the following way: an operon is considered differentially expressed, if
any gene in the operon is differentially expressed. The same rule is applied
for distinguishing between differentially upregulated and downregulated
operons. A potentially conicting case, where an operon consists of both,
signicantly upregulated and downregulated genes in the same experi-
ment, does not occur in the data sets we analyzed.
Unless indicated otherwise, results are shown for T
FC
=2.5 and T
GPN
=20
kbp. The robustness of the results under variation of T
FC
and T
GPN
is
demonstrated in Supplementary Figs 1114.
Systematic random networks
A key component of the structural part of our analysis is the arrangement
of edges with respect to a given spatial axis. In order to interpret the
statistics observed in the real network, we employ a simple graph-
generation algorithm to create random networks with edge counts, which
are systematic with respect to one predened axis. Parameters of this
algorithm are the number of nodes, N; the number of edges in each
category, n(r
+
), n(r
), n(l
+
), n(l
), n(rl), n(lr), with respect to the chosen axis
(see Fig. 1for an illustration of the edge categories); the position of the
axis, a*; the size of the genome, g. First, random positions for the Ngenes
are created (N/2 per chromosomal arm). Next, random links within each
category are created according to the axis a*. These networks can
subsequently be analyzed via the same analysis pipeline as the real
chromosomally embedded TRN.
Graph randomization
All the results for the edge category asymmetry analysis shown here are
differences between a given graph and a set of randomized graphs,
serving as a null model. Here we keep the gene positions xed and
randomize the edges via switch randomization. For each randomized
graph, 5000 randomization steps are performed.
Control strengths
Qualitatively speaking, each control strength measures the agreement
between a set of genes and a given network. In the induced subgraph
spanned by the set of genes we compute the connectivity (specically we
compute the number of nodes with non-zero degree in the subgraph).
Using a null model of randomly drawn gene sets, we then compute the z-
score of this connectivity. This z-score is the control strength. Applying this
procedure for the TRN yields the digital control strength; applying this
procedure to the GPN yields the analog control strength.
For each effective network the control ratio Ris calculated as the
number of connected nodes N
connected
(i.e., the size of the connected
subnet component) over the number of isolated nodes N
isolated
(i.e., the
size of the unconnected subnet component), R=N
connected
/N
isolated
. The
control type condence, CTC
27
or control strength, is the z-score of R,
calculated from the mean Rand its standard deviation obtained from
10,000 runs of the corresponding null model. In the case of the digital null
model, the same number of affected nodes was mapped randomly on the
TRN. For the analog null model, the same number of affected genes was
mapped randomly on the positions in circular genome.
Decision trees
For the decision tree implementation we choose nine features which will
be the input of our machine learning model. Our decision tree model will
use these features to predict whether a gene will be differentially
expressed or not. These features are the following:
PosOric =position relative to Ori.
crp density =crp binding sites density i.e., number of cpr binding sites
in a distance +/50,000 base pairs around the gene.
hns density =hns binding sites density i.e., number of hns binding
K. Kosmidis et al.
7
Published in partnership with the Systems Biology Institute npj Systems Biology and Applications (2020) 5
sites in a distance +/50,000 base pairs around the gene.
s density =s binding sites density i.e., number of s binding sites in
a distance +/50,000 base pairs around the gene.
gpn cont =number of affected neighbors in the GPN.
trn cont =number of affected ancestors in the TRN.
hns dig cont =binary variable; 1, if hns is in the genes direct TRN
predecessors, 0 otherwise.
s dig cont =binary variable; 1, if s is in the genes TRN direct
predecessors, 0 otherwise.
crp dig cont =binary variable; 1, if crp is in the genes TRN direct
predecessors, 0 otherwise.
In short, we assume that the differential expression of a gene is a
function fof the above nine variables. The range of fis the discrete set 0, 1
where the value 1 means that the gene is differentially expressed. Thus,
each of the 4602 E. coli genes is characterized by a nine-dimensional
vectorwith the values of these nine variables as coordinates. For each of
these genes the value of fis known (and that is true for each of the ~4000
experiments of the COLOMBOS database). Predicting the values of fand
comparing them to the known values can be seen as a supervised learning
problem. In fact, a decision tree represents a function that takes as input a
vector of attribute values and returns a decisioni.e., a single output value.
A decision tree reaches its decision by performing a sequence of tests.
Each internal node in the tree corresponds to a test of the value of one of
the input attributes. The algorithm selects a variable and splits the data to
the value of the variable that maximizes the entropy gain (or equivalently
the Gini impurity gain) from the split.
56
In our case at each node which contains for example Ngenes we
calculate the value S¼ðp1lnðp1Þþp0lnðp0ÞÞ where p
1
is the fraction of
expressed genes to total genes Non the node and p
0
is the fraction of
silent genes to N. Then test splittings are performed and the quantity
G¼ð
nleft
NSleft þnright
NSrightÞis calculated. The split that maximizes the
difference SGis selected. The Gini impurity g=p
1
(1 p
1
)+p
0
(1 p
0
)
is a valid and often used alternative to the entropy S.
Then it does the same for all other variables, nally selecting the variable
and value that leads to the maximum gain among all possible choices.
Thus, the main nodes split in two nodes and the process is repeated for
each of them until a perfect classication is reached. Finally, we calculate
the importance of each feature (variable) used for the classication
process. The way to compute the feature importance values of a single tree
is by traversing the tree and for each internal node that splits on feature i
we compute the error reduction of that node multiplied by the number of
samples that were routed to the node and sum this quantity for all nodes
to estimate the feature importance of variable i. The error reduction
depends on the impurity criterion that you use (Gini or entropy). It is the
impurity of the set of examples that gets routed to the internal node minus
the sum of the impurities of the two partitions created by the split. This is
the way that regression trees are implemented in scikit-learn
57
which is
rapidly becoming a standard machine learning tool.
Reporting summary
Further information on research design is available in the Nature Research
Reporting Summary linked to this article.
DATA AVAILABILITY
All data analysed during this study are publicly available via the databases referenced
in the article.
CODE AVAILABILITY
The code used to perform the analyses presented in the current study is available
from the corresponding author on reasonable request.
Received: 30 August 2019; Accepted: 21 January 2020;
REFERENCES
1. Westerhoff, H. V. & Palsson, B. O. The evolution of molecular biology into systems
biology. Nat. Biotechnol. 22, 1249 (2004).
2. Hütt, M.-T. Understanding genetic variation-the value of systems biology. Br. J.
Clin. Pharmacol. 77, 597605 (2014).
3. Palsson, B. & Palsson, B. ϕ.Systems Biology (Cambridge University Press, Cam-
bridge, 2015).
4. Tomita, M. Whole-cell simulation: a grand challenge of the 21st century. Trends
Biotechnol. 19, 205210 (2001).
5. Karr, J. R. et al. A whole-cell computational model predicts phenotype from
genotype. Cell 150, 389401 (2012).
6. Milo, R. et al. Network motifs: simple building blocks of complex networks. Sci-
ence 298, 824827 (2002).
7. Babu, M. M., Luscombe, N. M., Aravind, L., Gerstein, M. & Teichmann, S. A.
Structure and evolution of transcriptional regulatory networks. Curr. Opin. Struct.
Biol. 14, 283291 (2004).
8. Yu, H. & Gerstein, M. Genomic analysis of the hierarchical structure of regulatory
networks. Proc. Natl Acad. Sci. 103, 1472414731 (2006).
9. Alon, U. Network motifs: theory and experimental approaches. Nat. Rev. Genet. 8,
450 (2007).
10. Shen-Orr, S. S., Milo, R., Mangan, S. & Alon, U. Network motifs in the transcrip-
tional regulation network of Escherichia coli. Nat. Genet. 31, 64 (2002).
11. Barthélemy, M. Spatial networks. Phys. Rep. 499,1101 (2011).
12. Kosmidis, K., Havlin, S. & Bunde, A. Structural properties of spatially embedded
networks. Europhys. Lett. 82, 48005 (2008).
13. Tero, A. et al. Rules for biologically inspired adaptive network design. Science 327,
439442 (2010).
14. Morris, R. G. & Barthelemy, M. Transport on coupled spatial networks. Phys. Rev.
Lett. 109, 128703 (2012).
15. Bullmore, E. & Sporns, O. The economy of brain network organization. Nat. Rev.
Neurosci. 13, 336 (2012).
16. Chen, Y., Wang, S., Hilgetag, C. C. & Zhou, C. Trade-off between multiple con-
straints enables simultaneous formation of modules and hubs in neural systems.
PLoS Comput. Biol. 9, e1002937 (2013).
17. Crucitti, P., Latora, V. & Porta, S. Centrality measures in spatial networks of urban
streets. Phys. Rev. E 73, 036125 (2006).
18. Gilarranz, L. J. & Bascompte, J. Spatial network structure and metapopulation
persistence. J. Theor. Biol. 297,1116 (2012).
19. Warren, P. & Ten Wolde, P. Statistical analysis of the spatial distribution of
operons in the transcriptional regulation network of Escherichia coli. J. Mol. Biol.
342, 13791390 (2004).
20. Sonnenschein, N., Hütt, M.-T., Stoyan, H. & Stoyan, D. Ranges of control in the
transcriptional regulation of Escherichia coli.BMC Syst. Biol. 3, 119 (2009).
21. Janga, S. C., Salgado, H. & Martínez-Antonio, A. Transcriptional regulation shapes
the organization of genes on bacterial chromosomes. Nucleic Acids Res. 37,
36803688 (2009).
22. Daqing, L., Kosmidis, K., Bunde, A. & Havlin, S. Dimension of spatially embedded
networks. Nat. Phys. 7, 481484 (2011).
23. Travers, A. & Muskhelishvili, G. DNA supercoiling-a global transcriptional regulator
for enterobacterial growth? Nat. Rev. Microbiol. 3, 157 (2005).
24. Travers, A. & Muskhelishvili, G. Bacterial chromatin. Curr. Opin. Genet. Dev. 15,
507514 (2005).
25. Muskhelishvili, G., Sobetzko, P., Geertz, M. & Berger, M. General organisational
principles of the transcriptional regulation system: a tree or a circle? Mol. BioSyst.
6, 662676 (2010).
26. Travers, A., Muskhelishvili, G. & Thompson, J. DNA information: from digital code
to analogue structure. Phil. Trans. R. Soc. A 370, 29602986 (2012).
27. Marr, C., Geertz, M., Hütt, M.-T. & Muskhelishvili, G. Dissecting the logical types of
network control in gene expression proles. BMC Syst. Biol. 2, 18 (2008).
28. Sonnenschein, N., Geertz, M., Muskhelishvili, G. & Hütt, M.-T. Analog regulation of
metabolic demand. BMC Syst. Biol 5, 40 (2011).
29. Beber, M. E., Sobetzko, P., Muskhelishvili, G. & Hütt, M.-T. Interplay of digital and
analog control in time-resolved gene expression proles. EPJ Nonlinear Biomed.
Phys. 4, 8 (2016).
30. Moretto, M. et al. COLOMBOS v3. 0: leveraging gene expression compendia for
cross-species analyses. Nucleic Acids Res. 44, D620D623 (2015).
31. Sobetzko, P., Travers, A. & Muskhelishvili, G. Gene order and chromosome
dynamics coordinate spatiotemporal gene expression during the bacterial
growth cycle. Proc. Natl Acad. Sci. 109, E42E50 (2012).
32. Fitzgerald, S. et al. Re-engineering cellular physiology by rewiring high-level
global regulatory genes. Sci. Rep. 5, 17653 (2015).
33. Gerganova, V. et al. Chromosomal position shift of a regulatory gene alters the
bacterial phenotype. Nucleic Acids Res. 43, 82158226 (2015).
34. Kosmidis, K. & Hütt, M.-T. The E. coli transcriptional regulatory network and its
spatial embedding. Eur. Phys. J. E 42, 30 (2019).
35. Struhl, K. Fundamentally different logic of gene regulation in eukaryotes and
prokaryotes. Cell 98,14 (1999).
K. Kosmidis et al.
8
npj Systems Biology and Applications (2020) 5 Published in partnership with the Systems Biology Institute
36. Schmid, M. B. & Roth, J. R. Gene location affects expression level in salmonella
typhimurium. J. Bacteriol. 169, 28722875 (1987).
37. Couturier, E. & Rocha, E. P. Replication-associated gene dosage effects shape the
genomes of fast-growing bacteria but only for transcription and translation
genes. Mol. Microbiol. 59, 15061518 (2006).
38. Block, D. H., Hussein, R., Liang, L. W. & Lim, H. N. Regulatory consequences of gene
translocation in bacteria. Nucleic Acids Res. 40, 89798992 (2012).
39. Soler-Bistué, A., Timmermans, M. & Mazel, D. The proximity of ribosomal protein
genes to oriC enhances Vibrio cholerae tness in the absence of multifork
replication. MBio 8, e00097e000117 (2017).
40. Llopis, P. M. et al. Spatial organization of the ow of genetic information in
bacteria. Nature 466, 77 (2010).
41.Kuhlman,T.E.&Cox,E.C.GenelocationandDNAdensitydetermine
transcription factor distributions in Escherichia coli. Mol. Syst. Biol. 8, 610
(2012).
42. Luijsterburg, M. S., White, M. F., van Driel, R. & Dame, R. T. The major architects of
chromatin: architectural proteins in bacteria, archaea and eukaryotes. Crit. Rev.
Biochem. Mol. Biol. 43, 393418 (2008).
43. Dillon, S. C. & Dorman, C. J. Bacterial nucleoid-associated proteins, nucleoid
structure and gene expression. Nat. Rev. Microbiol. 8, 185 (2010).
44. Rimsky, S. & Travers, A. Pervasive regulation of nucleoid structure and func-
tion by nucleoid-associated proteins. Curr. Opin. Microbiol. 14, 136141
(2011).
45. Sobetzko, P., Glinkowska, M., Travers, A. & Muskhelishvili, G. Dna thermo-
dynamic stability and supercoil dynamics determine the gene expression
program during the bacterial growth cycle. Mol. BioSyst. 9,16431651
(2013).
46. Jiang, X., Sobetzko, P., Nasser, W., Reverchon, S. & Muskhelishvili, G. Chromosomal
stress-responsedomains govern the spatiotemporal expression of the bacterial
virulence program. MBio 6, e00353e00415 (2015).
47. Wang, G. & Vasquez, K. M. Effects of replication and transcription on DNA
structure-related genetic instability. Genes 8, 17 (2017).
48. Merrikh, H., Zhang, Y., Grossman, A. D. & Wang, J. D. Replication-transcription
conicts in bacteria. Nat. Rev. Microbiol. 10, 449 (2012).
49. Brambati, A., Colosio, A., Zardoni, L., Galanti, L. & Liberi, G. Replication and
transcription on a collision course: eukaryotic regulation mechanisms and
implications for DNA stability. Front. Genet. 6, 166 (2015).
50. Dorman, C. J. & Dorman, M. J. DNA supercoiling is a fundamental regulatory
principle in the control of bacterial gene expression. Biophys. Rev. 8,89100
(2016).
51. Kashtan, N. & Alon, U. Spontaneous evolution of modularity and network motifs.
Proc. Natl Acad. Sci. USA 102, 1377313778 (2005).
52. Fretter, C., Lesne, A., Hilgetag, C. C. & Hütt, M.-T. Topological determinants of self-
sustained activity in a simple model of excitable dynamics on graphs. Sci. Rep. 7,
42340 (2017).
53. Beber, M. E. et al. The prescribed output pattern regulates the modular structure
of ow networks. Eur. Phys. J. B 86, 473 (2013).
54. Brockmann, D. & Helbing, D. The hidden geometry of complex, network-driven
contagion phenomena. Science 342, 13371342 (2013).
55. Salgado, H. et al. RegulonDB v8.0: omics data sets, evolutionary conservation,
regulatory phrases, cross-validated gold standards and more. Nucleic Acids Res.
41, D203D213 (2013).
56. Breiman, L., Friedman, J., Olshen, R. & Stone, C. Classication and regression trees,
wadsworth international group, belmont, ca, 1984. Case Descr. Feature Subset
Correct Missed FA Misclass 1,13 (1993).
57. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res.
12, 28252830 (2011).
ACKNOWLEDGEMENTS
M.H. and G.M. acknowledge support from Deutsche Forschungsgemeinschaft (grant
numbers HU 937/8, MU 1549/13).
AUTHOR CONTRIBUTIONS
K.K., G.M. and M.H. conceived the research. K.K., K.J. and M.H. developed the
computational framework. K.K. and K.J. analyzed the data. G.M. and M.H. contributed
to the results interpretation. K.K., G.M. and M.H. wrote the paper.
COMPETING INTERESTS
The authors declare no competing interests.
ADDITIONAL INFORMATION
Supplementary information is available for this paper at https://doi.org/10.1038/
s41540-020-0124-1.
Correspondence and requests for materials should be addressed to M.-T.H.
Reprints and permission information is available at http://www.nature.com/
reprints
Publishers note Springer Nature remains neutral with regard to jurisdictional claims
in published maps and institutional afliations.
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative
Commons license, and indicate if changes were made. The images or other third party
material in this article are included in the articles Creative Commons license, unless
indicated otherwise in a credit line to the material. If material is not included in the
articles Creative Commons license and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder. To view a copy of this license, visit http://creativecommons.
org/licenses/by/4.0/.
© The Author(s) 2020
K. Kosmidis et al.
9
Published in partnership with the Systems Biology Institute npj Systems Biology and Applications (2020) 5
... The model consists of 2 modules, (1) a CNN taking as input the DNA sequence and outputting a low-dimensional representation of a gene, and (2) a GAT leveraging the gene representations as node features and modelling the interactions between the genes using data from existing protein-protein interaction database [54]. As bacteria show spatial gene expression patterns, with the location of the gene on the genome from the origin of the replication affecting its mean expression [55,56] and genes often clustering together in operons [50,51], we inject positional information by adding absolute (i.e. not relative) positional encodings to the gene embedding. ...
... Bacterial gene expression depends on the position in the genome [55,56]. To account for this, we add fixed absolute positional embedding to the gene representation, right after the CNN and before the graph module (Fig. 4a). ...
Preprint
Full-text available
Rapid detection of antibiotic-resistant bacteria and understanding the mecha- nisms underlying antimicrobial resistance (AMR) are major unsolved problems that pose significant threats to global public health. However, existing methods for predicting antibiotic resistance from genomic sequence data have had lim- ited success due to their inability to model epistatic effects and generalize to novel variants. Here, we present GeneBac, a deep learning method for predicting antibiotic resistance from DNA sequence through the integration of interactions between genes. We apply GeneBac to two distinct bacterial species and show that it can successfully predict the minimum inhibitory concentration (MIC) of multiple antibiotics. We use the WHO Mycobacterium tuberculosis mutation cat- alogue to demonstrate that GeneBac accurately predicts the effects of different variants, including novel variants that have not been observed during training. GeneBac is a modular framework which can be applied to a number of tasks including gene expression prediction, resistant gene identification and strain clus- tering. We leverage this modularity to transfer learn from the transcriptomic data to improve performance on the MIC prediction task.
... (iv) hub-set-orientation prevalence (HSO): a measure based on shortest paths that calculates the orientation of an edge with respect to a set of hubs by fixing an enumeration of edge endpoints and counting which of them is more often reached first from the hub set by following the shortest path [44]. In this study, we use the whole network as the set of hubs; (v) incident degree asymmetry (IDA): the absolute value of asymmetry [45] ...
Article
Full-text available
How important is a single edge of a graph for a specific dynamical task? This question is of practical relevance to many research fields and is pivotal to understanding the structure–function relationships in complex networks more deeply. Here, we design an analysis strategy to answer it and explore the connection of such importance to network topology. Our approach for evaluating dynamical edge importance is based on the differences in time courses between dynamics on the original graph G and on the graph G − missing an edge. To demonstrate the method’s versatility, we apply it to two drastically different classes of dynamics—a minimal model of excitable dynamics, and totalistic cellular automata on graphs as representatives of pattern formation. Our results suggest that the dynamical usage of a graph relies on markedly different topological attributes for these two classes of processes. Finally, we study dynamical edge importance in the macaque cortical area network, to illustrate possible real-world applications. We find that dynamical importance of edges differ between the network and its switch-randomized counterparts, and these differences can be functionally interpreted. Moreover, they are qualitatively distinct for long-time courses and short transients, highlighting different parts of the network’s intended function.
... The temporally organized regulatory cascades alone cannot provide for the unity of the living system-a cascade has a beginning and an end-yet it does not necessarily close onto itself, while as mentioned above, from the systems-theoretical perspective, the living system constitutes a self-referential circuit. What is assumed here is the closure of the system onto itself, and this organization of unity implicates spatial coordinates [131][132][133]. A relevant example of the integration of cell division and differentiation by coordinating the temporal gene expression and spatial organization of gene products and protein gradients has been provided in studies of the Caulobacter crescentus system [9,134,135]. ...
Article
Full-text available
Living systems are capable on the one hand of eliciting a coordinated response to changing environments (also known as adaptation), and on the other hand, they are capable of reproducing themselves. Notably, adaptation to environmental change requires the monitoring of the surroundings, while reproduction requires monitoring oneself. These two tasks appear separate and make use of different sources of information. Yet, both the process of adaptation as well as that of reproduction are inextricably coupled to alterations in genomic DNA expression, while a cell behaves as an indivisible unity in which apparently independent processes and mechanisms are both integrated and coordinated. We argue that at the most basic level, this integration is enabled by the unique property of the DNA to act as a double coding device harboring two logically distinct types of information. We review biological systems of different complexities and infer that the inter-conversion of these two distinct types of DNA information represents a fundamental self-referential device underlying both systemic integration and coordinated adaptive responses.
... The code underlying AMSD, as well as documentation of the method, is available on GitHub under the MIT license, and is archived at Zenodo (Sasani, 2024). We have also deposited a reproducible Snakemake workflow (Kosmidis et al., 2020) for running reproducing all analyses and figures presented in the manuscript. ...
Article
Full-text available
Maintaining germline genome integrity is essential and enormously complex. Although many proteins are involved in DNA replication, proofreading, and repair, mutator alleles have largely eluded detection in mammals. DNA replication and repair proteins often recognize sequence motifs or excise lesions at specific nucleotides. Thus, we might expect that the spectrum of de novo mutations – the frequencies of C>T, A>G, etc. – will differ between genomes that harbor either a mutator or wild-type allele. Previously, we used quantitative trait locus mapping to discover candidate mutator alleles in the DNA repair gene Mutyh that increased the C>A germline mutation rate in a family of inbred mice known as the BXDs (Sasani et al., 2022, Ashbrook et al., 2021). In this study we developed a new method to detect alleles associated with mutation spectrum variation and applied it to mutation data from the BXDs. We discovered an additional C>A mutator locus on chromosome 6 that overlaps Ogg1 , a DNA glycosylase involved in the same base-excision repair network as Mutyh (David et al., 2007). Its effect depends on the presence of a mutator allele near Mutyh , and BXDs with mutator alleles at both loci have greater numbers of C>A mutations than those with mutator alleles at either locus alone. Our new methods for analyzing mutation spectra reveal evidence of epistasis between germline mutator alleles and may be applicable to mutation data from humans and other model organisms.
... As a result, the two eligible ORI loci per genome could be unraveled, typically placed at the transition points of nucleotide overrepresentation (Lobry, 1996). Due to the replication initiation gene dnaA at one of those two conversion regions, the ORI could be located exactly with the DoriC 12.0 tool (Dong et al., 2023;Kosmidis et al., 2020;Trojanowski et al., 2018). Interestingly, the J. naejangsanensis ...
Article
Full-text available
Chitin is the second most abundant polysaccharide worldwide as part of arthropods' exoskeletons and fungal cell walls. Low concentrations in soils and sediments indicate rapid decomposition through chitinolytic organisms in terrestrial and aquatic ecosystems. The enacting enzymes, so‐called chitinases, and their products, chitooligosaccharides, exhibit promising characteristics with applications ranging from crop protection to cosmetics, medical, textile, and wastewater industries. Exploring novel chitinolytic organisms is crucial to expand the enzymatical toolkit for biotechnological chitin utilization and to deepen our understanding of diverse catalytic mechanisms. In this study, we present two long‐read sequencing‐based genomes of highly similar Jeongeupia species, which have been screened, isolated, and biochemically characterized from chitin‐amended soil samples. Through metabolic characterization, whole‐genome alignments, and phylogenetic analysis, we could demonstrate how the investigated strains differ from the taxonomically closest strain Jeongeupia naejangsanensis BIO‐TAS4‐2T (DSM 24253). In silico analysis and sequence alignment revealed a multitude of highly conserved chitinolytic enzymes in the investigated Jeongeupia genomes. Based on these results, we suggest that the two strains represent a novel species within the genus of Jeongeupia, which may be useful for environmentally friendly N‐acetylglucosamine production from crustacean shell or fungal biomass waste or as a crop protection agent.
... In a stylized regulatory network a directed cycle is said to be positive if it contains an even number of inhibitory links (possibly zero), and is said to be negative otherwise. Adopting the standard definition of an asymmetry (see, e.g., Ref. [40]), we introduce a quantity A k (G), quantifying the difference between the numbers of positive and negative cycles of length k in network G: ...
Article
Full-text available
How the architecture of gene regulatory networks shapes gene expression patterns is an open question, which has been approached from a multitude of angles. The dominant strategy has been to identify nonrandom features in these networks and then argue for the function of these features using mechanistic modeling. Here we establish the foundation of an alternative approach by studying the correlation of network eigenvectors with synthetic gene expression data simulated with a basic and popular model of gene expression dynamics: Boolean threshold dynamics in signed directed graphs. We show that eigenvectors of the network adjacency matrix can predict collective states (attractors). However, the overall predictive power depends on details of the network architecture, namely the fraction of positive 3-cycles, in a predictable fashion. Our results are a set of statistical observations, providing a systematic step towards a further theoretical understanding of the role of network eigenvectors in dynamics on graphs.
... Increasing evidence suggests that gene order within the bacterial chromosome contributes to cellular homeostasis by coordinating key entangled processes such as chromosome structuration, cell cycle, replication, and the expression of genetic information. While genome reshuffling experiments have not been yet performed in bacterial systems (79), approaches such as comparative genomics (17,18,21,23,31,22), systems biology (20), large DNA inversions (80)(81)(82), and relocation of individual gene sets (16,19,(83)(84)(85) have provided support to this notion. Genes encoding the genetic information flow are interesting models to test the role of genomic position on cellular physiology. ...
Article
Full-text available
It is unclear how gene order within the chromosome influences genome evolution. Bacteria cluster transcription and translation genes close to the replication origin (oriC). In Vibrio cholerae, relocation of s10-spc-α locus (S10), the major locus of ribosomal protein genes, to ectopic genomic positions shows that its relative distance to the oriC correlates to a reduction in growth rate, fitness, and infectivity. To test the long-term impact of this trait, we evolved 12 populations of V. cholerae strains bearing S10 at an oriC-proximal or an oriC-distal location for 1,000 generations. During the first 250 generations, positive selection was the main force driving mutation. After 1,000 generations, we observed more nonadaptative mutations and hypermutator genotypes. Populations fixed inactivating mutations at many genes linked to virulence: flagellum, chemotaxis, biofilm, and quorum sensing. Throughout the experiment, all populations increased their growth rates. However, those bearing S10 close to oriC remained the fittest, indicating that suppressor mutations cannot compensate for the genomic position of the main ribosomal protein locus. Selection and sequencing of the fastest-growing clones allowed us to characterize mutations inactivating, among other sites, flagellum master regulators. Reintroduction of these mutations into the wild-type context led to a ≈10% growth improvement. In conclusion, the genomic location of ribosomal protein genes conditions the evolutionary trajectory of V. cholerae. While genomic content is highly plastic in prokaryotes, gene order is an underestimated factor that conditions cellular physiology and evolution. A lack of suppression enables artificial gene relocation as a tool for genetic circuit reprogramming. IMPORTANCE The bacterial chromosome harbors several entangled processes such as replication, transcription, DNA repair, and segregation. Replication begins bidirectionally at the replication origin (oriC) until the terminal region (ter) organizing the genome along the ori-ter axis gene order along this axis could link genome structure to cell physiology. Fast-growing bacteria cluster translation genes near oriC. In Vibrio cholerae, moving them away was feasible but at the cost of losing fitness and infectivity. Here, we evolved strains harboring ribosomal genes close or far from oriC. Growth rate differences persisted after 1,000 generations. No mutation was able to compensate for the growth defect, showing that ribosomal gene location conditions their evolutionary trajectory. Despite the high plasticity of bacterial genomes, evolution has sculpted gene order to optimize the ecological strategy of the microorganism. We observed growth rate improvement throughout the evolution experiment that occurred at expense of energetically costly processes such the flagellum biosynthesis and virulence-related functions. From the biotechnological point of view, manipulation of gene order enables altering bacterial growth with no escape events.
Preprint
Maintaining germline genome integrity is essential and enormously complex. Hundreds of proteins are involved in DNA replication and proofreading, and hundreds more are mobilized to repair DNA damage [1]. While loss-of-function mutations in any of the genes encoding these proteins might lead to elevated mutation rates, mutator alleles have largely eluded detection in mammals. DNA replication and repair proteins often recognize particular sequence motifs or excise lesions at specific nucleotides. Thus, we might expect that the spectrum of de novo mutations — that is, the frequency of each individual mutation type (C>T, A>G, etc.) — will differ between genomes that harbor either a mutator or wild-type allele at a given locus. Previously, we used quantitative trait locus mapping to discover candidate mutator alleles in the DNA repair gene Mutyh that increased the C>A germline mutation rate in a family of inbred mice known as the BXDs [2, 3]. In this study we developed a new method, called “inter-haplotype distance,” to detect alleles associated with mutation spectrum variation. By applying this approach to mutation data from the BXDs, we confirmed the presence of the germline mutator locus near Mutyh and discovered an additional C>A mutator locus on chromosome 6 that overlaps Ogg1 and Mbd4, two DNA glycosylases involved in base-excision repair [4, 5]. The effect of a chromosome 6 mutator allele depended on the presence of a mutator allele near Mutyh, and BXDs with mutator alleles at both loci had even greater numbers of C>A mutations than those with mutator alleles at either locus alone. Our new methods for analyzing mutation spectra reveal evidence of epistasis between germline mutator alleles, and may be applicable to mutation data from humans and other model organisms.
Article
The inner physicochemical heterogeneity of bacterial cells generates three-dimensional (3D)-dependent variations of resources for effective expression of given chromosomally located genes. This fact has been exploited for adjusting the most favorable parameters for implanting a complex device for optogenetic control of biofilm formation in the soil bacterium Pseudomonas putida. To this end, a DNA segment encoding a superactive variant of the Caulobacter crescendus diguanylate cyclase PleD expressed under the control of the cyanobacterial light-responsive CcaSR system was placed in a mini-Tn5 transposon vector and randomly inserted through the chromosome of wild-type and biofilm-deficient variants of P. putida lacking the wsp gene cluster. This operation delivered a collection of clones covering a whole range of biofilm-building capacities and dynamic ranges in response to green light. Since the phenotypic output of the device depends on a large number of parameters (multiple promoters, RNA stability, translational efficacy, metabolic precursors, protein folding, etc.), we argue that random chromosomal insertions enable sampling the intracellular milieu for an optimal set of resources that deliver a preset phenotypic specification. Results thus support the notion that the context dependency can be exploited as a tool for multiobjective optimization, rather than a foe to be suppressed in Synthetic Biology constructs.
Article
Full-text available
For a coherent response to environmental changes, bacterial evolution has formed a complex transcriptional regulatory system comprising classical DNA binding proteins sigma factors and modulation of DNA topology. In this study, we investigate replication-induced gene copy numbers - a regulatory concept that is unlike the others not based on modulation of promoter activity but on replication dynamics. We show that a large fraction of genes are predominantly affected by transient copy numbers and identify cellular functions and central pathways governed by this mechanism in Escherichia coli. Furthermore, we show quantitatively that the previously observed spatio-temporal expression pattern between different growth phases mainly emerges from transient chromosomal copy numbers. We extend the analysis to the plant pathogen Dickeya dadantii and the biotechnologically relevant organism Vibrio natriegens. The analysis reveals a connection between growth phase dependent gene expression and evolutionary gene migration in these species. A further extension to the bacterial kingdom indicates that chromosome evolution is governed by growth rate related transient copy numbers.
Article
Full-text available
Recent works suggest that bacterial gene order links chromosome structure to cell homeostasis. Comparative genomics showed that, in fast-growing bacteria, ribosomal protein genes (RP) locate near the replication origin (oriC). We recently showed that Vibrio cholerae employs this positional bias as a growth optimization strategy: under fast-growth conditions, multifork replication increases RP dosage and expression. However, RP location may provide advantages in a dosage-independent manner: for example, the physical proximity of the many ribosomal components, in the context of a crowded cytoplasm, may favor ribosome biogenesis. To uncover putative dosage-independent effects, we studied isogenic V. cholerae derivatives in which the major RP locus, S10-spc-α (S10), was relocated to alternative genomic positions. When bacteria grew fast, bacterial fitness was reduced according to the S10 relative distance to oriC. The growth of wild-type V. cholerae could not be improved by additional copies of the locus, suggesting a physiologically optimized genomic location. Slow growth is expected to uncouple RP position from dosage, since multifork replication does not occur. Under these conditions, we detected a fitness impairment when S10 was far from oriC. Deep sequencing followed by marker frequency analysis in the absence of multifork replication revealed an up to 30% S10 dosage reduction associated with its relocation that closely correlated with fitness alterations. Hence, the impact of S10 location goes beyond a growth optimization strategy during feast periods. RP location may be important during the whole life cycle of this pathogen.
Article
Full-text available
Many repetitive sequences in the human genome can adopt conformations that differ from the canonical B-DNA double helix (i.e., non-B DNA), and can impact important biological processes such as DNA replication, transcription, recombination, telomere maintenance, viral integration, transposome activation, DNA damage and repair. Thus, non-B DNA-forming sequences have been implicated in genetic instability and disease development. In this article, we discuss the interactions of non-B DNA with the replication and/or transcription machinery, particularly in disease states (e.g., tumors) that can lead to an abnormal cellular environment, and how such interactions may alter DNA replication and transcription, leading to potential conflicts at non-B DNA regions, and eventually result in genetic stability and human disease.
Article
Full-text available
Although it has become routine to consider DNA in terms of its role as a carrier of genetic information, it is also an important contributor to the control of gene expression. This regulatory principle arises from its structural properties. DNA is maintained in an underwound state in most bacterial cells and this has important implications both for DNA storage in the nucleoid and for the expression of genetic information. Underwinding of the DNA through reduction in its linking number potentially imparts energy to the duplex that is available to drive DNA transactions, such as transcription, replication and recombination. The topological state of DNA also influences its affinity for some DNA binding proteins, especially in DNA sequences that have a high A + T base content. The underwinding of DNA by the ATP-dependent topoisomerase DNA gyrase creates a continuum between metabolic flux, DNA topology and gene expression that underpins the global response of the genome to changes in the intracellular and external environments. These connections describe a fundamental and generalised mechanism affecting global gene expression that underlies the specific control of transcription operating through conventional transcription factors. This mechanism also provides a basal level of control for genes acquired by horizontal DNA transfer, assisting microbial evolution, including the evolution of pathogenic bacteria.
Article
Full-text available
Background Measuring the agreement between a gene expression profile and a known transcriptional regulatory network is an important step in the functional interpretation of bacterial physiological state. In this way, general design principles can be explored. One such interpretive framework is the relationship of digital control, that is, the impact of sequence-specific interactions, and analog control, i.e., the extent of the influence of chromosomal structure. Methods and ResultsHere, we present time-resolved gene expression profiles of Escherichia coli’s growth cycle as measured by RNA-seq. We extend methods which have been developed for discrete sets of differentially expressed genes and apply them to the wild type and two mutant time-series for which the global transcriptional regulators fis and hns were inactivated. We test our continuous methods using simulated ‘expression profiles’ generated from random Boolean network dynamics where we observe a clear trade-off between maximum response and level of detail included. In the real time-course expression data, we find strong interdependent changes of digital and analog control during the exponential growth phase and a dominance of analog control during the stationary phase. Conclusions Our investigation puts forward a simple and reliable method for quantifying the match between time-resolved gene expression profiles and a transcriptional regulatory network. The method reveals a systematic compensatory interplay of digital and analog control in the genetic regulation of E. coli’s growth cycle.
Article
Full-text available
Although it has become routine to consider DNA in terms of its role as a carrier of genetic information, it is also an important contributor to the control of gene expression. This regulatory principle arises from its structural properties. DNA is maintained in an underwound state in most bacterial cells and this has important implications both for DNA storage in the nucleoid and for the expression of genetic information. Underwinding of the DNA through reduction in its linking number potentially imparts energy to the duplex that is available to drive DNA transactions, such as transcription, replication and recombination. The topological state of DNA also influences its affinity for some DNA binding proteins, especially in DNA sequences that have a high A + T base content. The underwinding of DNA by the ATP-dependent topoisomerase DNA gyrase creates a continuum between metabolic flux, DNA topology and gene expression that underpins the global response of the genome to changes in the intracellular and external environments. These connections describe a fundamental and generalised mechanism affecting global gene expression that underlies the specific control of transcription operating through conventional transcription factors. This mechanism also provides a basal level of control for genes acquired by horizontal DNA transfer, assisting microbial evolution, including the evolution of pathogenic bacteria.
Article
Full-text available
Knowledge of global regulatory networks has been exploited to rewire the gene control programmes of the model bacterium Salmonella enterica serovar Typhimurium. The product is an organism with competitive fitness that is superior to that of the wild type but tuneable under specific growth conditions. The paralogous hns and stpA global regulatory genes are located in distinct regions of the chromosome and control hundreds of target genes, many of which contribute to stress resistance. The locations of the hns and stpA open reading frames were exchanged reciprocally, each acquiring the transcription control signals of the other. The new strain had none of the compensatory mutations normally associated with alterations to hns expression in Salmonella; instead it displayed rescheduled expression of the stress and stationary phase sigma factor RpoS and its regulon. Thus the expression patterns of global regulators can be adjusted artificially to manipulate microbial physiology, creating a new and resilient organism.
Article
Full-text available
COLOMBOS is a database that integrates publicly available transcriptomics data for several prokaryotic model organisms. Compared to the previous version it has more than doubled in size, both in terms of species and data available. The manually curated condition annotation has been overhauled as well, giving more complete information about samples’ experimental conditions and their differences. Functionality-wise cross-species analyses now enable users to analyse expression data for all species simultaneously, and identify candidate genes with evolutionary conserved expression behaviour. All the expression-based query tools have undergone a substantial improvement, overcoming the limit of enforced co-expression data retrieval and instead enabling the return of more complex patterns of expression behaviour. COLOMBOS is freely available through a web application at http://colombos.net/. The complete database is also accessible via REST API or downloadable as tab-delimited text files.
Article
Usually complex networks are studied as graphs consisting of nodes whose spatial arrangement is of no significance. Several real biological networks are, however, embedded in space. In this paper we study the transcription regulatory network (TRN) of E. coli as a spatially embedded network. The embedding space of this network is the circular E. coli chromosome, i.e. it is practically one dimensional. However, the TRN itself is a high-dimensional network due to the existence of an adequate number of long-range connections. We find that nodes in short topological distance l = 1, 2 tend, on average, to be in shorter spatial distances r indicating an abundance of short-range connections as well. Community analysis of the TRN reveals the interesting fact that highly interconnected subnets consist of nodes that tend to be in spatial proximity on the circular chromosome. We also find indications that for certain transcriptional aspects of the E. coli it is advantageous to treat the circular genome as two line segments starting from the OriC and ending to Ter. Graphical abstract Open image in new window
Book
Recent technological advances have enabled comprehensive determination of the molecular composition of living cells. The chemical interactions between many of these molecules are known, giving rise to genome-scale reconstructed biochemical reaction networks underlying cellular functions. Mathematical descriptions of the totality of these chemical interactions lead to genome-scale models that allow the computation of physiological functions. Reflecting these recent developments, this textbook explains how such quantitative and computable genotype-phenotype relationships are built using a genome-wide basis of information about the gene portfolio of a target organism. It describes how biological knowledge is assembled to reconstruct biochemical reaction networks, the formulation of computational models of biological functions, and how these models can be used to address key biological questions and enable predictive biology. Developed through extensive classroom use, the book is designed to provide students with a solid conceptual framework and an invaluable set of modeling tools and computational approaches.