ArticlePDF Available

Functional information: Molecular messages

Authors:
Jack W. Szostak
I
n this age of genome sequencing, the idea
that biopolymer sequences are a type of
molecularly coded information is well
established. We are all familiar with the idea
that it is the sequence of the nucleotides or
amino acids that make up DNA, RNA or pro-
tein molecules that determine their structure
and function. But the recent deluge of phylo-
genetic sequence data provides thousands of
examples of related but different sequences
encoding essentially identical structures and
functions. More radical are the accumulating
examples of both RNA and protein molecules
with entirely different structures but similar
biochemical functions (for example, various
structurally distinct protease enzymes have
been identified). Such examples raise impor-
tant questions about the nature of the infor-
mation content of biological sequences. How
best can we define and quantify the informa-
tion content of biopolymer sequences?
The information content of biopolymers
is usually thought of in terms of the amount
of information required to specify a unique
sequence or structure. This viewpoint derives
from classical information theory, which
does not consider the meaning of a message,
defining the information content of a string
of symbols as simply that required to specify,
store or transmit the string. Thus, the
unannotated human genome sequence can be
encoded in a 750-megabyte file, but this could
be greatly reduced in size by the application
of standard data-compression techniques to
account for internal repetitions.
Approaches such as algorithmic complex-
ity further define the amount of information
needed to specify sequences with internal
order or structure, but fail to account for
the redundancy inherent in the fact that
many related sequences are structurally and
functionally equivalent. This objection is dealt
with by physical complexity, a
rigorously defined measure
activity as activity increases, then exponen-
tial behaviour may be seen. An interesting
question is whether the relationship between
functional information and activity will be
similar in many different systems, suggesting
that common principles are at work, or
whether each case will be unique.
The challenge in determining experimen-
tally the relationship between functional
information and activity is the extreme rarity
of functional sequences in populations of
random sequences (typically 10
110
to 10
115
for aptamers and ribozymes isolated from
random RNA pools). In vitro selection and
amplification allow the isolation of rare
functional sequences from a large initial pool
of random sequences. Unfortunately, the
original distribution of functional molecules
can be obscured by biases in replication and
selection efficiency that accumulate over
cycles of enrichment. A radically different
approach would be to apply the new single-
molecule fluorescence methods to the direct
analysis of large sets of random sequences.
Such experiments might ultimately allow us
to understand why proteins have taken over
so much of biochemical function from RNA,
and they might also serve to guide and inter-
pret the results of experiments in which new
nucleotides or amino acids are used to expand
the genetic code as we search for molecules
even better than those supplied by nature.
Jack W. Szostak is in the Howard Hughes Medical
Institute and Department of Molecular Biology,
Massachusetts General Hospital, Boston,
Massachusetts 02114-2696, USA.
FURTHER READING
Hamming, R. W.
Coding and Information Theory
(Prentice-Hall, Englewood Cliffs, New Jersey, 1987).
Zurek, W. H. in
Studies in the Sciences of Complexity
Vol. 8 (ed. Pines, D.) (Addison-Wesley, Reading,
Massachusetts, 1991).
Adami, C. & Cerf, N. J.
Physica D
137, 62–69 (2000).
Wilson, D. S. & Szostak, J. W.
Annu. Rev. Biochem.
68,
611–648 (1999).
Erratum
In Antonio Damasio’s Concepts essay on “Mental
self” (Nature 423, 227; 2003), the name of Gus
Nossal was misspelt.
Molecular messages
concepts
NATURE
|
VOL 423
|
12 JUNE 2003
|
www.nature.com/nature 689
Functional
information
A quantitative means of comparing
the functional abilities of different
biopolymers would allow us to
dissect out differences and to
discern their origins.
of the information content of such degenerate
sequences, which is based on functional
criteria and is measured by comparing
alignable sequences that encode functionally
equivalent structures. But different molecular
structures may be functionally equivalent. A
new measure of information — functional
information — is required to account for all
possible sequences that could potentially carry
out an equivalent biochemical function, inde-
pendent of the structure or mechanism used.
By analogy with classical information,
functional information is simply 1log
2
of
the probability that a random sequence will
encode a molecule with greater than any
given degree of function. For RNA sequences
of length n, that fraction could vary from 4
1n
if only a single sequence is active, to 1 if all
sequences are active. The corresponding
functional-information content would vary
from 2n (the amount needed to specify a
given random RNA sequence) to 0 bits. As an
example, the probability that a random RNA
sequence of 70 nucleotides will bind ATP with
micromolar affinity has been experimentally
determined to be about 10
111
. This corre-
sponds to a functional-information content
of about 37 bits, compared with 140 bits to
specify a unique 70-mer sequence. If there
are multiple sequences with a given activity,
then the corresponding functional informa-
tion will always be less than the amount of
information required to specify any particu-
lar sequence. It is important to note that
functional information is not a property of
any one molecule, but of the ensemble of all
possible sequences, ranked by activity.
Imagine a pile of DNA, RNA or protein
molecules of all possible sequences, sorted
by activity with the most active at the top. A
horizontal plane through the pile indicates
a given level of activity; as this rises, fewer
sequences remain above it. The functional
information required to specify that activity
is 1log
2
of the fraction of sequences above
the plane. Expressing this fraction in terms of
information provides a straight-
forward, quantitative measure of
the difficulty of a task. More informa-
tion is required to specify molecules that
carry out difficult tasks, such as high-affinity
binding or the rapid catalysis of chemical
reactions with high energy barriers, than is
needed to specify weak binders or slow
catalysts. But precisely how much more
functional information is required to
specify a given increase in activity is
unknown. If the mechanisms involved
in improving activity are similar over a
wide range of activities, then power-law
behaviour might be expected. Alternatively, if
it becomes progressively harder to improve
Increasing activity implies fewer sequences and
greater functional-information content.
© 2003
Nature
Publishing
Group
... (A simple example is provided in Fig. 3.) Another application that may be worth mentioning has to do with extending the domain of applicability for ideas that have been developed with discrete variables in mind. One example has to do with the concept of functional information, as proposed by J. Szostak (Szostak 2003;Hazen et al. 2007). Szostak, writing in the context of biomolecules such as enzymes, says that "functional information is simply − log 2 of the probability that a random sequence will encode a molecule with greater than any given degree of function" (Szostak 2003). ...
... One example has to do with the concept of functional information, as proposed by J. Szostak (Szostak 2003;Hazen et al. 2007). Szostak, writing in the context of biomolecules such as enzymes, says that "functional information is simply − log 2 of the probability that a random sequence will encode a molecule with greater than any given degree of function" (Szostak 2003). In addition to its use in the context of biomolecules, functional information (or very similar concepts) have also been used to characterise adaptation (Peck and Waxman 2017) and the closely allied concept of 'biological complexity' (Adami et al. 2000;Adami 2002). ...
Preprint
Full-text available
In classical information theory, a causal relationship between two random variables is typically modelled by assuming that, for every possible state of one of the variables, there exists a particular distribution of states of the second variable. Let us call these two variables the causal and caused variables, respectively. We assume that both of these random variables are continuous and one-dimensional. Carrying out independent transformations on the causal and caused variable creates two new random variables. Here, we consider transformations that are differentiable and strictly increasing. We call these increasing transformations. If, for example, the mass of an object is a caused variable, a logarithmic transformation could be applied to produce a new caused variable. Any causal relationship (as defined here) is associated with a channel capacity, which is the maximum rate that information could be sent if the causal relationship was used as a signalling system. Channel capacity is unaffected when the variables are changed by use of increasing transformations. For any causal relationship we show that there is always a way to transform the caused variable such that the entropy associated with the caused variable is independent of the value of the causal variable. Furthermore, the resulting universal entropy has an absolute value that is equal to the channel capacity associated with the causal relationship. This observation may be useful in statistical applications, and it implies that, for any causal relationship, there is a `natural' way to transform a continuous caused variable. With additional constraints on the causal relationship, we show that a natural transformation of both variables can be found such that the transformed system behaves like a good measuring device, with the expected value of the caused variable being approximately equal to the value of the causal variable.
... One consequence of this feature of the genotype-phenotype (GP) map is that the fraction of functional sequences (those that have a particular phenotype) is very small compared to the number of possible sequences. In fact, this "density" of functional sequences can be used to quantify the information content of the sequences, as pointed out by Szostak [2]. He defined the "functional information" a molecule has about a particular phenotype E in terms of the fraction of sequences F (E) that carry that phenotype at a level θ or higher, as ...
... A 'mer' is a unit of entropy or information of a polymer in terms of the monomer entropy, by taking logarithms to the base of the alphabet size D.2 There are in fact 36,171 distinct replicators, but some of them are equivalent under a code rotation. Avidian genomes have a beginning and an end, but the code is circular so two distinct replicators sometimes have the same sequence if one can be rotated into the other. ...
Preprint
How information is encoded in bio-molecular sequences is difficult to quantify since such an analysis usually requires sampling an exponentially large genetic space. Here we show how information theory reveals both robust and compressed encodings in the largest complete genotype-phenotype map (over 5 trillion sequences) obtained to date.
... Network expressions have been used broadly for the characterization and analysis of complex systems. However, individual models or more direct analyses for complex systems rather than network expressions have also been employed [10,[39][40][41]. erefore, a complexity definition is desirable to give no particular restriction on the objects. ...
Article
Full-text available
One of the most fundamental problems in science is to define the complexity of organized matters quantitatively, that is, organized complexity. Although many definitions have been proposed toward this aim in previous decades (e.g., logical depth, effective complexity, natural complexity, thermodynamics depth, effective measure complexity, and statistical complexity), there is no agreed-upon definition. The major issue of these definitions is that they captured only a single feature among the three key features of complexity, descriptive, computational, and distributional features, for example, the effective complexity captured only the descriptive feature, the logical depth captured only the computational, and the statistical complexity captured only the distributional. In addition, some definitions were not computable; some were not rigorously specified; and any of them treated either probabilistic or deterministic forms of objects, but not both in a unified manner. This paper presents a new quantitative definition of organized complexity. In contrast to the existing definitions, this new definition simultaneously captures all of the three key features of complexity for the first time. In addition, the proposed definition is computable, is rigorously specified, and can treat both probabilistic and deterministic forms of objects in a unified manner or seamlessly. The proposed definition is based on circuits rather than Turing machines and ɛ-machines. We give several criteria required for organized complexity definitions and show that the proposed definition satisfies all of them.
... The positioning of the constituent parts of a system is embodied information which can now be termed form. When particles are positioned in a form that is not random, then the form has a coherent spatial structure, i.e., its parts share mutual information (Fig. 3B) and this shared information is the basis for effective information (Szostak, 2003). It is effective because it constrains forces such that it transfers to them its coherence: the directions of the forces are correlated by the mutual information of the form. ...
Article
Living systems have long been a puzzle to physics, leading some to claim that new laws of physics are needed to explain them. Separating physical reality into the general (laws) and the particular (location of particles in space and time), it is possible to see that the combination of these amounts to efficient causation, whereby forces are constrained by patterns that constitute embodied information which acts as formal cause. Embodied information can only be produced by correlation with existing patterns, but sets of patterns can be arranged to form reflexive relations in which constraints on force are themselves formed by the pattern that results from action of those same constrained forces. This inevitably produces a higher level of pattern which reflexively reinforces itself. From this, multi-level hierarchies and downward causation by information are seen to be patterns of patterns that constrain forces. Such patterns, when causally cyclical, are closed to efficient causation. But to be autonomous, a system must also have its formative information accumulated by repeated cycles of selection until sufficient is obtained to represent the information content of the whole (which is the essential purpose of information oligomers such as DNA). Living systems are the result of that process and therefore cannot exist unless they are both closed to efficient causation and capable of embodying an independent supply of information sufficient to constitute their causal structure. Understanding this is not beyond the scope of standard physics, but it does recognise the far greater importance of information accumulation in living than in non-living systems and, as a corollary, emphasises the dependence of biological systems on the whole history of life, leading up to the present state of any and all organisms.
... For recordings from animal brains, the functionality of the sequence of neuronal activations could be the accuracy with which the animal performs a rewarded task. Functional information can be defined quantitatively in terms of the likelihood that a random sequence performs the function or task at the specified level [1,2]. Specifically, if we require a particular sequence to perform function E at least at the level θ, then functional information about E as defined by Szostak is given by ...
Preprint
The information content of symbolic sequences (such as nucleic- or amino acid sequences, but also neuronal firings or strings of letters) can be calculated from an ensemble of such sequences, but because information cannot be assigned to single sequences, we cannot correlate information to other observables attached to the sequence. Here we show that an information score obtained from multivariate (multiple-variable) correlations within sequences of a "training" ensemble can be used to predict observables of out-of-sample sequences with an accuracy that scales with the complexity of correlations, showing that functional information emerges from a hierarchy of multi-variable correlations.
... When particles 273 are positioned in a form that is not random (i.e. the information necessary to describe 274 it is mathematically compressible), then the form has a coherent spatial structure: its 275 spatial autocorrelation is non-zero and more generally the form has non-zero spatial 276 mutual information (which is what is being termed 'coherence' here) (Fig. 1 B). This is 277 the basis for effective information (Szostak, 2003 to act in coherent ways (coherent because there is non-zero mutual information) with 290 effects such as binding and its consequences such as conformational changes ( Fig. 1 C). ...
Article
Full-text available
Whether or not viruses are alive remains unsettled. Discoveries of giant viruses with translational genes and large genomes have kept the debate active. Here, a fresh approach is introduced, based on the organisational definition of life from within systems biology. It views living as a circular process of self-organisation and self-construction which is ‘closed to efficient causation’. How information combines with force to fabricate and organise environmentally obtained materials, given an energy source, is here explained as a physical embodiment of informational constraint. Comparing a general virus replication cycle with Rosen’s (M,R)-system shows it to be linear, rather than closed. Some viruses contribute considerable organisational information, but so far none is known to supply all required, nor the material nor energy necessary to complete their replication cycle. As a result, no known virus replication cycle is closed to efficient causation: unlike cellular obligate parasites, viruses do not match the causal structure of an (M,R)-system. Analysis based in identifying a Markov blanket in causal structure proved inconclusive, but using Integrated Information Theory on a Boolean representation, it was possible to show that the causal structure of a virocell is not different from that of the host cell.
Article
The information content of symbolic sequences (such as nucleic or amino acid sequences, but also neuronal firings or strings of letters) can be calculated from an ensemble of such sequences, but because information cannot be assigned to single sequences, we cannot correlate information to other observables attached to the sequence. Here we show that an information score obtained from multivariate (multiple-variable) correlations within sequences of a ‘training’ ensemble can be used to predict observables of out-of-sample sequences with an accuracy that scales with the complexity of correlations, showing that functional information emerges from a hierarchy of multi-variable correlations. This article is part of the theme issue ‘Emergent phenomena in complex physical and socio-technical systems: from cells to societies’.
Article
How information is encoded in bio-molecular sequences is difficult to quantify since such an analysis usually requires sampling an exponentially large genetic space. Here we show how information theory reveals both robust and compressed encodings in the largest complete genotype-phenotype map (over 5 trillion sequences) obtained to date.
Article
Full-text available
The goals and targets included in the 2030 Agenda compiled by the United Nations want to stimulate action in areas of critical importance for humanity and the Earth. These goals and targets regard everyone on Earth from both the health and economic and social perspectives. Reaching these goals means to deal with Complex Systems. Therefore, Complexity Science is undoubtedly valuable. However, it needs to extend its scope and focus on some specific objectives. This article proposes a development of Complexity Science that will bring benefits for achieving the United Nations’ aims. It presents a list of the features shared by all the Complex Systems involved in the 2030 Agenda. It shows the reasons why there are certain limitations in the prediction of Complex Systems’ behaviors. It highlights that such limitations raise ethical issues whenever new technologies interfere with the dynamics of Complex Systems, such as human beings and the environment. Finally, new methodological approaches and promising research lines to face Complexity Challenges included in the 2030 Agenda are put forward.
Preprint
Using insights from cybernetics and an information-based understanding of biological systems, a precise, scientifically inspired, definition of free-will is offered and the essential requirements for an agent to possess it in principle are set out. These are: a) there must be a self to self-determine; b) there must be a non-zero probability of more than one option being enacted; c) there must be an internal means of choosing among options (which is not merely random, since randomness is not a choice). For (a) to be fulfilled, the agent of self-determination must be organisationally closed (a `Kantian whole'). For (c) to be fulfilled: d) options must be generated from an internal model of the self which can calculate future states contingent on possible responses; e) choosing among these options requires their evaluation using an internally generated goal defined on an objective function representing the overall `master function' of the agent and f) for `deep free-will', at least two nested levels of choice and goal (d-e) must be enacted by the agent. The agent must also be able to enact its choice in physical reality. The only systems known to meet all these criteria are living organisms, not just humans, but a wide range of organisms. The main impediment to free-will in present-day artificial robots, is their lack of being a Kantian whole. Consciousness does not seem to be a requirement and the minimum complexity for a free-will system may be quite low and include relatively simple life-forms that are at least able to learn.
Article
Full-text available
In vitro selection allows rare functional RNA or DNA molecules to be isolated from pools of over 10(15) different sequences. This approach has been used to identify RNA and DNA ligands for numerous small molecules, and recent three-dimensional structure solutions have revealed the basis for ligand recognition in several cases. By selecting high-affinity and -specificity nucleic acid ligands for proteins, promising new therapeutic and diagnostic reagents have been identified. Selection experiments have also been carried out to identify ribozymes that catalyze a variety of chemical transformations, including RNA cleavage, ligation, and synthesis, as well as alkylation and acyl-transfer reactions and N-glycosidic and peptide bond formation. The existence of such RNA enzymes supports the notion that ribozymes could have directed a primitive metabolism before the evolution of protein synthesis. New in vitro protein selection techniques should allow for a direct comparison of the frequency of ligand binding and catalytic structures in pools of random sequence polynucleotides versus polypeptides.
Article
A practical measure for the complexity of sequences of symbols (“strings”) is introduced that is rooted in automata theory but avoids the problems of Kolmogorov–Chaitin complexity. This physical complexity can be estimated for ensembles of sequences, for which it reverts to the difference between the maximal entropy of the ensemble and the actual entropy given the specific environment within which the sequence is to be interpreted. Thus, the physical complexity measures the amount of information about the environment that is coded in the sequence, and is conditional on such an environment. In practice, an estimate of the complexity of a string can be obtained by counting the number of loci per string that are fixed in the ensemble, while the volatile positions represent, again with respect to the environment, randomness. We apply this measure to tRNA sequence data.