Engineering genes for predictable protein expression
Claes Gustafsson, Jeremy Minshull, Sridhar Govindarajan, Jon Ness, Alan Villalobos, Mark Welch⇑
DNA2.0 Inc., 1140 O’Brien Drive, Suite A, Menlo Park, CA 94025, USA
a r t i c l e i n f o
Received 28 December 2011
and in revised form 27 February 2012
Available online 8 March 2012
Heterologous protein expression
a b s t r a c t
The DNA sequence used to encode a polypeptide can have dramatic effects on its expression. Lack of read-
ily available tools has until recently inhibited meaningful experimental investigation of this phenome-
non. Advances in synthetic biology and the application of modern engineering approaches now
provide the tools for systematic analysis of the sequence variables affecting heterologous expression of
recombinant proteins. We here discuss how these new tools are being applied and how they circumvent
the constraints of previous approaches, highlighting some of the surprising and promising results emerg-
ing from the developing field of gene engineering.
? 2012 Elsevier Inc. All rights reserved.
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Systematic biological engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Strain engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Vector engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Open reading frame engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Gene design variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Variables for ORF engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Local variables for ORF engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Global variables for ORF engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Recent experiments with coding sequence variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
The future of coding sequence engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Low cost production of proteins in heterologous hosts is a
Enzyme-catalyzed industrial processes are increasingly common
in applications ranging from food processing to manufacture of
small molecule pharmaceuticals. The modern molecular biology
toolbox itself consists largely of heterologously expressed DNA
modifying enzymes. Even manufacturers of high value protein
therapeutics such as insulin and monoclonal antibodies are sensi-
tive to the costs of making protein, particularly as patents expire.
The fermentation and purification steps of the protein produc-
tion process are reasonably well understood and optimized by
chemical process engineering approaches in which influential
parameters are varied systematically and the results used to build
ever improving processes. The resulting statistical models relate
critical input variables such as media components and tempera-
ture to system performance and productivity. Robust and accurate
models are useful in revealing system limitations, allowing engi-
neers to increase quality, throughput and efficiency while simulta-
neously mitigating risk and reducing costs. In contrast to the rich
1046-5928/$ - see front matter ? 2012 Elsevier Inc. All rights reserved.
⇑Corresponding author. Fax: +1 650 618 2697.
E-mail address: mwelch@DNA20.com (M. Welch).
Protein Expression and Purification 83 (2012) 37–46
Contents lists available at SciVerse ScienceDirect
Protein Expression and Purification
journal homepage: www.elsevier.com/locate/yprep
knowledge of variables affecting fermentation and purification,
relatively little effort has been made to optimize the genetic com-
ponents of the expression system, and that effort is usually direc-
ted towards a few defined regulatory elements rather than to the
coding sequence of the gene itself.
The variables within the open reading frame of a gene that af-
fect its expression are numerous and often interdependent. The
relative frequencies of different codons used to represent each
amino acid, the propensity of the 50-end of the mRNA to fold into
stable secondary structures, the fraction of the mRNA composed
of G and C bases, and the presence of cryptic transcriptional termi-
nators, and others have been proposed as primary determinants of
a gene’s expression. Until recently researchers have been limited
either to using naturally occurring sequences, or to making a single
new version in which an alternative DNA sequence is used to en-
code the same amino acid sequence, so evidence to support any
hypothesis has been sparse, anecdotal and biased in favor of exper-
iments which yielded positive (and therefore publishable) data.
In the last few years, researchers have used hypothesis-driven
sequence design and testing, as well as techniques of gene synthe-
sis and statistical analysis to begin exploring the factors within the
coding region of a gene that affect its expression. In this review we
summarize the current state of understanding of these factors, and
describe experimental techniques that are refining this under-
standing and identifying new factors that control gene expression.
We illustrate how modern synthetic biology tools can be used to
make a gene’s nucleotide sequence a parameter that can be tuned
for expression as easily as dissolved oxygen concentration or pH is
tuned for fermentation today.
Systematic biological engineering
Maximizing heterologous protein expression is, in principle, a
classic multidimensional optimization problem. Multivariate opti-
mization is well established in fields outside biotechnology: exam-
ples include such diverse non-biological applications as the design
of text on credit card offers , player drafting strategy for major
league baseball , and predicting customers’ movie rental prefer-
ences . Examples in biological sciences include prediction of
therapy efficacy from genomic data , identifying genes that
functionally interact from microarray data , and QSAR small
molecule drug design . As different as these applications are,
they have all been successfully approached and optimized using
similar mathematical tools and approaches to what we describe
here for the application towards gene engineering.
Fig. 1 illustrates some of the variables that affect protein expres-
sion, and the levels within the system at which they operate. To
build a statistical model that can predict the behavior of the
system, input variables (here, rows in Fig. 1) are related to output
variables (here, protein expression). Fundamental considerations
in model building are described in the text box. Mechanistic under-
standing of the system can aid in variable selection, but it is not
required to build predictive models. Instead the method requires
the ability to experimentally make different combinations of vari-
ables, measure the performance of the system in each case, and
evaluate the relative contribution of each variable to system
For any multidimensional optimization problem, success de-
pends on identifying the relevant variables to alter, and on effi-
ciently sampling the variables such that their individual effects
to select variables and their ranges to explore. To combine variables
most efficiently it is standard practice in engineering disciplines to
employ process optimization tools such as Design of Experiment
(DoE)1, Quality by Design (QbD), Lean Manufacturing and Six Sigma
. These tools provide a framework for experimentation where sev-
eral variables and variable combinations can be simultaneously
tested. Multivariate analysis methods can then be used to identify
the effect of each on the performance of the system.
Although this kind of systematic multivariate optimization is
has instead been driven primarily by trial and error, usually with
relatively little sampling of the potential variables that could im-
pact productivity. One early exception is a systematic DoE study
of the impact of promoter sequence substitutions which identified
information was thenused to engineernew andstronger promoters
. The widespread adoption of de novo gene synthesis as a source
of genetic constructs, and the exponential growth of known natural
sequences are stimulating more systematic bioengineering ap-
proaches. The following sections briefly summarize developments
in each of the bioengineering domains.
We here define strain engineering as the modification or aug-
mentation of host genotype to confer some improved property
(e.g., growth rate, carbon source utilization, internal cellular envi-
ronment) that improves protein expression productivity. Early
strain improvement technologies consisted of whole genome
mutagenesis and selection for desired phenotypes. More recently,
with the availability of genome sequence data for many expression
hosts and increased knowledge of gene functions, strain engineer-
ing based on mechanistic understanding has become increasingly
In parallel to rational strain engineering, successful efforts have
been made to alter hosts using libraries of genes, gene fragments or
engineered proteins that are screened for their ability to confer an
improvedphenotype. The Stephanopoulos lab at MIT has developed
a process to introduce libraries of mutagenized RNA polymerase
sigma or RNA polymerase alpha factors resulting in Escherichia coli
strainswith altered expressionprofiles [9,10]. Similar work has also
been performed in the Harvard lab of George Church using oligonu-
cleotide directed mutations  and artificial zinc fingers .
Mammalian cell lines have been engineered using RNAi molecules
to alter the protein production pathways and secretion efficiency
for therapeutic proteins [13,14]. Recent development of TALE nuc-
leases and zinc fingers allows for precise and efficient editing of
Fig. 1. Bioengineering variables in protein manufacturing. Expression of recombi-
nant proteins is a multivariate optimization problem where each module (fermen-
tation, strain etc.) is defined by a distinct set of variables. Examples of variables are
listed below each module. Variables can be quantitative (e.g., temperature) or
qualitative (e.g., replication origin).
1Abbreviations used: DoE, Design of Experiment; QbD, Quality by Design; CHO,
Chinese hamster ovary; GST, glutathione transferase; MBP, maltose binding protein;
Trx, thioredoxin; ORF, open reading frame; CAI, Codon Adaptation Index; tAI, tRNA
Adaptation Index; eGFP, enhanced green fluorescent protein; scFv, single chain
antibody; NN, Neural Nets; SVM, Support Vector Machine.
C. Gustafsson et al./Protein Expression and Purification 83 (2012) 37–46
genomes . Full genome sequence data for Chinese hamster
ovary (CHO) cells  and Pichia pastoris  has recently become
available and should accelerate the use of such editing technologies
in these platform expression hosts. These genome editing technol-
ogies so far have been primarily applied for purposes other than
optimization of protein production, but they offer a tremendous
opportunity to systematically explore strain modifications for pro-
tein expression optimization. An approach of sampling combina-
tions of various genetic modifications for improved expression of
representative recombinant proteins should yield novel production
Most protein expression vectors employed today are an assem-
blage of natural genetic parts chosen from a limited pool of avail-
able, convenientcontrol elements
translation initiation signals) and propagation elements (selection
markers, replication origins). These vectors have not been system-
atically optimized for production – they are simply functional con-
structions that have been employed for lack of better options or
because of historical precedent. There is great potential for system-
atic engineering not simply of individual vector parts but of part
combinations to optimize vector properties.
New promoters, replication origins, cloning sites and selection
tags now appear almost daily from academic labs and companies
such as New England Biolabs, Promega and Life Technology. Efforts
are ongoing by Biofab (www.biofab.org), Addgene (www.add-
gene.org) and others to catalog and standardize these genetic vec-
tor elements and organize distribution processes. Solubility tags
such as glutathione transferase (GST), maltose binding protein
(MBP) and thioredoxin (Trx) are often used independently or in
combination with purification and detection tags such as His-,
Strep- and FLAG-tags. The abundance of convenient fusion tags al-
low for multiple of options when assessing solubility, stability,
folding and expression issues .
While the number of potentially useful vector elements is
expanding, more work is needed to optimize these elements as
well as combinations of elements in vectors for heterologous pro-
tein expression. With increasingly efficient de novo DNA synthesis
[19,20], and computational tools that facilitate the control and de-
sign of synthetic gene constructs, vector elements can be stored
and distributed as virtual information, to be incorporated indepen-
dently into designed synthetic constructs based on their perfor-
mance rather than having to compromise for the best available
combination of genetic elements found in the lab freezer with
the appropriate cloning sites.
Open reading frame engineering
Synonymous substitutions in the open reading frame (ORF) that
encodes a protein encompass the least well understood set of vari-
ables: they are potentially numerous, and they are intimately
intertwined with the protein they encode. Thus, while it is exper-
imentally straightforward to take a natural promoter and test its
ability to drive transcription of a wide variety of other genes, or
to take a replication origin and test its ability to replicate in a vari-
ety of host organisms, until recently it has been much more diffi-
cult to recode an open reading frame using synonymous
substitutions and test the effect in a meaningful way.
Early heterologous protein expression efforts (such as the pio-
neering work by Genentech 1977 to express human growth hor-
mone in bacteria) relied on complete gene design and synthesis
since only the amino acid sequence of the human hormone was
known at the time . With the advent of lambda gt11 cloning
and PCR cloning in the late 1980s, gene engineering primarily in-
volved minor alterations such as the inclusion and elimination of
restriction sites into natural sequences to facilitate cloning. With
increased knowledge of genome sequences, biases in codon usage
and other sequence features became evident and provided ratio-
nale for gene sequence modification to improve protein expres-
sion. Over the past few decades, observed natural sequence
biases have been the primary basis for gene engineering ap-
proaches. Such approaches have occasionally proven to be useful
for boosting protein expression over that of recalcitrant wild-type
genes, in some cases by several orders of magnitude. However, the
benefits of such approaches have been inconsistent and, in some
cases, such ‘‘optimization’’ has been shown to worsen protein
It should not be surprising that copying natural biases does not
consistently result in good heterologous protein expression. It is
important to recognize that natural genes have evolved in the con-
text of natural selection. Accordingly, the evolved features and
biases seen in natural gene sequences may not reflect a host’s pref-
erences for high level production of a heterologous protein. Suc-
cessful overexpression in some cases can reach levels over 30% of
the total cellular protein – orders of magnitude higher expression
levels than the most abundant natural proteins . Heterologous
overexpression in E. coli can overburden cell metabolism, leading
to pleiotropic effects that may significantly reduce protein yield
[24–27]. Selective pressures on naturally evolved coding sequences
are likely complex and include interactions with the entire system
in which the genes have evolved: transposon resistance, mRNA
processing, RNAi regulatory and assembly into higher order DNA
structures are all intertwined with the protein coding information.
Gene design variables
There are many ways a gene’s sequence can influence protein
expression, including effects on mRNA levels, translation velocity,
efficiency of initiation, and rates of charged tRNA consumption.
Each of these different influences is exerted through distinct inter-
actions with cellular machinery and each can be associated with
different types and combinations of sequence elements. For exam-
ple, AT-rich sequences within the gene could cause premature
transcriptional termination and reduced mRNA levels, mRNA
structure near the ribosome binding site could reduce translational
initiation, and rare codon clusters within the mRNA could cause
translational pausing. All of these sequence elements would have
the observable effect of reducing protein expression. Variables in
the context of biological engineering therefore can range from sin-
gle base pair changes to combinatorial variation of bases in an ele-
ment to replacing large sequence elements in toto. The broad range
of testable variables for gene engineering puts a particular empha-
sis on the methods of variable selection and experimental design
when exploring gene sequence-function relationships.
Not only do the possible variables of gene design range in type
and influence, but they also can show high degrees of co-variation
and complex interdependence. Gene sequence variables range
from local sequence motifs to global characteristics such as codon
usage frequency. Any particular substitution in a sequence can be
interpreted as a change in multiple variables. For example, a syn-
onymous substitution of a single codon will also necessarily influ-
ence local and global codon usage frequency, nucleotide frequency,
local and global mRNA structure, and local sequence patterns. If an
effect is seen upon any particular substitution, then the explana-
tion might be related to any one or a combination of the inter-
twined variables. Distinguishing individual variables becomes
even more complicated when several simultaneous substitutions
are made, as is typical in most gene optimization strategies.
C. Gustafsson et al./Protein Expression and Purification 83 (2012) 37–46
An comprehensive (at the time) collection of 41 published
examples comparing protein expression levels from wild-type
genes versus genes ‘‘optimized’’ by various algorithms was pub-
lished in 2004 . Since then, the number of published examples
has increased by orders of magnitude . In the majority of pub-
lished examples the studies resulted in increased protein expres-
sion level after gene ‘‘optimization’’, although improvements vary
from none to >1000-fold. Essentially all these studies, however,
are comparisons of only two genes differing in several gene vari-
ables. Because such limited studies cannot distinguish the relative
contribution of the multiple variables actually being sampled, it is
difficult to assess the robustness and applicability of the design
algorithms for the improvement of other genes. Furthermore, it
cannot be demonstrated that the algorithm has achieved an opti-
mal combination of gene variables to result in the best possible
expression level. Indeed, recent studies of the impact of gene de-
sign on heterologous protein expression have shown that some
of the primary tenets of common gene design algorithms do not
significantly explain variance in heterologous protein expression
yield [29–31]. In order to derive truly robust algorithms, system-
atic assessment of all relevant variables is necessary. In the next
sections we describe potential gene engineering variables in more
detail and current evidence for their importance for heterologous
protein production. We then discuss gene design and synthesis
strategies for exploring gene variable space and identifying gene
design principles for optimal expression of recombinant proteins.
Variables for ORF engineering
global variables. Herein, we define local variables to be those vari-
ables which are dependent on local sequence patterns within the
gene. In general, local variables affect protein expression at the level
mRNA secondary structures. In contrast, global variables are mea-
sures of aggregate gene features. Examples include codon usage
frequency and GC%. Some variable types can be expressed as either
local or global. For example, one might consider mRNA structure
natively one might consider mRNA structure globally as average
a globalvariablesuchasGC%maybe assessed inspecific regionsofa
gene (e.g., a window of codons at the start of the ORF). This distinc-
tion between local and global classes is somewhat arbitrary, but
serves to illustrate that gene sequence information relevant for pro-
are many ways to parameterize gene sequence variables.
Local variables for ORF engineering
heterologous protein expression is beyond the scope of this article
and previous works have covered this subject comprehensively
[22,32–35]. There are a number of local variables (sequence motifs)
that are considered to be deleterious for heterologous protein
expression and many gene design algorithms seek to exclude these
from gene sequences. This leads to an inherent conflict in the vari-
none of the genes will express, and there will be no data with which
to build a statistical model. However, eliminating too many se-
related gene variables that are actually relevant.
Local variables that are generally considered deleterious include
RNase sites, DNA recombination sites, transcriptional terminators,
and transcription factor recognition sequences. Some motifs de-
pend on the host organism: cryptic splicing and internal polyA sig-
nals could affect genes to be expressed in eukaryotic hosts, while
Shine–Dalgarno-like sequences may affect genes to be expressed
in prokaryotes. Avoiding motifs with relatively ambiguous se-
quencecomposition,suchas RNaseE sites andpolyadenylation
signals , can be particularly problematic as complete removal of
A variable commonly regarded as local and deleterious is the
occurrence in the gene of any codon utilized rarely in the host
organism transcriptome. Such codons are generally decoded by
rare tRNAs and the presumed logic has been that translational
elongation, and thus protein production, may be slowed by overuse
of low-concentration tRNAs . Engineering E. coli strains to over-
produce rare tRNAs has been shown to improve protein expression
from a number of genes with elevated numbers of rare E. coli
codons [39–41]. However, genes with several rare codons have
been shown to express at a high level in E. coli without increased
production of rare tRNAs [28,42]. Further, a number of examples
have been described where the increased presence of rare codons
in the gene correlates with increased (not decreased, as might be
expected) protein expression [43–45]. Overall, studies have ob-
served little correlation between the number of rare codons in a
gene and either the protein expression level or the expression
improvement upon overproduction of rare tRNAs . These
exceptions do not necessarily disprove the idea that rare codons
can limit translation rate or that this limit can be overcome with
increased levels of cognate tRNAs; however, they do show that
the impact of rare codons on protein expression is not fully
One of many possible reasons that the relationship between
rare codon occurrence and protein expression is complex is that se-
quence context may be critical. Studies have indicated that the im-
pact of rare codons is greater when such codons are clustered (i.e.,
when multiple rare codons are used within a span of a few codons)
. Clusters of rare AGA or AGG (arginine) codons may cause
ribosome pausing , ribosomal frameshifting , decreased
translational velocity  and amino acid mis-incorporation 
in E. coli. Also, the impact of the presumed pausing at rare codons
may depend on position within the ORF. The deleterious impact of
rare codons may be more pronounced when found in the initial
few codons of the open reading frame [51,52].
A number of other context-dependent codon usage variables
have been suggested by previous work. Cannarozzi, et al., observed
biases in local codon usage in genomic E. coli sequences that show
a statistical preference for repeated use of the same cognate tRNA
within clusters of like amino acids . The authors suggest that
tRNA diffusion relative to recharging might limit translation effi-
ciency and tRNA re-use could promote faster elongation. Several
groups have characterized significant biases in codon pair frequen-
cies in genomes [54,55]. Measurements of in vivo translational step
times suggests that codon pair context can significantly influence
translational elongation rates in E. coli [56,57], although the impact
of codon pair context on heterologous expression yield is unclear
[58,59]. In our studies, we observed no correlation between multi-
ple parameterizations of codon pair usage and expression in E. coli
. Another contextual factor that may be significant is position
within the ORF with respect to structure of the encoded protein
[60,61]. Recent data suggests that strategic placement of certain
codons might influence folding efficiency of the encoded protein,
presumably by modulating translation rate at critical points in
Predicted structures in the 50UTR of genes have been shown to im-
pair protein expression in eukaryotes . Strong RNA structures
C. Gustafsson et al./Protein Expression and Purification 83 (2012) 37–46
predicted to form in the region of the RBS or within the initial cod-
in prokaryotes, presumably by interfering with ribosomal binding
and translational initiation [29,67]. Study of gene design variables
related to mRNA structure is complicated by the fact that measures
of structure strength currently rely on computational RNA folding
algorithms that are limited in their ability to predict form and
strength of actual mRNA structures in vivo. One problem is that
commonlyused programscannot reliably predict tertiary structure,
which can contribute significantly to overall structure strength and
form. Ribosomes possess an intrinsic helicase activity that allows
translation through even very strong hairpins . Structures in ac-
tively translated messages are repeatedly unwound as ribosomes
progress . The structures that do form would be transient, dy-
namic and depend on ribosome density and translation rate as well
as other potential factors. Ribosome pauses at local sequence ele-
ments, such as rare codon clusters, might significantly alter which
structures can form. Emerging tools such as ribosome profiling
 and in vivo probing of mRNA secondary structure  may
shed light on the formation of mRNA structure in vivo and its rela-
tionship to translational efficiency, helping us to identify more
meaningful descriptors of mRNA structure for protein expression.
We are clearly still lacking a good understanding of the relationship
between predicted mRNA structure and expression level, making
this an important class of gene variable to continue to explore.
Global variables for ORF engineering
All of the local sequence variables discussed above must also be
considered within the context of global variables. The two global
variables most commonly considered in ORF engineering are codon
usage frequency and G + C nucleotide content (GC%) [22,32–35].
The common use of these variables stems from observations of sig-
nificant codon and/or nucleotide bias in genes, or certain classes of
genes in several organisms . As with local variables, global
variables have not been systematically explored, and most current
‘‘optimization’’ methods are based on weak experimental evidence
or none at all.
Almost all ORF engineering uses design principles that are ta-
ken directly from natural genes within the expression host. In the
simplest form this means just copying the codon usage frequency
or GC% bias of the expression host organism. A variation on this
approach uses the codon usage of genes that are highly expressed
in the expression host . In 1987, Sharp and Li proposed a
quantitative measure of the codon bias in highly expressed genes
which they named the Codon Adaptation Index (CAI) . CAI is a
measure of how often ‘favored’ codons are used within a gene.
‘Favored’ codons refer to the most frequent codon for a given
amino acid either across all genes in the host or only within a
subset of naturally highly expressed genes in the host. A gene
with a CAI of 1 uses only the most favored codon throughout
the entire gene. Gene engineering toward a high CAI therefore
dramatically narrows the codon diversity in the gene as the 61
possible codons are reduced to a set of only the 20 most frequent
codons (and indirectly narrows all variables correlating with co-
don bias). A similar parameterization of codon usage called the
tRNA Adaptation Index (tAI) has been proposed where this index
weights codons not by their relative frequencies in the host, but
by their cognate tRNA concentrations, based on tRNA assignment
The rationale for copying host gene codon biases is straightfor-
ward: if the synthetic gene is similar in codon usage to host genes,
it presumably will utilize the host tRNA supply in balance with the
host transcriptome and therefore be well accommodated and ex-
pressed. This is supported by the observation that codon usage fre-
quency bias is generally correlated with cognate tRNA concentration
in the cell [72,75,76]. The primary assumption in this logic is that
total tRNA concentration is static and correlated to translation
elongation rate and that this rate can be expression-limiting.
Although in some hosts, including E. coli, CAI and tAI show signif-
icant correlation with natural gene expression levels, recent stud-
ies do not support these parameterizations of codon usage as
strong determinants of heterologous protein expression level in
E. coli [30,31,45,29]. This is potentially a case where correlations
observed in nature reflect the co-evolution of host systems and
gene sequences, but may not predict the outcome of the unnatural
conditions and burden of heterologous protein overexpression.
One concern with using variables like the adaptation indices
and GC% is that they are ambiguous composites of other variables.
For example, two very dissimilar genes, one containing a mixture
of high and low-frequency codons and one using exclusively inter-
mediate frequency codons would have the same CAI value. Simi-
larly, the GC% is a composite number that is ambiguous about
the individual G%, C%, A%, and T%. GC% also is ambiguous with re-
spect to codon usage frequency, as it does not measure which ami-
no acids are encoded by codons ending in A/T and which are
encoded by codons ending in G/C. If individual codon frequencies
do have an effect on expression, these ambiguous compressions
can lose significant information about their contributions. In sys-
tematic gene engineering, care must be taken to define variables
unambiguously so that the results are decipherable and true
knowledge is gained.
We suggest that the most unambiguous global codon usage var-
iable is the frequency with which each codon is used to encode
each amino acid within the protein. If composite variables that
are intertwined with codon usage are to be explored, these should
be unambiguously defined and validated in studies where the rel-
ative impacts of the component variables (i.e., codon usage fre-
quencies and nucleotide frequencies) can be assessed.
Recent experiments with coding sequence variables
The past few years have seen the beginnings of more systematic
experimental testing of the effects of sequence coding variables.
This hypothesis-testing approach is providing data that will force
significant revisions of current design prejudices based merely on
observations of natural systems.
In one study, Kudla et al., measured expression in E. coli of a
library of 154 genes for enhanced green fluorescent protein (eGFP)
that varied in synonymous codon usage . To assess gene vari-
able preferences for heterologous protein expression, the authors
used a biased randomization approach in which multiple semi-
random oligonucleotides were assembled to form the eGFP coding
sequences. The oligonucleotides used in the library synthesis were
designed to produce global and regional variation in GC% and CAI.
When expressed in E. coli, the eGFP protein levels varied 250-fold
across the library as measured by fluorescent intensity. Based on
multivariate data analysis, the authors concluded that the majority
of the difference in protein expression can be explained by mRNA
folding near the translational start of the eGFP ORF. No significant
correlation was found between protein expression and GC% or
In 2010, Allert et al. used a designed variation approach to cre-
ate 285 genes encoding three different proteins (120 + 39 + 126
gene variants) . These genes were designed to sample global
codon usage according to CAI and regional GC% based on observed
biases seen in natural E. coli genes. The relative expression levels
were assessed using in vitro E. coli extracts and ranked on a scale
from 0 (no band on a protein gel) through 3 (strong band). The data
were then fit to a function proposed to model expression. The
were primarilytargeted for
C. Gustafsson et al./Protein Expression and Purification 83 (2012) 37–46
authors concluded that the majority of the difference in in vitro
expression could be attributed to low GC content and low pre-
dicted mRNA structure in the 50end of the ORF.
In a recent study, we used a designed systematic variation ap-
proach outlined in Fig. 2. In these experiments we created two sets
of ?40 gene variants encoding a single chain antibody (scFv) and a
DNA polymerase . The genes were designed using Design of
Experiments methods to minimize co-variation in the design vari-
ables (i.e., maximized orthogonality). Individual codon choice was
variant was equally distant from any other variants as measured in
Hamming distance (the number of sequence differences at the
nucleotide level). Each gene variant was synthesized as an indepen-
dent gene and the relative protein expression measured. When ex-
pressed in E. coli, both the protein expression levels of the scFv and
the DNA polymerase varied over two orders of magnitude. In agree-
ment with the study by Kudla et al., and Allert et al., we did not ob-
serve any significant correlation between heterologous protein
expression and CAI. With our different method of codon usage sam-
ologous protein expression and the frequencies of codons used to
encode a subset of amino acids. In contrast to the Kudla and Allert
studies, we did not see any correlation between predicted mRNA
Fig. 2. Gene engineering by systematic sampling and machine learning. (A) The
variables to assess are defined. (B) DoE methods are used to design the
experimental test gene set in silico. (C) Gene synthesis converts each virtual
sequence to the corresponding physical sequence. (D) Gene variants are experi-
mentally tested for the required function and the output (here protein expression)
is quantified. (E) Multivariate analysis of the experimental data is used to determine
the relative contribution of each variable. (F) The resulting model is validated by its
ability to predict expression of new genes. The process may be repeated if necessary
with new or revised variables or ranges until an optimal or adequate solution is
reached. While an example of gene engineering is illustrated in this figure, the
method is applicable to any multivariate problem in bioengineering.
Fig. 3. Experiment-based gene optimization for expression in E. coli. The perfor-
mance of a regression model fit of expression level as a function of codon usage
frequencies is shown. For each gene variant the measured expression level was
plotted against the expression predicted from a partial-least squares (PLS)
regression model (data from ). Prediction of expression is shown for gene
variants for a phi29 DNA polymerase (red squares) and a scFv (‘‘scFv1’’; blue
diamonds). Expression in each of the two gene sets was normalized to the highest
expression level in that set. R2(CV) indicates the correlation coefficient for the fit of
the model in cross-validation where random subsets of 20% of the data are left out
of the regression. Measured and predicted expression levels for a scFv1 gene using
codon usage that maximizes the codon adaptation index (HiCAI), which was not
included in the model fitting, is indicated with a green triangle. The lower panels
show results of applying codon usage predicted by the model to yield higher
expression (‘‘Exp’’) versus a high CAI bias (‘‘HiCAI’’) for the design of a number of
different genes. Genes were expressed from a strong repressible promoter using
pJExpress vectors (DNA2.0). Transformed BL21 cells were cultured in Luria broth at
37 ?C until mid-log growth, induced with 1mMIPTG and incubated at 30 ?C for 4 h.
PAGE analysis was performed on total culture protein from equivalent cell mass.
Gels were stained using Sypro Ruby (Invitrogen) and imaged by UV fluorescence.
C. Gustafsson et al./Protein Expression and Purification 83 (2012) 37–46
ever, this may be explained by the fact that the scFv and the poly-
merase gene codings showed low propensity to form structure in
Information derived from the systematic analysis described
above has been used to identify coding sequence variables affect-
ing protein expression. For heterologous expression in E. coli mul-
tivariate analysis as described herein identified and quantified the
frequencies of specific codons for about six amino acids to cor-
rectly predict the observed differences in expression. It is at this
point not clear what the biochemical basis is for the correlation.
It may reflect a physiological shock to the host cells as they at-
tempt to synthesize large amount of a single protein, biasing
the consumption of the aminoacyl-tRNA population: most of the
best codons for high expression correspond to those tRNA that
have been shown to remain highly charged under starvation con-
ditions [30,77,78]. In all cases to date, expression is highly corre-
lated with codon usage of a subset of codons. Consistent with
Kudla et al., and Allert et al., we do not see a general preference
for codons used at highest frequency in the genome or in the
highly expressed gene subset of the host.
Much further research is needed to fully understand the exper-
imentally-observed relationships between gene features and
expression level; however, the observed correlations can already
serve as the basis for reliable design algorithms as well as providing
direction for gene improvement strategies. Fig. 3 shows how at
DNA2.0 we have applied the results of our studies to improve gene
design for E. coli. A regression model of the impact of codon usage is
used to predict improved codon usage for the design of new genes.
A handful of examples are shown where we have observed dra-
matic improvement in expression by applying codon usage pre-
dicted to be improved relative to a high codon adaption index
bias. We have now employed experiment-based algorithms to opti-
mize thousands of genes for E. coli and have observed a strong
improvement inreliabilityand yieldversusour previousalgorithms
based on genomic data.
In addition to E. coli, we have also employed experiment-based
optimization to improve gene design algorithms for more than a
dozen different protein expression hosts including P. pastoris, Sac-
charomyces cerevisiae,Kluyveromyces lactis,
mammalian CHO and HEK293 cells, insect cells, Clostridium spe-
cies, and dicot and monocot plants (unpublished; some results
Fig. 4. Sequence diversity in three synthetic gene engineering libraries. Un-rooted tree generated by neighbor-joining, based on the pairwise Hamming distances (number of
nucleotide substitutions) among synthetic genes in three different kinds of library. (A) Figure S1 from  (Reprinted with permission from AAAS). The library encodes 168
green fluorescent protein gene generated by degenerate oligonucleotide randomization. (B) One of three libraries generated in . This library encodes 120 triose phosphate
isomerase gene variants generated by a rational designed variation scheme. (C) A set of 48 individually designed eGFP gene variants created by systematic codon usage
variation and codon choice randomization. The scale for each tree indicates the correspondence between branch length and proportion of nucleotide sites that differ between
gene sequences. Homology based neighbor-joining trees B and C were generated in MEGA .
Fig. 5. Codon usage correlation for two synthetic gene engineering libraries. Colors indicate the degree to which each pair of codons is correlated in global usage frequency
either positively (red) or negatively (blue) among genes of the library. Library 5B is same library as 4B and Library 5C is same library as 4C. See legend of Fig. 4 for library
details. The DNA sequences for gene variants in 4A  were not publicly disclosed, prohibiting inclusion in this comparison. Plots were generated using PLS Toolbox software
(Eigenvector, Inc.) in a Matlab environment.
C. Gustafsson et al./Protein Expression and Purification 83 (2012) 37–46
are available at www.dna20.com). In every case we have seen a
dramatic dependence of expression on synonymous codon usage
that we can capture in gene design algorithms. Based on these re-
sults and those emerging throughout the field, it is clear that
much valuable knowledge can be gained by experimental interro-
gation of the impacts of gene design on expression. We thus ex-
pect a dramatic improvement of gene design algorithms for any
host of interests as methods for host interrogation evolve.
The future of coding sequence engineering
A successful engineering approach to coding sequence design
requires that gene expression can be predicted with a manageable
set of variables. The searchable ORF engineering space is large and
highly correlated. There can be more than a googol (10100) different
ways to encode a specific protein sequence – clearly far beyond
what can be exhaustively searched by any technology . Still,
solutions within this space appear plentiful, judged by the number
of reported successes of ORF re-coding , although one must
bear in mind the many failures that go unreported because of the
nature of the scientific literature.
The availability of molecular biology tools, particularly of de
novo synthetic genes, has opened the door to a new era of ORF
engineering. Application of experimental design and multivariate
analysis methods commonly utilized in traditional engineering
practices and fermentation can now be efficiently used to exper-
imentally optimize gene variables. Fig. 2 illustrates how such a
‘‘machine learning’’ approach is applied to gene engineering. Pre-
defined values of variables are distributed in a test matrix accord-
ing to Design of Experiments methodology. Each sample is
subsequently synthesized and the corresponding protein expres-
sion is assessed in the relevant biological system. The resulting
data is used to build predictive statistical models and/or the pro-
cess is repeated with another iteration using knowledge gained to
alter the search space (i.e., variables sampled and their value
ranges). See box ‘Building predictive models’ for a more detailed
description of the process.
The efficiency and outcome of the search is highly dependent on
the means by which variation is created in the genes. A Design of
Experiments method seeks to maximize search efficiency by max-
imizing the number of variables assessed per test while minimiz-
ing co-variation between variables within the test set. A key
advantage of this is that the number of test samples required to as-
sess the variable space is minimized . This can be critical as
many real-world applications are difficult to assess with high-
throughput assays. Minimally the sample set should be larger than
the number of relevant independent variables to explore . In
practice, some degree of over sampling is usually warranted to
overcome experimental error and ‘‘noise’’ from unaccounted for
variables or variable combinations. Thus, to assess the relative con-
tribution of 10 variables, one would want to search using at least
11 but perhaps >30 systematically varied test samples.
In contrast, a random variation approach will be considerably
less efficient. Although such approaches have the advantage that
diversity can be created with little investment in gene design
and synthesis, there are serious limitations which can compromise
experiments and their interpretation. One is that many more indi-
viduals need to be tested when variation is random versus a prop-
erly designed and synthesized test set. For example, for a library to
have a >99.9% probability of sampling all combinations of four
variables with two values each, one would need to test more than
108 variants randomized in these variables, assuming a ‘perfect’
randomized library. Since true randomization is often difficult to
achieve in practice one usually would want to significantly over
sample to be sure to sample variation that might be under repre-
sented. A synthetic library could sample the 24= 16 possible
variants with exactly 16 test genes. Thus, the most efficient way
to search the space is to strategically design and synthesize indi-
vidual test genes (see Figs. 4 and 5).
Another advantage of systematically designed and synthesized
libraries is full control of the hierarchy of gene variables. Limited
or full randomization of synonymous codon choice at positions in
sample many global gene variables which are averages of many co-
don choices, such as overall codon usage frequency. In silico design
and gene synthesis are not constrained by any gene sequence vari-
able allowing exploration of any variable type and range.
Figs. 4 and 5 illustrate the advantages of using gene design and
synthesis to create variation. Fig. 4 shows how a designed library
can control local sequence variation relative to alternative non-
systematic methods. The library in Fig. 4A was obtained by assem-
bling oligonucleotides that contained degeneracies at the third
‘‘wobble’’ position of each codon. The library in Fig. 4B was the
product of rational designed variation. The biases of these libraries
are clear in the presence of very dense clusters of related se-
quences. This indicates that some areas of sequence space are
being preferentially explored which might limit interpretation of
the results. For contrast, compare the even sequence space cover-
age obtained in a set of individually designed genes, shown in
Control over global codon frequencies is illustrated in Fig. 5.
Covariation between usage frequencies of different codons is
shown for two different library designs. In the plots shown, each
of the 61 sense codons is represented on the x and y axes, and
covariation in usage between one codon and each other codon
is indicated by coloration of the resultant grid: red squares indi-
cate where two codons covary positively, blue squares show
where they covary negatively. The sea of red and blue seen for
the library in Fig. 5B can be compared with the minimal covaria-
tion seen between codons in a set of genes that are designed to
diversify codon usage and then individually synthesized, shown
in Fig. 5C. In this latter library the areas of intense blue are neg-
ative covariation between codons where there are only two pos-
sibilities: for example CAT and CAC to encode histidine. Here it is
unavoidable that when one codon is used more, the other must
be used less. Covariation in the Fig. 5B library is largely due to
an intentional global bias in CAI and GC%. While such a library
can assess the impact of the particular bias used to vary the
genes, it cannot provide information on other measures of global
codon usage. The lack of covariation in the Fig. 5C library allows
essentially unambiguous assessment of any global codon fre-
quency variable (including CAI).
Experimental design of sequences, direct synthesis and multi-
variate analysis of the resultant data is the most promising path to-
wards understanding sequence design principles. It is also an
efficient combination of tools for obtaining useful solutions to indi-
vidual expression problems. By interrogating the search space with
high precision and great information efficiency, the number of
experimental samples can be drastically reduced while producing
data that provides the most information regarding the contribution
of individual variables to system performance. The same underly-
ing approach can be applied to protein engineering [81–83] and
pathway engineering , and we expect them to be equally appli-
cable to genome engineering . As the complexity and number
of variables in the designed system increases, systematic sampling
will become even more critical for understanding and optimiza-
tion. Understanding the variables and design parameters for re-
coding open reading frames will provide an excellent test case in
this brave new world of synthetic biology.
C. Gustafsson et al./Protein Expression and Purification 83 (2012) 37–46
Box 1. Building predictive models.
mation of how a complex system behaves. A so called ‘soft
model’ describes to the extent possible the relationships of
selected input variables to the functional output. This is con-
trary to a ‘hard model’ which is absolute in its description of
a system (e.g., a comprehensive mechanistic model). In the
case of a soft model for protein expression, an input variable
may be, for example, pH in media and output variable may be,
for example, protein expression yield. A good model identi-
fies the input variables that reliably and quantifiably affect
the output variable, and similarly identifies non-contributing
input variables so that the user knows which variables and
their ranges are relevant. Models may be constructed using
a variety of algorithms such as Neural Nets (NN), Support
Vector Machine (SVM), and linear regression. Simulations of
experiments can be used to test robustness of the model
and predict the output from new combinations of variables.
Useful simulations can vary from massive Monte Carlo exper-
iments to analysis of a small subset of predicted solutions.
A crucial part of any modeling process is the scoring func-
tion of variables and whetheror not a given model describes a
system accurately. A true scoring function requires making a
real-world experiment where the variables from the model
are combined and the functional outcome measured. This is
usually very expensive and for most datasets impossible to
explore exhaustively. The goal is always to build and validate
imized while still allowing for sufficient information for robust
and accurate predictions of the system.
Design of Experiments schemes such as Plackett–Burman
 or Taguchi orthogonal factorial  design are typically
used to design sets of experiments on which the models
are built. These algorithms distribute the input variables so
as to maximize the potential information derived from the
test set. A proper Design of Experiment allows the impacts
of all input variables to be estimated with a minimum num-
ber of test samples.
Significance of the model is often tested by cross-valida-
tion. Typically, the experimentally derived data is split into
two subsets: training data and verification data. The training
data is used to build a model and the prediction of the left-out
verification data is assessed. It can be concluded that the data
supports a predictive model if verification data is accurately
predicted from training data. The true value of a model rests
not only on its fit to the data used to generate it, but also on
its ability to predict behavior outside the evaluated space for
the tested or homologous systems. Lack of agreement be-
tween theoretical models and experimental measurements
is an important tool to identify weak links in the predictability
of a system and can be used to design experiments to refine
and expand the model.
A model is an approxi-
This work was supported by NSF SBIR Grant No. 0750206, the
NIH Roadmap Center Grant No. P50 GM073210 and DNA2.0 Grant
No. 1. We thank Josh Silverman, Calysta Biosystems for input on
 C. Fishman, This is a marketing revolution, Fast Company 24 (1999) 204.
 M. Lewis, Moneyball: The Art of Winning an Unfair Game, W. W. Norton, New
 Y. Koren, The BellKor Solution to the Netflix Grand Prize, Netflix, <http://
 X. Huang, W. Pan, S. Park, X. Han, L.W. Miller, J. Hall, Modeling the relationship
between LVAD support time and gene expression changes in the human heart
by penalized partial least squares, Bioinformatics 20 (2004) 888–894.
 S. Datta, Exploring relationships in gene expressions: a partial least squares
approach, Gene expr. 9 (2001) 249–255.
 A. Tropsha, QSAR in drug discovery, in: K.M. Merz, D. Ringe, C.H. Reynolds
(Eds.), Drug Design: Structure- and Ligand-Based Approaches, Cambridge
University Press, Cambridge, 2010, pp. 151–164.
 D.L. Goetsch, S. Davis, Quality Management for Organizational Excellence.
Introduction to Total Quality, Prentice Hall PTR, 2010.
 J. Jonsson, T. Norberg, L. Carlsson, C. Gustafsson, S. Wold, Quantitative
sequence-activity models (QSAM) - tools for sequence design, Nucleic Acids
Res. 21 (1993) 733–739.
 D. Klein-Marcuschamer, C.N. Santos, H. Yu, G. Stephanopoulos, Mutagenesis of
the bacterial RNA polymerase alpha subunit for improvement of complex
phenotypes, Appl. Environ. Microb. 75 (2009) 2705–2711.
 H. Yu, K. Tyo, H. Alper, D. Klein-Marcuschamer, G. Stephanopoulos, A high-
throughputscreen for hyaluronic
Escherichia coli transformed by libraries of engineered sigma factors,
Biotechnol. Bioeng. 101 (2008) 788–796.
 H.H. Wang, F.J. Isaacs, P.A. Carr, Z.Z. Sun, G. Xu, C.R. Forest, G.M. Church,
Programming cells by multiplex genome engineering and accelerated
evolution, Nature 460 (2009) 894–898.
 K.S. Park, Y.S. Jang, H. Lee, J.S. Kim, Phenotypic alteration and target gene
identification using combinatorial libraries of zinc finger proteins in
prokaryotic cells, J. bacteriol. 187 (2005) 5496–5499.
 S.C. Wu, RNA interference technology to improve recombinant protein
production in Chinese hamster ovary cells, Biotechnol. Adv. 27 (2009) 417–
 S. Hammond, K.H. Lee, RNA interference of cofilin in Chinese hamster ovary
cells improves recombinant protein productivity, Biotechnol. Bioeng. 109
 A.J. Bogdanove, D.F. Voytas, TAL effectors: customizable proteins for DNA
targeting, Science 333 (2011) 1843–1846.
 S. Hammond, J.C. Swanberg, M. Kaplarevic, K.H. Lee, Genomic sequencing and
analysis of a Chinese hamster ovary cell line using illumina sequencing
technology, BMC Genomics 12 (2011) 67.
 K. De Schutter, Y.C. Lin, P. Tiels, A. Van Hecke, S. Glinka, J. Weber-Lehmann,
P. Rouze, Y. Van de Peer, N. Callewaert, Genome sequence of the
recombinant protein production host Pichia pastoris, Nat. Biotechnol. 27
 D. Walls, S.T. Loughran, Tagging recombinant proteins to enhance solubility
and aid purification, Methods Mol. Biol. 681 (2011) 151–175.
 M. Welch, A. Villalobos, C. Gustafsson, J. Minshull, Designing genes for
successful protein expression, Methods Enzymol. 498 (2011) 43–66.
 A. Villalobos, J.E. Ness, C. Gustafsson, J. Minshull, S. Govindarajan, Gene
designer: a synthetic biology tool for constructing artificial DNA segments,
BMC Bioinformatics 7 (2006) 285.
 K. Itakura, T. Hirose, R. Crea, A.D. Riggs, H.L. Heyneker, F. Bolivar, H.W. Boyer,
Expression in Escherichia coli of a chemically synthesized gene for the hormone
somatostatin, Science 198 (1977) 1056–1063.
 C. Gustafsson, S. Govindarajan, J. Minshull, Codon bias and heterologous
protein expression, Trends Biotechnol. 22 (2004) 346–353.
 S. Pedersen, P.L. Bloch, S. Reeh, F.C. Neidhardt, Patterns of protein synthesis in
E. coli: a catalog of the amount of 140 individual proteins at different growth
rates, Cell 14 (1978) 179–190.
 J. Bonomo, R.T. Gill, Amino acid content of recombinant proteins influences the
metabolic burden response, Biotechnol. Bioeng. 90 (2005) 116–126.
 M. Gong, F. Gong, C. Yanofsky, Overexpression of tnaC of Escherichia coli
inhibits growth by depleting tRNA2Pro availability, J. Bacteriol. 188 (2006)
 S.W. Harcum, Structured model to predict intracellular amino acid shortages
during recombinant protein overexpression in E. coli, J. Biotechnol. 93 (2002)
 M.V. Rojiani, H. Jakubowski, E. Goldman, Relationship between protein
synthesis and concentrationsof
Escherichia coli, Proc. Natl. Acad. Sci. USA 87 (1990) 1511–1515.
 G. Wu, Y. Zheng, I. Qureshi, H.T. Zin, T. Beck, B. Bulka, S.J. Freeland, SGDB: a
database of synthetic genes re-designed for optimizing protein over-
expression, Nucleic Acids Res. 35 (2007) D76–79.
 G. Kudla,A.W.Murray, D.Tollervey,
determinants of gene expression in Escherichia coli, Science 324 (2009) 255–
 M. Welch, S. Govindarajan, J.E. Ness, A. Villalobos, A. Gurney, J. Minshull, C.
Gustafsson, Design parameters to control synthetic gene expression in
Escherichia coli, PLoS ONE 4 (2009) e7002.
 M. Allert, J.C. Cox, H.W. Hellinga, Multifactorial determinants of protein
expression in prokaryotic open reading frames, J. Mol. Biol. 402 (2010) 905–
 L. Stewart, A.B. Burgin, Whole gene synthesis: a gene-o-matic future, Front.
Drug Des. Discov. 1 (2005) 297–341.
 J.B. Plotkin, G. Kudla, Synonymous but not the same: the causes and
consequences of codon bias, Nat. Rev. Genet. 12 (2011) 32–42.
 M. Welch, A. Villalobos, C. Gustafsson, J. Minshull, You’re one in a googol:
optimizing genes for protein expression, J R Soc. Interfac. 6 (Suppl 4) (2009)
acid accumulationin recombinant
chargedanduncharged tRNATrp in
C. Gustafsson et al./Protein Expression and Purification 83 (2012) 37–46
 C. Gustafsson, Tools designed to regulate translational efficiency, in: C.D. Download full-text
Smolke (Ed.), The Metabolic Pathway Engineering Handbook: Tools and
Applications, CRC Press, 2009, pp. 1–14.
 C.P. Ehretsmann, A.J. Carpousis, H.M. Krisch, Specificity of Escherichia coli
endoribonuclease RNase E: in vivo and in vitro analysis of mutants in a
bacteriophage T4 mRNA processing site, Genes Dev. 6 (1992) 149–159.
 N.J. Proudfoot, Ending the message: poly(A) signals then and now, Genes Dev.
25 (2011) 1770–1782.
 J.F. Kane, Effects of rare codon clusters on high-level expression of
heterologous proteins in Escherichia coli, Curr. Opin. Biotechnol. 6 (1995)
 N.A. Burgess-Brown, S. Sharma, F. Sobott, C. Loenarz, U. Oppermann, O. Gileadi,
Codon optimization can improve expression of human genes in Escherichia
coli: a multi-gene study, Protein Expres. Purif. 59 (2008) 94–102.
 B.J. Del Tito Jr., J.M. Ward, J. Hodgson, C.J. Gershater, H. Edwards, L.A. Wysocki,
F.A. Watson, G. Sathe, J.F. Kane, Effects of a minor isoleucyl tRNA on
heterologous protein translation in Escherichia coli, J. Bacteriol. 177 (1995)
 A.G. Zdanovsky,M.V. Zdanovskaia,
heterologous expression of clostridial proteins, Appl. Environ. Microbiol. 66
 H. Tegel, S. Tourle, J. Ottosson, A. Persson, Increased levels of recombinant
human proteins with the Escherichia coli strain Rosetta(DE3), Protein Expres.
Purif. 69 (2010) 159–167.
 T. Kolmsee, R. Hengge, Rare codons play a positive role in the expression of the
stationary phase sigma factor RpoS (sigmaS) in Escherichia coli, RNA Biol. 8
 E. Angov, C.J. Hillier, R.L. Kincaid, J.A. Lyon, Heterologous protein expression is
enhanced by harmonizing the codon usage frequencies of the target gene with
those of the expression host, PLoS ONE 3 (2008) e2189.
 X. Wu, H. Jörnvall, K.D. Berndt, U. Oppermann, Codon optimization reveals
critical factors for high level expression of two rare codon genes in Escherichia
coli: RNA stability and secondary structure but not tRNA abundance, Biochem.
Biophys. Res. Commun. 313 (2004) 89–96.
 H. Tegel, J. Ottosson, S. Hober, Enhancing the protein production levels in
Escherichia coli with a strong promoter, FEBS J. 278 (2011) 729–739.
 C. Hayes, B. Bose, R. Sauer, Stop codons preceded by rare arginine codons are
efficient determinants of SsrA tagging in Escherichia coli, Proc. Natl. Acad. Sci.
USA 99 (2002) 3440–3445.
 R.A. Spanjaard, J. van Duin, Translation of the sequence AGG-AGG yields 50%
ribosomal frameshift, Proceedings of the National Academy of Sciences of the
United States of America 85 (1988) 7967–7971.
 M.A. Sorensen, C.G. Kurland, S. Pedersen, Codon usage determines translation
rate in Escherichia coli, J. Mol. Biol. 207 (1989) 365–377.
 E.B. Kramer, P.J. Farabaugh, The frequency of translational misreading errors in
E coli is largely determined by tRNA competition, Rna 13 (2007) 87–96.
 J.J. Olivares-Trejo, J.G. Bueno-Martinez, G. Guarneros, J. Hernandez-Sanchez,
The pair of arginine codons AGA AGG close to the initiation codon of the
lambda int gene inhibits cell growth and protein synthesis by accumulating
peptidyl-tRNAArg4, Mol. Microbiol. 49 (2003) 1043–1049.
 G.T. Chen, M. Inouye, Role of the AGA/AGG codons, the rarest codons in global
gene expression in Escherichia coli, Genes Dev. 8 (1994) 2641–2652.
 G. Cannarozzi, N.N. Schraudolph, M. Faty, P. von Rohr, M.T. Friberg, A.C. Roth, P.
Gonnet, G. Gonnet, Y. Barral, A role for codon order in translation dynamics,
Cell 141 (2010) 355–367.
 A. Tats, T. Tenson, M. Remm, Preferred and avoided codon pairs in three
domains of life, BMC Genomics 9 (2008) 463.
 S. Boycheva, G. Chkodrov, I. Ivanov, Codon pairs in the genome of Escherichia
coli, Bioinformatics 19 (2003) 987–998.
 L.S. Folley, M. Yarus, Codon contexts from weakly expressed genes reduce
expression in vivo, J. Mol. Biol. 209 (1989) 359–378.
 B. Irwin, J.D. Heck, G.W. Hatfield, Codon pair utilization biases influence
translational elongation step times, J. Biol. Chem. 270 (1995) 22801–22806.
 L. Cheng, E. Goldman, Absence of effect of varying Thr-Leu codon pairs on
protein synthesis in a T7 system, Biochemistry 40 (2001) 6102–6106.
 M. Friberg, P. von Rohr, G. Gonnet, Limitations of codon adaptation index and
other coding DNA-based features for prediction of protein expression in
Saccharomyces cerevisiae, Yeast 21 (2004) 1083–1093.
 E. Angov, Codon usage: nature’s roadmap to expression and folding of proteins,
Biotechnol. J. 6 (2011) 650–659.
 E. Siller, D.C. DeZwaan, J.F. Anderson, B.C. Freeman, J.M. Barral, Slowing
bacterial translation speed enhances eukaryotic protein folding efficiency, J.
Mol. Biol. 396 (2010) 1310–1318.
 M. Widmann, M. Clairo, J. Dippon, J. Pleiss, Analysis of the distribution of
functionally relevant rare codons, BMC Genomics 9 (2008) 207.
 C. Kimchi-Sarfaty, J.M. Oh, I.W. Kim, Z.E. Sauna, A.M. Calcagno, S.V. Ambudkar,
M.M. Gottesman, A ‘‘silent’’ polymorphism in the MDR1 gene changes
substrate specificity, Science 315 (2007) 525–528.
 C.J. Tsai, Z.E. Sauna, C. Kimchi-Sarfaty, S.V. Ambudkar, M.M. Gottesman, R.
Nussinov, Synonymous mutations and ribosome stalling can lead to altered
folding pathways and distinct minima, J. Mol. Biol. 383 (2008) 281–291.
 A.A. Komar, SNPs, silent but not invisible, Science 315 (2007) 466–467.
 M. Kozak, Influences of mRNA secondary structure on initiation by eukaryotic
ribosomes, Proc. Natl. Acad. Sci. USA 83 (1986) 2850–2854.
 S.M. Studer, S. Joseph, Unfolding of mRNA secondary structure by the bacterial
translation initiation complex, Mol. Cell. 22 (2006) 105–115.
 S. Takyar, R.P. Hickerson, H.F. Noller, MRNA helicase activity of the ribosome,
Cell 120 (2005) 49–58.
 J.D. Wen, L. Lancaster, C. Hodges, A.C. Zeri, S.H. Yoshimura, H.F. Noller, C.
Bustamante, I. Tinoco, Following translation by single ribosomes one codon at
a time, Nature 452 (2008) 598–603.
 N.T. Ingolia, S. Ghaemmaghami, J.R. Newman, J.S. Weissman, Genome-wide
analysis in vivo of translation with nucleotide resolution using ribosome
profiling, Science 324 (2009) 218–223.
 R. Mahendran, M.R. Spottswood, D.L. Miller, RNA editing by cytidine insertion
in mitochondria of Physarum polycephalum, Nature 349 (1991) 434–438.
 T. Ikemura, Codon usage and tRNA content in unicellular and multicellular
organisms, Mol. Biol. Evol. 2 (1985) 13–34.
 P.M. Sharp, W.H. Li, The codon Adaptation Index–a measure of directional
synonymous codon usage bias, and its potential applications, Nucleic Acids
Res. 15 (1987) 1281–1295.
 M. dos Reis, L. Wernisch, R. Savva, Unexpected correlations between gene
expression and codon usage bias from microarray data for the whole
Escherichia coli K-12 genome, Nucleic Acids Res. 31 (2003) 6976–6985.
 H. Dong, L. Nilsson, C.G. Kurland, Co-variation of tRNA abundance and codon
usage in Escherichia coli at different growth rates, J. Mol. Biol. 260 (1996) 649–
 T. Ikemura, Correlation between the abundance of yeast transfer RNAs and the
occurrence of the respective codons in protein genes. Differences in
synonymous codon choice patterns of yeast and Escherichia coli with
reference to the abundance of isoaccepting transfer RNAs, J. Mol. Biol. 158
 K.A. Dittmar, M.A. Sorensen, J. Elf, M. Ehrenberg, T. Pan, Selective charging of
tRNA isoacceptors induced by amino-acid starvation, EMBO Rep. 6 (2005)
 J. Elf, D. Nilsson, T. Tenson, M. Ehrenberg, Selective charging of tRNA
isoacceptors explains patterns of codon usage, Science 300 (2003) 1718–1722.
 R.A. Fisher, Design of Experiments, Oliver and Boyd, London, UK, 1935.
 D.J. Hand, H. Mannila, P. Smyth, Principles of Data Mining, The MIT Press, 2001.
 S. Lutz, Beyond directed evolution–semi-rational protein engineering and
design, Curr. Opin. Biotechnol. 21 (2010) 734–743.
 C. Gustafsson, S. Govindarajan, J. Minshull, Putting engineering back into
protein engineering: bioinformatic approaches to catalyst design, Curr. Opin.
Biotechnol. 14 (2003) 366–370.
 J. Liao, M.K. Warmuth, S. Govindarajan, J.E. Ness, R.P. Wang, C. Gustafsson, J.
Minshull, Engineering proteinase K using machine learning and synthetic
genes, BMC Biotechnol. 7 (2007) 16.
 P.K. Ajikumar, W.H. Xiao, K.E. Tyo, Y. Wang, F. Simeon, E. Leonard, O. Mucha,
T.H. Phon, B. Pfeifer, G. Stephanopoulos, Isoprenoid pathway optimization for
Taxol precursor overproduction in Escherichia coli, Science 330 (2010) 70–74.
 D.G. Gibson, J.I. Glass, C. Lartigue, V.N. Noskov, R.Y. Chuang, M.A. Algire, G.A.
Benders, M.G. Montague, L. Ma, M.M. Moodie, C. Merryman, S. Vashee, R.
Krishnakumar, N. Assad-Garcia, C. Andrews-Pfannkoch, E.A. Denisova, L.
Young, Z.Q. Qi, T.H. Segall-Shapiro, C.H. Calvey, P.P. Parmar, C.A. Hutchison
3rd, H.O. Smith, J.C. Venter, Creation of a bacterial cell controlled by a
chemically synthesized genome, Science 329 (2010) 52–56.
 K. Tamura, D. Peterson, N. Peterson, G. Stecher, M. Nei, S. Kumar, MEGA5:
evolutionary distance, and maximum parsimony methods, Mol. Biol. Evol. 28
 R.L. Plackett, J.P. Burman, Design of optimum multifactorial experiments,
Biometrika 33 (1946) 305–325.
 G.I. Taguchi, S. Konishi, American Supplier Institute, Orthogonal arrays and
linear graphs: tools for quality engineering, Mich, American Supplier Institute,
analysisusing maximum likelihood,
C. Gustafsson et al./Protein Expression and Purification 83 (2012) 37–46