Designing genes for successful protein expression.
ABSTRACT DNA sequences are now far more readily available in silico than as physical DNA. De novo gene synthesis is an increasingly cost-effective method for building genetic constructs, and effectively removes the constraint of basing constructs on extant sequences. This allows scientists and engineers to experimentally test their hypotheses relating sequence to function. Molecular biologists, and now synthetic biologists, are characterizing and cataloging genetic elements with specific functions, aiming to combine them to perform complex functions. However, the most common purpose of synthetic genes is for the expression of an encoded protein. The huge number of different proteins makes it impossible to characterize and catalog each functional gene. Instead, it is necessary to abstract design principles from experimental data: data that can be generated by making predictions followed by synthesizing sequences to test those predictions. Because of the degeneracy of the genetic code, design of gene sequences to encode proteins is a high-dimensional problem, so there is no single simple formula to guarantee success. Nevertheless, there are several straightforward steps that can be taken to greatly increase the probability that a designed sequence will result in expression of the encoded protein. In this chapter, we discuss gene sequence parameters that are important for protein expression. We also describe algorithms for optimizing these parameters, and troubleshooting procedures that can be helpful when initial attempts fail. Finally, we show how many of these methods can be accomplished using the synthetic biology software tool Gene Designer.
Article: Optimisation of over-expression in E. coli and biophysical characterisation of human membrane protein synaptogyrin 1.[show abstract] [hide abstract]
ABSTRACT: Progress in functional and structural studies of integral membrane proteins (IMPs) is lacking behind their soluble counterparts due to the great challenge in producing stable and homogeneous IMPs. Low natural abundance, toxicity when over-expressed and potential lipid requirements of IMPs are only a few reasons for the limited progress. Here, we describe an optimised workflow for the recombinant over-expression of the human tetraspan vesicle protein (TVP) synaptogyrin in Escherichia coli and its biophysical characterisation. TVPs are ubiquitous and abundant components of vesicles. They are believed to be involved in various aspects of the synaptic vesicle cycle, including vesicle biogenesis, exocytosis and endocytotic recycling. Even though TVPs are found in most cell types, high-resolution structural information for this class of membrane proteins is still missing. The optimisation of the N-terminal sequence of the gene together with the usage of the recently developed Lemo21(DE3) strain which allows the balancing of the translation with the membrane insertion rate led to a 50-fold increased expression rate compared to the classical BL21(DE3) strain. The protein was soluble and stable in a variety of mild detergents and multiple biophysical methods confirmed the folded state of the protein. Crosslinking experiments suggest an oligomeric architecture of at least four subunits. The protein stability is significantly improved in the presence of cholesteryl hemisuccinate as judged by differential light scattering. The approach described here can easily be adapted to other eukaryotic IMPs.PLoS ONE 01/2012; 7(6):e38244. · 4.09 Impact Factor
[show abstract] [hide abstract]
ABSTRACT: Synthetic biology is used to develop cell factories for production of chemicals by constructively importing heterologous pathways into industrial microorganisms. In this work we present a retrosynthetic approach to the production of therapeutics with the goal of developing an in situ drug delivery device in host cells. Retrosynthesis, a concept originally proposed for synthetic chemistry, iteratively applies reversed chemical transformations (reversed enzyme-catalyzed reactions in the metabolic space) starting from a target product to reach precursors that are endogenous to the chassis. So far, a wider adoption of retrosynthesis into the manufacturing pipeline has been hindered by the complexity of enumerating all feasible biosynthetic pathways for a given compound. In our method, we efficiently address the complexity problem by coding substrates, products and reactions into molecular signatures. Metabolic maps are represented using hypergraphs and the complexity is controlled by varying the specificity of the molecular signature. Furthermore, our method enables candidate pathways to be ranked to determine which ones are best to engineer. The proposed ranking function can integrate data from different sources such as host compatibility for inserted genes, the estimation of steady-state fluxes from the genome-wide reconstruction of the organism's metabolism, or the estimation of metabolite toxicity from experimental assays. We use several machine-learning tools in order to estimate enzyme activity and reaction efficiency at each step of the identified pathways. Examples of production in bacteria and yeast for two antibiotics and for one antitumor agent, as well as for several essential metabolites are outlined. We present here a unified framework that integrates diverse techniques involved in the design of heterologous biosynthetic pathways through a retrosynthetic approach in the reaction signature space. Our engineering methodology enables the flexible design of industrial microorganisms for the efficient on-demand production of chemical compounds with therapeutic applications.BMC Systems Biology 08/2011; 5:122. · 3.15 Impact Factor
C H A P T E RT H R E E
Designing Genes for Successful
Mark Welch, Alan Villalobos, Claes Gustafsson, and
2. Gene Design Software
3. General Sequence Parameters Affecting Protein Expression
3.1. Initiation of translation
3.2. Codon bias
3.3. mRNA structure and translational elongation
4. Protein-Specific Factors Providing Additional Complexity
4.1. Protein toxicity
4.2. Transmembrane proteins
4.3. cis-Regulatory regions
DNA sequences are now far more readily available in silico than as physical
DNA. De novo gene synthesis is an increasingly cost-effective method for
building genetic constructs, and effectively removes the constraint of basing
constructs on extant sequences. This allows scientists and engineers to experi-
mentally test their hypotheses relating sequence to function. Molecular biolo-
gists, and now synthetic biologists, are characterizing and cataloging genetic
elements with specific functions, aiming to combine them to perform complex
functions. However, the most common purpose of synthetic genes is for the
expression of an encoded protein.
The huge number of different proteins makes it impossible to characterize
and catalog each functional gene. Instead, it is necessary to abstract design
principles from experimental data: data that can be generated by making
predictions followed by synthesizing sequences to test those predictions.
Because of the degeneracy of the genetic code, design of gene sequences to
encode proteins is a high-dimensional problem, so there is no single simple
Methods in Enzymology, Volume 498
ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00003-6
#2011 Elsevier Inc.
All rights reserved.
DNA2.0, Inc., Suite A, Menlo Park, California, USA
formula to guarantee success. Nevertheless, there are several straightforward
steps that can be taken to greatly increase the probability that a designed
sequence will result in expression of the encoded protein.
In this chapter, we discuss gene sequence parameters that are important for
protein expression. We also describe algorithms for optimizing these para-
meters, and troubleshooting procedures that can be helpful when initial
attempts fail. Finally, we show how many of these methods can be accom-
plished using the synthetic biology software tool Gene Designer.
A major objective of synthetic biology is to characterize biological
components with sufficient precision to enable these components to be
combined to produce predictable outcomes. Progress has been made in
defining functional parameters for some elements. Particularly, those with
regulatory functions that act to control transcription (promoters, operators,
repressors, and activators) are now reasonably well characterized (see http://
www.partsregistry.org; Lisser and Margalit, 1993; Peccoud et al., 2008).
However, reaching the ultimate targets of synthetic biology projects will
require the balanced control of both transcription and translation in order to
achieve controlled protein expression, whether those targets are engineered
pathways for producing metabolites, remodeled photosynthesis, or trees
that can turn into houses. Proteins are not necessarily the components of
regulatory networks; they may also be catalysts that interact with cellular
metabolism, structuralparts of the cell,or therapeutically activecompounds.
Unfortunately, understanding transcriptional regulation is not sufficient to
provide control of protein production.
The characterization of sequences governing translation has proved
challenging. This is largely because translational determinants interact
with, or are embedded within the sequences that encode the polypeptide.
Consequently, there is not yet a perfectly robust way to convert a virtual
amino acid sequence to a DNA sequence that will, when introduced into a
desired host cell, yield sufficient protein for a specific downstream applica-
tion. Here, we describe recently developed tools and technologies for gene
design, and discuss the heuristic basis of our understanding of particularly
important design features.
Translation can be controlled at the level of initiation and elongation.
Initiation of translation is primarily dependent on the sequence of the
ribosome binding site (RBS) and early mRNA secondary structure (Allert
et al., 2010; Kudla et al., 2009; Salis et al., 2009). Other determinants of
protein expression are less well understood but equally potent. Different
proteins expressed from the same promoter with the same RBS or 50
Mark Welch et al.
untranslated region (UTR) may be expressed at wildly different levels. Even
different ways of encoding the same protein, under otherwise identical
conditions, can result in protein concentrations differing by 100-fold
(Allert et al., 2010; Kudla et al., 2009; Welch et al., 2009b). Understanding
these determinants would greatly enhance our ability to express proteins at
specific desired levels. In the best case, we could hope to use the control
they offer. At the least, it would be helpful if we could eliminate them so
that we could rely on the controls we do understand. Experimental data on
the influence of gene design on heterologous expression are rapidly grow-
ing, and design algorithms derived from these experiments provide both an
increased probability of success in individual projects and a starting point for
2. Gene Design Software
Backtranslation from a polypeptide sequence to obtain a DNA
sequence requires choosing between an enormous number of possibilities
(Welch et al., 2009b).We use the backtranslation module of Gene Designer,
a free software tool (www.dna20.com/genedesigner2), to select sequences
with specific design characteristics. Backtranslation parameters can be
altered by selecting backtranslation profiles from the Configure menu in
the Project Window (see Fig. 3.1). These parameters will be discussed in
more detail in the following sections.
3. General Sequence Parameters Affecting
Evidence that recoding a gene can radically change its expression has
been accumulating over the past two decades (Gustafsson et al., 2004;
Welch et al., 2009b). However, it is only in the last year or two that
experiments have compared the expression of many different individual
genes encoding the same protein. These experiments are finally allowing
hypotheses about the causes of expression differences to be tested.
3.1. Initiation of translation
A key component affecting initiation of translation in prokaryotes is the
RBS that occurs between5and15 bases upstreamof the open readingframe
(ORF) AUG start codon. Binding of the ribosome to the Shine–Dalgarno
(SD) sequencewithinthe RBS localizes the ribosome to theinitiation codon.
This binding is primarily due to direct base pairing with the anti-SD region
Gene Design and Protein Expression
of the 16S rRNA of the small ribosome subunit and can be greatly influenced
by context (Komarova et al., 2005; Lee et al., 1996; Shultzaberger et al., 2001;
Vimberg et al., 2007). Changes in RBS sequences can change expression
levels over more than three orders of magnitude. Affinity of the RBS for the
ribosome is a critical factor controlling the efficiency with which new
polypeptide chains are initiated. This interaction is in competition with
possible base-pairing interactions involving the RBS region that may form
withinthe mRNA itself. Thus, SDsequenceswithweaker basepairing to the
ribosome are more susceptible to interference from mRNA structure. How-
bedeleterious, particularly atlower temperatures, bystallinginitialelongation
From the Configure Menu, choose Backtranslation Profiles to open My Backtranslation
Profiles. Then select a profile and click edit (pencil icon) or double click on a profile to
open the Backtranslation Editor. In the editor, you can change parameters related to the
genetic algorithm, codon usage, sequences to avoid, 50structure, repeats, and homolo-
Project Window, My Backtranslation Profiles, and Backtranslation Profile Editor.
Mark Welch et al.
(Komarova et al., 2002; Vimberg et al., 2007). Also critical is the distance
between the RBS and the start codon with 5–7 bases from the consensus SD
AGGAGG being optimal (Chen et al., 1994). Models that factor competition
betweenthe anti-SDandmRNA structureaswellasstartcodon spacinghave
beenshownto approximateactualtranslationinitiationrates (deSmitand van
Duin, 2003; Na et al., 2010; Salis et al., 2009).
Much prior work has demonstrated that mRNA structures that occlude
the region of the RBS and/or start codon in genes expressed in prokaryotes
can impair expression (de Smit and van Duin, 1990, 1994; Griswold et al.,
2003; Kozak, 1986; Kudla et al., 2009; Studer and Joseph, 2006). For this
reason, gene design strategies often avoid such structures in choosing coding
of the first several amino acids. Salis and coworkers have recently developed
a thermodynamic model that captures competition between internal
mRNA structures and the binding of the ribosome to the RBS (Salis
et al., 2009). An alternative mathematical model of initiation has also been
proposed based on similar considerations (Na et al., 2010). The Salis model
is the basis of an online tool that can be used to design RBSs with modified
rates of initiation of translation (http://www.voigtlab.ucsf.edu/software/).
In its current stage of development, this tool is best suited for attenuating
expression of an existing gene.
In eukaryotes, translation initiation is significantly different from that in
prokaryotes, and multiple mechanisms have been characterized. Most initi-
ation of translation from polymerase II-derived transcripts proceeds via
recognition of the m7G cap at the 50terminus of the mRNA followed by
scanning of the ribosome to the initiation codon, which is identified by
proximity to the 50-end and sequence context (Kozak, 1999, 2005; Pestova
et al., 2001; Preiss and Hentze, 1999). Several factors are apparently
involved in unwinding structure in the region from the cap to the start
codon (Parsyan et al., 2009; Pisareva et al., 2008). Alternatively, initiation for
some genes can occur via recognition of internal mRNA elements that
recruit ribosomes to the message and direct them to the start codon (Berry
et al., 2010; Gazo et al., 2004; Pestova et al., 2001).
Numerous lines of evidence suggest that the initial 15–25 codons of the
ORF deserve special consideration in gene optimization (Allert et al., 2010;
Chen and Inouye, 1994; Eyre-Walker and Bulmer, 1993; Gonzalez de
Valdivia and Isaksson, 2004, 2005; Kudla et al., 2009; Stenstro ¨m and
Isaksson, 2002; Stenstro ¨m et al., 2001a,b; Tuller et al., 2010). Studies have
shown that the impact ofrare codonson translation rateis particularlystrong
in these first codons, for expression in both Escherichia coli and Saccharomyces
cerevisiae (Chen and Inouye, 1990, 1994; Hoekema et al., 1987). In E. coli,
peptidyl-tRNA drop-off during translation of the initial codons appears to
be accentuated by the presence of rare or NGG codons (Cruz-Vera et al.,
2004; Gonzalez de Valdivia and Isaksson, 2004, 2005). These effects appear
to be independent of local mRNA secondary structure. The impact of early
Gene Design and Protein Expression