ArticlePDF Available

Abstract and Figures

Background Mass spectrometry-based proteomics can identify and quantify thousands of proteins from individual microbial species, but a significant percentage of these proteins are unannotated and hence classified as proteins of unknown function (PUFs). Due to the difficulty in extracting meaningful metabolic information, PUFs are often overlooked or discarded during data analysis, even though they might be critically important in functional activities, in particular for metabolic engineering research. Results We optimized and employed a pipeline integrating various “guilt-by-association” (GBA) metrics, including differential expression and co-expression analyses of high-throughput mass spectrometry proteome data and phylogenetic coevolution analysis, and sequence homology-based approaches to determine putative functions for PUFs in Clostridium thermocellum . Our various analyses provided putative functional information for over 95% of the PUFs detected by mass spectrometry in a wild-type and/or an engineered strain of C. thermocellum . In particular, we validated a predicted acyltransferase PUF (WP_003519433.1) with functional activity towards 2-phenylethyl alcohol, consistent with our GBA and sequence homology-based predictions. Conclusions This work demonstrates the value of leveraging sequence homology-based annotations with empirical evidence based on the concept of GBA to broadly predict putative functions for PUFs, opening avenues to further interrogation via targeted experiments.
This content is subject to copyright. Terms and conditions apply.
Poudeletal. Biotechnol Biofuels (2021) 14:116
Identication andcharacterization
ofproteins ofunknown function (PUFs)
inClostridium thermocellum DSM 1313 strains
aspotential genetic engineering targets
Suresh Poudel1,2,3†, Alexander L. Cope1,3†, Kaela B. O’Dell1,2,4, Adam M. Guss1,4, Hyeongmin Seo2,5,
Cong T. Trinh2,3,4,5 and Robert L. Hettich1*
Background: Mass spectrometry-based proteomics can identify and quantify thousands of proteins from individual
microbial species, but a significant percentage of these proteins are unannotated and hence classified as proteins
of unknown function (PUFs). Due to the difficulty in extracting meaningful metabolic information, PUFs are often
overlooked or discarded during data analysis, even though they might be critically important in functional activities,
in particular for metabolic engineering research.
Results: We optimized and employed a pipeline integrating various “guilt-by-association” (GBA) metrics, includ-
ing differential expression and co-expression analyses of high-throughput mass spectrometry proteome data and
phylogenetic coevolution analysis, and sequence homology-based approaches to determine putative functions for
PUFs in Clostridium thermocellum. Our various analyses provided putative functional information for over 95% of the
PUFs detected by mass spectrometry in a wild-type and/or an engineered strain of C. thermocellum. In particular, we
validated a predicted acyltransferase PUF (WP_003519433.1) with functional activity towards 2-phenylethyl alcohol,
consistent with our GBA and sequence homology-based predictions.
Conclusions: This work demonstrates the value of leveraging sequence homology-based annotations with empiri-
cal evidence based on the concept of GBA to broadly predict putative functions for PUFs, opening avenues to further
interrogation via targeted experiments.
© The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco
mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/
zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Lignocellulose solubilization and fermentation have been
major challenges in the quest to produce cost-effective
cellulosic biofuels. Clostridium thermocellum (which has
also been renamed as Ruminiclostridium thermocellum
[1], Hungateiclostridium thermocellum [2], Acetivibrio
thermocellus [3]) is a fermentative anaerobic thermophile
that has been studied extensively as a possible chassis
organism for this goal. Several attempts have been made
to engineer C. thermocellum strains to produce bioetha-
nol as the major cellulose degradation product at high
yield [48], but none of these attempts have matched
conventional bioethanol producers, such as Saccharomy-
ces cerevisiae and Zymomonas mobilis [9, 10].
Although C. thermocellum produces various short-
chain alcohols (e.g., ethanol, isobutanol, etc.), several
other end products are also generated (e.g., formic acid,
acetic acid, lactic acid, hydrogen, amino acids, etc.). In
Open Access
Biotechnology for Biofuels
Suresh Poudel and Alexander L. Cope contributed equally to this work
1 Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN
37831, USA
Full list of author information is available at the end of the article
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 2 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
particular, the organic acids decrease pH of the cul-
ture media and reduce yields of alcohols as biofuels. To
improve ethanol production, a modified version of C.
thermocellum DSM1313 was generated, called strain
LL1210, in which the specific genes involved in the pro-
duction of acetate, lactate, formate, and most hydrogen
(Δhpt ΔhydG Δldh Δpfl Δpta-ack) have been deleted,
followed by adaptive laboratory evolution strategy [11].
While LL1210 is among the highest producers of etha-
nol titer and yield from lignocellulosic biomass, further
advances in this strain are required to examine organism
robustness and scalability for industrial applications [10].
Interestingly, many of the proteins determined to be
differentially expressed or highly expressed based on
specific substrate in C. thermocellum and other cellulo-
lytic bacteria are annotated as hypothetical proteins,
uncharacterized proteins, domains of unknown func-
tion(DUFs), or a similar term indicating no known func-
tion. We broadly refer to this class of proteins as “proteins
of unknown function” (PUFs). High abundances and/or
differential expression of PUFs that are sensitive to envi-
ronmental conditions (specifically, cellulosic substrate
type) suggests a possible role in the metabolism of cel-
lulose or other key cellular processes. For example, pre-
vious work in the cellulolytic Caldicellulosiruptor bescii
indicated differential abundance of 37 PUFs driven by
the nature of the cellulosic substrates used in the growth
media [12]. Similarly, many PUFs were found to be highly
and/or differentially abundant across four strains of C.
thermocellum (one wild-type parent strain plus three
mutant strains) [11]. For example, WP_003519067.1
(Clo1313_1790), which was a PUF at the time of this
study, was highly abundant across the 4 strains [11], sug-
gesting an important functional role even in mutants
that had undergone adaptive laboratory evolution.
WP_003519067.1 is now annotated in NCBI RefSeq as
a 2Fe-2S ferredoxin based on a conserved domain iden-
tified by NCBI SPARCLE [13]. Some PUFs were highly
abundant in mutants, but not in the wild-type strain,
while other PUFs showed differential abundance across
mutant strains. Such measurements suggest a functional
role, but a key challenge for researchers is to identify the
specific function of a PUF.
As evident from the critical assessment of protein func-
tion annotation (CAFA), functional predictions based
on sequence homology have dramatically improved
over the past 2 decades [1416]. Despite this progress,
a large number of proteins remain annotated as PUFs
[17]. As of March 2020, a total of 17,929 domains were
deposited in the Pfam database, with 5792 domains (32%
of the total) containing the keyword “unknown func-
tion” [18]. Reports indicate that a large fraction of Pro-
tein Data Bank (PDB) entries are categorized under
“unknown functions” [19, 20]. PUFs are common even in
well-studied species. For example, only 40% of predicted
genes in the model plant Arabidopsis thaliana have reli-
able annotations [21]. Previous efforts have been made to
predict the biochemical functions for protein structures
of unknown function [22] and to characterize essential
domains of unknown function (DUFs) [23]. Even after
a recent attempt to better annotate PUFsin the S. cer-
evisiae and human genomes via sequence homology,
greater than 30% of their unknownproteins (600 and
2000 proteins, respectively) remain uncharacterized [24].
In E. coli, 80% of predicted proteins have some functional
annotation, but only 54% have some level of empirical
characterization [25, 26].
Protein characterization via empirical methods is
challenging due to a large amount of sequencing data
currently available combined with the low-throughput
nature of characterization experiments. An alternative
approach is to use interaction or co-expression data pro-
duced via high-throughput omics-scale measurements to
identify proteins of known functions with which a PUF is
associated, a concept referred to as “guilt-by-association”
(GBA) [24, 2731]. GBA operates under the reasonable
assumption if two proteins physically interact or are co-
expressed with one another, they are more likely to be
connected in function [29]. Previous work has found sig-
nificant overlap between co-expression and protein–pro-
tein interaction networks, suggesting that functionally
related proteins are co-expressed [32]. Using the concept
of GBA, PUFs which interact or co-express with proteins
of known function may serve similar functional roles as
the former, which can be confirmed via targeted charac-
terization experiments.
Given that approximately 20% of the C. thermocel-
lum genome consists of PUFs, the goal of this work is
to identify putative functional roles for PUFs in C. ther-
mocellum, with a focus on PUFs which may play a role
in cellulose degradation and ethanol production. To
explain the potential roles of these PUFs, a time-course
MS-based proteomics study was performed with C. ther-
mocellum DSM1313 wild-type (∆hpt) and the evolved
LL1210 strain to assess differential and co-expressed
PUFs. ∆hpt is the parent strain for essentially every
mutant ever made in C. thermocellum. It has a deletion
in the hypoxanthine phosphoribosyl transferase (hpt),
which allows use of∆hpt as a counter-selectable marker
for making gene deletions.
e LL1210 strain was chosen to compare with the
hpt wild-type strain not onlybecause they are geneti-
cally and phenotypically distinct but also because
LL1210is the highest ethanol producing strain of C. ther-
mocellum to date. erefore, discovery of PUFswithin
the strain could lead to advances in improving its
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 3 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
metabolism toward ethanol production as well as over-
allgrowth .
e major aim of this experimental design was to
explorethe temporal response of PUFs that are specific
to a particular strain and more importantly increase or
accumulate along with the soubilizationof the substrate.
GBA evidence was leveraged with various functional pre-
diction tools, structural modeling, phylogenetic analysis,
and gene regulatory information to propose putative
functional roles for many PUFs in C. thermocellum. In an
attempt to validate functional predictions derived here,
PUF candidates which could be tested and verified by a
measurable phenotype effect, either in vitro or in vivo,
were identified. is is a very difficult and unpredictable
process with the risk of no positive return. A range of
PUFs were considered and the best validation candidate
selected. From this, PUF WP_003519433.1 was empiri-
cally validated, showing clear evidence to support the
alcohol acetyltransferase activity prediction.
A visual outline of the GBA approach described in this
manuscript is presented in Fig.1, which illustrates how
the MS-based proteome information is first connected
with expression networks and then interrogated with a
variety of informatics and structural prediction tools.
PUFs with consistent lines of evidence across multiple
GBA approaches were deemed strong candidates for
putative functional classification.
A total of 1960 proteins out of 3033 possible proteins
(65%) were quantified across all time points (as defined
in "Methods" section) in both C. thermocellum strains
(∆hpt and LL1210).  Figure 2 demonstrates the global
proteome overlap (Venn-diagram) across both strains
(a–c) and distribution of protein abundances (annotated
versus PUFs), as shown by boxplots (d). In both strains,
each time point had several unique proteins (Fig. 2a
and b); however, a majority of proteins were observed
in both experimental strains (Fig.2c). is reveals that
while much of the overall protein machinery is con-
stant, some of the identified proteins are specific for
one strain under the provided growth condition, which
could help to characterize and understand the overall
826 PUFs in C. thermocellum
344 PUFs
Measured via
Promoter Operator
PUFProtein A
Protein A
Protein B
Protein A
Protein B
Fig. 1 A pipeline summarizing the guilt-by-association and functional annotation approaches used in this study. The 344 PUFs measured via
LC–MS/MS were subjected to co-expression and differential expression analyses. Structural modeling with SwissModel was used to determine
structural templates which best fit a PUFs protein sequence. Domain and function prediction were performed using InterProScan, eggNOG-mapper,
and PANNZER2. Phylogenetic coevolution analysis was used to test for coevolution. Gene trees were generated from a homology search in BLAST,
followed by alignment of the top 200 hits and tree construction using FastTree. Regulatory information based on shared operons was extracted
from the DOOR database
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 4 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
functionality of that strain. e unique proteins in the
∆hpt strain were enriched in function related to sulfur
compound metabolic process (GO:0044272), drug meta-
bolic process (GO:0017144), oxidation–reduction pro-
cess (GO:0055114), aromatic compound biosynthetic
process (GO:0019438), and water-soluble vitamin bio-
synthetic process (GO:0042364). Notably, many proteins
involved in these processes are perturbed in the LL1210
strain [11]. In contrast, the unique proteins in the LL1210
strain were related to polysaccharide catabolic process
(GO:0000272). Global abundance distribution between
all annotated identified proteins versus PUFs revealed
that PUFs are, on average, lower in abundance across all
conditions (Fig.2d); however, since they are identified,
they likely play key roles in the solubilization of the
In total, 344 PUFs were identified via LC–MS/MS and
were interrogated with GBA and sequence homology-
based analyses. Across all time points, proteins with
functional annotations were on average of higher abun-
dance, as expected, although some PUFs are clearly
highly abundant (Fig.2d). At the time of our experiment,
PUFs WP_003518117.1 and WP_003519055.1 were the
only PUFs found in the top 10% most abundant protein
across all samples. However, these PUFs were recently
annotated as Ig-like domain containing protein and (2Fe-
2S) ferredoxin domain-containing protein, respectively.
e results section will focus on some broad trends
observed related to PUFs (e.g., differential expression
Fig. 2 Summary of protein identifications and abundances. Overlap of protein identifications across time points in a Δhpt and b LL1210,
and c overlap between strains. d Distribution of protein abundances for annotated proteins and PUFs across all strains and time points.
****p < = 0.0001.31
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 5 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
patterns, coevolution patterns, etc.) followed by the
description of specific PUFs of interest.
Phylogenetic analysis reveals PUFs coevolving
withcellulose solubilizationand cellulosome structural
To reduce ambiguity in unique protein assignments,
orthogroups, or sets of orthologous and paralogous
proteins across species, were determined using the soft-
ware OrthoFinder [33] with 27 species that included
both cellulolytic and non-cellulolytic Firmicutes. Of the
344 PUFs detected via LC–MS/MS, 68 were assigned
to 37 unique orthogroups. To examine coevolution of
PUFs with other proteins, we employed a phylogenetic
method that tests for correlated presence or absence
of traits across species [34]. e traits in this case were
the orthogroups, specifically the 37 containing a PUF
measured via LC–MS/MS. Using the species tree esti-
mated by OrthoFinder (Additional file1: Figure S2), our
analysis detected 115 PUFs that exhibited significant
signals of coevolution with another protein orthogroup.
Interestingly, 76 of the 115 significant results indicated
PUFs that coevolved with WP_003516608.1 (YcxB fam-
ily protein), WP_003516626.1 (zinc-finger transcription
factor II domain-containing protein), or ADU74616.1
(PUF, not detected by LC–MS/MS). Although the func-
tion has not been characterized, YcxB family proteins are
predicted to be transmembrane proteins. e set of pro-
teins shown to be coevolving with PUFs were enriched
in GO terms related to polysaccharide catabolic pro-
cess (GO:0000272), chemotaxis (GO:0006935), cell wall
macromolecule catabolic process (GO:0016998), xylan
metabolic process (GO:0045491), transmembrane sign-
aling receptor activity (GO:0004888), cellulose binding
(GO:0030248), cellulose 1,4-beta-cellobiosidase activity
(GO:0016162), calcium ion binding (GO:0005509), and
O-glucosyl hydrolase activity (GO:0004553), among oth-
ers, as shown in Additional file2. Clearly, many of these
processes are related to the solubilization of cellulose.
Comparison ofstrains Δhpt andLL1210 reveals dierential
protein expression ofbothknown andunknown (PUF)
Differential expression analysis of protein abundances
was performed using limma [35] between the two strains
at early-log phase, mid-log phase, and late-log phase.
Results of various functional enrichments can be found
in Additional files 3 and 4. In total, we found 707 unique
proteins that were differentially expressed in at least one
time point, 100 of which were PUFs. For each time point,
there were 393, 414, and 444 differentially expressed pro-
teins between the ∆hpt and LL1210 strains, respectively.
Of these, 57, 53, and 59 were PUFs, with 38, 27, and 37
of these having an absolute log2 fold change of at least
1.5 (see volcano plot, Additional file 5: Figure S3). Sets
of differentially expressed proteins were enriched in
various GO terms and KEGG terms at each time point
(Additional files 3 and 4). Under the assumption of GBA,
differentially expressed PUFs are likely to have similar
functional roles.
During early-log and late-log phases, differentially
expressed proteins were enriched in GO and KEGG
terms related to flagellum-dependent cell movement
(GO:0071973) and chemotaxis (GO:0006935). ese
biological processes appear to be overall up-regulated
in LL1210 during early-log phase, with mean log2 fold
changes of 0.24 and 1.03, respectively. However, these
processes appear to be down-regulated in late-log growth
(mean log2 fold change-2.16 and -1.75, respectively). In
addition, gene set enrichment analysis of differentially
expressed proteins revealed that proteins with KEGG
terms related to flagellar assembly (KEGG ID ctx02040)
were less abundant across all 3 time points in LL1210 rel-
ative to ∆hpt. Previous work found that proteins related
to cell motility were down-regulated in the LL1210 strain
[10]. Cellular motility can be an energetically costly pro-
cess, so the already slow-growing strain with a heavily
perturbed proteome could down-regulate cellular motil-
ity processes to channel ATP to other key cellular pro-
cesses, consistent with many of our observations. Note
that sporulation genes, specifically the master regulator
SpoA, was mutated in LL1210. In other clostridia species,
mutations in spoA have affected biofilm and flagellum
At all three time points, proteins involved in the acetyl-
CoA biosynthesis (GO:0006085) also appear to be dif-
ferentially expressed (i.e., 1.08, 2.18, 2.80 mean log2
fold change in early, mid-, and late-log phases, respec-
tively). is result is consistent with the genetic modi-
fication of the LL1210 strain, which started as a strain
with the pyruvate-formate lyase-dependent pathway
converting pyruvate to acetyl-CoA disrupted. In addi-
tion, there is an overall increased abundance in LL1210
of proteins involved in the pantothenate metabolic pro-
cesses (GO:0015939) at the mid-log phase (mean log2
fold change 1.97). is finding is particularly interest-
ing as pantothenate is the precursor for CoA biosynthe-
sis which has a range of functions in bacteria [36]. As
the acetyl–CoA pathway has been altered in the LL1210
strain to drive the pyruvate metabolism towards etha-
nol production, this increased abundance of pantoth-
enate metabolism could indicate changes to fatty acid
metabolism. A previous study found that C. thermocel-
lum adapts to the increase of ethanol by remodulation of
the cell membrane [37]. Consistent with this conclusion,
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 6 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
we found that proteins with GO:0006633 (fatty acid bio-
synthetic processes) and GO:0004312 (fatty acid synthase
activity) were both more abundant in LL1210.
At all three time points, GO term GO:0016730 (oxi-
doreductase activity, acting on iron–sulfur proteins as
donors) was more abundant in the LL1210 strain rela-
tive to Δhpt (mean log fold change 4.14, 2.56, and 2.44 in
early, mid-, and late-log phases, respectively). e reduc-
tion of oxidized ferredoxin (an iron–sulfur protein) is an
important step in the conversion of pyruvate to acetyl-
CoA. Notably, GO:0008901 (ferredoxin hydrogenase
activity) was also more abundant (mean log2 fold change
1.30) in LL1210 at mid-log phase.
Aside from identifying differentially expressed pro-
teins at each time point, we also sought to identify dif-
ferentially co-expressed proteins between the Δhpt
and LL1210 strains. We identified 359 differentially co-
expressed proteins between the Δhpt and LL1210 strain.
Of these 359 proteins, 50 were PUFs. ese differen-
tially co-expressed proteins were enriched in GO terms
related to dephosphorylation (GO:0016311), positive
regulation of gene expression (GO:0010628), cell adhe-
sion (GO:0007155), metal ion transport (GO:0030001),
phosphate-containing compound metabolic processes
(GO:0006796), cellular response to oxygen-containing
compound (GO:1,901,701), magnesium ion binding
(GO:0000287), cyclic-di-GMP binding (GO:0035438),
transferase activity, transferring phosphorous con-
taining groups (GO:0016772), and isomerase activity
(GO:0016853), among others. As noted above, differen-
tial expression related to iron-binding proteins could be
significant due to their role in the conversion of pyru-
vate and ethanol. We note that two PUFs with GO terms
related to iron-ion binding were found to be differentially
co-expressed: WP_003512015.1, which is discussed fur-
ther below, and WP_003515910.1.
Co‑expression analysis
Co-expression analysis was performed separately for
the Δhpt and LL1210 strains to determine clusters of
co-expressed genes. Using the Python tool clust [38] we
identified 11 and 14 clusters of co-expressed proteins in
the Δhpt and LL1210 strains, respectively. e cluster-
specific protein abundance patterns can be seen for these
strains in Additional file6: Figure S4 Additional file7:
Figure S5, respectively. In total, these clusters repre-
sented co-expression patterns of 1226 and 786 proteins
in the Δhpt and LL1210 strains, respectively. Functional
enrichment was performed to assess potential functions
of PUFs based on GBA. Out of the numerous PUFS that
were identified in this study, we will highlight a fewbelow
that are of particular interest due to their potential role
in cellulose solubilization, pyruvate metabolism, and/or
ethanol production.
PUF WP_003512015.1 (Clo1313_2169): Evidence
forarubredoxin protein
GBA and sequence homology-based evidence suggest
that WP_003512015.1 is a rubredoxin protein, a protein
consisting of one iron atom that serves as an electron car-
rier. WP_003512015.1 was found in cluster hpt_C5 and
LL1210_C7. Although LL1210_C7 contained no enriched
GO or KEGG terms, hpt_C5 (Fig.3a) was enriched in
many functional terms, including 4 iron, 4 sulfur cluster
binding (GO:0051539), metal ion binding (GO:0046872),
and oxidoreductase activity, acting on the CH-OH group
of donors andNAD or NADP as acceptor (GO:0016616).
Interestingly, this PUF falls into clusters which qualita-
tively appear to demonstrate differential co-expression
patterns between the Δhpt and LL1210 strain. In the Δhpt
strain, WP_003512015.1 decreases from early-log to
mid-log phase before a large jump in abundance in late-
log phase. e opposite pattern is observed in LL1210,
where there is a small increase in WP_003512015.1 from
early-log to mid-log phase, followed by a sharp decrease
into late-log phase. If this PUF is involved in the oxida-
tion–reduction processes in the conversion of pyruvate
to ethanol, then contrasting patterns between the Δhpt
and LL1210 strain might be expected. WP_003512015.1
was differentially expressed between the two strains
at late-log phase, with a log2 fold change of -1.87. If
WP_003512015.1 has oxidoreductase activity, then this
is consistent with its differential expression along with
many other proteins with similar biological function.
However, WP_003512015.1 was not significant based
on our differential co-expression analysis (dCp = 0.3, q
value = 0.065).
Sequence homology-based evidence strongly supports
WP_003512015.1 as a rubredoxin. e best fitting struc-
ture from SwissModel is a rubredoxin protein found in
Guillardia theta (PDB 1H7V, Fig. 3b), but many other
structures were annotated as rubredoxins or rubredoxin-
like proteins. Furthermore, examination of the phyloge-
netic gene tree reveals WP_003512015.1 is closely related
to many rubredoxin proteins annotated in UniProt
(Fig.3c). Although the operon for WP_003512015.1 does
not regulate expression for any other proteins, it is anno-
tated as a rubredoxin-type protein in the DOOR database
[39], consistent with the co-expression and homology-
based analyses. is result also highlights the limitations
of the RefSeq and GenBank repositories to reflect the
most up-to-date functional annotations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 7 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
PUF WP_003516357.1 (Clo1313_1439): Evidence
foranABC transporter
Various lines of evidence suggest that PUF
WP_003516357.1 is a component of a sugar ABC trans-
porter. WP_003516357.1 is differentially expressed in at
all 3 time points between the Δhpt and LL1210 strains,
with log fold changes of -2.95, -3.31, and -4.20. is indi-
cates relatively lower abundance of WP_003516357.1
in the LL1210 strain. WP_003516357.1 falls into
LL1210_C13 (Fig. 4a), which consists of 17 proteins,
3 of which are PUFs. is network is enriched in GO
terms chemotaxis (GO:0006935), polysaccharide cata-
bolic process (GO:0000272), DNA-dependent DNA
replication (GO:0,006,261), and carbohydrate bind-
ing (GO:0030246). Although none of the enriched GO
terms directly relate to protein transport, sequence
homology-based approached strongly suggests that
this PUF is likely an ABC transporter. InterProScan
[40] identifies WP_003516357.1 as an ABC transporter,
substrate-binding protein. Furthermore, the vast major-
ity of structural templates fitting to WP_003516357.1
come from sugar ABC transporters, consistent with the
enrichment of polysaccharide catabolic process and car-
bohydrate-binding proteins in LL1210_C13. Consistent
with this, this small cluster contains WP_003515342.1
(glycoside hydrolase), WP_003519375.1 (cell surface gly-
coprotein 2), and WP_014522595.l (cellulosome anchor-
ing protein cohesin subunit). e best fitting structure
of known function is annotated as a probable ribose
ABC transporter, substrate-binding protein (PDB 5IBQ,
Fig.4b). Additionally, five of the matched structures were
related to the transport of arabinose, a monosaccharide
found in the hemicellulose of plant cell walls. We note
that C. thermocellum is known to use ABC transporter
systems for the uptake of oligosaccharides [37]. e gene
tree supports this protein as a component of an ABC
transporter. WP_003516357.1 is most closely related to a
membrane protein, but ABC transporters and ABC-type
hpt_T1 hpt_T2 hpt_T3
Time Point
Normalized Abundance
Fig. 3 Results for possible rubredoxin, PUF WP_003512015.1. a Co-expression cluster of PUF WP_003512015.1, including other PUFs and proteins
with GO:0051539 (4 iron, 4 sulfur cluster binding). b Best fitting structure of known function from PDB, 1H7V, which is a rubredoxin from G. theta
(sequence similarity 0.42 and coverage 0.32). c Phylogenetic gene tree for WP_003512015.1 indicates that this protein is closely related to many
rubredoxin proteins
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 8 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
uncharacterized transporters are also present in the gene
tree (Fig.4c). Taken all together, PUF WP_003516357.1
has strong evidence as a protein component of an
ABC transporter possibly involved in the uptake of
PUF WP_003511984.1 (Clo1313_2180): Evidence
foraglycoside hydrolase
During the process of our data analysis for this study,
we focused attention on PUF WP_003511984.1, as we
had strong GBA evidence that it was a glycoside hydro-
lase. Interestingly, in the most recent reannotation of the
C. thermocellum genome, this protein is now labeled as
a putative glycoside hydrolase. Since our examination of
this protein was completed in the absence of that infor-
mation, we hereby present below the evidence we had
that converged on the same functional assignment as the
reannotation, as a type of positive control for our PUF
WP_003511984.1 was not differentially expressed
between the Δhpt and LL1210 strains at any of the
time points; however, it was differentially co-expressed
(dCp = 0.93, q value = 0.034). WP_003511984.1 was
found in clusters hpt_C0 (Fig.5a) and LL1210_C0, which
are two large clusters with 205 and 167 genes, respec-
tively. Both clusters had many enriched GO terms.
e strongest evidence for WP_003511984.1 as a gly-
coside hydrolase was the enrichment of the GO term
macromolecule catabolic process (GO:0009057). We
note that this does not necessarily refer to polysaccha-
ride catabolism. However, examination of the proteins
with this GO term in the hpt_C0 cluster included pro-
teins annotated as glycoside hydrolases (ADU75731.1,
WP_003515281.1, WP_003517278.1), endoglucanase
(WP_003512420.1, WP_003514472.1, WP_003517595.1),
carbon storage regulators (WP_003513578.1),
N-acetylmuramoyl--alanine amidase (a cell wall
hydrolase, WP_003515629.1), carbohydrate-binding
LL1210_T1 LL1210_T2 LL1210_T3 LL1210_T4
Time Point
Normalized Abundance
Fig. 4 Results for possible ABC transporter, PUF WP_003516357.1. a Co-expression cluster of PUF WP_003516357.1, including other PUFs and
proteins with GO:0000272 (polysaccharide catabolic process). b Best fitting structure of known function from PDB, 5IBQ, which is annotated
as a probable ribose ABC transporter, substrate-binding protein (sequence similarity 0.27 and coverage 0.67). c Phylogenetic gene tree for
WP_003516357.1, which is closely related to proteins related to ABC transport systems
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 9 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
domain-containing protein (WP_003516871.1), glyco-
syl transferase (WP_003518177.1), and a copper–amine
oxidase (WP_003518386.1), with GO terms related to
polysaccharide catabolic process (GO:0000272), chi-
tin binding (GO:0008061), and carbohydrate binding
Further examination of predicted protein structures
also supports WP_003511984.1 as a glycoside hydrolase.
Predicted structures include multiple beta-galactosidase
structures, consistent with results of EGAD related to
carbohydrate metabolism. e best matching structure
of known function for WP_003511984.1 is annotated as
Cwp19 (PDB 5OQ2). is protein is found in Clostridium
difficile, and the structure represents the glycoside hydro-
lase domain of Cwp19 (Fig.5b).
e phylogenetic gene tree also indicates
WP_003511984.1 is similar in sequence to glycoside hydro-
lases (Fig.5c). PANNZER2 [41] annotates this protein as
a potential glycoside hydrolase. WP_003511984.1 was
predicted to have a signal peptide and a transmembrane
region. Taken together, current evidence strongly suggests
that this protein is a glycoside hydrolase. As noted above,
a recent reannotation in the RefSeq database established
hpt_T1 hpt_T2 hpt_T3
Time Point
Normalized Abundance
Fig. 5 Results for possible glycoside hydrolase, PUF WP_003511984.1. a Co-expression cluster of PUF WP_003511984.1, including other PUFs and
proteins with GO:0009057 (macromolecule catabolic process). b Best fitting structure of known function from PDB, 5OQ2, which is protein Cwp19
in C. difficile and contains a glycoside hydrolase domain (sequence similarity 0.28 and coverage 0.70). c Phylogenetic gene tree for WP_003511984.1
indicates this protein is closely related to a glycoside hydrolase, but many GTP-binding proteins are also present
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 10 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
this as a putative glycoside hydrolase, consistent with the
results presented here.
PUF WP_003519433.1 (Clo1313_1074): Evidence
andexperimental validation asanalcohol
acetyltransferase activity
Exploring WP_003519433.1 at several levels such as
annotation using PANNZER2, eggNOG-mapper [42],
phylogenetic gene trees, and structural modeling all
indicated that WP_003519433.1 is a probable alcohol
LL1210_T1 LL1210_T2 LL1210_T3LL1210_T4
Time Point
Normalized Abundance
Esterase/Lipase WP_003519433.1 WP_003519432.1
1.2825 Mb 1.283 Mb 1.2835 Mb 1.284 Mb 1.2845 Mb 1.285 Mb
Fig. 6 Results for possible alcohol acetyltransferase, PUF WP_003519433.1. a Co-expression cluster of PUF WP_003519433.1, including other
PUFs and proteins with GO:0016740 (transferase activity). b Best fitting structure of known function from PDB, 3FOT, which is annotated as a
15-O-acetyltransferase (0.27 sequence similarity and 0.89 sequence coverage). c Cartoon representation of operon structure according to DOOR
database. d Partial phylogenetic gene tree for WP_003519433.1, which is closely related to proteins related to alcohol acetyltransferase. The
complete tree contains many proteins annotated as alcohol acetyltransferases aside from those seen in the partial tree
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 11 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
acetyltransferase (Fig. 6). WP_003519433.1 was not
found to be differentially expressed or differentially co-
expressed between strains. WP_003519433.1 was found
in hpt_C0 and LL1210_C1 clusters (Fig.6a). e strong-
est co-expression evidence is the enrichment of GO
term transferase activity (GO:0016740) in LL1210_C1.
Although this is a broad GO term, we note that pro-
teins falling into this cluster with this GO term could
be an acetyltransferase as this cluster also includes
an N-acetyltransferase (WP_003513195.1) and PUF
WP_003513604.1 with GO term N-acetyltransferase
activity. Cluster LL1210_C1 was also enriched for the
KEGG module Shikimate pathway, which is responsi-
ble for the synthesis of folate and aromatic amino acids.
Notably, PUF WP_003519433.1 has a GO term indicating
that it is possibly a membrane protein and fits the struc-
tural template of a TRI3 Tricothecene 15-O-acetyltrans-
ferase from the fungus Fusarium sporotrichioides (PDB
3FOT). PUF WP_003519433.1 is also part of an operon,
a key piece of GBA evidence, with a protein annotated in
DOOR as an esterase/lipase (Fig.6c). WP_003519433.1
was selected for further characterization. Interestingly,
the phylogenetic gene tree appears to be split between
two major groups: one in which many of the proteins are
annotated as an alcohol acetyltransferase or similar func-
tion, and a group that is mostly PUFs (Fig.6d).
To experimentally validate alcohol acetyltrans-
ferase activity (e.g., alcohol + acetyl-CoA ac(r)yl
acetate + CoA), WP_003519433.1 was N-terminus
His-tagged and expressed in E. coli. Western blot of
the purified protein indicated successful expression
of WP_003519433.1 (Fig. 7a). is protein was then
screened against a library of linear C2-C10 alcohols
for acetyltransferase functional activity, but no activ-
ity was observed. Interestingly, the LL1210_C1 cluster
was enriched for proteins involved in KEGG module
M00022, which is part of the shikimate pathway that con-
verts phosphoenolpyruvte and erythrose-4P to choris-
mate. Overall, the shikimate pathway is involved in the
synthesis of aromatic amino acids, which can be used in
the production of aromatic alcohols [43]. Further screen-
ing revealed that WP_003519433.1 has activity toward
the aromatic alcohol 2-phenylethyl alcohol, both invitro
(data not shown) and invivo (Fig.7b and c). e synthe-
sized 2-phenylethyl acetate confirmed WP_003519433.1
as an alcohol acetyltransferase. As this enzyme is active
toward aromatic alcohols, it likely belongs to EC 2.3.1.-
and is different from EC that has substrate speci-
ficity toward short-chain alcohols [4447]. To elucidate
the physiological role of WP_003519433.1, further inves-
tigation will focus on characterization of C. thermocellum
that overexpresses and downregulates this enzyme under
various conditions.
Fig. 7 a Western blot of WP_003519433.1 expression in E. coli. L: protein ladder; C: protein purified from no IPTG-induced cells (negative control);
Lane 1: protein purified from cells induced with 0.1 mM IPTG; and 2: protein purified from cells induced with 1 mM IPTG. The band signals observed
in lanes 1 and 2 in the red box confirmed the identity of WP_003519433.1 with an expected protein size of 50.4 kDa. b Total ion chromatography
of high cell density E. coli whole-cell conversion of 2-phenylethyl alcohol. E. coli harboring empty plasmid was used a negative control. c
Mass-to-charge ratio of the selected 2-phenylethyl acetate peak. Here, the eluted peaks at the retention of 15.5 min in panel B confirmed that
2-phenylethyl acetate was produced by Ec1074 carrying WP_003519433.1 with the expected mass fragmentation shown in panel C
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 12 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
Despite improvements in gene annotation procedures, a
large percentage of genes remain annotated as PUFsin
commonly used genome repositories [19, 20]. Although
current computational pipelines for functional prediction
tools based on sequence homology are powerful [1416,
24, 41, 42], they are limited to the currently known pro-
tein sequence space present in databases and assume
sequence similarity implies functional similarity [48]. A
protein which differs significantly from any known pro-
tein sequence may present challenges to current func-
tional prediction tools. While direct characterization
experiments are one option, these are often low through-
put. Other methods based on the concept of guilt-by-
association (GBA), such as co-expression analysis, may
be used to predict putative functions on omics-scale data
[29]. Hypothesized putative functions can serve as the
basis for further characterization experiments, particu-
larly in identifying the types of experiments needed to
confirm a particular function.
To this end, we performed a comprehensive analysis of
PUFs in two distinct strains of C. thermocellum using a
combination of expression analyses (e.g., co-expression
and differential expression analyses), evolutionary anal-
ysis (e.g., coevolution analysis, gene tree estimation),
structural modeling, and sequence homology-based
function predictions to identify putative functions for
PUFs, with a focus on those potentially related to cel-
lulose degradation, redox balance, and ethanol produc-
tion. A total of 344 PUFs were measured via LC–MS/
MS. Differential expression information and co-expres-
sion clusters were generated using proteomics data from
two strains of C. thermocellum (Δhpt and LL1210). Pro-
teins that were differentially abundant across the strains
showed clear enrichment of particular functions, such
as GO:0016730 (oxidoreductase activity acting on iron–
sulfur proteins as donors), which was up-regulated in
the LL1210. As many PUFs demonstrated differential
expression consistent with proteins of known function,
it is likely at least some of these PUFs play roles in these
functions under GBA. Importantly, strain LL1210 is an
experimentally evolved strain originating from a strain
with gene deletions in pathways that compete with etha-
nol production. It has also been observed that the par-
ent strain of LL1210 is noted for having perturbed redox
metabolism [49], Based on previous work, we expected
PUFs with potential functional roles in ethanol produc-
tion and redox metabolism to show differential (co-)
expression or temporal patterns relative to the wild-type
Δhpt, as was observed in this study.
Similar to our differential expression analysis, co-
expression analysis identified many clusters containing
PUFs in both strains. ese clusters were often enriched
in various GO and/or KEGG terms, including those
related to redox balance, ethanol production, and cel-
lulose degradation. Coevolution analysis identified as
subset of PUFs which appear to coevolve with proteins
involved in cellulose degradation.
Finally, operon information was also obtained for C.
thermocellum from the DOOR database [39], which
indicates shared regulatory elements of PUFs with
proteins of known function, providing another form
of GBA. GBA evidence was combined with sequence
homology-based information, including domain pre-
diction, structural modeling, and phylogenetic gene
tree analysis to hypothesize putative functions for
PUFs. ese are not meant to serve as official anno-
tations; however, they help to narrow down the list
of PUFs with possible interesting putative functions.
ese selected candidate PUFs can then be validated
by other experimental methods, such as gene knock-
out experiments for phenotype perturbations. Given
the evidence presented here, it seems clear that further
characterization of PUFs will be a critical for engineer-
ing C. thermocellum to improve biofuel production.
Importantly, our combination of GBA approaches
with sequence homology-based functional/struc-
tural prediction identified a putative alcohol acetyl-
transferase for further experimental characterization.
Although co-expression support for this function was
modest, it was strongly supported by both sequence
homology and gene regulatory information. Experi-
mental characterization revealed that this PUF cata-
lyzes ester formation between acetyl-CoA and aromatic
alcohols. While other PUFs had stronger overall evi-
dence, this PUF was chosen for further characterization
in part due to a straightforward experimental path for
validation. A major challenge for targeted experimen-
tal characterization of proteins is the ability to induce
a phenotype when experiments are performed invivo.
Without a clear, detectable phenotype, such experi-
mental validations are difficult to achieve.
e Δhpt and LL1210 co-expression analyses were
based on protein abundance data across 3 and 4 time
points, respectively, each with 4 replicates. A larger num-
ber of samples would likely result in clusters with clearer
functional groupings based on co-expression patterns,
as previously described [50]. Despite modest statistical
power here, many clusters served as solid evidence for
hypothesized functions of PUFs. Further work focused on
putative functional identification of PUFs should incor-
porate more publicly available proteome measurements
(with appropriate normalization for different mass spec-
trometers, label-free quantification methods, etc.) and/
or measurement of more samples, ideally varying over a
large number of possible growth states and conditions.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 13 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
Various in-silico approaches were employed to comple-
ment our analyses based on protein abundance measure-
ments. Many of the sequence homology-based functional
annotation tools provided consistent functional informa-
tion. Although this information was often redundant,
suggesting that only one of the tools may be needed, con-
sistent results across tools provide confidence in the pre-
dicted function or domains, helping to eliminate possible
false positives. Coevolution analyses have also been used
previously to test for functional relationships between
proteins [34, 51]. Although these in-silico approaches are
useful on their own, these approaches are built on their
own assumptions, particularly that sequence similarity
implies functional similarity, which may not hold over
large phylogenetic distances. GBA approaches based on
experimental measurements provide another layer of
functional information that, while providing less direct
functional information, can provide greater confidence in
the results of in-silico analyses (and vice-versa).
Although some PUFs were highly abundant in both
strains, most were low abundance proteins on average
(Fig.2d). Why might PUFs tend to be lower abundance
proteins? One possible reason is lower abundance pro-
teins tend to accumulate nonsynonymous substitutions
at a faster rate than high abundance proteins [5254].
We speculate that this could present greater challenges
for functional prediction via sequence homology, par-
ticularly for species which are relatively distant from bet-
ter functionally characterized species (e.g., yeast, E. coli,
mice). To the best of our knowledge, no study has system-
atically investigated if sequence homology-based predic-
tions perform better on highly expressed genes in which
selective constraints on sequence evolution are expected
to be stronger, on average. Another, and we believe more
likely, reason is a bias towards focusing research efforts
on proteins which are more abundant, under the assump-
tion that higher abundance proteins tend to be more
important for a species.
Here, we employed both co-expression and phyloge-
netic analyses of correlated presence/absence of genes
as GBA methods for determining functional roles of
PUFs. Another option is to examine coevolution of pro-
tein abundances across species. Previous work has found
that functionally related proteins coevolve at the level of
gene expression [5558]. Such methods could be used to
test for functional relationships of PUFs; however, this
approach requires the PUFs to be conserved across spe-
cies, making it most applicable to conserved DUFs. Nota-
bly, most of the previous analyses have used codon-based
proxies of gene expression (e.g., the Codon Adaptation
Index [59]) or data based on RNA-Seq. To date, no work
has examined coevolution of protein abundances, even
though some evidence suggests that protein abundances
are more conserved across species compared to mRNA
abundance [60].
Here, GBA and sequence homology-based approaches
were combined to identify putative functions for proteins
of unknown function (PUFs) in C. thermocellum, with a
specific focus on PUFs possibly related to cellulose deg-
radation and ethanol production. One PUF tentatively
characterized by our GBA approach, WP_003519433.1,
was confirmed experimentally to be an alcohol acetyl-
transferase. As part of this analysis, a table (Additional
file 8) is provided which summarizes the various lines
of evidence accumulated for the PUFs in this study.
e amount of evidence for any given PUF varies. For
example, 216 of the 344 PUFs detected via LC–MS/
MS fell into at least one cluster enriched in at least one
GO term. For 285 of these 344 PUFs, eggNOG-mapper,
PANNZER2, InterProScan, and/or BlastKOALA iden-
tified an annotation, although many of these are non-
specific or indicate that the protein is uncharacterized.
We expect that this table will be of significant interest to
the bioenergy research community who may be eager to
investigate PUFs of potential interest for further char-
acterization in C. thermocellum. e different analyses
presented here can easily be applied to other microbes
of interest. All functional/structural prediction tools are
publicly available, many with easy-to-use web interfaces.
During the course of our study, we identified a few pro-
teins which were annotated in the DOOR database, such
as WP_003512015.1, despite being a PUF in Genbank/
Refseq. Additional file8 (specifically, columns “Operon
Proteins” and “Operon Functions”) can be referenced
for other examples. is highlights that, in some cases,
major sequence databases such as Genbank/RefSeq may
not provide the most up-to-date information. We are
likely not the first to note this problem but given the sig-
nificance of databases like RefSeq for modern biological
research, our work supports the need to more effectively
keep these databases up-to-date.
Bacterial strains andculture conditions
Clostridium thermocellum strains DSM 1313 ∆hpt [4]
and LL1210 [11] were used in this study.
Strains ∆hpt and LL1210 were each grown for 30 and
93h, respectively, inside a Coy anaerobic chamber (Coy
Laboratory Products, Grass Lake, MI) under 85% N2, 10%
CO2, and 5% H2 gases at 55°C in quadruplicate 500mL
(total vessel capacity 1L) cultures in MTC5 media [61],
along with 5 g/L cellobiose supplemented with 2 mM
sodium formate. Formate supplementation in minimal
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 14 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
medium improves growth of C. thermocellum mutant
strains that lack pyruvate-formate lyase (pfl) by improv-
ing C1 metabolism [49], and LL1210 has pfl deleted.
Samples for proteomic analyses were collected in 50mL
aliquots for timepoints corresponding to early-log, mid-
log, and late-log of growth for both strains. Growth
phases were determined by optical density values at each
timepoint plotted for a growth curve. Additional samples
were collected for the lag phase of growth for a total of
four sampling events for strain LL1210. Cells were centri-
fuged (3600×g) in 50mL tubes for 10min, immediately
quenched with liquid nitrogen, and the supernatants
were discarded. e samples were then stored at 80°C
until protein isolation and proteomic analysis.
Proteome analyses using LC–MS/MS
e ∆hpt and LL1210 strains of C. thermocellum were
proteolytically digested (trypsin) for nano-LC–MS/MS
analysis. An automated 2D LC–MS/MS analysis was car-
ried out for the peptide samples using an Ultimate 3000
connected in-line with a QExactive Plus mass spectrome-
ter (ermo Scientific). A triphasic MudPIT back column
(RP-SCX-RP) was coupled to an in-house pulled nano-
spray emitter packed with 30cm 5µm Kinetex C18 RP
resin (Phenomenex). For each sample, 12µg of peptides
were loaded and cleaned to remove salts (if any) and was
separated and analyzed across two successive salt cuts of
ammonium acetate (50mM and 500mM), each followed
by 105min organic gradient. LC-resolved peptides were
analyzed by data-dependent acquisition (DDA) on the
QExactive MS.
MS database searching, data analysis, andinterpretation
A non-redundant database was made by combining
GenBank and RefSeq C. thermocellum proteome data-
bases. e proteins were grouped at 100% identity using
CD-Hit. [62] MS/MS spectra were searched against this
proteome database concatenated with cRAP databases
(ftp:// ftp. thegpm. org/ fasta/ cRAP) consisting of com-
mon contaminants using Tide-search [63] keeping a
static modification on cysteine (+ 57.0214Da), and a
dynamic modification to an oxidation (+ 15.9949 Da)
of methionine. Tide-search was followed by Percola-
tor [64] with default parameters to assign spectra to
peptides (peptide-spectrum matches; PSM). Reten-
tion times of each PSM were extracted parsing mzML
file with in-house script and MS1 apex intensities were
assigned using moFF [65]. e moFF parameters were
set to 10ppm for the precursor mass tolerance, 4min
for the XIC time window, and 1 min (equivalent to
60s) to get the apex for the ms2 peptide/feature. e
peptide intensities from were summed to their respec-
tive proteins per sample. Protein intensities were then
normalized by protein length and overall abundance
per MS run. Each protein required a minimum of 2
peptide and 2 PSMs to become a valid protein. us,
the obtained normalized intensities of proteins were
considered valid if a protein exists in 2 out of 4 repli-
cates. Protein abundance distributions were then nor-
malized across samples and missing values imputed
to simulate the mass spectrometer limit of detection.
All raw mass spectra for the proteome measurements
have been deposited into the ProteomeXchange reposi-
tory with the following accession numbers: (MassIVE
Accession: MSV000085237, ProteomeXchange acces-
sion PXD018407: FTP link to files: ftp:// MSV00 00852
37@ massi ve. ucsd. edu, username is MSV000085237,
password is PUF123).
Validation ofalcohol acetyl transferase WP_003519433.1
Plasmid construction
e plasmid pET_1074 was constructed using restric-
tion endonucleases (NEB, MA, USA) and DNA ligase
(NEB, MA, USA) and propagated in E. coli TOP10 (Addi-
tional file9: TableS1). e Clo1313_1074 gene (encod-
ing WP_003519433.1) was PCR-amplified using primers
CCT ACA TGT TTG ACA CTA TTT C-3). e amplified
PCR fragment and plasmid (pETDuet-1) were digested
by BamHI and SacI restriction enzymes, ligated together,
and transformed in E. coli using a heat shock transforma-
tion method. Transformed colonies were PCR verified for
successful plasmid cloning using the same primers. e
constructed plasmid pET_1074 was verified by Sanger
Protein expression andpurication
To express the His-tagged WP_003519433.1
(Clo1313_1074), E. coli C41 (DE3) pLysS was used to
maximize the protein production with a tight expres-
sion regulation [66]. Ec1074 was cultured in 3mL Lysog-
eny broth (LB) medium supplemented with 100g/mL
ampicillin and 30g/mL chloramphenicol in a shaking
incubator at 37°C for overnight (~ 16h). e overnight
culture was inoculated in 50mL fresh LB medium with
10g/L glucose and the antibiotics in a shaking incuba-
tor at 37°C until optical density (OD) reached ~ 0.4. To
induce the recombinant protein biosynthesis, 0.1 mM
of isopropyl β--1-thiogalactopyranoside (IPTG) was
added to the culture followed by overnight incubation at
18°C. After the incubation, cells were harvested by cen-
trifugation at 4700rpm for 10min. Cell lysis and protein
purification followed the method described previously
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 15 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
with slight modifications [67]. Briefly, the cell pellets
were washed twice with Millipore water before cell lysis
by B-PER complete bacterial protein extraction reagent
(ermoFisher Scientific, MA, USA). e cell extracts
were incubated with HisPur Ni–NTA superflow agarose
(ermoFisher Scientific, MA, USA) in a batch. After
the washing and elution, the eluted protein sample was
desalted by Amicon centrifugal filter with 10kDa molec-
ular weight cut-off (MilliporeSigma, MA, USA). e
desalted protein was quantified by Bradford assay with
bovine serum albumin (BSA) as the reference protein,
before enzyme reaction.
SDS‑PAGE andwestern blot
e His-tag purified WP_003519433.1 (Clo1313_1074)
was qualitatively analyzed by sodium dodecylsulfate-
polyacrylamide gel electrophoresis (SDS-PAGE) and
western blot. For SDS-PAGE, Novex WedgeWell 14%
Tris–Glycine gel was used (cat# XP00145BOX, Invit-
rogen, CA, USA). For western blot, proteins after SDS-
PAGE were transferred to a nitrocellulose membrane and
probed with Anti-6x-His-tag monoclonal antibody con-
jugated with horseradish peroxidase (HRP). e blot was
visualized by 1-Step Ultra TMB (ermofisher Scientific,
Screening ofacetyltransferase activity
Acetyltransferase activity of WP_003519433.1
(Clo1313_1074) was screened by an in vitro enzymatic
assay conducted in a 100 L total reaction volume [67,
68]. e reaction solution consisted of 50mM Tris–HCl
pH 7.4, 2mM acetyl-CoA, 0.5mg of the purified proteins,
and various alcohol concentrations, including 100 mM
for ethanol, butanol, isobutanol, pentanol, isoamyl alco-
hol, 40mM for hexanol, 20mM for phenylethyl alcohol,
and 2mM for octanol and decanol with 20% DMSO. 100
L of hexadecane spiked with 10 mg/L n-decane was
overlaid to extract esters. e reaction was carried out at
50°C for 48h and the hexadecane layer was analyzed by
gas chromatography coupled with a mass spectrometer
For invivo verification of the acetyltransferase activity
toward phenylethyl alcohol, the IPTG-induced Ec1074
whole cell was concentrated to ODs of 2, 4, and 8 in 4mL
M9 defined medium containing 10g/L glucose, and 1g/L
yeast extract, 0.1 mM IPTG, and 1 g/L 2-phenylethyl
alcohol, and 1mL of hexadecane with 10mg/L n-decane
was overlaid. e whole-cell reaction was performed in
a 37°C shaking incubator for 48h and the hexadecane
layer was analyzed to detect 2-phenylethyl acetate by
GC/MS analysis todetect esters
GC (HP 6890, Agilent, CA, USA) equipped with a MS
(HP 5973, Agilent, CA, USA) was used to detect esters
[4446, 69]. 1 L sample was injected into the GC cap-
illary column (Zebron ZB-5, 30m × 0.25 mm × 0.25m,
Phenomenex, CA, USA) with the splitless mode at an
injector temperature of 280°C. Helium was used as the
carrier gas at a flow rate of 0.5 mL/min, and the oven
temperature was programmed as 50°C initial tempera-
ture, 1°C/min ramp up to 58°C, 25°C/min ramp up to
235°C, 50°C/min ramp up to 300°C, and 2-min bake-out
at 300°C.
For the MS system, selected ion mode (SIM) was used
to detect esters with the following parameters: (a) ethyl
acetate, m/z 45.00 and 61.00 from 4.2 to 4.6min reten-
tion time (RT), (b) isobutyl acetate, m/z 61 and 101 from
6.6 to 7.6min RT, (c) butyl acetate, m/z 61 and 73 from
7.6 to 8.5min RT, (d) pentyl acetate, m/z 56, 61 and 73
from 8.5 to 10.1min RT, (e) isoamyl acetate, m/z 61 and
73 from 10.1 to 10.7, (f) hexyl acetate, m/z 61 and 129
from 10.7 to 11.5. min RT, (g) octyl acetate, m/z 61 and
173 from 11.5 to 13.2 RT, (h) n-decane, m/z 78, 99, and
170 from 13.2 to 13.5 RT, (i) decyl acetate, m/z 61 and
167 from 13.5 to 13.8 RT, and (j) 2-phenethyl acetate, m/z
61, 104, and 121 from 13.8 to 15.5min RT.
Coevolution analysis
Protein orthogroups (i.e. both orthologs and paralogs)
and a species tree were determined using OrthoFinder
(see Additional file1: Figure S2 for the species tree and
list of species used) [33]. Coevolution analysis was based
on the correlated presence/absence of orthogroups across
species, similar to phylogenetic profiling, while account-
ing for the shared ancestry of species. Orthogroups were
treated as discrete species traits with a species containing
the orthogroup having a value of 1; otherwise, the spe-
cies was given a value of 0. Orthogroups containing PUFs
identified by LC–MS/MS were then paired with all other
orthogroups. Phylogenetic analysis of discrete trait evo-
lution was performed using the corHMM R package [70],
assuming no hidden states. For each pair of orthogroups,
models were fit either allowing for coevolution or forcing
independent evolution. e two models were compared
using the corrected version of the Akaike Information
Criterion (AICc), which corrects for small sample sizes.
Orthogroups were considered to be coevolving if the
model allowing for coevolution was 2 or more AICc units
better than model forcing independent evolution.
Dierential andco‑expression analyses
Differential expression analysis was performed using
the R package limma [35] using the limma-trend
functionality and robust hyperparameter estimation.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 16 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
Proteins withlow expression (here, defined as a nor-
malized abundance less than 23) on average across both
strains and all time points were excluded due to viola-
tions of limma’s assumptions. Proteins were considered
differentially expressed if they had a Benjamini–Hoch-
berg corrected p-value < 0.05. For the mid-log phase in
LL1210, we chose to treat LL1210 time point T2 as the
mid-log phase based on the PCA (Additional file 10:
Figure S1). Co-expression analysis was performed using
imputed protein abundances for both the Δhpt and
LL1210 strains after filtering out proteins with missing
protein abundances in more than 50% of the measure-
ments for a given strain. Clusters were generated for
each strain using the Python tool clust [38], with fur-
ther removal of genes showing low variation across
time points. Differential co-expression analysis was
performed using the R package DCGL [71] using the
DCp method [72].
Gene Ontology enrichment was performed using the
R package topGO [73]. For differential expression and
differential co-expression analysis, the Kolmogorov–
Smirnov test was performed using the weight01 algo-
rithm. For co-expression analysis, the Fisher’s exact test
was used with the weight01 algorithm. We note that,
per the recommendation of the topGO developers, the p
values used for our GO enrichment tests were not cor-
rected for multiple-hypothesis testing. is is because
the weight01 algorithm violates the assumptions of inde-
pendence (see topGO vignette) made by FDR control
methods such as the Benjamini–Hochberg correction.
Analysis of KEGG terms and modules was performed
using the R package clusterProfiler [74]. We note that GO
enrichment tests and KEGG over-representation tests
were performed using the set of proteins detected via the
LC–MS/MS measurements as the background.
Sequence‑based functional predictions
In addition to network analysis, other relevant func-
tional features of PUFs were interrogated via a suite of
protein sequence homology approaches. All tools were
run with default settings unless otherwise stated. To
identify possible enzymatic activity, enzyme commis-
sion (EC) numbers and KEGG terms were taken from
PANNZER2 [41] and BlastKoala [75], respectively. PAN-
NZER2 was run allowing for 80% minimum alignment
length, minimum query and subject coverage of 0.6,
and a minimum of sequence identity of 0.4. Functional/
domain prediction was also performed using eggNOG-
mapper [42] and InterProScan [40]. Gene Ontology
terms were pulled from PANNZER2, InterProScan, and
eggNOG-Mapper. Many proteins still had no GO terms
after the initial analysis. ese proteins were re-analyzed
with PANNZER2 with a minimum query coverage 0.2
and allowing for a minimum alignment length of 0.2,
and with eggNOG-mapper with a 0.1 minimum E value.
Structural and cellular localization features of PUFs were
further interrogated using SignalP [76], TMHMM [77],
and Swiss-Model servers to determine relevant structural
Gene regulatory information
Genes which are under the same regulatory control often
serve related functions within the cell. Operon informa-
tion, including annotations, for C. thermocellum was
pulled from the DOOR database [39].
Phylogenetic gene trees
Building alocal BLAST database using UniProtKB
To examine possible evolutionary relationships of PUFs
with proteins of known function, phylogenetic gene trees
were created. Homologs for the PUFs of interest were
found using blastp in the BLAST + software suite [78,
79]. FASTA files from Swiss-Prot and TreEMBL were
downloaded from UniProtKB and used to create a cus-
tom protein sequence database. All C. thermocellum
PUFs and DUFs were queried against the custom data-
base using an E value cut-off of 10E5. e searches
were done in CADES server at ORNL.
Multiple sequence alignment using MAFFT
Following the BLAST homology search, detected
homologs for each PUF were aligned using the multi-
ple sequence alignment (MSA) tool MAFFT [80], using
the auto-feature to automatically select an appropriate
alignment strategy for the given query. e estimation
of a highly accurate MSA is necessary to have low error
rates when computing the phylogenetic gene trees [81]
and this was achieved using the automated feature of, the
MSA trimming tool trimAl [82].
Phylogenetic gene trees using FastTree
FastTree can compute approximately maximum-like-
lihood phylogenetic trees from MSA involving protein
sequences or nucleotide sequences [83]. Phylogenetic
genes trees were generated for a protein alignment using
the JTT + CAT model, where JTT [84] is a model for
amino acid evolution and CAT is the an approximation
used to account for the varying rates of sequence evolu-
tion across amino acid sites [85]. Phylogenetic trees were
visualized using the ggtree R package [86].
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 17 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
PUFs: Proteins of unknown function; DUFs: Domains of unknown function; GO:
Gene Ontology; GC/MS: Gas chromatography coupled with mass spectrom-
etry; LC/MS: Liquid chromatography coupled with mass spectrometry; RT:
Retention time; MSA: Multiple sequence alignment; GBA: Guilt-by-association.
Supplementary Information
The online version contains supplementary material available at https:// doi.
org/ 10. 1186/ s13068- 021- 01964-4.
Additional le1: Figure S2. Species tree estimated by OrthoFinder and
used as the tree for the coevolution analysis.
Additional le2. Enriched GO terms of proteins show to be coevolving
with PUFs.
Additional le3. GO enrichment analyses of differentially expressed
Additional le4. KEGG enrichment analyses of differentially expressed
Additional le5: Figure S3. Volcano plots highlighting significantly
different PUFs across strains in A) early-log phase, B) mid-log phase, and C)
late-log phase.
Additional le6: Figure S4. All clusters generated by clust for the Δhpt
Additional le7: Figure S5. All clusters generated by clust for the LL1210
Additional le8. Summary of information for each PUF detected by
Additional le9: TableS1. Strains and plasmids used in validation of
alcohol acetyl transferase WP_003519433.1.
Additional le10: Figure S1. PCA analysis of protein abundances across
strains and time points.
This study was funded by the Center for Bioenergy Innovation (CBI), a U.S.
Department of Energy Bioenergy Research Center supported by the Office
of Biological and Environmental Research in the DOE Office of Science. Oak
Ridge National Laboratory is managed by University of Tennessee-Battelle LLC
for the Department of Energy under contract DOE-AC05-00OR22725.
This manuscript has been authored by UT-Battelle, LLC under Contract
No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United
States Government retains and the publisher, by accepting the article for
publication, acknowledges that the United States Government retains a non-
exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the
published form of this manuscript, or allow others to do so, for United States
Government purposes. The Department of Energy will provide public access
to these results of federally sponsored research in accordance with the DOE
Public Access Plan (http:// energy. gov/ downl oads/ doe- public- access- plan).
Authors’ contributions
RLH, SP, and ALC designed the experiment and outlined the research plan.
S.P. performed the LC–MS/MS measurements, data analysis, and wrote/edited
the manuscript. A.L.C performed a majority of data analyses and wrote/edited
the manuscript. K.O and A.M.G provided and grew C. thermocellum strains for
the LC–MS/MS and wrote/edited the manuscript. H.S and C.T.T performed
the characterization experiments and wrote/edited the manuscript. R.L.H
also assisted in writing and editing the manuscript. All authors have read and
approved the final manuscript.
This study was funded by the Center for Bioenergy Innovation (CBI), a U.S.
Department of Energy Bioenergy Research Center supported by the Office of
Biological and Environmental Research in the DOE Office of Science.
Availability of data and materials
All raw mass spectra for the proteome measurements have been deposited
into the ProteomeXchange repository with the following accession numbers:
(MassIVE Accession: MSV000085237, ProteomeXchange accession PXD018407:
FTP link to files: ftp:// MSV00 00852 37@ massi ve. ucsd. edu, username is
MSV000085237, password is PUF123).
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Author details
1 Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831,
USA. 2 The Center for Bioenergy Innovation at Oak Ridge National Laboratory,
Oak Ridge, TN, USA. 3 The Graduate School of Genome Science and Technol-
ogy, University of Tennessee, Knoxville, TN, USA. 4 The Bredesen Center, Uni-
versity of Tennessee, Knoxville, TN, USA. 5 Department of Chemical and Biomo-
lecular Engineering, University of Tennessee, Knoxville, TN, USA.
Received: 29 December 2020 Accepted: 26 April 2021
1. Yutin N, Galperin MY. A genomic update on clostridial phylogeny: gram-
negative spore formers and other misplaced clostridia. Environ Microbiol.
2013;15:2631–41. https:// doi. org/ 10. 1111/ 1462- 2920. 12173.
2. Zhang X, Tu B, Dai LR, Lawson PA, Zheng ZZ, Liu LY, et al. Petroclostridium
xylanilyticum gen Nov., sp. nov., a xylan-degrading bacterium isolated
from an oilfield, and reclassification of clostridial cluster iii members
into four novel genera in a new hungateiclostridiaceae fam. Nov. Int J
Syst Evol Microbiol. 2018;68:3197–211. https:// doi. org/ 10. 1099/ ijsem.0.
3. Tindall BJ. The names Hungateiclostridium Zhang et al. 2018, Hun-
gateiclostridium thermocellum (Viljoen et al. 1926) Zhang et al. 2018,
Hungateiclostridium cellulolyticum (Patel et al. 1980) Zhang et al. 2018,
Hungateiclostridium aldrichii (Yang et al. 1990) Zhang et. Int J Syst Evol
Microbiol. 2019;69:3927–32. https:// www. micro biolo gyres earch. org/
docse rver/ fullt ext/ ijsem/ 69/ 12/ 3927_ ijsem 003685. pdf? expir es= 16147
11788 & id= id& accna me= guest & check sum= 60B50 6A014 E496D 269B9
3BFBE 549E5 25. Accessed 2 Mar 2021.
4. Argyros DA, Tripathi SA, Barrett TF, Rogers SR, Feinberg LF, Olson
DG, et al. High ethanol titers from cellulose by using metabolically
engineered thermophilic, anaerobic microbes. Appl Environ Microbiol.
5. Deng Y, Olson DG, Zhou J, Herring CD, Joe Shaw A, Lynd LR. Redirecting
carbon flux through exogenous pyruvate kinase to achieve high ethanol
yields in Clostridium thermocellum. Metab Eng. 2013;15:151–8.
6. Papanek B, Biswas R, Rydzak T, Guss AM. Elimination of metabolic path-
ways to all traditional fermentation products increases ethanol yields in
Clostridium thermocellum. Metab Eng. 2015;32:49–54.
7. Biswas R, Prabhu S, Lynd LR, Guss AM. Increase in ethanol yield via elimi-
nation of lactate production in an ethanol-tolerant mutant of Clostridium
thermocellum. PLoS ONE. 2014;9:e86389.
8. Biswas R, Zheng T, Olson DG, Lynd LR, Guss AM. Elimination of hydroge-
nase active site assembly blocks H2 production and increases ethanol
yield in Clostridium thermocellum. Biotechnol Biofuels. 2015;8:20 http://
www. biote chnol ogyfo rbiof uels. com/ conte nt/8/ 1/ 20. Accessed 15 Apr
9. Akinosho H, Yee K, Close D, Ragauskas A. The emergence of Clostridium
thermocellum as a high utility candidate for consolidated bioprocessing
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 18 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
applications. Front Chem [Internet]. Frontiers Media S. A; 2014;2. www.
front iersin. org. Accessed 4 Mar 2021.
10. Whitham JM, Moon J-W, Rodriguez M, Engle NL, Klingeman DM, Rydzak
T, et al. Clostridium thermocellum LL1210 pH homeostasis mechanisms
informed by transcriptomics and metabolomics. Biotechnol Biofuels.
2018;11:98. https:// doi. org/ 10. 1186/ s13068- 018- 1095-y.
11. Tian L, Papanek B, Olson DG, Rydzak T, Holwerda EK, Zheng T, et al.
Simultaneous achievement of high ethanol yield and titer in Clostridium
thermocellum. Biotechnol Biofuels. 2016;9:116. https:// doi. org/ 10. 1186/
s13068- 016- 0528-8.
12. Poudel S, Giannone RJ, Basen M, Nookaew I, Poole FL, Kelly RM, et al. The
diversity and specificity of the extracellular proteome in the cellulolytic
bacterium Caldicellulosiruptor bescii is driven by the nature of the cel-
lulosic growth substrate. Biotechnol Biofuels. 2018;11:80.
13. Marchler-Bauer A, Bo Y, Han L, He J, Lanczycki CJ, Lu S, et al. CDD/SPAR-
CLE: functional classification of proteins via subfamily domain architec-
tures. Nucleic Acids Res. 2017;45:D200–3.
14. Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, et al.
A large-scale evaluation of computational protein function prediction.
Nat Methods. 2013;10:221–7.
15. Jiang Y, Oron TR, Clark WT, Bankapur AR, D’Andrea D, Lepore R, et al. An
expanded evaluation of protein function prediction methods shows an
improvement in accuracy. Genome Biol. 2016;17:184. https:// doi. org/
10. 1186/ s13059- 016- 1037-6.
16. Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, et al. The
CAFA challenge reports improved protein function prediction and
new functional annotations for hundreds of genes through experi-
mental screens. Genome Biol. 2019;20:244. https:// doi. org/ 10. 1186/
s13059- 019- 1835-8.
17. Webb B, Sali A. Protein structure modeling with MODELLER. Methods
Mol Biol. 2014;1137:1–15.
18. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy
SR, et al. Pfam: the protein families database. Nucleic Acids Res.
19. McKay T, Hart K, Horn A, Kessler H, Dodge G, Bardhi K, et al. Annotation
of proteins of unknown function: initial enzyme results. J Struct Funct.
20. Nadzirin N, Firdaus-Raih M. Proteins of unknown function in the
protein data bank (PDB): an inventory of true uncharacterized proteins
and computational tools for their analysis. Int J Mol Sci MDPI AG.
21. Niehaus TD, Thamm AMK, De Crécy-Lagard V, Hanson AD. Proteins of
unknown biochemical function: a persistent problem and a roadmap
to help overcome it. Plant Physiol. 2015;169:1436–42.
22. Mills CL, Beuning PJ, Ondrechen MJ. Biochemical functional predic-
tions for protein structures of unknown or uncertain function. Comput
Struct Biotechnol J. 2015;13:182–91.
23. Goodacre NF, Gerloff DL, Uetz P. Protein domains of unknown function
are essential in bacteria. MBio. 2013;5:e00744.
24. Ellens KW, Christian N, Singh C, Satagopam VP, May P, Linster CL. Con-
fronting the catalytic dark matter encoded by sequenced genomes.
Nucleic Acids Res. 2017;45:11495–514.
25. Frishman D. Protein annotation at genomic scale: the current status.
Chem Rev American Chemical Society. 2007;107:3448–66.
26. Hanson AD, Pribat A, de Creécy-Lagard V. “Unknown” proteins and
“orphans” enzymes: the missing half of the engineering part list-and
how to find it. Biochem J Portland Press. 2010;425:1–11.
27. Walker MG, Volkmuth W, Sprinzak E, Hodgson D, Klingler T. Predic-
tion of gene function by genome-scale expression analysis: prostate
cancer-associated genes. Genome Res. 1999;9:1198–203.
28. Wolfe CJ, Kohane IS, Butte AJ. Systematic survey reveals general appli-
cability of ``guilt-by-association’’ within gene coexpression networks.
BMC Bioinform. 2005;6:227. https:// doi. org/ 10. 1186/ 1471- 2105-6- 227.
29. Oliver S. Guilt-by-association goes global. Nature. 2000;403:601–3.
30. Gillis J, Pavlidis P. “Guilt by association” is the exception rather than the
rule in gene networks. PLoS Comput Biol. 2012;8:e1002444.
31. Gillis J, Pavlidis P. The impact of multifunctional genes on “Guilt by
Association’’ analysis. PLoS ONE. 2011;6:e17258. https:// doi. org/ 10.
1371/ journ al. pone. 00172 58.
32. Jansen R, Greenbaum D, Gerstein M. Relating whole-genome
expression data with protein-protein interactions. Genome Res.
33. Emms DM, Kelly S. OrthoFinder: Phylogenetic orthology inference for
comparative genomics. Genome Biol. 2019;20:238. https:// doi. org/ 10.
1186/ s13059- 019- 1832-y.
34. Barker D, Pagel M. Predicting functional gene links from phylogenetic-sta-
tistical analyses of whole genomes. PLoS Comput Biol. 2005;1:e3. https://
doi. org/ 10. 1371/ journ al. pcbi. 00100 03.
35. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. Limma powers
differential expression analyses for RNA-sequencing and microarray stud-
ies. Nucleic Acids Res. 2015;43:e47.
36. Shi L, Tu BP. Acetyl-CoA and the regulation of metabolism: mechanisms
and consequences. Curr Opin Cell Biol. 2015;33:125–31.
37. Poudel S, Giannone RJ, Rodriguez M, Raman B, Martin MZ, Engle NL, et al.
Integrated omics analyses reveal the details of metabolic adaptation of
Clostridium thermocellum to lignocellulose-derived growth inhibitors
released during the deconstruction of switchgrass. Biotechnol Biofuels.
2017;10:1–14. https:// doi. org/ 10. 1186/ s13068- 016- 0697-5.
38. Abu-Jamous B, Kelly S. Clust: automatic extraction of optimal co-
expressed gene clusters from gene expression data. Genome Biol.
2018;19:172. https:// doi. org/ 10. 1186/ s13059- 018- 1536-8.
39. Mao X, Ma Q, Zhou C, Chen X, Zhang H, Yang J, et al. DOOR 2.0: present-
ing operons and their functions through dynamic and integrated views.
Nucleic Acids Res. 2014;42:D654–9.
40. Zdobnov EM, Apweiler R. InterProScan-An integration platform for the
signature-recognition methods in InterPro. Bioinformatics. 2001;17:847–8.
41. Törönen P, Medlar A, Holm L. PANNZER2: a rapid functional annotation
web server. Nucleic Acids Res. 2018;46:W84–8.
42. Huerta-Cepas J, Forslund K, Coelho LP, Szklarczyk D, Jensen LJ, Von Mer-
ing C, et al. Fast genome-wide functional annotation through orthology
assignment by eggNOG-mapper. Mol Biol Evol. 2017;34:2115–22.
43. Lonvaud A, Albertin W, Beltran G, González B, Vázquez J, Cullen PJ, et al.
Aromatic Amino Acid-Derived Compounds Induce Morphological
Changes and Modulate the Cell Growth of Wine Yeast Species. Front
Microbiol. 2018;9:1–16. www. front iersin. org. Accessed 25 Nov 2020.
44. Layton DS, Trinh CT. Engineering modular ester fermentative pathways in
Escherichia coli. Metab Eng. 2014;26:77–88.
45. Layton DS, Trinh CT. Microbial synthesis of a branched-chain ester
platform from organic waste carboxylates. Metab Eng Commun.
46. Layton DS, Trinh CT. Expanding the modular ester fermentative pathways
for combinatorial biosynthesis of esters from volatile organic acids.
Biotechnol Bioeng. 2016;113:1764–76. https:// doi. org/ 10. 1002/ bit. 25947.
47. Rodriguez GM, Tashiro Y, Atsumi S. Expanding ester biosynthesis in
Escherichia coli. Nat Chem Biol. 2014;10:259–65.
48. Joshi T, Xu D. Quantitative assessment of relationship between sequence
similarity and function similarity. BMC Genomics. 2007;8:222. https:// doi.
org/ 10. 1186/ 1471- 2164-8- 222.
49. Papanek B, O’Dell KB, Manga P, Giannone RJ, Klingeman DM, Hettich RL,
et al. Transcriptomic and proteomic changes from medium supplemen-
tation and strain evolution in high-yielding Clostridium thermocellum
strains. J Ind Microbiol Biotechnol. 2018;45:1007–15. https:// doi. org/ 10.
1007/ s10295- 018- 2073-x.
50. Van Dam S, Osa UV~, Van Der Graaf A, Franke L, Ao J~, De Magalh P. Gene
co-expression analysis for functional classification and gene-disease pre-
dictions. Brief Bioinform. 2018;19:575–92. http:// pcwww. liv. ac. uk/ $aging/.
Accessed 30 Apr 2020
51. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Assign-
ing protein functions by comparative genome analysis: protein phyloge-
netic profiles. Proc Natl Acad Sci USA. 1999;96:4285–8.
52. Drummond DA, Raval A, Wilke CO. A single determinant dominates the
rate of yeast protein evolution. Mol Biol Evol. 2005;23:327–37.
53. Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH. Why highly
expressed proteins evolve slowly. Proc Natl Acad Sci. 2005;102:14338–43.
54. Bloom JD, Drummond DA, Arnold FH, Wilke CO. Structural determinants
of the rate of protein evolution in yeast. Mol Biol Evol. 2006;23:1751–61.
55. Fraser HB, Hirsh AE, Wall DP, Eisen MB. Coevolution of gene expression
among interacting proteins. Proc Natl Acad Sci USA. 2004;101:9033–8.
56. Clark NL, Alani E, Aquadro CF. Evolutionary rate covariation reveals shared
functionality and coexpression of genes. Genome Res. 2012;22:714–20.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 19 of 19
Poudeletal. Biotechnol Biofuels (2021) 14:116
fast, convenient online submission
thorough peer review by experienced researchers in your field
rapid publication on acceptance
support for research data, including large and complex data types
gold Open Access which fosters wider collaboration and increased citations
maximum visibility for your research: over 100M website views per year
At BMC, research is always in progress.
Learn more
Ready to submit your research
Ready to submit your research
? Choose BMC and benefit from:
? Choose BMC and benefit from:
57. Martin T, Fraser HB. Comparative expression profiling reveals widespread
coordinated evolution of gene expression across eukaryotes. Nat Com-
mun. 2018;9:4963.
58. Cope AL, O’Meara BC, Gilchrist MA. Gene expression of functionally-
related genes coevolves across fungal species: detecting coevolution
of gene expression using phylogenetic comparative methods. BMC
Genomics. 2020;21:370. https:// doi. org/ 10. 1186/ s12864- 020- 6761-3.
59. Sharp PM, Li W. The codon adaptation index - a measure of directional
synonymous codon usage bias, and its potential applications. Nucl Acids
Res. 1987;15:1281–95.
60. Laurent JM, Vogel C, Kwon T, Craig SA, Boutz DR, Huse HK, et al. Protein
abundances are more conserved than mRNA abundances across diverse
taxa. Proteomics. 2010;10:4209–12. https:// doi. org/ 10. 1002/ pmic. 20100
61. Rydzak T, Lynd LR, Guss AM. Elimination of formate production in
Clostridium thermocellum. J Ind Microbiol Biotechnol Springer Verlag.
62. Li W, Godzik A. Cd-hit: A fast program for clustering and comparing large
sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–9.
63. Diament BJ, Noble WS. Faster SEQUEST searching for peptide identifica-
tion from tandem mass spectra. J Proteome Res. 2011;10:3871–9.
64. Käll L, Canterbury JD, Weston J, Noble WS, MacCoss MJ. Semi-supervised
learning for peptide identification from shotgun proteomics datasets. Nat
Methods. 2007;4:923–5.
65. Argentini A, Goeminne LJE, Verheggen K, Hulstaert N, Staes A, Clement
L, et al. MoFF: a robust and automated approach to extract peptide ion
intensities. Nat Methods. 2016;13:964–6.
66. Dumon-Seignovert L, Cariot G, Vuillard L. The toxicity of recombinant
proteins in Escherichia coli: a comparison of overexpression in BL21(DE3),
C41(DE3), and C43(DE3). Protein Expr Purif. 2004;37:203–6.
67. Seo H, Lee JW, Garcia S, Trinh CT. Single mutation at a highly conserved
region of chloramphenicol acetyltransferase enables isobutyl acetate
production directly from cellulose by Clostridium thermocellum at
elevated temperatures. Biotechnol Biofuels. 2019;12:245. https:// doi. org/
10. 1186/ s13068- 019- 1583-8.
68. Seo H, Lee JW, Giannone RJ, Dunlap NJ, Trinh CT. Repurposing chlo-
ramphenicol acetyltransferase for a robust and efficient designer ester
biosynthesis platform. bioRxiv. 2020. https:// doi. org/ 10. 1101/ 2020. 11. 04.
69. Lee JW, Trinh CT. Microbial biosynthesis of lactate esters. Biotechnol
Biofuels. 2019. https:// doi. org/ 10. 1186/ s13068- 019- 1563-z.
70. Beaulieu J, Oliver J, O’Meara BC. corHMM: Analysis of binary character
evolution. R package version 1.22. 2017. https:// cran.r- proje ct. org/ packa
ge= corHMM
71. Yang J, Yu H, Liu B-H, Zhao Z, Liu L, Ma L-X, et al. DCGL: Differential Co-
expression Analysis and Differential Regulation Analysis of Gene Expres-
sion Microarray Data. R package version 2.1.2. 2014. https:// cran.r- proje ct.
org/ packa ge= DCGL
72. Yu H, Liu BH, Ye ZQ, Li C, Li YX, Li YY. Link-based quantitative methods to
identify differentially coexpressed genes and gene Pairs. BMC Bioinform.
2011;12:315. https:// doi. org/ 10. 1186/ 1471- 2105- 12- 315.
73. Alexa A, Rahnenfuhrer J. topGO: Enrichment Analysis for Gene Ontology.
R package version 2.38.1. 2019.
74. Yu G, Wang LG, Han Y, He QY. ClusterProfiler: an R package for com-
paring biological themes among gene clusters. Omi A J Integr Biol.
75. Kanehisa M, Sato Y, Morishima K. BlastKOALA and GhostKOALA: KEGG
tools for functional characterization of genome and metagenome
sequences. J Mol Biol. 2016;428:726–31.
76. Petersen TM, Brunak S, von Heijne G, Nielsen H. SignalP 4.0: discrimi-
nating signal peptides from transmembrane regions. Nat Methods.
77. Krogh A, Larsson B, Von Heijne G, Sonnhammer ELL. Predicting trans-
membrane protein topology with a hidden Markov model: application to
complete genomes. J Mol Biol. 2001;305:567–80.
78. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment
search tool. J Mol Biol. 1990;215:403–10.
79. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al.
BLAST+: Architecture and applications. BMC Bioinform. 2009;10:421.
80. Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid
multiple sequence alignment based on fast Fourier transform. Nucleic
Acids Res. 2002;30:3059–66.
81. Liu K, Linder CR, Warnow T. Multiple sequence alignment: a major chal-
lenge to large-scale phylogenetics. PLoS Curr. 2010;2:RRN1198.
82. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for
automated alignment trimming in large-scale phylogenetic analyses.
Bioinformatics. 2009;25:1972–3.
83. Price MN, Dehal PS, Arkin AP. FastTree 2 - Approximately maximum-
likelihood trees for large alignments. PLoS ONE. 2010;5:e9490.
84. Stamatakis A. Phylogenetic models of rate heterogeneity: A high per-
formance computing perspective. 20th Int Parallel Distrib Process Symp
IPDPS 2006. IEEE Computer Society; 2006.
85. Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data
matrices from protein sequences. Bioinformatics. 1992;8:275–82.
86. Yu G, Smith DK, Zhu H, Guan Y, Lam TT. ggtree: an R package for visualiza-
tion and annotation of phylogenetic trees with their covariates and other
associated data. Methods Ecol Evol. 2017;8:28–36. https:// doi. org/ 10.
1111/ 2041- 210X. 12628.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in pub-
lished maps and institutional affiliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
... optStoic results are inherently dependent upon the quality of the genome annotation. Given that bioinformatically predicted annotations are not always correct, as seen for clo1313_1686, and that approximately 20% of the C. thermocellum genome encodes proteins of unknown function (also called hypothetical proteins) (53), it is possible that important PP i -supplying mechanisms are currently missed. In an attempt to address this possibility, the KEGG database was probed for PP i -generating cycles carrying out the net conversion, ATP eq 1 P i ! ...
Full-text available
The atypical glycolysis of Clostridium thermocellum is characterized by the use of pyrophosphate (PP i ) as phosphoryl donor for phosphofructokinase (Pfk) and pyruvate phosphate dikinase (Ppdk) reactions. Previously, biosynthetic PP i was calculated to be stoichiometrically insufficient to drive glycolysis. This study investigates the role of a H ⁺ -pumping membrane-bound pyrophosphatase, glycogen cycling, a predicted Ppdk–malate shunt cycle and acetate cycling in generating PP i . Knockout studies and enzyme assays confirmed that clo1313_0823 encodes a membrane-bound pyrophosphatase. Additionally, clo1313_0717-0718 was confirmed to encode ADP-glucose synthase by knockouts, glycogen measurements in C. thermocellum and heterologous expression in E. coli . Unexpectedly, individually-targeted gene deletions of the four putative PP i sources did not have a significant phenotypic effect. Although combinatorial deletion of all four putative PP i sources reduced the growth rate by 22% (0.30±0.01 h ⁻¹ ) and the biomass yield by 38% (0.18±0.00 g biomass g substrate ⁻¹ ), this change was much smaller than what would be expected for stoichiometrically essential PP i -supplying mechanisms. Growth-arrested cells of the quadruple knockout readily fermented cellobiose indicating that the unknown PP i -supplying mechanisms are independent of biosynthesis. An alternative hypothesis that ATP-dependent Pfk activity circumvents a need for PP i altogether, was falsified by enzyme assays, heterologous expression of candidate genes and whole-genome sequencing. As a secondary outcome, enzymatic assays confirmed functional annotation of clo1313_1832 as ATP- and GTP-dependent fructokinase. These results indicate that the four investigated PP i sources individually and combined play no significant PP i -supplying role and the true source(s) of PP i , or alternative phosphorylating mechanisms, that drive glycolysis in C. thermocellum remain(s) elusive. IMPORTANCE Increased understanding of the central metabolism of C. thermocellum is important from a fundamental as well as from a sustainability and industrial perspective. In addition to showing that H ⁺ -pumping membrane-bound PPase, glycogen cycling, a Ppdk–malate shunt cycle, and acetate cycling are not significant sources of PP i supply, this study adds functional annotation of four genes and availability of an updated PP i stoichiometry from biosynthesis to the scientific domain. Together, this aids future metabolic engineering attempts aimed to improve C. thermocellum as a cell factory for sustainable and efficient production of ethanol from lignocellulosic material through consolidated bioprocessing with minimal pretreatment. Getting closer to elucidating the elusive source of PP i , or alternative phosphorylating mechanisms, for the atypical glycolysis is itself of fundamental importance. Additionally, the findings of this study directly contribute to investigations into trade-offs between thermodynamic driving force versus energy yield of PP i - and ATP-dependent glycolysis.
Full-text available
Robust and efficient enzymes are essential modules for metabolic engineering and synthetic biology strategies across biological systems to engineer whole-cell biocatalysts. By condensing an acyl-CoA and an alcohol, alcohol acyltransferases (AATs) can serve as an interchangeable metabolic module for microbial biosynthesis of a diverse class of ester molecules with broad applications as flavors, fragrances, solvents, and drop-in biofuels. However, the current lack of robust and efficient AATs significantly limits their compatibility with heterologous precursor pathways and microbial hosts. Through bioprospecting and rational protein engineering, we identified and repurposed chloramphenicol acetyltransferases (CATs) from mesophilic prokaryotes to function as robust and efficient AATs compatible with at least 21 alcohol and 8 acyl-CoA substrates for microbial biosynthesis of linear, branched, saturated, unsaturated and/or aromatic esters. By plugging the best engineered CAT (CATec3 Y20F) into the gram-negative mesophilic bacterium Escherichia coli , we demonstrated that the recombinant strain could effectively convert various alcohols into desirable esters, for instance, achieving a titer of 13.9 g/L isoamyl acetate with 95% conversion by fed-batch fermentation. The recombinant E. coli was also capable of simulating the ester profile of roses with high conversion (> 97%) and titer (> 1 g/L) from fermentable sugars at 37°C. Likewise, a recombinant gram-positive, cellulolytic, thermophilic bacterium Clostridium thermocellum harboring CATec3 Y20F could produce many of these esters from recalcitrant cellulosic biomass at elevated temperatures (>50°C) due to the engineered enzyme’s remarkable thermostability. Overall, the engineered CATs can serve as a robust and efficient platform for designer ester biosynthesis from renewable and sustainable feedstocks.
Full-text available
Background: Researchers often measure changes in gene expression across conditions to better understand the shared functional roles and regulatory mechanisms of different genes. Analogous to this is comparing gene expression across species, which can improve our understanding of the evolutionary processes shaping the evolution of both individual genes and functional pathways. One area of interest is determining genes showing signals of coevolution, which can also indicate potential functional similarity, analogous to co-expression analysis often performed across conditions for a single species. However, as with any trait, comparing gene expression across species can be confounded by the non-independence of species due to shared ancestry, making standard hypothesis testing inappropriate. Results: We compared RNA-Seq data across 18 fungal species using a multivariate Brownian Motion phylogenetic comparative method (PCM), which allowed us to quantify coevolution between protein pairs while directly accounting for the shared ancestry of the species. Our work indicates proteins which physically-interact show stronger signals of coevolution than randomly-generated pairs. Interactions with stronger empirical and computational evidence also showing stronger signals of coevolution. We examined the effects of number of protein interactions and gene expression levels on coevolution, finding both factors are overall poor predictors of the strength of coevolution between a protein pair. Simulations further demonstrate the potential issues of analyzing gene expression coevolution without accounting for shared ancestry in a standard hypothesis testing framework. Furthermore, our simulations indicate the use of a randomly-generated null distribution as a means of determining statistical significance for detecting coevolving genes with phylogenetically-uncorrected correlations, as has previously been done, is less accurate than PCMs, although is a significant improvement over standard hypothesis testing. These methods are further improved by using a phylogenetically-corrected correlation metric. Conclusions: Our work highlights potential benefits of using PCMs to detect gene expression coevolution from high-throughput omics scale data. This framework can be built upon to investigate other evolutionary hypotheses, such as changes in transcription regulatory mechanisms across species.
Full-text available
Background: The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Results: Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. Conclusion: We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.
Full-text available
Here, we present a major advance of the OrthoFinder method. This extends OrthoFinder's high accuracy orthogroup inference to provide phylogenetic inference of orthologs, rooted gene trees, gene duplication events, the rooted species tree, and comparative genomics statistics. Each output is benchmarked on appropriate real or simulated datasets, and where comparable methods exist, OrthoFinder is equivalent to or outperforms these methods. Furthermore, OrthoFinder is the most accurate ortholog inference method on the Quest for Orthologs benchmark test. Finally, OrthoFinder's comprehensive phylogenetic analysis is achieved with equivalent speed and scalability to the fastest, score-based heuristic methods. OrthoFinder is available at
Full-text available
Background: To produce second-generation biofuels, enzymatic catalysis is required to convert cellulose from lignocellulosic biomass into fermentable sugars. β-Glucosidases finalize the process by hydrolyzing cellobiose into glucose, so the efficiency of cellulose hydrolysis largely depends on the quantity and quality of these enzymes used during saccharification. Accordingly, to reduce biofuel production costs, new microbial strains are needed that can produce highly efficient enzymes on a large scale. Results: We heterologously expressed the fungal β-glucosidase D2-BGL from a Taiwanese indigenous fungus Chaetomella raphigera in Pichia pastoris for constitutive production by fermentation. Recombinant D2-BGL presented significantly higher substrate affinity than the commercial β-glucosidase Novozyme 188 (N188; Km = 0.2 vs 2.14 mM for p-nitrophenyl β-d-glucopyranoside and 0.96 vs 2.38 mM for cellobiose). When combined with RUT-C30 cellulases, it hydrolyzed acid-pretreated lignocellulosic biomasses more efficiently than the commercial cellulase mixture CTec3. The extent of conversion from cellulose to glucose was 83% for sugarcane bagasse and 63% for rice straws. Compared to N188, use of D2-BGL halved the time necessary to produce maximal levels of ethanol by a semi-simultaneous saccharification and fermentation process. We upscaled production of recombinant D2-BGL to 33.6 U/mL within 15 days using a 1-ton bioreactor. Crystal structure analysis revealed that D2-BGL belongs to glycoside hydrolase (GH) family 3. Removing the N-glycosylation N68 or O-glycosylation T431 residues by site-directed mutagenesis negatively affected enzyme production in P. pastoris. The F256 substrate-binding residue in D2-BGL is located in a shorter loop surrounding the active site pocket relative to that of Aspergillus β-glucosidases, and this short loop is responsible for its high substrate affinity toward cellobiose. Conclusions: D2-BGL is an efficient supplement for lignocellulosic biomass saccharification, and we upscaled production of this enzyme using a 1-ton bioreactor. Enzyme production could be further improved using optimized fermentation, which could reduce biofuel production costs. Our structure analysis of D2-BGL offers new insights into GH3 β-glucosidases, which will be useful for strain improvements via a structure-based mutagenesis approach.
Full-text available
Background: Esters are versatile chemicals and potential drop-in biofuels. To develop a sustainable production platform, microbial ester biosynthesis using alcohol acetyltransferases (AATs) has been studied for decades. Volatility of esters endows high-temperature fermentation with advantageous downstream product separation. However, due to the limited thermostability of AATs known, the ester biosynthesis has largely relied on use of mesophilic microbes. Therefore, developing thermostable AATs is important for ester production directly from lignocellulosic biomass by the thermophilic consolidated bioprocessing (CBP) microbes, e.g., Clostridium thermocellum. Results: In this study, we engineered a thermostable chloramphenicol acetyltransferase from Staphylococcus aureus (CATSa) for enhanced isobutyl acetate production at elevated temperatures. We first analyzed the broad alcohol substrate range of CATSa. Then, we targeted a highly conserved region in the binding pocket of CATSa for mutagenesis. The mutagenesis revealed that F97W significantly increased conversion of isobutanol to isobutyl acetate. Using CATSa F97W, we demonstrated direct conversion of cellulose into isobutyl acetate by an engineered C. thermocellum at elevated temperatures. Conclusions: This study highlights that CAT is a potential thermostable AAT that can be harnessed to develop the thermophilic CBP microbial platform for biosynthesis of designer bioesters directly from lignocellulosic biomass.
Full-text available
Background: Green organic solvents such as lactate esters have broad industrial applications and favorable environmental profiles. Thus, manufacturing and use of these biodegradable solvents from renewable feedstocks help benefit the environment. However, to date, the direct microbial biosynthesis of lactate esters from fermentable sugars has not yet been demonstrated. Results: In this study, we present a microbial conversion platform for direct biosynthesis of lactate esters from fermentable sugars. First, we designed a pyruvate-to-lactate ester module, consisting of a lactate dehydrogenase (ldhA) to convert pyruvate to lactate, a propionate CoA-transferase (pct) to convert lactate to lactyl-CoA, and an alcohol acyltransferase (AAT) to condense lactyl-CoA and alcohol(s) to make lactate ester(s). By generating a library of five pyruvate-to-lactate ester modules with divergent AATs, we screened for the best module(s) capable of producing a wide range of linear, branched, and aromatic lactate esters with an external alcohol supply. By co-introducing a pyruvate-to-lactate ester module and an alcohol (i.e., ethanol, isobutanol) module into a modular Escherichia coli (chassis) cell, we demonstrated for the first time the microbial biosynthesis of ethyl and isobutyl lactate esters directly from glucose. In an attempt to enhance ethyl lactate production as a proof-of-study, we re-modularized the pathway into (1) the upstream module to generate the ethanol and lactate precursors and (2) the downstream module to generate lactyl-CoA and condense it with ethanol to produce the target ethyl lactate. By manipulating the metabolic fluxes of the upstream and downstream modules through plasmid copy numbers, promoters, ribosome binding sites, and environmental perturbation, we were able to probe and alleviate the metabolic bottlenecks by improving ethyl lactate production by 4.96-fold. We found that AAT is the most rate-limiting step in biosynthesis of lactate esters likely due to its low activity and specificity toward the non-natural substrate lactyl-CoA and alcohols. Conclusions: We have successfully established the biosynthesis pathway of lactate esters from fermentable sugars and demonstrated for the first time the direct fermentative production of lactate esters from glucose using an E. coli modular cell. This study defines a cornerstone for the microbial production of lactate esters as green solvents from renewable resources with novel industrial applications.
Full-text available
Comparative studies of gene expression across species have revealed many important insights, but have also been limited by the number of species represented. Here we develop an approach to identify orthologs between highly diverged transcriptome assemblies, and apply this to 657 RNA-seq gene expression profiles from 309 diverse unicellular eukaryotes. We analyzed the resulting data for coevolutionary patterns, and identify several hundred protein complexes and pathways whose expression levels have evolved in a coordinated fashion across the trillions of generations separating these species, including many gene sets with little or no within-species co-expression across environmental or genetic perturbations. We also detect examples of adaptive evolution, for example of tRNA ligase levels to match genome-wide codon usage. In sum, we find that comparative studies from extremely diverse organisms can reveal new insights into the evolution of gene expression, including coordinated evolution of some of the most conserved protein complexes in eukaryotes.
Full-text available
Identifying co-expressed gene clusters can provide evidence for genetic or physical interactions. Thus, co-expression clustering is a routine step in large-scale analyses of gene expression data. We show that commonly used clustering methods produce results that substantially disagree and that do not match the biological expectations of co-expressed gene clusters. We present clust, a method that solves these problems by extracting clusters matching the biological expectations of co-expressed genes and outperforms widely used methods. Additionally, clust can simultaneously cluster multiple datasets, enabling users to leverage the large quantity of public expression data for novel comparative analysis. Clust is available at Electronic supplementary material The online version of this article (10.1186/s13059-018-1536-8) contains supplementary material, which is available to authorized users.
Full-text available
Clostridium thermocellum is a potentially useful organism for the production of lignocellulosic biofuels because of its ability to directly deconstruct cellulose and convert it into ethanol. Previously engineered C. thermocellum strains have achieved higher yields and titers of ethanol. These strains often initially grow more poorly than the wild type. Adaptive laboratory evolution and medium supplementation have been used to improve growth, but the mechanism(s) by which growth improves remain(s) unclear. Here, we studied (1) wild-type C. thermocellum, (2) the slow-growing and high-ethanol-yielding mutant AG553, and (3) the faster-growing evolved mutant AG601, each grown with and without added formate. We used a combination of transcriptomics and proteomics to understand the physiological impact of the metabolic engineering, evolution, and medium supplementation. Medium supplementation with formate improved growth in both AG553 and AG601. Expression of C1 metabolism genes varied with formate addition, supporting the hypothesis that the primary benefit of added formate is the supply of C1 units for biosynthesis. Expression of stress response genes such as those involved in the sporulation cascade was dramatically over-represented in AG553, even after the addition of formate, suggesting that the source of the stress may be other issues such as redox imbalances. The sporulation response is absent in evolved strain AG601, suggesting that sporulation limits the growth of engineered strain AG553. A better understanding of the stress response and mechanisms of improved growth hold promise for informing rational improvement of C. thermocellum for lignocellulosic biofuel production.