The RAST Server: Rapid Annotations using Subsystems Technology

Article (PDF Available)inBMC Genomics 9(1):75 · February 2008with301 Reads
DOI: 10.1186/1471-2164-9-75 · Source: PubMed
The number of prokaryotic genome sequences becoming available is growing steadily and is growing faster than our ability to accurately annotate them. We describe a fully automated service for annotating bacterial and archaeal genomes. The service identifies protein-encoding, rRNA and tRNA genes, assigns functions to the genes, predicts which subsystems are represented in the genome, uses this information to reconstruct the metabolic network and makes the output easily downloadable for the user. In addition, the annotated genome can be browsed in an environment that supports comparative analysis with the annotated genomes maintained in the SEED environment. The service normally makes the annotated genome available within 12-24 hours of submission, but ultimately the quality of such a service will be judged in terms of accuracy, consistency, and completeness of the produced annotations. We summarize our attempts to address these issues and discuss plans for incrementally enhancing the service. By providing accurate, rapid annotation freely to the community we have created an important community resource. The service has now been utilized by over 120 external users annotating over 350 distinct genomes.
BioMed Central
Page 1 of 15
(page number not for citation purposes)
BMC Genomics
Open Access
The RAST Server: Rapid Annotations using Subsystems
Ramy K Aziz
, Daniela Bartels
, Aaron A Best
, Matthew DeJongh
Terrence Disz
, Robert A Edwards
, Kevin Formsma
, Svetlana Gerdes
Elizabeth M Glass
, Michael Kubal
, Folker Meyer
, Gary J Olsen
Robert Olson
, Andrei L Osterman
, Ross A Overbeek*
, Leslie K McNeil
Daniel Paarmann
, Tobias Paczian
, Bruce Parrello
, Gordon D Pusch
Claudia Reich
, Rick Stevens
, Olga Vassieva
, Veronika Vonstein
Andreas Wilke
and Olga Zagnitko
Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA,
Mathematics and Computer Science Division, Argonne National
Laboratory, Argonne, IL 60439, USA,
Computation Institute, University of Chicago, Chicago, IL 60637, USA,
Department of Microbiology,
University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA,
The Burnham Institute, San Diego, CA 92037, USA,
National Center for
Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA,
Hope College, Holland, MI 49423, USA,
University of Tennessee, Health Science Center, Memphis, TN 38136, USA and
Department of Microbiology and Immunology, Cairo University,
Cairo, Egypt
Email: Ramy K Aziz -; Daniela Bartels -; Aaron A Best -;
Matthew DeJongh -; Terrence Disz -; Robert A Edwards -;
Kevin Formsma -; Svetlana Gerdes -; Elizabeth M Glass -;
Michael Kubal -; Folker Meyer -; Gary J Olsen -;
Robert Olson -; Andrei L Osterman -; Ross A Overbeek* -;
Leslie K McNeil -; Daniel Paarmann -; Tobias Paczian -;
Bruce Parrello -; Gordon D Pusch -; Claudia Reich -;
Rick Stevens -; Olga Vassieva -; Veronika Vonstein;
Andreas Wilke -; Olga Zagnitko -
* Corresponding author
Background: The number of prokaryotic genome sequences becoming available is growing
steadily and is growing faster than our ability to accurately annotate them.
Description: We describe a fully automated service for annotating bacterial and archaeal
genomes. The service identifies protein-encoding, rRNA and tRNA genes, assigns functions to the
genes, predicts which subsystems are represented in the genome, uses this information to
reconstruct the metabolic network and makes the output easily downloadable for the user. In
addition, the annotated genome can be browsed in an environment that supports comparative
analysis with the annotated genomes maintained in the SEED environment.
The service normally makes the annotated genome available within 12–24 hours of submission, but
ultimately the quality of such a service will be judged in terms of accuracy, consistency, and
Published: 8 February 2008
BMC Genomics 2008, 9:75 doi:10.1186/1471-2164-9-75
Received: 12 September 2007
Accepted: 8 February 2008
This article is available from:
© 2008 Aziz et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
BMC Genomics 2008, 9:75
Page 2 of 15
(page number not for citation purposes)
completeness of the produced annotations. We summarize our attempts to address these issues
and discuss plans for incrementally enhancing the service.
Conclusion: By providing accurate, rapid annotation freely to the community we have created an
important community resource. The service has now been utilized by over 120 external users
annotating over 350 distinct genomes.
In 1995 the first complete genome became available.
Since then, hundreds more have been sequenced, and it
has become clear that thousands will follow shortly. This
has led to the obvious conclusion that most of the anno-
tations that will be associated with these newly-sequenced
genomes will be provided through technologies that are
largely automated, and a growing number of efforts focus-
ing on different aspects of automated annotation have
emerged [1-6]. In this paper we describe the RAST Server,
a fully automated annotation service for complete, or
near-complete, archaeal and bacterial genomes. The serv-
ice seeks to rapidly produce high-quality assessments of
gene functions and an initial metabolic reconstruction.
Initially the server was planned for use by the National
Microbial Pathogen Data Resource (NMPDR) [7] commu-
nity, but very quickly the global utility of such a service
became apparent. Users of the facility upload a genome as
a set of contigs in FASTA format, and they receive access to
an annotated genome in an environment that supports
comparison with an integration of hundreds of existing
genomes. The complete annotation is normally produced
within 12–24 hours, and the existing implementation can
support a throughput of 50–100 genomes per day. How-
ever, it is important to note that speed is not the central
requirement for such a system; accuracy, completeness
and consistency will ultimately be the criteria used to eval-
uate the success or failure of a service such as the one
described. To date, the server has been used by over 120
external users to annotate over 350 genomes.
RAST bases its attempts to achieve accuracy, consistency,
and completeness on the use of a growing library of sub-
systems that are manually curated [8], and on protein fam-
ilies largely derived from the subsystems (FIGfams). In the
sections below we describe the steps the RAST server
implements to automatically produce two classes of
asserted gene functions: subsystem-based assertions are
based on recognition of functional variants of subsystems,
while nonsubsystem-based assertions are filled in using more
common approaches based on integration of evidence
from a number of tools. The fact that RAST distinguishes
these two classes of annotation and uses the relatively reli-
able subsystem-based assertions as the basis for a detailed
metabolic reconstruction makes the RAST annotations an
exceptionally good starting point for a more comprehen-
sive annotation effort.
Besides producing initial assignments of gene function
and a metabolic reconstruction, the RAST server provides
an environment for browsing the annotated genome and
comparing it to the hundreds of genomes maintained
within the SEED [9] integration. The genome viewer
included in RAST supports detailed comparison against
existing genomes, determination of genes that the
genome has in common with specific sets of genomes (or,
genes that distinguish the genome from those in a set of
existing genomes), the ability to display genomic context
around specific genes, and the ability to download rele-
vant information and annotations as desired.
Construction and content
Subsystems: an Overview
It is commonly held that one central role of bioinformat-
ics is to project a relatively small set of assertions of gene
and protein function from the literature (i.e., from wet lab
characterizations) to genes from other genomes. This cap-
tures a kernel of truth (that, ultimately, new assertions of
function are based on wet lab characterizations), but, per-
haps, elevates the role of bioinformatics beyond what is
reasonable to expect. In contrast, we view projection as a
2-step process:
1. In an initial stage, an expert in a biological topic inte-
grates what is known from the literature producing a set of
expert assertions, which include the assertions from the lit-
erature, as well as a far broader set based on judgement
and extrapolation.
2. Bioinformatics tools are developed to project structured
collections of expert assertions (rather than just the wet
lab results captured in the literature) to new genomes.
The process of integrating what is known from the litera-
ture into a set of expert assertions involves highly complex
decisions and is well beyond most of the common bioin-
formatics tools. On the other hand, there is every reason
to believe that fully automated tools can be developed to
project these expert assertions. The more comprehensive
and well structured the collection of expert assertions, the
more rapidly accurate projection technology will be devel-
oped. Here it is worth noting that we speak of "well-struc-
tured" sets of expert assertions, since the developed tools
will almost certainly need to encapsulate numerous rules
BMC Genomics 2008, 9:75
Page 3 of 15
(page number not for citation purposes)
covering special cases, and a careful delineation of these
rules can best be achieved by domain experts.
One technology for creating and maintaining expert asser-
tions was developed within the context of The Project to
Annotate 1000 Genomes [10].
This technology involves an expert curator defining a sub-
system as a set of abstract functional roles. Figure 1A shows
a very simple case in which a subsystem named "Tricarbal-
lylate Utilization" is composed of four functional roles.
The subsystem is populated by connecting these functional
roles to specific genes in specific genomes, producing a
subsystem spreadsheet, where each row represents one
genome and each column corresponds to one functional
role as shown in Figure 1B. The proteins encoded by the
genes in one column are used to construct the subsystem-
based FIGfams (discussed below). The cooperative effort
to develop subsystems has produced a publicly available
set of such populated subsystems that now includes over
600 subsystems. These subsystems include assertions of
function for well over 500,000 protein-encoding genes in
over 500 bacterial and archaeal genomes (relating to over
6200 functional roles). This manually curated collection
represents sets of co-curated protein families. While it is
true that the quality of the assertions varies substantially,
it is also true that these structured sets of assertions repre-
sent a major resource in constructing automated annota-
tion systems.
FIGfams: Yet Another Set of Protein Families
A number of groups have spent substantial effort building
protein families that now represent resources that are
widely used and valued by the community [11-15]; see
[16] for a more extended discussion. RAST utilizes a new
collection of protein families. This collection is referred to
as the set of FIGfams, and the publication of a detailed
account of them is in preparation. Each FIGfam may be
thought of as a 3-tuple composed of a set of proteins, a
family function, and a decision procedure. The set of pro-
teins are believed to be globally similar (and, presumably,
homologous) and the members all share a common func-
tion. The decision procedure takes as input a protein
sequence and returns a decision about whether or not the
protein could be added to the family (i.e., whether or not
the protein is globally similar to the members and shares
the common function).
Example Tricarballylate Utilization SubsystemFigure 1
Example Tricarballylate Utilization Subsystem. A) The subsystem is comprised of 4 functional roles. B) The Subsystem
Spreadsheet is populated with genes from 5 organisms (simplified from the original subsystem) where each row represents one
organism and each column one functional role. Genes performing the specific functional role in the respective organism popu-
late the respective cell. Gray shading of cells indicates proximity of the respective genes on the chromosomes. There are two
distinct variants of the subsystem: variant 1, with all 4 functional roles and variant 2 where the 3rd functional role is missing.
BMC Genomics 2008, 9:75
Page 4 of 15
(page number not for citation purposes)
Hence, the basic principles underlying FIGfams are quite
similar to those corresponding to the lowest-level PIR
families [17] or the TIGRfam equivalogs [15].
The construction of FIGfams is done conservatively: care
is taken to make sure that two proteins included in the
same set actually do share a common function, but if sub-
stantial uncertainty exists about whether or not two pro-
teins actually share the same function they are kept in
distinct families. Two proteins will be placed in the same
1. If both occur in the same column of a manually curated
subsystem spreadsheet (i.e., if they implement the same
functional role) and the region of similarity shared by the
two sequences covers over 70% of each sequence.
2. If they come from closely related genomes (e.g.,
genomes from two strains of the same species), the simi-
larity is high (usually greater than 90% identity), and the
context on the chromosome (i.e., the adjacent genes) can
easily be seen to correspond, then they can be placed in
the same family (even if the function they implement is
yet to be determined).
These are the two cases in which we feel confident in
asserting a common function between two proteins; the
first reflects an expert assertion, and the second an
instance in which divergence is minimal. Construction of
FIGfams using these two grouping principles has led to a
collection of about 17,000 FIGfams that include proteins
related to subsystems (those are the FIGfams that we call
subsystem-based) and over 80,000 that contain only pro-
teins grouped using the second principle (i.e. the non-
subsystem-based FIGfams). Many of the non-subsystem-
based FIGfams contain just 2, 3 or 4 proteins.
Over time we expect to coalesce the non-subsystem-based
FIGfams. This will be done by creating new, manually
curated subsystems; these will form kernels of new fami-
lies that will group the isolated families that now exist.
It is worth noting that the existing collection of FIGfams
covers most of the central cellular machinery with families
derived from subsystems, and the numerous small non-
subsystem-based FIGfams efficiently support recognition
of genes in close strains. While it is true that we cover a
limited percentage of genes in newly sequenced divergent
genomes, we recognize well over 90% of the genes in
newly sequenced strains that are close to existing anno-
tated genomes. It seems likely that a large percentage of
newly sequenced genomes will be close to existing
genomes (e.g., note projects to sequence tens and soon
hundreds of closely related pathogenic strains), and the
FIGfams already constitute an effective recognition frame-
work in such cases.
The Basic Steps in Annotating a Genome Using RAST
The basic steps used to annotate a genome using RAST are
described in the subsections below. Input to the process is
a prokaryotic genome in the form of a set of contigs in
FASTA format. As described below, the actual RAST server
will allow a user to specify a set of gene calls, but in the
usual case RAST will make its own calls. We now describe
the basic steps in a RAST annotation in detail.
Call the tRNA and rRNA genes
We use existing tools built by other research teams to first
identify both the tRNA and rRNA encoding genes. For the
tRNA genes we use tRNAscan-SE [18] and to identify the
rRNA encoding genes we use a tool " search_for_rnas"
developed by Niels Larsen (available from the author).
We begin the process by calling these genes, which we
believe can be reliably determined. Then, the server will
not consider retaining any protein-encoding gene that sig-
nificantly overlaps any of these regions. Unfortunately,
the public archives do contain putative protein-encoding
genes that are embedded in rRNAs. These gene calls are
almost certainly artefacts of the period in which groups
were learning how to develop proper annotations, and
RAST attempts to avoid propagating these errors.
Make an Initial Effort to Call Protein-Encoding Genes
Once the tRNA and rRNA gene-encoding regions are
removed from consideration, we make an initial call using
GLIMMER2 [19]. At this point we are seeking a reasonable
estimate of probable genes, and GLIMMER2 is an excel-
lent tool for that purpose. At this stage, RAST is not con-
cerned about calling spurious genes or getting starts called
accurately. What is needed is that most of the actual pro-
tein-encoding genes are represented in the initial estimate
of putative genes.
Establishing Phylogenetic Context
Once an initial set of protein-encoding genes has been
established, we take representative sequences from a small
set of FIGfams that have the property that they are univer-
sal or nearly universal in prokaryotes. This set includes, for
example, the tRNA synthetases.
Using this small set of representatives we search the pro-
tein-encoding genes from the new genome for occur-
rences of these FIGfams. It should be noted that this is a
very rapid step, since only the new genome is being
searched, and it is being searched using a small set of rep-
resentative protein sequences. The outcome of this initial
scan is a small set (normally, 8–15 genes) that can be used
to estimate the closest phylogenetic neighbours of the
newly-sequenced genome. This can be done by taking
BMC Genomics 2008, 9:75
Page 5 of 15
(page number not for citation purposes)
each located gene and blasting it against the genes from
the corresponding FIGfam. Normally, we attempt to
locate the ten closest neighbours, but clearly the approach
is insensitive to the exact number sought. For each
detected gene, we adjust its starting position and move it
from the set of putative genes to a set of determined genes
and the function (i.e., product name) assigned to the gene
is taken from the FIGfam.
A Targeted Search Based on FIGfams that Occur in Closely
Related Genomes
Once the "neighbouring genomes" have been deter-
mined, we can form the set of FIGfams that are present in
these genomes. This constitutes a set of FIGfams that are
likely to be found in the new genome. For each of these
FIGfams, we search the new genome. Note that we expect
these searches to have a relatively high rate of success.
Whenever we do find a gene, we adjust its starting posi-
tion and move the gene from the set of putative genes to the
set of determined genes. The computational costs required
to locate these genes are low (since we are searching a very
small set of putative genes).
Recall Protein-Encoding Genes
At this point, we have accumulated a set of determined
genes within the new genome and can now use this excel-
lent training set to recall the protein-encoding genes. In
the case of a genome that is a closely related strain of one
or more existing genomes, this training set may well
include over 90% of the actual protein-encoding genes.
Processing the Remaining Genes Against the Entire FIGfam
The putative genes that remain can be used to search
against the entire collection of FIGfams. This is done by
blasting against a representative set of sequences from the
FIGfams to determine potential families that need to be
checked, and then checking against each family. While
computationally more expensive than the focused
searches in the previous steps, it is still far, far cheaper
than blasting against a large non-redundant protein data-
base. Currently, the collection of representative protein
sequences from FIGfams used to compute potentially rel-
evant FIGfams includes somewhat over 100,000 protein
This step amounts to a comprehensive search of the FIG-
fams for each of the remaining putative genes. Once it has
been completed, all of the genes that could be processed
using FIGfams have been processed.
Clean Up Remaining Gene Calls (Remove Overlaps and
Adjust Starting Positions)
The putative proteins that remain are processed to attempt
to resolve issues relating to overlapping gene calls, starts
that need to be adjusted, and so forth. In the case of the
RAST server, we do blast the remaining putative genes
against a large non-redundant protein database in an
attempt to determine whether there is similarity-based
evidence that could be used in resolving conflicts.
Process the Remaining, Unannotated Protein-encoding
At this point, final assignments of function are made to
the remaining putative genes. If similarities were com-
puted in the preceding step, these similarities can be
accessed and functions can be asserted. Optionally, one
can employ any of the commonly employed pipeline
technologies to run a suite of tools and produce a more
accurate estimate. The genes processed using this
approach represent most of the overhead in a RAST anno-
tation. By first processing a majority of the genes using
FIGfam-based technology and focused searches, this cost
is minimized by RAST without (we believe) reducing
Construct an Initial Metabolic Reconstruction
Once assignments of function have been made, an initial
metabolic reconstruction is formed. For our purposes, this
amounts to connecting genes in the new genome to func-
tional roles in subsystems, determining when a set of con-
nections to a specific subsystem are sufficient to support
an active variant of the subsystem, and tabulating the com-
plete set of active variants. Since the subsystems them-
selves are arranged in crude categories reflecting basic
divisions of function, we can produce a detailed estimate
of the genome contents that got successfully connected to
subsystems (see Figure 2). In the case of a genome like
Buchnera aphidicula, in excess of 82% of the genes fall in
this category; for Escherichia coli O157:H7 the percentage
drops to 76%, while in a relatively diverged genome like
Methanocaldococcus jannaschii DSM 2661 the percentage
that can be connected (at this point in time) is only 22%.
Figure 2 offers a brief overview of the type of display a user
can employ to quickly explore the contents of the new
It should be emphasized that the subsystems cover all
modules of cellular machinery – not just the metabolic
pathways. Hence, what we are calling a metabolic recon-
struction (a collection of the active variants of subsystems
that have been identified) is more properly thought of as
a grouping of genes into modules, rather than the recon-
struction of the metabolic network. However, besides sim-
ply compiling the set of active variants of subsystems, the
RAST server uses a set of scenarios encoded in metabolic
subsystems to assemble a metabolic reaction network for
the organism [20]. These scenarios represent components
of the metabolic network in which specific compounds
are labelled as inputs and outputs (i.e., they may be
BMC Genomics 2008, 9:75
Page 6 of 15
(page number not for citation purposes)
thought of as directed modules of the metabolic net-
work). The metabolic network is assembled using bio-
chemical reaction information associated with functional
roles in subsystems to find paths through scenarios from
inputs to outputs. Scenarios that are connected by linked
inputs and outputs can be composed to form larger blocks
of the metabolic network, spanning processes that convert
transported nutrients into biomass components. In the
case of newly sequenced genomes that are close to those
our team manually curates, it is possible to directly esti-
mate what percent of the reaction network typically
included in a genome-scale metabolic reconstruction [21]
can be generated automatically. Today the RAST server
produces 70–95% of the reaction network, depending on
the specific species and genome.
In the previous sections we have described the basic tech-
nology that underlies the RAST server. We believe that the
issues discussed above determine accuracy and speed of
the system. The usability of the system is largely deter-
mined by the user interface.
We have spent the effort required to build a simple inter-
face that offers the ability to submit genomes, monitor
progress of the annotation, to view the results in a frame-
work allowing comparisons against hundreds of existing
genomes, and the ability to download the results in any of
several formats.
Upload Genome and monitor annotation process
The service is freely available for the annotation of
prokaryotic genomes. The genomes may be "complete" or
they may be in hundreds of contigs (which does impact
the quality of the derived annotations). A new user must
register for the service, which involves giving us contact
information and acquiring a password. By registering
users, we can create a framework in which users have
access to only those genomes that they have submitted. It
allows us also to contact the user once the automatic
annotation has finished or in case user intervention is
After login the user can monitor his/her submitted job/
jobs on the Job Overview page (Figure 3). This page lists
for each submitted job its number, submitter, the taxon-
omy ID and Genome name followed by a six-button bar,
where each button represents a step in the RAST annota-
tion service. Depending on the state of each step the but-
ton colour will change from grey (not started) to blue
(queued for computation) to yellow (in progress) to green
(successfully completed) or red (error) as shown in Figure
3. More detailed information about each step can be
viewed after clicking the button bar itself. Figure 4 illus-
Genes connected to subsystems and their distribution in different categoriesFigure 2
Genes connected to subsystems and their distribution in different categories. The categories are expandable down
to the specific gene (see Secondary Metabolism).
BMC Genomics 2008, 9:75
Page 7 of 15
(page number not for citation purposes)
trates such a Job Detail page with the submission time
stamp and the six steps. Here step one had been com-
pleted, step two was in progress and the other steps had
not yet started.
Browse Genome in SEED-Viewer environment
After the annotation is complete the user can choose to
download the annotated genome in a variety of export
formats (e.g. GenBank, FASTA, GFF3, Excel) or browse the
genome in the comparative environment of the SEED-
Viewer without having the data actually installed in the
SEED. These options remain for 120 days or until the data
are deleted by the user. If desired, the user can request to
have the annotated genome added to the SEED.
The SEED-Viewer environment presents the user with a
variety of options for the immediate analysis of the anno-
tated genome. The Organism Overview page contains
basic information on the Genome such as Taxonomy,
Size, the Number of Contigs, the Number of Coding
Sequences and RNAs and counts of non-hypothetical and
hypothetical gene annotations. In addition it contains the
Number of Subsystems that were automatically deter-
mined to be present in the genome. A bar graph and a pie
chart (shown in Figure 2) illustrate the distribution of
genes connected to the various subsystem groups. Each of
those groups can be expanded (by clicking the "+" button)
down to the specific protein encoding genes (pegs) found
in a given subsystem. This page is also the entry point to a
whole Genome Browser, the Compare Metabolic Recon-
struction tool, the View Features and the View Scenarios
The whole Genome Browser, as shown in Figure 5 allows
the user to zoom from a graphic whole genome presenta-
tion into any desired area of the genome down to a gene
Job Overview pageFigure 3
Job Overview page. The colours in the progress bar have the following meaning: gray – not started, blue – queued for com-
putation, yellow – in progress, red – requires user input, brown – failed with an error, green – successfully completed.
BMC Genomics 2008, 9:75
Page 8 of 15
(page number not for citation purposes)
(peg or RNA encoding gene). By clicking at any of the
genome features the user can choose to see the Annota-
tion Overview page (Figure 6), which includes a graphical
representation of the Genomic Context of the peg of inter-
est and compares that to regions in other genomes that
have homologous genes.
The Compare Metabolic Reconstruction tool allows the
user to compute a metabolic comparison of the newly-
annotated genome to any genome present in the SEED.
The output of such a computation is a three-column table
(Figure 7) that shows genes that are connected to subsys-
tems and are unique in the query genome (left column) or
unique in the SEED genome (right column) or are found
in both genomes (middle column). To see individual
genes the user needs to un-collapse the three-tear hierar-
chical representation of subsystems (by clicking the "+"
Job Detail pageFigure 4
Job Detail page. The RAST annotation progress can be monitored by each user.
Genome BrowserFigure 5
Genome Browser. The annotated genome can be browsed starting from a whole-genome view and zooming-in to a specific
BMC Genomics 2008, 9:75
Page 9 of 15
(page number not for citation purposes)
Annotation OverviewFigure 6
Annotation Overview. For each annotated feature RAST presents an overview page, which includes comparative genomics
views and the connections to a subsystem if one was asserted.
Compare Metabolic Reconstruction toolFigure 7
Compare Metabolic Reconstruction tool. In the example the RAST metabolic reconstruction of the submitted genome of
S. pyogenes Manfredo was compared to the metabolic reconstruction for S. pyogenes MGAS315, which is part of the compara-
tive environment of the SEED. All three columns of subsystem categories are expandable. In cases where RAST was conserva-
tive in the assertion of a subsystem a manual attempt to retrieve the missing function/s can be made by clicking the find button.
BMC Genomics 2008, 9:75
Page 10 of 15
(page number not for citation purposes)
All annotated features can be viewed and downloaded
from the View Features page (Figure 8). For each peg the
location on the contig, the functional role assignment, its
EC number (if present) and GO category, the connection
to a subsystem and a KEGG reaction (if appropriate) are
For each annotated genome a set of metabolic scenarios is
computed and can be viewed on the View Scenarios page
(Figure 9). Again a subsystem hierarchy can be un-col-
lapsed and for each subsystem that has been asserted, a
scenario is given with input and output compounds, their
stoicheometry and a relevant coloured KEGG map (if one
A beta version of the RAST server was made available in
February 2007. Since then we have been addressing per-
formance issues, systematic errors, and all of the details
required to effectively support such a service. Over 120
external users have now registered, and we have processed
over 350 submissions from these users. The total number
of genomes processed exceeds 1200 (including genomes
that we have run through the system for evaluation pur-
poses and to recall annotations in some of the existing
genomes) at the time of writing this manuscript.
Performance analysis
To provide an assessment of the annotation quality of the
new service, we first have compared the annotations in
our manually curated SEED annotation framework with
those generated automatically by the RAST server. There
are obvious limitations in using existing SEED genomes to
evaluate the service, and this lead us to add a comparison
of RAST annotations to KAAS (KEGG Automatic Annota-
tion Server) [22] annotations, the only other public anno-
tation service that we are aware of which will allow an
online sequence submission. The output of this compari-
son is available online, please see the section on availabil-
ity. A rough estimate of annotation quality can be gained
by comparing the number of genes linked to subsystems
[8] and the number of genes annotated as hypothetical
proteins (see Figure 10).
This informal analysis indicates that the RAST server can
successfully project the annotations generated in the
SEED environment, as the number of hypotheticals and
the number of genes linked to subsystems are roughly
equivalent for RAST and SEED. To better understand the
differences in annotation quality we have analyzed the
individual genes in the 5 genomes listed in Figure 10 fur-
ther. To enable comparison of annotations, we generated
a sequence-based matching of genes between the manu-
ally curated version of each of the five genomes (main-
tained within the SEED) and the corresponding RAST
annotated version.
Detailed discussion of the range of "differences" in
Table 1 shows that between 81.7% (M. jannaschii) and
94.9% (Buchnera) of genes matched between RAST and
SEED have identical annotations.
For the three genomes in Table 2 we have performed a
careful manual analysis of the discrepancies in annota-
View Features pageFigure 8
View Features page. All annotated features can be viewed and downloaded in table format. For each peg the location on the
contig, the functional role assignment, its EC number (if present) and GO category, the connection to a subsystem and a KEGG
reaction (if appropriate) are given.
BMC Genomics 2008, 9:75
Page 11 of 15
(page number not for citation purposes)
View Scenarios pageFigure 9
View Scenarios page. A genome-specific reaction network can be viewed on a scenario by scenario basis. The scenarios are
organized on the left by subsystems, which are themselves organized by categories of metabolic function. If a path through a
scenario was found in a given subsystem, the subsystem name is highlighted in blue. In this case, one path was found through
the Uroporphyrinogen III generation scenario in the Porphyrin, Heme and Siroheme Biosynthesis subsystem. The table to the
right shows the input and output compounds for the scenario, including their stoichiometry, and the reactions that make up
the path through the scenario.
Comparison of a set of genomes manually curated in the SEED and automatically annotated in RASTFigure 10
Comparison of a set of genomes manually curated in the SEED and automatically annotated in RAST. The
number of genes annotated as hypothetical and the number of genes linked to subsystems (our mechanism of manual curation)
is shown to provide an initial assessment of the performance of RAST.
BMC Genomics 2008, 9:75
Page 12 of 15
(page number not for citation purposes)
tion, attempting to reconcile annotations that were not
automatically recognized as identical.
As shown in Table 2 a significant percentage of the differ-
ing annotations can be manually reconciled. For 4.2% of
the 2814 features in A. borkumensis SK2 the RAST server
did not predict an identical function. Slightly worse
results were found for Methanocaldococcus jannaschii DSM
2661 (9.1%) and Wolinella succinogenes DSM1740
(8.01%). As an example of annotations that were judged
as "essentially identical" in our manual comparison, but
viewed as distinct by our automated comparison please
consider the following pairs:
• Type 4 prepilin-like proteins leader peptide processing
• phage DNA polymerase domain protein/DNA polymer-
ase, bacteriophage-type
Detailed discussion of the range of "differences" in gene
A number of genes were missed in the RAST predictions of
genes present in the contigs, and in addition the RAST
server predicted genes that were not present in the manu-
ally curated SEED genomes. Table 3 details the results of a
careful manual analysis of those differences in gene pre-
diction for three genomes from Table 1.
The majority of genes missing in RAST or predicted in
RAST, but not predicted in SEED, are hypothetical and
Our manual analysis of the features predicted by the RAST
server shows that only 1.3% (A. borkumensis SK2 and
Methanocaldococcus jannaschii DSM 2661) of the non-
hypothetical genes in the SEED and only 2.1% for
Wolinella succinogenes DSM 1740 were missed by RAST.
Further analysis revealed that in the case of Methanocaldo-
coccus jannaschii DSM 2661 of the 44 non-hypothetical
genes that the RAST server did not predict, 15 were trans-
posases or recombinases, 5 were small ribosomal proteins
and one was a leader peptide. These 21 cases present hard
cases for the current gene prediction algorithm used in
RAST to KAAS annotation comparison
We have compiled a comparison of RAST annotations to
annotations obtained from KAAS for five genomes (Bacil-
lus subtilis subsp. subtilis str. 168, Escherichia coli K12, Sta-
phylococcus aureus subsp. aureus COL, Synechocystis sp. PCC
6803, Vibrio cholera cholerae O1 biovar eltor str. N16961).
The RAST annotation to KAAS annotation comparison is
available at the URL given in the "Availability and require-
ments section". The KAAS provides functional annotation
of genes by BLAST comparisons against the manually
curated KEGG GENES database. RAST annotations were
obtained by submitting the DNA sequences of the
genomes (GenBank format obtained from RefSeq) to the
RAST server. KAAS annotations were obtained by submit-
ting the protein sequences for each genome (obtained
from RefSeq) to the KAAS. The resulting RAST and KO
(KEGG Orthology) assignments have been tabulated for
each genome and sorted by GenBank identifiers. In addi-
tion each GenBank identifier has been connected to the
appropriate entry in the Annotation Clearing House
(ACH) [23] to allow comparison to other public annota-
tion resources. The ACH is a framework for comparing
annotations of identical proteins from public resources
Table 2: Analysis of the discrepancies in annotation between SEED and RAST for three genomes
Genome different really different manually reconciled
Alcanivorax borkumensis SK2 164 111 53
Methanocaldococcus jannaschii DSM 2661 306 153 153
Wolinella succinogenes DSM 1740 314 159 155
Table 1: Differences in annotation
Genome genes % matched % identical different
Alcanivorax borkumensis SK2 2814 92.8 93.7 164
Aquifex aeolicus VF5 1613 91.7 87.5 185
Buchnera aphidicola str. Bp 550 90.0 94.9 25
Methanocaldococcus jannaschii DSM 2661 1844 90.6 81.7 306
Wolinella succinogenes DSM 1740 2094 93.8 84.0 314
The total number of genes (genes) is the number annotated in SEED, percentage of matched genes (% matched) is the number generated by a
sequence-based matching of genes. Of those matched genes the % identical subsets are annotated with an identical annotation. The last column
gives the number of predicted genes with different annotations.
BMC Genomics 2008, 9:75
Page 13 of 15
(page number not for citation purposes)
such as: TIGR-CMR [24], UniProtKB/Swissprot and Uni-
ProtKB/TrEMBL[12], GenBank [25], SEED [9], DOE-JGI
IMG [26], Integrated Microbial Genomes [26], KEGG
Summary and Discussion of results
There are obvious limitations in using existing SEED
genomes to evaluate the service. However we believe that
the examples discussed above indicate that the RAST
server has a false-negative rate of false gene predictions
between 1.3% and 2.1%. While a more comprehensive
analysis is possible, the lack of a gold standard for gene
predictions in diverse genomes leads the authors to
believe that this performance analysis is adequate.
As shown in the examples discussed above the rate of false
positives is of the same order or magnitude as the rate of
false-negatives (Table 3).
The functional annotations generated by the RAST server
are between 91% and 94% identical to those in the SEED.
Again the lack of a "gold standard" for annotations makes
a more formal evaluation problematic, but we believe that
our analysis provides a qualitative estimate of the actual
server performance.
The reader is encouraged to manually peruse the compar-
ison of annotations described in section headed "RAST
to KAAS annotation comparison" (see also the URL pro-
vided in "Availability and requirements section) to gain
an appreciation of the relative accuracies provided by the
different annotation services or to select a well-annotated
existing prokaryotic genome from any source, submit
the contigs to the RAST server, and do a comparison of
the returned annotations against those in the original
version. It is the most direct way to gain a meaningful
estimate of accuracy, consistency and completeness.
Developments in progress
We envision many additions and improvements to the
RAST Server several of which are already being addressed
by our team and will be discussed in the following para-
Detection and Processing of "Foreign DNA"
In many genomes, careful analysis of prophages, the rem-
nants of transposition events, insertions resulting from
conjugation, and the resulting pseudogenes is considered
an essential part of a manual annotation effort. It will
become increasingly important that we provide this anal-
ysis rapidly, accurately, and automatically if we wish to
process (for example) hundreds or thousands of closely
related pathogen genomes.
Processing Lower-quality Sequence
As we move to an era in which hundreds of genomes of
less-than-perfect quality are produced, bioinformatics
support will be needed to compensate for frameshifts that
reflect errors in sequence data. At this point most annota-
tion efforts are understandably reluctant to alter the input
sequence or to derive adjusted protein translations in
order to eliminate the impact of what might (or might
not) be a frameshift. We will offer a service that allows a
user to request automated "correction" of what appear to
be frameshifts, recording the alterations in attached anno-
A Server that Will Support Analysis of Short Fragments of
A simple modification to the step in which the RAST
server establishes the closest phylogenetic neighbours can
be used to allow processing of relatively short fragments
of DNA (typically over 20 kb). We have added this capa-
bility, although it will not be part of this initial release.
While the quality of the annotations is undoubtedly infe-
rior to what can be done with complete genomes, we feel
that many users would value even the limited analysis we
can provide automatically, allowing such a fragment to be
explored in a framework designed to support comparative
Table 3: A detailed manual analysis
missing in RAST missing in SEED
Genome RNAs hypoth. non-hypo hypoth. non-hypo
Alcanivorax borkumensis SK2 51 113 38 49 16
Methanocaldococcus jannaschii DSM 2661 43 105 25 74 19
Wolinella succinogenes DSM 1740 45 40 44 98 22
A detailed manual analysis of the genes called in RAST and in the SEED sheds some light on the differences in the respective predictions. As the
matching was performed on protein sequences, RNAs could not be matched. Genes found in the SEED and not predicted in the RAST were split
into two categories with and without hypothetical annotations. Additional predictions found in RAST but not in SEED were also included and again
split into hypothetical and non-hypothetical.
BMC Genomics 2008, 9:75
Page 14 of 15
(page number not for citation purposes)
Construction of Analogous RAST-based Servers for
Metagenomic Data
We have constructed an analogous server, the MG-RAST
(MetaGenome-RAST), that is designed to take as input an
environmental sample in the form of thousands of
"reads". The server uses many aspects of the technology
described within this paper, but also features numerous
additions designed to support the analysis of metagen-
omic data [28].
Processing More Types of Genes
There is a growing awareness of the need to process more
types of RNA genes, as well as properly annotating special-
ized regions of the genome (e.g., the origin of replication).
In many cases, this can be achieved using the growing
number of excellent freely available tools that are being
developed worldwide. We will certainly add these to the
initial step of RAST in which non-protein-encoding genes
are recognized before initiating the main analysis.
We have designed, implemented and released a freely
available public server that will provide initial gene calls,
gene functions, and metabolic reconstructions for bacte-
rial and archaeal genomes. This server provides initial
annotations that we believe to be unusually complete,
consistent and accurate. It achieves these goals by utilizing
the growing collection of subsystems produced by "The
Project to Annotate 1000 Genomes" and a collection of
protein families, which are referred to as FIGfams. The
existing implementation is capable of sustaining a
throughput rate of 50–100 genomes daily.
Availability and requirements
The server is freely available at
The RAST annotation to KAAS annotation comparison is
available at
Authors' contributions
RKA Subsystem creation and maintenance; DB Evaluation
of the RAST output and QC; AAB Development and main-
tenance of metabolic scenarios; MDeJ Development and
maintenance of metabolic scenarios; TD Development
and implementation of Rapid Propagation Technology;
RAE Subsystem creation and maintenance; KF Develop-
ment and maintenance of metabolic scenarios; SG Subsys-
tem creation and maintenance; EG Contributed to
development of user interface; MK Subsystem creation
and maintenance; FM RAST System architecture and eval-
uation of the output, manuscript preparation; GJO Sub-
system creation and maintenance; RO Development and
implementation of Rapid Propagation Technology; ALO
Subsystem creation and maintenance; RAO RAST System
architecture and Development, implementation of Rapid
Propagation Technology, manuscript preparation, corre-
sponding author; LKMcN Testing and evaluation of the
RAST output; DP Interface design and implementation;
TP Interface design and implementation; BP Develop-
ment and implementation of Rapid Propagation Technol-
ogy; GDP Development and implementation of Rapid
Propagation Technology; CR Testing and evaluation of
the RAST output; RS RAST System architecture; OV Sub-
system creation and maintenance; VV Subsystem creation
and maintenance, manuscript preparation; AW Testing
and Monitoring of the RAST server; OZ Subsystem crea-
tion and maintenance. All authors have read and
approved the final manuscript.
This work was funded by the National Institute of Allergy and Infectious
Diseases, National Institutes of Health, Department of Health and Human
Services, under Contract HHSN266200400042C. This work was sup-
ported in part by the U.S. Department of Energy under Contract DE-
We wish to thank the RAST server users for their very helpful feedback.
1. Meyer F, Goesmann A, McHardy AC, Bartels D, Bekel T, Clausen J,
Kalinowski J, Linke B, Rupp O, Giegerich R, et al.: GenDB – an open
source genome annotation system for prokaryote genomes.
Nucleic Acids Res 2003, 31(8):2187-2195.
2. Van Domselaar GH, Stothard P, Shrivastava S, Cruz JA, Guo A, Dong
X, Lu P, Szafron D, Greiner R, Wishart DS: BASys: a web server
for automated bacterial genome annotation. Nucleic Acids Res
3. Bryson K, Loux V, Bossy R, Nicolas P, Chaillou S, van de Guchte M,
Penaud S, Maguin E, Hoebeke M, Bessieres P, et al.: AGMIAL:
implementing an annotation strategy for prokaryote
genomes as a distributed system. Nucleic Acids Res 2006,
4. Vallenet D, Labarre L, Rouy Z, Barbe V, Bocs S, Cruveiller S, Lajus A,
Pascal G, Scarpelli C, Medigue C: MaGe: a microbial genome
annotation system supported by synteny results. Nucleic Acids
Res 2006, 34(1):53-65.
5. Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M: KAAS: an
automatic genome annotation and pathway reconstruction
server. Nucleic Acids Res 2007:W182-185.
6. Manatee [
7. McNeil LK, Reich C, Aziz RK, Bartels D, Cohoon M, Disz T, Edwards
RA, Gerdes S, Hwang K, Kubal M, et al.: The National Microbial
Pathogen Database Resource (NMPDR): a genomics plat-
form based on subsystem annotation. Nucleic Acids Res
8. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY,
Cohoon M, de Crecy-Lagard V, Diaz N, Disz T, Edwards R, et al.: The
subsystems approach to genome annotation and its use in
the project to annotate 1000 genomes. Nucleic Acids Res 2005,
9. The SEED framework for comparative genomics [http://]
10. The Project to Annotate 1000 Genomes [http://www.the]
11. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram
UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The
COG database: new developments in phylogenetic classifica-
tion of proteins from complete genomes. Nucleic Acids Res
2001, 29(1):22-28.
12. Schneider M, Tognolli M, Bairoch A: The Swiss-Prot protein
knowledgebase and ExPASy: providing the plant community
Publish with Bio Med Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical researc h in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright
Submit your manuscript here:
BMC Genomics 2008, 9:75
Page 15 of 15
(page number not for citation purposes)
with high quality proteomic data and tools. Plant Physiol Bio-
chem 2004, 42(12):1013-1021.
13. Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR,
Hu ZZ, Mazumder R, Kumar S, Kourtesis P, et al.: PIRSF: family
classification system at the Protein Information Resource.
Nucleic Acids Res 2004:D112-114.
14. Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and
genomes. Nucleic Acids Res 2000, 28(1):27-30.
15. Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT,
White O: TIGRFAMs: a protein family resource for the func-
tional identification of proteins. Nucleic Acids Res 2001,
16. Overbeek R, Bartels D, Vonstein V, Meyer F: Annotation of bacte-
rial and archaeal genomes: improving accuracy and consist-
ency. Chem Rev 2007, 107(8):3431-3447.
17. Wu CH, Shivakumar S: Proclass protein family database: new
version with motif alignments. Pac Symp Biocomput
18. Lowe TM, Eddy SR: tRNAscan-SE: a program for improved
detection of transfer RNA genes in genomic sequence.
Nucleic Acids Res 1997, 25(5):955-964.
19. Delcher AL, Harmon D, Kasif S, White O, Salzberg SL: Improved
microbial gene identification with GLIMMER. Nucleic Acids Res
1999, 27(23):4636-4641.
20. DeJongh M, Formsma K, Boillot P, Gould J, Rycenga M, Best A:
Toward the automated generation of genome-scale meta-
bolic networks in the SEED. BMC Bioinformatics 2007, 8:139.
21. Becker SA, Palsson BO: Genome-scale reconstruction of the
metabolic network in Staphylococcus aureus N315: an initial
draft to the two-dimensional annotation. BMC Microbiol 2005,
22. KAAS – KEGG Automatic Annotation Server [http://]
23. The Annotation Clearinghouse [http://clearing]
24. TIGR's Comprehensive Microbial Resource [http://]
25. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL:
GenBank. Nucleic Acids Res 2007:D21-25.
26. Markowitz VM, Szeto E, Palaniappan K, Grechkin Y, Chu K, Chen IM,
Dubchak I, Anderson I, Lykidis A, Mavromatis K, et al.: The inte-
grated microbial genomes (IMG) system in 2007: data con-
tent and analysis tool extensions. Nucleic Acids Res
27. Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M,
Katayama T, Kawashima S, Okuda S, Tokimatsu T, et al.: KEGG for
linking genomes to life and the environment. Nucleic Acids Res
28. The metagenomics RAST server [http://metagenom]
    • "The assembled draft genome was annotated using Prokka, version 1.10 (Seemann 2014) and the RAST automated annotation pipeline server (http:// (Aziz et al. 2008 ), which employs subsystems technology to identify genes related to different categories of cellular processes and metabolism (Overbeek et al. 2014). The whole-genome shotgun project of Bacillus subtilis strain D7XPN1 (= KCTC 33554, JCM 30051) has been deposited at DDBJ/EMBL/GenBank under the accession number JHCA00000000. "
    [Show abstract] [Hide abstract] ABSTRACT: A thermophilic, heterotrophic and facultatively anaerobic bacterium designated strain D7XPN1 was isolated from Baku BakuKing™, a commercial food-waste degrading bioreactor (composter). The strain grew optimally at 45 °C (growth range between 24 and 50 °C) and pH 7 (growth pH range between pH 5 and 9) in Luria Broth supplemented with 0.3 % glucose. Strain D7XPN1 tolerated up to 7 % NaCl and showed amylolytic and xylanolytic activities. 16S rRNA gene analysis placed strain D7XPN1 in the cluster represented by Bacillus subtilis and the genome analysis of the 4.1 Mb genome sequence determined using RAST (Rapid Annotation using Subsystem Technology) indicated a total of 5116 genomic features were present of which 2320 features could be grouped into several subsystem categories. Of these, 615 features were related to carbohydrate metabolism which included a range of enzymes with potential in the biodegradation of food wastes, a property consistent with the ecological habitat of the isolate. ANIb (Average Nucleotide Identity based on BLAST) analysis with 49 Bacillus subtilis genomes indicated that it was distantly related to the three currently taxonomically validated B. subtilis subspecies namely B. subtilis subsp. subtilis (95.6 %), B. subtilis subsp. spizizenii (93 %) and B. subtilis subsp. inaquosorum (92 %) and based on our current knowledge warranted that it be included as a separate cluster together with strain JS which it was closely related (98.69 %). The close relationship of strains D7XPN1 and JS is also supported from our results from electronic DNA–DNA Hybridization (e-DDH) studies. Furthermore, our additional in-depth phylogenomic analyses using three different datasets unequivocally supported the creation of a fourth B. subtilis subspecies to include strains D7XPN1 and JS for which we propose strain D7XPN1T (=KCTC 33554T, JCM 30051T) as the type strain, and designate it as B. subtilis subsp. stecoris.
    Full-text · Article · Dec 2016
    • "Unannotated contig sequences were deposited in GenBank and annotated according to the NCBI prokaryotic genome annotation pipeline [21]. The size and accession number(s) of each isolate genome are given in Additional file 2. The genomes were also annotated with the RAST web server [22]. CDS counts from the three annotations (DIYA, GenBank and RAST) are provided in Additional file 2. "
    [Show abstract] [Hide abstract] ABSTRACT: Moraxella bovoculi is a recently described bacterium that is associated with infectious bovine keratoconjunctivitis (IBK) or "pinkeye" in cattle. In this study, closed circularized genomes were generated for seven M. bovoculi isolates: three that originated from the eyes of clinical IBK bovine cases and four from the deep nasopharynx of asymptomatic cattle. Isolates that originated from the eyes of IBK cases profoundly differed from those that originated from the nasopharynx of asymptomatic cattle in genome structure, gene content and polymorphism diversity and consequently placed into two distinct phylogenetic groups. These results suggest that there are genetically distinct strains of M. bovoculi that may not associate with IBK.
    Full-text · Article · Dec 2016
    • "The genome from S. proteamaculans CDBB-1961 was sequenced and analyzed by Rapid Annotation Subsystem Technology (RAST) (Aziz et al. 2008) and HMMER 3.1b1 software ( A cellulase encoding gene, named spr cel8A, was identified by automatic annotation and deposited in the GenBank (accession number KX023906). "
    [Show abstract] [Hide abstract] ABSTRACT: Serratia proteamaculans CDBB-1961, a gut symbiont from the roundheaded pine beetle Dendroctonus adjunctus, displayed strong cellulolytic activity on agar-plates with carboxymethyl cellulose (CMC) as carbon source. Automatic genome annotation of S. proteamaculans made possible the identification of a single endoglucanase encoding gene, designated spr cel8A. The predicted protein, named Spr Cel8A shows high similarity (59–94 %) to endo-1,4-β-d-glucanases (EC from the glycoside hydrolase family 8 (GH8). The gene spr cel8A has an ORF of 1113 bp, encoding a 371 amino acid residue protein (41.2 kDa) with a signal peptide of 23 amino acid residues. Expression of the gene spr cel8A in Escherichia coli yields a mature recombinant endoglucanase 39 kDa. Cel8A displayed optimal activity at pH 7.0 and 40 °C, with a specific activity of 0.85 U/mg. The enzyme was stable at pH from 4 to 8.5, retaining nearly 40–80 % of its original activity, and exhibited a half-life of 8 days at 40 °C. The Km and Vmax values for Spr Cel8A were 6.87 mg/ml and 3.5 μmol/min/mg of protein, respectively, using CMC as substrate. The final principle products of Spr Cel8A-mediated hydrolysis of CMC were cellobiose, cello oligosaccharides and a small amount of glucose, suggesting that Spr Cel8A is an endo-β-1,4-glucanase manifesting exo-activity. This is the first report regarding the functional biochemical and molecular characterization of an endoglucanase from S. proteamaculans, found in the gut-associated bacteria community of Dendroctonus bark beetles. These results contribute to improved understanding of the functional role played by this bacterium as a symbiont of bark beetles.
    Full-text · Article · Dec 2016
Show more