ChapterPDF Available

Phage Genome Annotation Using the RAST Pipeline

  • Fellowship for the Interpretation of Genomes


Phages are complex biomolecular machineries that have to survive in a bacterial world. Phage genomes show many adaptations to their lifestyle such as shorter genes, reduced capacity for redundant DNA sequences, and the inclusion of tRNAs in their genomes. In addition, phages are not free-living, they require a host for replication and survival. These unique adaptations provide challenges for the bioinformatics analysis of phage genomes. In particular, ORF calling, genome annotation, noncoding RNA (ncRNA) identification, and the identification of transposons and insertions are all complicated in phage genome analysis. We provide a road map through the phage genome annotation pipeline, and discuss the challenges and solutions for phage genome annotation as we have implemented in the rapid annotation using subsystems (RAST) pipeline.
Chapter 17
Phage Genome Annotation Using the RAST Pipeline
Katelyn McNair, Ramy Karam Aziz, Gordon D. Pusch, Ross Overbeek,
Bas E. Dutilh, and Robert Edwards
Phages are complex biomolecular machineries that have to survive in a bacterial world. Phage genomes
show many adaptations to their lifestyle such as shorter genes, reduced capacity for redundant DNA
sequences, and the inclusion of tRNAs in their genomes. In addition, phages are not free-living, they
require a host for replication and survival. These unique adaptations provide challenges for the bioinfor-
matics analysis of phage genomes. In particular, ORF calling, genome annotation, noncoding RNA
(ncRNA) identification, and the identification of transposons and insertions are all complicated in phage
genome analysis. We provide a road map through the phage genome annotation pipeline, and discuss the
challenges and solutions for phage genome annotation as we have implemented in the rapid annotation
using subsystems (RAST) pipeline.
Key words Phage, Genome annotation, RAST, Functional annotation, Gene predictions
1 The Steps of Phage Genome Annotation
The essential steps in annotating any genome, whether phage,
bacterial, or eukaryotic, consist of identifying the features in the
genome and assigning terms describing roles or functions to those
features. Typical features that can be found in a phage genome
include protein-encoding genes, noncoding RNA genes, insertion
elements and transposons, direct and indirect repeats, origins of
replication, and attachment or integration sites. Annotations are
routinely only added to protein and RNA-encoding genes, labels
are often provided for insertion elements or transposons. Specific
for phages, they are fundamentally dependent on a cellular host to
replicate, and the functions on its genome can only be completely
understood in the context of the genome of the host. Thus, identi-
fication of prediction of the bacterial or archaeal host is an impor-
tant part of phage annotation. Together, these features provide the
core annotation of phages and this annotation provides the first
steps to understanding the function of the phage as it interacts with
Martha R.J. Clokie et al. (eds.), Bacteriophages: Methods and Protocols, Volume 3, Methods in Molecular Biology, vol. 1681,,©Springer Science+Business Media LLC 2018
its host (Fig. 1). We discuss the approaches to identify and annotate
each of these features below, and discuss how these annotations are
performed in the Rapid Annotation Using Subsystems Technology
approach (RAST) [1,2].
Protein-encoding genes are the focus of most automated anno-
tation systems, and more algorithms have been designed to handle
these features than other features. Generally a protein-encoding
gene can be identified as a long stretch of sequence in one reading
frame that can be translated into protein sequence without includ-
ing one of the three stop codons; these long stretches are called
Open Reading Frames (ORFs). In gene calling, the stop codons are
obvious because there is a choice of three codons to choose from
and they are all stop codons (unless the phage encodes a suppressor
tRNA which we do not discuss here). Most algorithms attempt to
identify the longest nonoverlapping ORFs in a genome, based on
the theory that the longer the open reading frame the less likely it is
to occur by chance. There are many alternative gene-finding algo-
rithms that have been developed over the last two decades,
Fig. 1 Pipeline of phage genome annotation starting with DNA sequences and
ending with an annotated genome
232 Katelyn McNair et al.
including CRITICA [3], GeneMark [4,5], GISMO [6], Glimmer
[7,8], MetaGeneAnnotator [9], and Prodigal [10]. Most of the
gene-finding algorithms find the same large genes because these are
obvious and have high confidence. The algorithms may differ in the
particular start sites that they identify; there may be multiple methi-
onine (ATG) or valine (GTG) codons that could all be used as the
start codon, and predicting exactly which start codon is the correct
one for a given gene is difficult without a priori knowledge of the
translation boundaries of the gene. In addition, the gene callers also
differ in their ability to identify small protein-encoding genes.
Short genes are statistically difficult to separate from the back-
ground noise of stretches of nucleotides that do not encode a
stop codon, and often gene calling algorithms use an artificial cut
off of (for example) 75 amino acids. It remains to be determined
how many small proteins are encoded in phage genomes, and this is
unlikely to be approached from a pure bioinformatics standpoint, as
it will require biological validation of bioinformatics predictions or
large-scale proteomic studies.
Most bacterial genomes are not thought to contain overlapping
open reading frames, and these shadow ORFs are removed during
the annotation step [10]. In viruses, including phages, however,
there are several well-known examples of two different genes from
the same stretch of DNA, such as the Rz/Rz1 system [11]. One
study even suggests that new genes may be born via this process,
providing evidence from the comparative genomics of Rhabdovir-
idae genomes [12]. These overlapping regions are generally not
predicted using most bioinformatics approaches, as adding over-
lapping ORFs to gene prediction algorithms would include an
enormous number of false positives to compensate for only a few
false negatives. Therefore, most phage protein prediction schemes
ignore overlapping proteins.
Following ORF identification, most bioinformatic gene predic-
tion tools assign a confidence score to the ORFs using a model of
what a gene is expected to look like, based on its nucleotide usage
statistics. These statistics are specific for a species, and depend on
properties like the codon usage and GC content of the genome. In
bacterial genomes, the RAST pipeline starts by identifying highly
conserved genes that are present in nearly every genome. The
statistics from those genes are then used to build a genome-specific
model for open reading frame identification that is applied to the
rest of the genome. In phage genes, there are typically very few, if
any, highly conserved genes, and never enough to build a reliable
gene model. Therefore, most gene calling is performed by a generic
model that is not trained on the specific genome being annotated
but on the genomes of all phages. By default, the RAST pipeline
uses Glimmer to identify the open reading frames, but options are
available to use MetaGeneAnnotator [9], GeneMark [4], or
Prodigal [10].
Phage Genome Annotation Using the RAST Pipeline 233
The functional annotation of protein-encoding phage genes is
usually based on homology searches against existing phages. His-
torically, phage genes were named with a single letter starting at
gpA, and either proceeding along the genome or assigning names
based on the order in which the genes or their products were
found. This resulted in several unrelated proteins from different
phages all having the same names. For example both terminases and
DNA replication initiation proteins have been annotated as gpA in
different phage genomes available from GenBank. This confusion,
amplified by the explosion of genome sequences in recent years, led
to efforts to categorize phage proteins into either phage ortholo-
gous groups (POGs) [13] or subsystems [14] that have unified the
annotation of many phage proteins. These common, descriptive,
names provide a framework for comparing annotations among
different phage genomes. The RAST system uses a combination
of homology, chromosomal clustering, and subsystems to assign
functions to proteins. First, proteins are annotated on the basis of
homology to known proteins. If this initial search yields matches to
proteins that are a component of a subsystem, RAST then tries to
find other members of the subsystem that should be present in the
same genome based on information from the previously annotated
genomes. The advantage of this approach is that the RAST system
can strengthen otherwise weak assertions of homology, based on
predictions from subsystem annotations. Of note, the RAST tools
allow the analysis of proteins in their chromosomal context, which
sometimes helps determine the roles of proteins with unknown
functions based on the functions of their chromosomal neighbors
(e.g., protein subunits encoded by different genes, members of
operons, or transporters of metabolites whose metabolizing
enzymes are encoded on the same cluster). Phage genomes, like
bacterial genomes, also order some of their genes, and this infor-
mation can be leveraged to identify clusters of genes. For example,
the small and large terminase (TerS and TerL) are frequently adja-
cent on the genome, and the identification of one leads to the
identification of the other.
A major difficulty in the functional annotation of protein-
encoding genes on phage genomes by homology searches is the
fact that most proteins have no close homologs in the reference
databases. Especially for novel phages, this results in the majority of
encoded ORFs having no annotated function, or a hypothetical
function at best. A possible solution includes homology-indepen-
dent annotation, based on amino acid usage profiles of the proteins.
One such approach, iVIREONS (
) uses machine learning to “learn” the characteristics of manually
annotated phage proteins and then tests unknown proteins to see if
they have similar characteristics [15].
Noncoding RNA (ncRNA) genes. Although Ribosomal RNAs
have not yet been found in phage genomes, most pipelines,
234 Katelyn McNair et al.
including RAST, look for them anyway as the pipelines have been
developed for bacterial genome annotation and the computational
cost of looking for rRNA genes is a minimal addition to the pipe-
line. Ribosomal RNA genes are highly conserved and are identified
by extrinsic gene calling—using a database of known RNA genes to
compare against. In contrast to rRNA genes that are recognized by
homology, tRNA genes are recognized by intrinsic gene calling—
using only features of the sequence. They are typically identified by
computational tools built specifically to recognize the secondary
structure of the tRNA molecule [16]. As with tRNAs, the function
of other non-protein coding RNA genes also depends on the
structure of the folded RNA molecule rather than the nucleotide
sequence. Therefore, other noncoding RNA genes are also recog-
nized by their conserved secondary structure rather than homology
to existing sequences [17]. The RAST pipeline uses a manually
curated database of ribosomal RNA genes to find them in a
genome, and uses tRNAScan-SE [16] to identify tRNA genes.
Many phages encode tRNA genes, and it has been proposed that
these may supplement host-encoded tRNAs in translating phage
proteins for anticodons that are insufficiently covered by the bacte-
rial tRNAs [18]. These tRNA genes are also often used as phage
integration sites in the host’s genome (attP). Integration of the
phage disrupts the host gene, and thus carrying complete, or near
complete, tRNA genes allows the phage to reconstitute a tRNA
into which it can integrate [19]. There has been little exploration of
the role of ncRNA in phage lifestyle. Recent work with CRISPR/
Cas systems have identified the presence of these systems in phage
genomes [20] and metagenomes [21], and it is thought that they
are being used to attack other phages that may be infecting the
same host.
Insertion elements and transposons are currently identified by
annotations of protein-encoding genes. Transposases (Tn) are
readily identified as protein-encoding genes, and the similarity
between members of the transposase family, and with other recom-
binases, is high enough that they usually receive accurate annota-
tion. However, the repeats flanking the insertion sequence or
transposon are not typically automatically annotated. There are
boutique databases of these problematic mobile elements
[22,23], but often the classification of insertion (IS) elements is
dependent on one or a few residues. Typically automatic annotation
systems identify the Tn or IS elements but cannot identify the fine
details responsible for the accurate categorization of these ele-
ments. More work is required to accurately denote the ends of
these mobile elements in automatic phage annotation systems.
Direct and indirect repeats are usually used to identify the ends of
insertion elements and transposons [22], and to predict the ends of
prophages that have been found in bacterial genomes [13]. Stan-
dard informatics approaches can easily identify repeats longer than
Phage Genome Annotation Using the RAST Pipeline 235
approximately 14 nucleotides in a phage genome. Below that
length, repeats are found too frequently to ascertain whether they
are indeed the correct flanking repeats, or randomly occurring
repeated sequence elements. A few websites can be used to identify
repeats in DNA sequences (e.g., [24,25]).
Phage attachment sites are impossible to detect de novo if only
the phage is known, but if the phage and the host genome
sequences are known, they are trivial to find. The phage carries
the attachment site P (attP) that has sequence homology to the
bacterial attachment site B (attB). Integration is initiated by recom-
bination between attP and attB, resulting in attL and attR sites
that flank the nascent prophage.
Accurately Annotating Phage Metadata. Annotating genomic
metadata is a general challenge to genomics and metagenomics.
With bacteriophages, this issue is even more problematic, given the
lack of systematic nomenclature for viruses (as opposed to the
binomial system used for cellular organisms, see Chapter 15 of this
book). Some attempts were made to suggest systematic nomencla-
ture for viruses similar to those used for plasmids [26], but they are
not widely applied or enforced. In addition to accurate taxonomic
descriptions of viruses, including metadata associated with the virus
(e.g., its morphology, actual host, host range, and lifestyle) is
equally important. These make comparative genomics studies pos-
sible, enable predictive tools such as those that identify the host of
unknown phages [27], or predict the lifestyle of new phages [14]
and improve metagenomic/microbiomic annotations. Other
important types of metadata can be computed from the genomic
information, e.g., a genome’s length, %G+C, and codon usage
[28]. These too have quite powerful applications in comparative
genomics, prophage finding, and metagenomics. For example,
information content of phage genomes has improved prophage
finding [29] and is proposed to improve metagenomic analysis
[30]. As with gene annotation, metadata annotation needs to use
a controlled vocabulary (which has to be consistent but not neces-
sarily rigid or hierarchical). Spelling inconsistencies (e.g., firmicutes
vs. Firmicutes vs. gram-positive bacteria) or terminology inconsis-
tencies (e.g., temperate vs. lysogenic lifestyles) are all obstacles
against computational analysis and data propagation.
To summarize, phage annotation involves the identification
and functional description of several types of features, including
protein-encoding genes, RNA genes, insertion elements and trans-
posons, repeats, and attachment sites. Moreover, phage–host asso-
ciations are an important part of understanding phage biology that
can be predicted using a range of computational tools [27]. The
RAST pipeline provides an automated approach to phage genome
annotation. The pipeline currently uses bacterial ORF-finding algo-
rithms to identify the proteins in the genome, and a combination of
236 Katelyn McNair et al.
homology-based and subsystems-based approaches to decorate
those proteins with their functional annotation. RNA genes are
detected by a combination of extrinsic and intrinsic gene calling
methods. There remain several hurdles to accurate phage genome
annotation, especially the assignment of functions to unknown
proteins, the identification of small proteins in the genome, and
the correct and unambiguous identification of insertion elements
and transposons. The combinations of bioinformatics advances and
a better understanding of phage biology will help to improve phage
genome annotation, making this field a fertile area for further
This work was supported by grants from the National Science
Foundation MCB-1330800 and DUE-1323809 to RAE. BED
was supported by the Netherlands Organization for Scientific
Research (NWO) Vidi grant 864.14.004.
1. Aziz RK, Bartels D, Best AA, DeJongh M,
Disz T, Edwards RA, Formsma K, Gerdes S,
Glass EM, Kubal M, Meyer F, Olsen GJ,
Olson R, Osterman AL, Overbeek RA, McNeil
LK, Paarmann D, Paczian T, Parrello B, Pusch
GD, Reich C, Stevens R, Vassieva O,
Vonstein V, Wilke A, Zagnitko O (2008) The
RAST Server: rapid annotations using subsys-
tems technology. BMC Genomics 9:75
2. Brettin T, Davis JJ, Disz T, Edwards RA,
Gerdes S, Olsen GJ, Olson R, Overbeek R,
Parrello B, Pusch GD, Shukla M, Thomason
Iii JA, Stevens R, Vonstein V, Wattam AR, Xia F
(2015) RASTtk: A modular and extensible
implementation of the RAST algorithm for
building custom annotation pipelines and
annotating batches of genomes. Sci Rep
3. Badger JH, Olsen GJ (1999) CRITICA: cod-
ing region identification tool invoking compar-
ative analysis. Mol Biol Evol 16:512–524
4. Borodovsky M, Mclninch JD, Koonin EV,
Rudd KE, Me
´digue C, Danchin A (1995)
Detection of new genes in a bacterial genome
using Markov models for three gene classes.
Nucleic Acids Res 23:3554–3562
5. Lukashin AV, Borodovsky M (1998) Gene-
Mark.hmm: new solutions for gene finding.
Nucleic Acids Res 26:1107–1115
6. Krause L, McHardy AC, P
uhler A, Stoye J,
Meyer F (2007) GISMO - Gene identification
using a support vector machine for ORF classi-
fication. Nucleic Acids Res 35:540–549
7. Delcher AL, Harmon D, Kasif S, White O,
Salzberg SL (1999) Improved microbial gene
identification with GLIMMER. Nucleic Acids
Res 27:4636–4641
8. Kelley DR, Liu B, Delcher AL, Pop M, Salz-
berg SL (2012) Gene prediction with Glimmer
for metagenomic sequences augmented by
classification and clustering. Nucleic Acids Res
9. Noguchi H, Taniguchi T, Itoh T (2008) Meta-
GeneAnnotator: Detecting species-specific pat-
terns of ribosomal binding site for precise gene
prediction in anonymous prokaryotic and
phage genomes. DNA Res 15:387–396
10. Hyatt D, Chen G-L, LoCascio PF, Land ML,
Larimer FW, Hauser LJ (2010) Prodigal: pro-
karyotic gene recognition and translation initi-
ation site identification. BMC Bioinformatics
11. Summer EJ, Berry J, Tran TAT, Niu L, Struck
DK, Young R (2007) Rz/Rz1 lysis gene
equivalents in phages of Gram-negative hosts.
J Mol Biol 373:1098–1112
12. Walker PJ, Firth C, Widen SG, Blasdell KR,
Guzman H, Wood TG, Paradkar PN, Holmes
EC, Tesh RB, Vasilakis N (2015) Evolution of
genome size and complexity in the Rhabdovir-
idae. PLoS Pathog 11:e1004664
Phage Genome Annotation Using the RAST Pipeline 237
13. Kristensen DM, Waller AS, Yamada T, Bork P,
Mushegian AR, Koonin EV (2013) Ortholo-
gous gene clusters and taxon signature genes
for viruses of prokaryotes. J Bacteriol
14. McNair K, Bailey BA, Edwards RA (2012)
PHACTS, a computational approach to classi-
fying the lifestyle of phages. Bioinformatics
15. Seguritan V, Alves N, Arnoult M, Raymond A,
Lorimer D, Burgin AB, Salamon P, Segall AM
(2012) Artificial neural networks trained to
detect viral and phage structural proteins.
PLoS Comput Biol 8:e1002657
16. Lowe TM, Eddy SR (1997) tRNAscan-SE: a
program for improved detection of transfer
RNA genes in genomic sequence. Nucleic
Acids Res 25:955–964
17. Nawrocki EP (2014) Annotating functional
RNAs in genomes using Infernal. Methods
Mol Biol 1097:163–197
18. Bailly-Bechet M, Vergassola M, Rocha E
(2007) Causes for the intriguing presence of
tRNAs in phages. Genome Res 17:1486–1495
19. Williams KP (2002) Integration sites for
genetic elements in prokaryotic tRNA and
tmRNA genes: sublocation preference of inte-
grase subfamilies. Nucleic Acids Res
20. Seed KD, Lazinski DW, Calderwood SB,
Camilli A (2013) A bacteriophage encodes its
own CRISPR/Cas adaptive response to evade
host innate immunity. Nature 494:489–491
21. Cassman N, Prieto-Davo
´A, Walsh K, Silva
GGZ, Angly F, Akhter S, Barott K, Busch J,
McDole T, Haggerty JM, Willner D,
´n G, Ulloa O, DeLong EF, Dutilh BE,
Rohwer F, Dinsdale EA (2012) Oxygen mini-
mum zones harbour novel viral communities
with low diversity. Environ Microbiol
22. Aziz RK, Breitbart M, Edwards RA (2010)
Transposases are the most abundant, most
ubiquitous genes in nature. Nucleic Acids Res
23. Riadi G, Medina-Moenne C, Holmes DS
(2012) TnpPred: a web service for the robust
prediction of prokaryotic transposases. Comp
Funct Genomics 2012:678761
24. Benson G (1999) Tandem repeats finder: a
program to analyze DNA sequences. Nucleic
Acids Res 27:573–580
25. Volfovsky N, Haas BJ, Salzberg SL (2001) A
clustering method for repeat analysis in DNA
sequences. Genome Biol 2:RESEARCH0027
26. Kropinski AM, Prangishvili D, Lavigne R
(2009) Position paper: the creation of a ratio-
nal scheme for the nomenclature of viruses of
Bacteria and Archaea. Environ Microbiol
27. Edwards RA, McNair K, Faust K, Raes J,
Dutilh BE (2016) Computational approaches
to predict bacteriophage–host relationships.
FEMS Microbiol Rev 40:58–72
28. Aziz RK, Dwivedi B, Akhter S, Breitbart M,
Edwards RA (2015) Multidimensional metrics
for estimating phage abundance, distribution,
gene density, and sequence coverage in meta-
genomes. Front Microbiol 6:381
29. Akhter S, Aziz RK, Edwards RA (2012)
PhiSpy: a novel algorithm for finding pro-
phages in bacterial genomes that combines
similarity- and composition-based strategies.
Nucleic Acids Res 40:e126–e126
30. Akhter S, Bailey BA, Salamon P, Aziz RK,
Edwards RA (2013) Applying Shannon’s infor-
mation theory to bacterial and phage genomes
and metagenomes. Sci Rep 3:1033
238 Katelyn McNair et al.
... [44]. Sequence analysis and annotation of resulting open reading frames (ORFs) were predicted using Prokka v1.14 [45] and Rapid Annotation Subsystems Technology (RAST) [46]. compared with their corresponding sequences of reference bacteriophages deposited in the NCBI database to determine the phylogenetic position of the recently isolated phage [50]. ...
Full-text available
Background Pseudomonas aeruginosa is a nosocomial bacterium responsible for variety of infections. Inappropriate use of antibiotics could lead to emergence of multidrug-resistant (MDR) P. aeruginosa strains. Herein, a virulent phage; vB_PaeM_PS3 was isolated and tested for its application as alternative to antibiotics for controlling P. aeruginosa infections. Methods Phage morphology was observed using transmission electron microscopy (TEM). The phage host range and efficiency of plating (EOP) in addition to phage stability were analyzed. One-step growth curve was performed to detect phage growth kinetics. The impact of isolated phage on planktonic cells and biofilms was assessed. The phage genome was sequenced. Finally, the therapeutic potential of vB_PaeM_PS3 was determined in vivo. Results Isolated phage has an icosahedral head and a contractile tail and was assigned to the family Myoviridae. The phage vB_PaeM_PS3 displayed a broad host range, strong bacteriolytic ability, and higher environmental stability. Isolated phage showed a short latent period and large burst size. Importantly, the phage vB_PaeM_PS3 effectively eradicated bacterial biofilms. The genome of vB_PaeM_PS3 consists of 93,922 bp of dsDNA with 49.39% G + C content. It contains 171 predicted open reading frames (ORFs) and 14 genes as tRNA. Interestingly, the phage vB_PaeM_PS3 significantly attenuated P. aeruginosa virulence in host where the survival of bacteria-infected mice was markedly enhanced following phage treatment. Moreover, the colonizing capability of P. aeruginosa was markedly impaired in phage-treated mice as compared to untreated infected mice. Conclusion Based on these findings, isolated phage vB_PaeM_PS3 could be potentially considered for treating of P. aeruginosa infections.
... La secuenciación del genoma de los fagos y la construcción de bases de datos internacionales donde esta información se vuelca y se comparte son claves para facilitar la identificación de características genéticas indeseables y para descartar rápidamente cepas de fagos no adecuadas en función de las homologías con genes ya conocidos e informados (35)(36)(37). Por ejemplo, portar genes represores e integrasas asociadas con la lisogenia, o genes de virulencia bacteriana, toxinas y/o resistencia a los antibióticos descalificaría un fago para la terapia humana (38,39). De todas maneras, en situaciones en las que es difícil aislar un fago lítico, es posible, como ya se mencionó, eliminar los genes con características indeseables a través de ingeniería genética, volviendo apto a un fago lisogénico (27,40). ...
Full-text available
La creciente resistencia antimicrobiana asociada a la crisis en la producción de nuevos antibióticos y las consecuencias humanas y económicas de este fenómeno, constituyen un complejo escenario que requiere el urgente desarrollo de estrategias antimicrobianas alternativas. Los bacteriófagos son virus que infectan y lisan bacterias. Se conocen desde hace más de un siglo pero en las últimas dos décadas, la administración de bacteriofagos ha ganado popularidad en todo el mundo. Existe un extenso cuerpo de evidencia preclínica y clínica que posiciona a la fagoterapia como una de las principales herramientas para el tratamiento de infecciones difíciles de tratar. Si bien esto es conceptualmente promisorio, su implementación está limitada por la escasez de datos clínicos de seguridad y eficacia, obtenidos acorde a los estándares científicos actuales. Esta revisión describe los datos más relevantes acerca de la biología de los fagos, los aspectos farmacocinéticos y farmacodinámicos conocidos hasta la actualidad, los temas regulatorios y los resultados clínicos más relevantes.
... accessed on 3 September 2021) and RAST (, accessed on 3 September 2021) [19]. Using CGView Server (, ...
Full-text available
Limosilactobacillus fermentum is a bacterium widely used in food production, medicine, and industrial fermentation. However, fermentation could fail due to phage contamination. L. fermentum bacteriophage LFP02 can be induced from L. fermentum IMAU 32579 using mitomycin C. To better understand the characteristics of this phage, its physiological and genomic characteristics were evaluated. The results showed that its optimal multiplicity of infection was 0.01, and the burst size was 148.03 ± 2.65 pfu/infective center. Compared to temperature, pH had a more obvious influence on phage viability, although its adsorption capacity was not affected by the divalent cations (Ca2+ and Mg2+) or chloramphenicol. Its genome size was 43,789 bp and the GC content was 46.06%, including 53 functional proteins. Compared to other L. fermentum phages, phage LFP02 had chromosome deletion, insertion, and inversion, which demonstrated that it was a novel phage. This study could expand the knowledge of the biological characteristics of L. fermentum bacteriophages and provide some theoretical basis for bacteriophage prevention during fermentation.
... Predicted functional categories were associated with all available phage genes using the RAST pipeline [40]. The in silico translated protein sequences were used as queries to search for sequence homologs in the non-redundant protein database at the National Centre for Biotechnology (including the viral genome database). ...
Full-text available
The Oenococcus genus comprises four recognized species, and members have been found in different types of beverages, including wine, kefir, cider and kombucha. In this work, we implemented two complementary strategies to assess whether oenococcal hosts of different species and habitats were connected through their bacteriophages. First, we investigated the diversity of CRISPR-Cas systems using a genome-mining approach, and CRISPR-endowed strains were identified in three species. A census of the spacers from the four identified CRISPR-Cas loci showed that each spacer space was mostly dominated by species-specific sequences. Yet, we characterized a limited records of potentially recent and also ancient infections between O. kitaharae and O. sicerae and phages of O. oeni, suggesting that some related phages have interacted in diverse ways with their Oenococcus hosts over evolutionary time. Second, phage-host interaction analyses were performed experimentally with a diversified panel of phages and strains. None of the tested phages could infect strains across the species barrier. Yet, some infections occurred between phages and hosts from distinct beverages in the O. oeni species.
... The phage genome was reannotated on the RAST server (, accessed on 23 October 2022) by the RASTtk pipeline, following the customized phage annotation pipeline [39]. The annotation of every protein-coding gene was further checked and confirmed by BLASTP searches against the NCBI NR database, filtered for viruses [40] then by PFAM searches in the case of hypothetical proteins. ...
Full-text available
The inadequate therapeutic opportunities associated with carbapenem-resistant Pseudomonas aeruginosa (CRPA) clinical isolates impose a search for innovative strategies. Therefore, our study aimed to characterize and evaluate two locally isolated phages formulated in a hydrogel, both in vitro and in vivo, against CRPA clinical isolates. The two phages were characterized by genomic, microscopic, phenotypic characterization, genomic analysis, in vitro and in vivo analysis in a Pseudomonas aeruginosa-infected skin thermal injury rat model. The two siphoviruses belong to class Caudovirectes and were named vB_Pae_SMP1 and vB_Pae_SMP5. Each phage had an icosahedral head of 60 ± 5 nm and a flexible, non-contractile tail of 170 ± 5 nm long, while vB_Pae_SMP5 had an additional base plate containing a 35 nm fiber observed at the end of the tail. The hydrogel was prepared by mixing 5% w/v carboxymethylcellulose (CMC) into the CRPA propagated phage lysate containing phage titer 108 PFU/mL, pH of 7.7, and a spreadability coefficient of 25. The groups were treated with either Phage vB_Pae_SMP1, vB_Pae_SMP5, or a two-phage cocktail hydrogel cellular subepidermal granulation tissues with abundant records of fibroblastic activity and mixed inflammatory cell infiltrates and showed 17.2%, 25.8%, and 22.2% records of dermal mature collagen fibers, respectively. In conclusion, phage vB_Pae_SMP1 or vB_Pae_SMP5, or the two-phage cocktails formulated as hydrogels, were able to manage the infection of CRPA in burn wounds, and promoted healing at the injury site, as evidenced by the histopathological examination, as well as a decrease in animal mortality rate. Therefore, these phage formulae can be considered promising for clinical investigation in humans for the management of CRPA-associated skin infections.
... [79] pipeline with genetic code 4 (i.e., for Mycoplasma/Spiroplasma), which utilized a database containing all available NCBI Spiroplasma genomes (June 2018). Predicted ORFs at least 100 bp long were retained, allowing for overlapping ORFs [80]. Non-coding RNA regions were annotated with Aragorn v.1.2.38 [81]. ...
Full-text available
Bacteriophages are vastly abundant, diverse, and influential, but with few exceptions (e.g. the Proteobacteria genera Wolbachia and Hamiltonella), the role of phages in heritable bacteria-arthropod interactions, which are ubiquitous and diverse, remains largely unexplored. Despite prior studies documenting phage-like particles in the mollicute Spiroplasma associated with Drosophila flies, genomic sequences of such phage are lacking, and their effects on the Spiroplasma-Drosophila interaction have not been comprehensively characterized. We used a density step gradient to isolate phage-like particles from the male-killing bacterium Spiroplasma poulsonii (strains NSRO and MSRO-Br) harbored by Drosophila melanogaster. Isolated particles were subjected to DNA sequencing, assembly, and annotation. Several lines of evidence suggest that we recovered phage-like particles of similar features (shape, size, DNA content) to those previously reported in Drosophila-associated Spiroplasma strains. We recovered three ~ 19 kb phage-like contigs (two in NSRO and one in MSRO-Br) containing 21–24 open reading frames, a read-alignment pattern consistent with circular permutation, and terminal redundancy (at least in NSRO). Although our results do not allow us to distinguish whether these phage-like contigs represent infective phage-like particles capable of transmitting their DNA to new hosts, their encoding of several typical phage genes suggests that they are at least remnants of functional phage. We also recovered two smaller non-phage-like contigs encoding a known Spiroplasma toxin (Ribosome Inactivating Protein; RIP), and an insertion element, suggesting that they are packaged into particles. Substantial homology of our particle-derived contigs was found in the genome assemblies of members of the Spiroplasma poulsonii clade.
To kill bacteria, bacteriophages (phages) must first bind to a receptor, triggering the release of the phage DNA into the bacterial cell. Many bacteria secrete polysaccharides that had been thought to shield bacterial cells from phage attack. We use a comprehensive genetic screen to distinguish that the capsule is not a shield but is instead a primary receptor enabling phage predation. Screening of a transposon library to select phage-resistant Klebsiella shows that the first receptor-binding event docks to saccharide epitopes in the capsule. We discover a second step of receptor binding, dictated by specific epitopes in an outer membrane protein. This additional and necessary event precedes phage DNA release to establish a productive infection. That such discrete epitopes dictate two essential binding events for phages has profound implications for understanding the evolution of phage resistance and what dictates host range, two issues critically important to translating knowledge of phage biology into phage therapies.
Weizmannia coagulans (formerly Bacillus coagulans) is Gram-positive, and spore-forming bacteria causing food spoilage, especially in acidic canned food products. To control W. coagulans, we isolated a bacteriophage Youna2 from a sewage sludge sample. Morphological analysis revealed that phage Youna2 belongs to the Siphoviridae family with a non-contractile and flexible tail. Youna2 has 52,903 bp double-stranded DNA containing 61 open reading frames. There are no lysogeny-related genes, suggesting that Youna2 is a virulent phage. plyYouna2, a putative endolysin gene was identified in the genome of Youna2 and predicted to be composed of a N-acetylmuramoyl-L-alanine amidase domain (PF01520) at the N-terminus and unknown function DUF5776 domain (PF19087) at the Cterminus. While phage Youna2 has a narrow host range, infecting only certain strains of W. coagulans, PlyYouna2 exhibited a broad antimicrobial spectrum beyond the Bacillus genus. Interestingly, PlyYouna2 can lyse Gram-negative bacteria such as Escherichia coli, Yersinia enterocolitica, Pseudomonas putida and Cronobacter sakazakii without other additives to destabilize bacterial outer membrane. To the best of our knowledge, Youna2 is the first W. coagulans-infecting phage and we speculate its endolysin PlyYouna2 can provide the basis for the development of a novel biocontrol agent against various foodborne pathogens.
Full-text available
Autographiviridae is a diverse yet distinct family of bacterial viruses marked by a strictly lytic lifestyle and a generally conserved genome organization. Here, we characterized Pseudomonas aeruginosa phage LUZ100, a distant relative of type phage T7. LUZ100 is a podovirus with a limited host range which likely uses lipopolysaccharide (LPS) as a phage receptor. Interestingly, infection dynamics of LUZ100 indicated moderate adsorption rates and low virulence, hinting at temperate characteristics. This hypothesis was supported by genomic analysis, which showed that LUZ100 shares the conventional T7-like genome organization yet carries key genes associated with a temperate lifestyle. To unravel the peculiar characteristics of LUZ100, ONT-cappable-seq transcriptomics analysis was performed. These data provided a bird's-eye view of the LUZ100 transcriptome and enabled the discovery of key regulatory elements, antisense RNA, and transcriptional unit structures. The transcriptional map of LUZ100 also allowed us to identify new RNA polymerase (RNAP)-promoter pairs that can form the basis for biotechnological parts and tools for new synthetic transcription regulation circuitry. The ONT-cappable-seq data revealed that the LUZ100 integrase and a MarR-like regulator (proposed to be involved in the lytic/lysogeny decision) are actively cotranscribed in an operon. In addition, the presence of a phage-specific promoter transcribing the phage-encoded RNA polymerase raises questions on the regulation of this polymerase and suggests that it is interwoven with the MarR-based regulation. This transcriptomics-driven characterization of LUZ100 supports recent evidence that T7-like phages should not automatically be assumed to have a strictly lytic life cycle. IMPORTANCE Bacteriophage T7, considered the "model phage" of the Autographiviridae family, is marked by a strictly lytic life cycle and conserved genome organization. Recently, novel phages within this clade have emerged which display characteristics associated with a temperate life cycle. Screening for temperate behavior is of utmost importance in fields like phage therapy, where strictly lytic phages are generally required for therapeutic applications. In this study, we applied an omics-driven approach to characterize the T7-like Pseudomonas aeruginosa phage LUZ100. These results led to the identification of actively transcribed lysogeny-associated genes in the phage genome, pointing out that temperate T7-like phages are emerging more frequent than initially thought. In short, the combination of genomics and transcriptomics allowed us to obtain a better understanding of the biology of nonmodel Autographiviridae phages, which can be used to optimize the implementation of phages and their regulatory elements in phage therapy and biotechnological applications, respectively.
Staphylococcus aureus is one of the major pathogens causing foodborne outbreaks and severe infections worldwide. Generally, various physical and chemical treatments have been applied to control S. aureus in the food industry. However, conventional treatments usually affected food quality and often produced toxic compounds. Therefore, bacteriophage (phage), a natural antimicrobial agent, has been suggested as an alternative strategy to control foodborne pathogens including S. aureus. In this study, KMSP1, a bacteriophage infecting S. aureus was isolated from a raw milk sample and characterized. Transmission electron microscopy (TEM) analysis revealed that phage KMSP1 belongs to the Myoviridae family. Phage KMSP1 efficiently inhibited bacterial growth for >28 h post-infection. In addition, phage KMSP1 could infect a broad spectrum of S. aureus strains, including methicillin-resistant S. aureus (MRSA) strains. Whole-genome sequence analysis showed that KMSP1 is a lytic phage with the absence of genes related to lysogen formation, toxin production, and antibiotics resistance, respectively. In the genome of KMSP1, the presence of putative tail lysin containing a cysteine/histidine-dependent amidohydrolase/peptidase (CHAP) domain could be one of the reasons for the effective antimicrobial activity of KMSP1. Furthermore, high stability of phage KMSP1 at temperature ranging from 4 to 55 °C and pH ranging from 5 to 11, suggested its potential use in various food systems. Receptor analysis revealed that KMSP1 utilized cell wall teichoic acid (WTA), one of the major virulence factors of S. aureus, as a host receptor. Application of phage KMSP1 at an MOI of 104 achieved a significant reduction of log 8.8 CFU/mL of viable cell number in pasteurized milk and log 4.3 CFU/cm2 in sliced cheddar cheese after 24 h. Taken together, the strong antimicrobial activity of phage KMSP1 suggested that it could be developed as a biocontrol agent in dairy products to control S. aureus contamination.
Full-text available
Metagenomics has changed the face of virus discovery by enabling the accurate identification of viral genome sequences without requiring isolation of the viruses. As a result, metagenomic virus discovery leaves the first and most fundamental question about any novel virus unanswered: What host does the virus infect? The diversity of the global virosphere and the volumes of data obtained in metagenomic sequencing projects demand computational tools for virus-host prediction. We focus on bacteriophages (phages, viruses that infect bacteria), the most abundant and diverse group of viruses found in environmental metagenomes. By analyzing 820 phages with annotated hosts, we review and assess the predictive power of in silico phage-host signals. Sequence homology approaches are the most effective at identifying known phage-host pairs. Compositional and abundance-based methods contain significant signal for phage-host classification, providing opportunities for analyzing the unknowns in viral metagenomes. Together, these computational approaches further our knowledge of the interactions between phages and their hosts. Importantly, we find that all reviewed signals significantly link phages to their hosts, illustrating how current knowledge and insights about the interaction mechanisms and ecology of coevolving phages and bacteria can be exploited to predict phage-host relationships, with potential relevance for medical and industrial applications.
Full-text available
Phages are the most abundant biological entities on Earth and play major ecological roles, yet the current sequenced phage genomes do not adequately represent their diversity, and little is known about the abundance and distribution of these sequenced genomes in nature. Although the study of phage ecology has benefited tremendously from the emergence of metagenomic sequencing, a systematic survey of phage genes and genomes in various ecosystems is still lacking, and fundamental questions about phage biology, lifestyle, and ecology remain unanswered. To address these questions and improve comparative analysis of phages in different metagenomes, we screened a core set of publicly available metagenomic samples for sequences related to completely sequenced phages using the web tool, Phage Eco-Locator. We then adopted and deployed an array of mathematical and statistical metrics for a multidimensional estimation of the abundance and distribution of phage genes and genomes in various ecosystems. Experiments using those metrics individually showed their usefulness in emphasizing the pervasive, yet uneven, distribution of known phage sequences in environmental metagenomes. Using these metrics in combination allowed us to resolve phage genomes into clusters that correlated with their genotypes and taxonomic classes as well as their ecological properties. We propose adding this set of metrics to current metaviromic analysis pipelines, where they can provide insight regarding phage mosaicism, habitat specificity, and evolution.
Full-text available
RNA viruses exhibit substantial structural, ecological and genomic diversity. However, genome size in RNA viruses is likely limited by a high mutation rate, resulting in the evolution of various mechanisms to increase complexity while minimising genome expansion. Here we conduct a large-scale analysis of the genome sequences of 99 animal rhabdoviruses, including 45 genomes which we determined de novo, to identify patterns of genome expansion and the evolution of genome complexity. All but seven of the rhabdoviruses clustered into 17 well-supported monophyletic groups, of which eight corresponded to established genera, seven were assigned as new genera, and two were taxonomically ambiguous. We show that the acquisition and loss of new genes appears to have been a central theme of rhabdovirus evolution, and has been associated with the appearance of alternative, overlapping and consecutive ORFs within the major structural protein genes, and the insertion and loss of additional ORFs in each gene junction in a clade-specific manner. Changes in the lengths of gene junctions accounted for as much as 48.5% of the variation in genome size from the smallest to the largest genome, and the frequency with which new ORFs were observed increased in the 3' to 5' direction along the genome. We also identify several new families of accessory genes encoded in these regions, and show that non-canonical expression strategies involving TURBS-like termination-reinitiation, ribosomal frame-shifts and leaky ribosomal scanning appear to be common. We conclude that rhabdoviruses have an unusual capacity for genomic plasticity that may be linked to their discontinuous transcription strategy from the negative-sense single-stranded RNA genome, and propose a model that accounts for the regular occurrence of genome expansion and contraction throughout the evolution of the Rhabdoviridae.
Full-text available
The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.
Full-text available
All sequence data contain inherent information that can be measured by Shannon's uncertainty theory. Such measurement is valuable in evaluating large data sets, such as metagenomic libraries, to prioritize their analysis and annotation, thus saving computational resources. Here, Shannon's index of complete phage and bacterial genomes was examined. The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size. In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases. A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database. Measuring uncertainty may be used for rapid screening for sequences with matches in available database, prioritizing computational resources, and indicating which sequences with no known similarities are likely to be important for more detailed analysis.
Full-text available
Transposases (Tnps) are enzymes that participate in the movement of insertion sequences (ISs) within and between genomes. Genes that encode Tnps are amongst the most abundant and widely distributed genes in nature. However, they are difficult to predict bioinformatically and given the increasing availability of prokaryotic genomes and metagenomes, it is incumbent to develop rapid, high quality automatic annotation of ISs. This need prompted us to develop a web service, termed TnpPred for Tnp discovery. It provides better sensitivity and specificity for Tnp predictions than given by currently available programs as determined by ROC analysis. TnpPred should be useful for improving genome annotation. The TnpPred web service is freely available for noncommercial use.
Full-text available
Viruses are the most abundant biological entities on earth and encompass a vast amount of genetic diversity. The recent rapid increase in the number of sequenced viral genomes has created unprecedented opportunities for gaining new insight into the structure and evolution of the virosphere. Here we present an update of the Phage Orthologous Groups (POGs), a collection of 4,542 clusters of orthologous genes from bacteriophages that now also includes viruses infecting archaea and encompasses more than 1,000 distinct virus genomes. Analysis of this expanded dataset shows that the number of POGs keeps growing without saturation and that a substantial majority of the POGs remain specific to viruses, lacking homologues in prokaryotic cells, outside of known proviruses. Thus, the great majority of virus genes apparently remains to be discovered. A complementary observation is that numerous viral genomes remain poorly if at all covered by POGs. The genome coverage by POGs is expected to increase as more genomes are sequenced. Taxon-specific, single-copy signature genes, that are not observed in prokaryotic genomes outside of detected proviruses, were identified for two-thirds of the 57 taxa (those with genomes available from at least 3 distinct viruses, with half of these present in all members of the respective taxon. These signatures can be used to specifically identify the presence and quantify the abundance of viruses from particular taxa in metagenomic samples and thus gain new insights into the ecology and evolution of viruses in relation to their hosts.
We describe a program, tRNAscan-SE, which identifies 99-100% of transfer RNA genes in DNA sequence while giving less than one false positive per 15 gigabases. Two previously described tRNA detection programs are used as fast, first-pass prefilters to identify candidate tRNAs, which are then analyzed by a highly selective tRNA covariance model. This work represents a practical application of RNA covariance models, which are general, probabilistic secondary structure profiles based on stochastic context-free grammars. tRNAscan-SE searches at approximately 30 000 bp/s. Additional extensions to tRNAscan-SE detect unusual tRNA homologues such as selenocysteine tRNAs, tRNA-derived repetitive elements and tRNA pseudogenes.
Many different types of functional non-coding RNAs participate in a wide range of important cellular functions but the large majority of these RNAs are not routinely annotated in published genomes. Several programs have been developed for identifying RNAs, including specific tools tailored to a particular RNA family as well as more general ones designed to work for any family. Many of these tools utilize covariance models (CMs), statistical models of the conserved sequence, and structure of an RNA family. In this chapter, as an illustrative example, the Infernal software package and CMs from the Rfam database are used to identify RNAs in the genome of the archaeon Methanobrevibacter ruminantium, uncovering some additional RNAs not present in the genome's initial annotation. Analysis of the results and comparison with family-specific methods demonstrate some important strengths and weaknesses of this general approach.
Bacteriophages (or phages) are the most abundant biological entities on earth, and are estimated to outnumber their bacterial prey by tenfold. The constant threat of phage predation has led to the evolution of a broad range of bacterial immunity mechanisms that in turn result in the evolution of diverse phage immune evasion strategies, leading to a dynamic co-evolutionary arms race. Although bacterial innate immune mechanisms against phage abound, the only documented bacterial adaptive immune system is the CRISPR/Cas (clustered regularly interspaced short palindromic repeats/CRISPR-associated proteins) system, which provides sequence-specific protection from invading nucleic acids, including phage. Here we show a remarkable turn of events, in which a phage-encoded CRISPR/Cas system is used to counteract a phage inhibitory chromosomal island of the bacterial host. A successful lytic infection by the phage is dependent on sequence identity between CRISPR spacers and the target chromosomal island. In the absence of such targeting, the phage-encoded CRISPR/Cas system can acquire new spacers to evolve rapidly and ensure effective targeting of the chromosomal island to restore phage replication.