ChapterPDF Available

Phage Genome Annotation Using the RAST Pipeline

Authors:
  • Fellowship for the Interpretation of Genomes

Abstract

Phages are complex biomolecular machineries that have to survive in a bacterial world. Phage genomes show many adaptations to their lifestyle such as shorter genes, reduced capacity for redundant DNA sequences, and the inclusion of tRNAs in their genomes. In addition, phages are not free-living, they require a host for replication and survival. These unique adaptations provide challenges for the bioinformatics analysis of phage genomes. In particular, ORF calling, genome annotation, noncoding RNA (ncRNA) identification, and the identification of transposons and insertions are all complicated in phage genome analysis. We provide a road map through the phage genome annotation pipeline, and discuss the challenges and solutions for phage genome annotation as we have implemented in the rapid annotation using subsystems (RAST) pipeline.
Chapter 17
Phage Genome Annotation Using the RAST Pipeline
Katelyn McNair, Ramy Karam Aziz, Gordon D. Pusch, Ross Overbeek,
Bas E. Dutilh, and Robert Edwards
Abstract
Phages are complex biomolecular machineries that have to survive in a bacterial world. Phage genomes
show many adaptations to their lifestyle such as shorter genes, reduced capacity for redundant DNA
sequences, and the inclusion of tRNAs in their genomes. In addition, phages are not free-living, they
require a host for replication and survival. These unique adaptations provide challenges for the bioinfor-
matics analysis of phage genomes. In particular, ORF calling, genome annotation, noncoding RNA
(ncRNA) identification, and the identification of transposons and insertions are all complicated in phage
genome analysis. We provide a road map through the phage genome annotation pipeline, and discuss the
challenges and solutions for phage genome annotation as we have implemented in the rapid annotation
using subsystems (RAST) pipeline.
Key words Phage, Genome annotation, RAST, Functional annotation, Gene predictions
1 The Steps of Phage Genome Annotation
The essential steps in annotating any genome, whether phage,
bacterial, or eukaryotic, consist of identifying the features in the
genome and assigning terms describing roles or functions to those
features. Typical features that can be found in a phage genome
include protein-encoding genes, noncoding RNA genes, insertion
elements and transposons, direct and indirect repeats, origins of
replication, and attachment or integration sites. Annotations are
routinely only added to protein and RNA-encoding genes, labels
are often provided for insertion elements or transposons. Specific
for phages, they are fundamentally dependent on a cellular host to
replicate, and the functions on its genome can only be completely
understood in the context of the genome of the host. Thus, identi-
fication of prediction of the bacterial or archaeal host is an impor-
tant part of phage annotation. Together, these features provide the
core annotation of phages and this annotation provides the first
steps to understanding the function of the phage as it interacts with
Martha R.J. Clokie et al. (eds.), Bacteriophages: Methods and Protocols, Volume 3, Methods in Molecular Biology, vol. 1681,
https://doi.org/10.1007/978-1-4939-7343-9_17,©Springer Science+Business Media LLC 2018
231
its host (Fig. 1). We discuss the approaches to identify and annotate
each of these features below, and discuss how these annotations are
performed in the Rapid Annotation Using Subsystems Technology
approach (RAST) [1,2].
Protein-encoding genes are the focus of most automated anno-
tation systems, and more algorithms have been designed to handle
these features than other features. Generally a protein-encoding
gene can be identified as a long stretch of sequence in one reading
frame that can be translated into protein sequence without includ-
ing one of the three stop codons; these long stretches are called
Open Reading Frames (ORFs). In gene calling, the stop codons are
obvious because there is a choice of three codons to choose from
and they are all stop codons (unless the phage encodes a suppressor
tRNA which we do not discuss here). Most algorithms attempt to
identify the longest nonoverlapping ORFs in a genome, based on
the theory that the longer the open reading frame the less likely it is
to occur by chance. There are many alternative gene-finding algo-
rithms that have been developed over the last two decades,
Fig. 1 Pipeline of phage genome annotation starting with DNA sequences and
ending with an annotated genome
232 Katelyn McNair et al.
including CRITICA [3], GeneMark [4,5], GISMO [6], Glimmer
[7,8], MetaGeneAnnotator [9], and Prodigal [10]. Most of the
gene-finding algorithms find the same large genes because these are
obvious and have high confidence. The algorithms may differ in the
particular start sites that they identify; there may be multiple methi-
onine (ATG) or valine (GTG) codons that could all be used as the
start codon, and predicting exactly which start codon is the correct
one for a given gene is difficult without a priori knowledge of the
translation boundaries of the gene. In addition, the gene callers also
differ in their ability to identify small protein-encoding genes.
Short genes are statistically difficult to separate from the back-
ground noise of stretches of nucleotides that do not encode a
stop codon, and often gene calling algorithms use an artificial cut
off of (for example) 75 amino acids. It remains to be determined
how many small proteins are encoded in phage genomes, and this is
unlikely to be approached from a pure bioinformatics standpoint, as
it will require biological validation of bioinformatics predictions or
large-scale proteomic studies.
Most bacterial genomes are not thought to contain overlapping
open reading frames, and these shadow ORFs are removed during
the annotation step [10]. In viruses, including phages, however,
there are several well-known examples of two different genes from
the same stretch of DNA, such as the Rz/Rz1 system [11]. One
study even suggests that new genes may be born via this process,
providing evidence from the comparative genomics of Rhabdovir-
idae genomes [12]. These overlapping regions are generally not
predicted using most bioinformatics approaches, as adding over-
lapping ORFs to gene prediction algorithms would include an
enormous number of false positives to compensate for only a few
false negatives. Therefore, most phage protein prediction schemes
ignore overlapping proteins.
Following ORF identification, most bioinformatic gene predic-
tion tools assign a confidence score to the ORFs using a model of
what a gene is expected to look like, based on its nucleotide usage
statistics. These statistics are specific for a species, and depend on
properties like the codon usage and GC content of the genome. In
bacterial genomes, the RAST pipeline starts by identifying highly
conserved genes that are present in nearly every genome. The
statistics from those genes are then used to build a genome-specific
model for open reading frame identification that is applied to the
rest of the genome. In phage genes, there are typically very few, if
any, highly conserved genes, and never enough to build a reliable
gene model. Therefore, most gene calling is performed by a generic
model that is not trained on the specific genome being annotated
but on the genomes of all phages. By default, the RAST pipeline
uses Glimmer to identify the open reading frames, but options are
available to use MetaGeneAnnotator [9], GeneMark [4], or
Prodigal [10].
Phage Genome Annotation Using the RAST Pipeline 233
The functional annotation of protein-encoding phage genes is
usually based on homology searches against existing phages. His-
torically, phage genes were named with a single letter starting at
gpA, and either proceeding along the genome or assigning names
based on the order in which the genes or their products were
found. This resulted in several unrelated proteins from different
phages all having the same names. For example both terminases and
DNA replication initiation proteins have been annotated as gpA in
different phage genomes available from GenBank. This confusion,
amplified by the explosion of genome sequences in recent years, led
to efforts to categorize phage proteins into either phage ortholo-
gous groups (POGs) [13] or subsystems [14] that have unified the
annotation of many phage proteins. These common, descriptive,
names provide a framework for comparing annotations among
different phage genomes. The RAST system uses a combination
of homology, chromosomal clustering, and subsystems to assign
functions to proteins. First, proteins are annotated on the basis of
homology to known proteins. If this initial search yields matches to
proteins that are a component of a subsystem, RAST then tries to
find other members of the subsystem that should be present in the
same genome based on information from the previously annotated
genomes. The advantage of this approach is that the RAST system
can strengthen otherwise weak assertions of homology, based on
predictions from subsystem annotations. Of note, the RAST tools
allow the analysis of proteins in their chromosomal context, which
sometimes helps determine the roles of proteins with unknown
functions based on the functions of their chromosomal neighbors
(e.g., protein subunits encoded by different genes, members of
operons, or transporters of metabolites whose metabolizing
enzymes are encoded on the same cluster). Phage genomes, like
bacterial genomes, also order some of their genes, and this infor-
mation can be leveraged to identify clusters of genes. For example,
the small and large terminase (TerS and TerL) are frequently adja-
cent on the genome, and the identification of one leads to the
identification of the other.
A major difficulty in the functional annotation of protein-
encoding genes on phage genomes by homology searches is the
fact that most proteins have no close homologs in the reference
databases. Especially for novel phages, this results in the majority of
encoded ORFs having no annotated function, or a hypothetical
function at best. A possible solution includes homology-indepen-
dent annotation, based on amino acid usage profiles of the proteins.
One such approach, iVIREONS (https://vdm.sdsu.edu/ivireons/
) uses machine learning to “learn” the characteristics of manually
annotated phage proteins and then tests unknown proteins to see if
they have similar characteristics [15].
Noncoding RNA (ncRNA) genes. Although Ribosomal RNAs
have not yet been found in phage genomes, most pipelines,
234 Katelyn McNair et al.
including RAST, look for them anyway as the pipelines have been
developed for bacterial genome annotation and the computational
cost of looking for rRNA genes is a minimal addition to the pipe-
line. Ribosomal RNA genes are highly conserved and are identified
by extrinsic gene calling—using a database of known RNA genes to
compare against. In contrast to rRNA genes that are recognized by
homology, tRNA genes are recognized by intrinsic gene calling—
using only features of the sequence. They are typically identified by
computational tools built specifically to recognize the secondary
structure of the tRNA molecule [16]. As with tRNAs, the function
of other non-protein coding RNA genes also depends on the
structure of the folded RNA molecule rather than the nucleotide
sequence. Therefore, other noncoding RNA genes are also recog-
nized by their conserved secondary structure rather than homology
to existing sequences [17]. The RAST pipeline uses a manually
curated database of ribosomal RNA genes to find them in a
genome, and uses tRNAScan-SE [16] to identify tRNA genes.
Many phages encode tRNA genes, and it has been proposed that
these may supplement host-encoded tRNAs in translating phage
proteins for anticodons that are insufficiently covered by the bacte-
rial tRNAs [18]. These tRNA genes are also often used as phage
integration sites in the host’s genome (attP). Integration of the
phage disrupts the host gene, and thus carrying complete, or near
complete, tRNA genes allows the phage to reconstitute a tRNA
into which it can integrate [19]. There has been little exploration of
the role of ncRNA in phage lifestyle. Recent work with CRISPR/
Cas systems have identified the presence of these systems in phage
genomes [20] and metagenomes [21], and it is thought that they
are being used to attack other phages that may be infecting the
same host.
Insertion elements and transposons are currently identified by
annotations of protein-encoding genes. Transposases (Tn) are
readily identified as protein-encoding genes, and the similarity
between members of the transposase family, and with other recom-
binases, is high enough that they usually receive accurate annota-
tion. However, the repeats flanking the insertion sequence or
transposon are not typically automatically annotated. There are
boutique databases of these problematic mobile elements
[22,23], but often the classification of insertion (IS) elements is
dependent on one or a few residues. Typically automatic annotation
systems identify the Tn or IS elements but cannot identify the fine
details responsible for the accurate categorization of these ele-
ments. More work is required to accurately denote the ends of
these mobile elements in automatic phage annotation systems.
Direct and indirect repeats are usually used to identify the ends of
insertion elements and transposons [22], and to predict the ends of
prophages that have been found in bacterial genomes [13]. Stan-
dard informatics approaches can easily identify repeats longer than
Phage Genome Annotation Using the RAST Pipeline 235
approximately 14 nucleotides in a phage genome. Below that
length, repeats are found too frequently to ascertain whether they
are indeed the correct flanking repeats, or randomly occurring
repeated sequence elements. A few websites can be used to identify
repeats in DNA sequences (e.g., [24,25]).
Phage attachment sites are impossible to detect de novo if only
the phage is known, but if the phage and the host genome
sequences are known, they are trivial to find. The phage carries
the attachment site P (attP) that has sequence homology to the
bacterial attachment site B (attB). Integration is initiated by recom-
bination between attP and attB, resulting in attL and attR sites
that flank the nascent prophage.
Accurately Annotating Phage Metadata. Annotating genomic
metadata is a general challenge to genomics and metagenomics.
With bacteriophages, this issue is even more problematic, given the
lack of systematic nomenclature for viruses (as opposed to the
binomial system used for cellular organisms, see Chapter 15 of this
book). Some attempts were made to suggest systematic nomencla-
ture for viruses similar to those used for plasmids [26], but they are
not widely applied or enforced. In addition to accurate taxonomic
descriptions of viruses, including metadata associated with the virus
(e.g., its morphology, actual host, host range, and lifestyle) is
equally important. These make comparative genomics studies pos-
sible, enable predictive tools such as those that identify the host of
unknown phages [27], or predict the lifestyle of new phages [14]
and improve metagenomic/microbiomic annotations. Other
important types of metadata can be computed from the genomic
information, e.g., a genome’s length, %G+C, and codon usage
[28]. These too have quite powerful applications in comparative
genomics, prophage finding, and metagenomics. For example,
information content of phage genomes has improved prophage
finding [29] and is proposed to improve metagenomic analysis
[30]. As with gene annotation, metadata annotation needs to use
a controlled vocabulary (which has to be consistent but not neces-
sarily rigid or hierarchical). Spelling inconsistencies (e.g., firmicutes
vs. Firmicutes vs. gram-positive bacteria) or terminology inconsis-
tencies (e.g., temperate vs. lysogenic lifestyles) are all obstacles
against computational analysis and data propagation.
To summarize, phage annotation involves the identification
and functional description of several types of features, including
protein-encoding genes, RNA genes, insertion elements and trans-
posons, repeats, and attachment sites. Moreover, phage–host asso-
ciations are an important part of understanding phage biology that
can be predicted using a range of computational tools [27]. The
RAST pipeline provides an automated approach to phage genome
annotation. The pipeline currently uses bacterial ORF-finding algo-
rithms to identify the proteins in the genome, and a combination of
236 Katelyn McNair et al.
homology-based and subsystems-based approaches to decorate
those proteins with their functional annotation. RNA genes are
detected by a combination of extrinsic and intrinsic gene calling
methods. There remain several hurdles to accurate phage genome
annotation, especially the assignment of functions to unknown
proteins, the identification of small proteins in the genome, and
the correct and unambiguous identification of insertion elements
and transposons. The combinations of bioinformatics advances and
a better understanding of phage biology will help to improve phage
genome annotation, making this field a fertile area for further
exploration.
Acknowledgments
This work was supported by grants from the National Science
Foundation MCB-1330800 and DUE-1323809 to RAE. BED
was supported by the Netherlands Organization for Scientific
Research (NWO) Vidi grant 864.14.004.
References
1. Aziz RK, Bartels D, Best AA, DeJongh M,
Disz T, Edwards RA, Formsma K, Gerdes S,
Glass EM, Kubal M, Meyer F, Olsen GJ,
Olson R, Osterman AL, Overbeek RA, McNeil
LK, Paarmann D, Paczian T, Parrello B, Pusch
GD, Reich C, Stevens R, Vassieva O,
Vonstein V, Wilke A, Zagnitko O (2008) The
RAST Server: rapid annotations using subsys-
tems technology. BMC Genomics 9:75
2. Brettin T, Davis JJ, Disz T, Edwards RA,
Gerdes S, Olsen GJ, Olson R, Overbeek R,
Parrello B, Pusch GD, Shukla M, Thomason
Iii JA, Stevens R, Vonstein V, Wattam AR, Xia F
(2015) RASTtk: A modular and extensible
implementation of the RAST algorithm for
building custom annotation pipelines and
annotating batches of genomes. Sci Rep
5:8365
3. Badger JH, Olsen GJ (1999) CRITICA: cod-
ing region identification tool invoking compar-
ative analysis. Mol Biol Evol 16:512–524
4. Borodovsky M, Mclninch JD, Koonin EV,
Rudd KE, Me
´digue C, Danchin A (1995)
Detection of new genes in a bacterial genome
using Markov models for three gene classes.
Nucleic Acids Res 23:3554–3562
5. Lukashin AV, Borodovsky M (1998) Gene-
Mark.hmm: new solutions for gene finding.
Nucleic Acids Res 26:1107–1115
6. Krause L, McHardy AC, P
uhler A, Stoye J,
Meyer F (2007) GISMO - Gene identification
using a support vector machine for ORF classi-
fication. Nucleic Acids Res 35:540–549
7. Delcher AL, Harmon D, Kasif S, White O,
Salzberg SL (1999) Improved microbial gene
identification with GLIMMER. Nucleic Acids
Res 27:4636–4641
8. Kelley DR, Liu B, Delcher AL, Pop M, Salz-
berg SL (2012) Gene prediction with Glimmer
for metagenomic sequences augmented by
classification and clustering. Nucleic Acids Res
40:e9–e9
9. Noguchi H, Taniguchi T, Itoh T (2008) Meta-
GeneAnnotator: Detecting species-specific pat-
terns of ribosomal binding site for precise gene
prediction in anonymous prokaryotic and
phage genomes. DNA Res 15:387–396
10. Hyatt D, Chen G-L, LoCascio PF, Land ML,
Larimer FW, Hauser LJ (2010) Prodigal: pro-
karyotic gene recognition and translation initi-
ation site identification. BMC Bioinformatics
11:119
11. Summer EJ, Berry J, Tran TAT, Niu L, Struck
DK, Young R (2007) Rz/Rz1 lysis gene
equivalents in phages of Gram-negative hosts.
J Mol Biol 373:1098–1112
12. Walker PJ, Firth C, Widen SG, Blasdell KR,
Guzman H, Wood TG, Paradkar PN, Holmes
EC, Tesh RB, Vasilakis N (2015) Evolution of
genome size and complexity in the Rhabdovir-
idae. PLoS Pathog 11:e1004664
Phage Genome Annotation Using the RAST Pipeline 237
13. Kristensen DM, Waller AS, Yamada T, Bork P,
Mushegian AR, Koonin EV (2013) Ortholo-
gous gene clusters and taxon signature genes
for viruses of prokaryotes. J Bacteriol
195:941–950
14. McNair K, Bailey BA, Edwards RA (2012)
PHACTS, a computational approach to classi-
fying the lifestyle of phages. Bioinformatics
28:614–618
15. Seguritan V, Alves N, Arnoult M, Raymond A,
Lorimer D, Burgin AB, Salamon P, Segall AM
(2012) Artificial neural networks trained to
detect viral and phage structural proteins.
PLoS Comput Biol 8:e1002657
16. Lowe TM, Eddy SR (1997) tRNAscan-SE: a
program for improved detection of transfer
RNA genes in genomic sequence. Nucleic
Acids Res 25:955–964
17. Nawrocki EP (2014) Annotating functional
RNAs in genomes using Infernal. Methods
Mol Biol 1097:163–197
18. Bailly-Bechet M, Vergassola M, Rocha E
(2007) Causes for the intriguing presence of
tRNAs in phages. Genome Res 17:1486–1495
19. Williams KP (2002) Integration sites for
genetic elements in prokaryotic tRNA and
tmRNA genes: sublocation preference of inte-
grase subfamilies. Nucleic Acids Res
30:866–875
20. Seed KD, Lazinski DW, Calderwood SB,
Camilli A (2013) A bacteriophage encodes its
own CRISPR/Cas adaptive response to evade
host innate immunity. Nature 494:489–491
21. Cassman N, Prieto-Davo
´A, Walsh K, Silva
GGZ, Angly F, Akhter S, Barott K, Busch J,
McDole T, Haggerty JM, Willner D,
Alarco
´n G, Ulloa O, DeLong EF, Dutilh BE,
Rohwer F, Dinsdale EA (2012) Oxygen mini-
mum zones harbour novel viral communities
with low diversity. Environ Microbiol
14:3043–3065
22. Aziz RK, Breitbart M, Edwards RA (2010)
Transposases are the most abundant, most
ubiquitous genes in nature. Nucleic Acids Res
38:4207–4217
23. Riadi G, Medina-Moenne C, Holmes DS
(2012) TnpPred: a web service for the robust
prediction of prokaryotic transposases. Comp
Funct Genomics 2012:678761
24. Benson G (1999) Tandem repeats finder: a
program to analyze DNA sequences. Nucleic
Acids Res 27:573–580
25. Volfovsky N, Haas BJ, Salzberg SL (2001) A
clustering method for repeat analysis in DNA
sequences. Genome Biol 2:RESEARCH0027
26. Kropinski AM, Prangishvili D, Lavigne R
(2009) Position paper: the creation of a ratio-
nal scheme for the nomenclature of viruses of
Bacteria and Archaea. Environ Microbiol
11:2775–2777
27. Edwards RA, McNair K, Faust K, Raes J,
Dutilh BE (2016) Computational approaches
to predict bacteriophage–host relationships.
FEMS Microbiol Rev 40:58–72
28. Aziz RK, Dwivedi B, Akhter S, Breitbart M,
Edwards RA (2015) Multidimensional metrics
for estimating phage abundance, distribution,
gene density, and sequence coverage in meta-
genomes. Front Microbiol 6:381
29. Akhter S, Aziz RK, Edwards RA (2012)
PhiSpy: a novel algorithm for finding pro-
phages in bacterial genomes that combines
similarity- and composition-based strategies.
Nucleic Acids Res 40:e126–e126
30. Akhter S, Bailey BA, Salamon P, Aziz RK,
Edwards RA (2013) Applying Shannon’s infor-
mation theory to bacterial and phage genomes
and metagenomes. Sci Rep 3:1033
238 Katelyn McNair et al.
... Besides the BLASTn analysis, nucleotide-based intergenomic similarity of phage V5 genome against other Vibrio phage genomes (Supplementary Table 1) was also determined by Virus Intergenomic Distance Calculator (VIRIDIC) at the genus and species-level thresholds of 70% and 95%, respectively [17]. Rapid Annotation using Subsystem Technology (RAST v2.0) pipeline was used for open reading frame (ORF) prediction in the genome assembly [18]. The predicted ORFs were functionally annotated based on NCBI conserved domain, BLASTp, HHpred protein homology detection and InterProScan domain search results [14]. ...
Article
Despite their evolutionary, molecular biology and biotechnological significance, relatively fewer numbers of single-stranded DNA (ssDNA) filamentous phages belonging to the family Inoviridae have been discovered and characterized to date. The present study focused on genome sequencing and characterization of an ssDNA Vibrio parahaemolyticus phage V5 previously isolated from an inland saline shrimp culture farm. The complete circular genome of phage V5 consisted of 6658 bp with GC content of 43.7%. During BLASTn analysis, only 36% of phage V5 genome matched with other Vibrio phage genomes in the NCBI database with a sequence identity value of 79%. During the phylogenetic analysis, phage V5 formed a separate branch in the minor clade. These features indicate the novel nature of the phage V5 genome. Among 10 predicted open reading frames (ORFs) in the phage V5 genome, 6 encoded for the proteins of known biological functions, whereas the rest were classified as hypotheticals. Proteins involved in replication and structural assembly were encoded by the phage genome. However, the absence of genes encoding for DNA/RNA polymerases and tRNAs signified that phage V5 is dependent on the host`s molecular machinery for its propagation. As per our knowledge, this is the first study describing the novel genome sequence of an ssDNA V. parahaemolyticus phage from the inland saline environment.
... For example, RASTtk allows the annotation of RNAs and repeat regions. Although it was conceived to annotate bacterial and archaeal genomes, RASTtk has been a handy tool for phage annotation in the past (86). Moreover, while VIGA relies only on Prodigal for gene calling, its strength is that it includes several modules or routines to comprehensively annotate a viral genome. ...
Article
Full-text available
Over a century of bacteriophage research has uncovered a plethora of fun- damental aspects of their biology, ecology, and evolution. Furthermore, the introduction of community-level studies through metagenomics has revealed unprecedented insights on the impact that phages have on a range of ecological and physiological processes. It was not until the introduction of viral metagenomics that we began to grasp the aston- ishing breadth of genetic diversity encompassed by phage genomes. Novel phage genomes have been reported from a diverse range of biomes at an increasing rate, which has prompted the development of computational tools that support the multile- vel characterization of these novel phages based solely on their genome sequences. The impact of these technologies has been so large that, together with MAGs (Metagenomic Assembled Genomes), we now have UViGs (Uncultivated Viral Genomes), which are now officially recognized by the International Committee for the Taxonomy of Viruses (ICTV), and new taxonomic groups can now be created based exclusively on genomic sequence information. Even though the available tools have immensely con- tributed to our knowledge of phage diversity and ecology, the ongoing surge in soft- ware programs makes it challenging to keep up with them and the purpose each one is designed for. Therefore, in this review, we describe a comprehensive set of currently available computational tools designed for the characterization of phage genome sequences, focusing on five specific analyses: (i) assembly and identification of phage and prophage sequences, (ii) phage genome annotation, (iii) phage taxonomic classifica- tion, (iv) phage-host interaction analysis, and (v) phage microdiversity.
... PHASTER (Ríos-Sandoval et al., 2020) and GeneMarkS (Salisbury and Tsourkas, 2019) were used as the initial tools for open reading frames (ORFs) prediction. All predicted ORFs were confirmed and protein functions were assigned using RAST server and Blast results from NCBI against non-reductant protein database (McNair et al., 2018). Molecular weight and isoelectric point was calculated by Expasy 2 while Pfam HMMER tool 3 and HH phred 4 were utilized for identifying the protein family based on its conserved domain matches. ...
Article
Full-text available
Salmonella gallinarum is a poultry restricted-pathogen causing fowl-typhoid disease in adult birds with mortality rates up-to 80% and exhibit resistance against commonly used antibiotics. In this current study, a temperate broad host range bacteriophage SGP-C was isolated against S. gallinarum from poultry digesta. It showed infection ability in all the 15 tested field strains of S. gallinarum. The SGP-C phage produced circular, turbid plaques with alternate rings. Its optimum activity was observed at pH 7.0 and 37–42°C, with a latent period of 45 min and burst size of 187 virions/bacterial cell. The SGP-C lysogens, SGPC-L5 and SGPC-L6 exhibited super-infection immunity against the same phage, an already reported feature of lysogens. A virulence index of 0.5 and 0.001 as MV50 of SGP-C suggests its moderate virulence. The genome of SGP-C found circular double stranded DNA of 42 Kbp with 50.04% GC content, which encodes 63 ORFs. The presence of repressor gene at ORF49, and absence of tRNA sequence in SGP-C genome indicates its lysogenic nature. Furthermore, from NGS analysis of lysogens we propose that SGP-C genome might exist either as an episome, or both as integrated and temporary episome in the host cell and warrants further studies. Phylogenetic analysis revealed its similarity with Salmonella temperate phages belonging to family Siphoviridae. The encoded proteins by SGP-C genome have not showed homology with any known toxin and virulence factor. Although plenty of lytic bacteriophages against this pathogen are already reported, to our knowledge SGP-C is the first lysogenic phage against S. gallinarum reported so far.
... The phages were annotated by both VIBRANT v1.2.1 (Kieft et al. 2020) and Rapid Annotation using Subsystems Technology (RAST) server (Aziz et al. 2008), after adjusting RAST settings to the optimized phage pipeline McNair et al. 2018). Genes encoding tRNAs were detected by tRNAscan-SE (Lowe and Eddy 1997). ...
Article
Full-text available
Cholera is a devastating diarrheal disease that accounts for more than 10% of children’s lives worldwide, but its treatment is hampered by a rise in antibiotic resistance. One promising alternative to antibiotic therapy is the use of bacteriophages to treat antibiotic-resistant cholera infections, and control Vibrio cholera in clinical cases and in the environment, respectively. Here, we report four novel, closely related environmental myoviruses, VP4, VP6, VP18, and VP24, which we isolated from two environmental toxigenic Vibrio cholerae strains from river Kuja and Usenge beach in Kenya. High-throughput sequencing followed by bioinformatics analysis indicated that the genomes of the four bacteriophages have closely related sequences, with sizes of 148,180 bp, 148,181 bp, 148,179 bp, and 148,179 bp, and a G + C content of 36.4%. The four genomes carry the phoH gene, which is overrepresented in marine cyanophages. The isolated phages displayed a lytic activity against 15 environmental, as well as one clinical, Vibrio cholerae strains. Thus, these novel lytic vibriophages represent potential biocontrol candidates for water decontamination against pathogenic Vibrio cholerae and ought to be considered for future studies of phage therapy.
... The phages were annotated by both VIBRANT v1.2.1 (Kieft et al. 2020) and Rapid Annotation using Subsystems Technology (RAST) server (Aziz et al. 2008), after adjusting RAST settings to the optimized phage pipeline McNair et al. 2018). Genes encoding tRNAs were detected by tRNAscan-SE (Lowe and Eddy 1997). ...
Preprint
Full-text available
Cholera is a devastating diarrheal disease that accounts for more than 10% of children’s lives worldwide, but its treatment is hampered by a rise in antibiotic resistance. One promising alternative to antibiotic therapy is the use of bacteriophages to treat antibiotic-resistant cholera infections, and control Vibrio cholera in clinical cases and in the environment, respectively. Here, we report four novel, closely related environmental myoviruses, VP4, VP6, VP18, and VP24, which we isolated from two environmental toxigenic Vibrio cholerae strains from river Kuja and Usenge beach in Kenya. High-throughput sequencing followed by bioinformatics analysis indicated that the genomes of the four bacteriophages have closely related sequences, with sizes of 148,180 bp, 148,181 bp, 148,179 bp, and 148,179 bp, and a G + C content of 36.4%. The four genomes carry the phoH gene, which is overrepresented in marine cyanophages. The isolated phages displayed a lytic activity against 15 environmental, as well as one clinical, Vibrio cholerae strains. Thus, these novel lytic vibriophages represent potential biocontrol candidates for water decontamination against pathogenic Vibrio cholerae and ought to be considered for future studies of phage therapy.
Article
Relatively few viruses infecting haloarchaea (haloviruses) have been reported. In this study, the genome sequence of VOLN27B, a recently described archaeal tailed virus (arTV) with a myovirus morphotype was described, along with the sequence of its host, Halorubrum spp. LN27. Halovirus VOLN27B contains a linear, dsDNA genome of 76,891 bp which is predicted to encode 109 proteins and four tRNAs (tRNAThr, tRNAArg, tRNAGly and tRNAAsn). The DNA G+C content of VOLN27B genome is 56.1 mol%, nearly 10% lower than that of its host strain. A 315 bp LTR (long terminal repeat) was detected in the genome. The genome of its host strain LN27 was 3,301,211 bp (chromosome and 1 plasmid) with a DNA G+C content of 68.3 mol% and 3,142 annotated protein coding genes. At least two hypothetical proviruses were detected in the genome. It lacked a CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) locus. Sequence similarity and phylogenetic tree reconstructions placed it within the genus Halorubrum as a potential new species. VOLN27B exhibits a distinct difference in the frequency of codon usage against its host strain Halorubrum sp. LN27. The organization of VOLN27B genome shows remarkable synteny and amino acid sequence similarity to the genomes and predicted proteins of HF1-like haloviruses (genus Haloferacalesvirus) and a provirus in the genome of Halorubrum depositum Y78. VOLN27B and its host Halorubrum sp. LN27 comprise a new virus-host system from a hypersaline ecosystem and can be used to further understand the novel biology at extreme salt concentration.
Thesis
Excess use of antimicrobials and release into the environment for over half a century have generated a constant selective pressure for resistant bacterial strains. Consequently, we are facing a worldwide antibiotic resistance challenge with increasing numbers of bacterial infection becoming difficult to treat once again. Avian pathogenic Escherichia coli (APEC) is one of the leading pathogens affecting poultry worldwide, and various multi-drug resistant strains have been isolated. APEC strains with O-serogroup O1, O2, or O78 have been shown to cause the majority of infections. The advances in the therapeutic use of the bacterial viruses, bacteriophages (phages), have highlighted their potential use as an alternative or supplement treatment against bacterial pathogens. However, we are only just beginning to understand the diversity of phages, and the use of phages in therapy requires a detailed characterisation of the candidate phages prior to their application to ensure that they have the expected potential to kill pathogenic bacteria and have therapeutic effects, while minimising negative environmental modifications. The understanding of the phage-host interactions has shown to be essential for the development and application of a successful phage therapy. This PhD dissertation contributes to our understanding of E. coli-infecting phage (coliphage)-host interaction for phage therapy against APEC by determine in vitro growth dynamics as well as underlying mechanism of phage resistance.
Article
Full-text available
This article is open access at https://www.liebertpub.com/doi/full/10.1089/phage.2021.0013
Preprint
Full-text available
Viruses are vastly abundant and influential in all ecosystems, and are generally regarded as pathogens. Viruses of prokaryotes (themselves highly diverse and abundant) are known as bacteriophages or phages. Phages engage in diverse associations with their hosts, and contribute to regulation of biogeochemical processes, horizontal movement of genes, and control of bacterial populations. Recent studies have revealed the influential role of phage in the association of arthropods and their heritable endosymbiotic bacteria (e.g. the Proteobacteria genera Wolbachia and Hamiltonella). Despite prior studies (~30 years ago) documenting presence of phage in the mollicute Spiroplasma infecting Drosophila, genomic sequences of such phage are lacking, and their effects on the Spiroplasma-Drosophila interaction have not been comprehensively characterized. The present work isolated phage-like particles from the male-killing Spiroplasma poulsonii (strains NSRO and MSRO-Br) harbored by Drosophila melanogaster. Isolated particles were subjected to DNA sequencing, assembly, and annotation. Our results recovered three ~19 kb phage-like contigs (two in NSRO and one in MSRO-Br), and two smaller non-phage-like contigs encoding a known Spiroplasma toxin and an insertion element. Whole or parts of the particle-derived contigs were found in the genome assemblies of members of the Spiroplasma poulsonii clade. Although our results do not allow us to distinguish whether the contigs obtained represent infective phage-like particles capable of transmitting their DNA to new hosts, their encoding of several typical phage genes suggests that they are at least remnants of functional phage. We discuss potential implications of our findings and suggest future directions.
Chapter
Whole-genome sequencing (WGS) has shown immense value in enabling identification and characterization of bacterial taxa. This is particularly true for mycobacteria, where culture-based characterization becomes delayed by the inherently slow growth rate of these organisms. This chapter reviews the general techniques behind WGS and their optimization, existing techniques for species-level identification and the advantages of WGS for this purpose, and a variety of useful tools for the genomic characterization of mycobacterial strains.
Article
Full-text available
Metagenomics has changed the face of virus discovery by enabling the accurate identification of viral genome sequences without requiring isolation of the viruses. As a result, metagenomic virus discovery leaves the first and most fundamental question about any novel virus unanswered: What host does the virus infect? The diversity of the global virosphere and the volumes of data obtained in metagenomic sequencing projects demand computational tools for virus-host prediction. We focus on bacteriophages (phages, viruses that infect bacteria), the most abundant and diverse group of viruses found in environmental metagenomes. By analyzing 820 phages with annotated hosts, we review and assess the predictive power of in silico phage-host signals. Sequence homology approaches are the most effective at identifying known phage-host pairs. Compositional and abundance-based methods contain significant signal for phage-host classification, providing opportunities for analyzing the unknowns in viral metagenomes. Together, these computational approaches further our knowledge of the interactions between phages and their hosts. Importantly, we find that all reviewed signals significantly link phages to their hosts, illustrating how current knowledge and insights about the interaction mechanisms and ecology of coevolving phages and bacteria can be exploited to predict phage-host relationships, with potential relevance for medical and industrial applications.
Article
Full-text available
Phages are the most abundant biological entities on Earth and play major ecological roles, yet the current sequenced phage genomes do not adequately represent their diversity, and little is known about the abundance and distribution of these sequenced genomes in nature. Although the study of phage ecology has benefited tremendously from the emergence of metagenomic sequencing, a systematic survey of phage genes and genomes in various ecosystems is still lacking, and fundamental questions about phage biology, lifestyle, and ecology remain unanswered. To address these questions and improve comparative analysis of phages in different metagenomes, we screened a core set of publicly available metagenomic samples for sequences related to completely sequenced phages using the web tool, Phage Eco-Locator. We then adopted and deployed an array of mathematical and statistical metrics for a multidimensional estimation of the abundance and distribution of phage genes and genomes in various ecosystems. Experiments using those metrics individually showed their usefulness in emphasizing the pervasive, yet uneven, distribution of known phage sequences in environmental metagenomes. Using these metrics in combination allowed us to resolve phage genomes into clusters that correlated with their genotypes and taxonomic classes as well as their ecological properties. We propose adding this set of metrics to current metaviromic analysis pipelines, where they can provide insight regarding phage mosaicism, habitat specificity, and evolution.
Article
Full-text available
RNA viruses exhibit substantial structural, ecological and genomic diversity. However, genome size in RNA viruses is likely limited by a high mutation rate, resulting in the evolution of various mechanisms to increase complexity while minimising genome expansion. Here we conduct a large-scale analysis of the genome sequences of 99 animal rhabdoviruses, including 45 genomes which we determined de novo, to identify patterns of genome expansion and the evolution of genome complexity. All but seven of the rhabdoviruses clustered into 17 well-supported monophyletic groups, of which eight corresponded to established genera, seven were assigned as new genera, and two were taxonomically ambiguous. We show that the acquisition and loss of new genes appears to have been a central theme of rhabdovirus evolution, and has been associated with the appearance of alternative, overlapping and consecutive ORFs within the major structural protein genes, and the insertion and loss of additional ORFs in each gene junction in a clade-specific manner. Changes in the lengths of gene junctions accounted for as much as 48.5% of the variation in genome size from the smallest to the largest genome, and the frequency with which new ORFs were observed increased in the 3' to 5' direction along the genome. We also identify several new families of accessory genes encoded in these regions, and show that non-canonical expression strategies involving TURBS-like termination-reinitiation, ribosomal frame-shifts and leaky ribosomal scanning appear to be common. We conclude that rhabdoviruses have an unusual capacity for genomic plasticity that may be linked to their discontinuous transcription strategy from the negative-sense single-stranded RNA genome, and propose a model that accounts for the regular occurrence of genome expansion and contraction throughout the evolution of the Rhabdoviridae.
Article
Full-text available
The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.
Article
Full-text available
All sequence data contain inherent information that can be measured by Shannon's uncertainty theory. Such measurement is valuable in evaluating large data sets, such as metagenomic libraries, to prioritize their analysis and annotation, thus saving computational resources. Here, Shannon's index of complete phage and bacterial genomes was examined. The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size. In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases. A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database. Measuring uncertainty may be used for rapid screening for sequences with matches in available database, prioritizing computational resources, and indicating which sequences with no known similarities are likely to be important for more detailed analysis.
Article
Full-text available
Transposases (Tnps) are enzymes that participate in the movement of insertion sequences (ISs) within and between genomes. Genes that encode Tnps are amongst the most abundant and widely distributed genes in nature. However, they are difficult to predict bioinformatically and given the increasing availability of prokaryotic genomes and metagenomes, it is incumbent to develop rapid, high quality automatic annotation of ISs. This need prompted us to develop a web service, termed TnpPred for Tnp discovery. It provides better sensitivity and specificity for Tnp predictions than given by currently available programs as determined by ROC analysis. TnpPred should be useful for improving genome annotation. The TnpPred web service is freely available for noncommercial use.
Article
Full-text available
Viruses are the most abundant biological entities on earth and encompass a vast amount of genetic diversity. The recent rapid increase in the number of sequenced viral genomes has created unprecedented opportunities for gaining new insight into the structure and evolution of the virosphere. Here we present an update of the Phage Orthologous Groups (POGs), a collection of 4,542 clusters of orthologous genes from bacteriophages that now also includes viruses infecting archaea and encompasses more than 1,000 distinct virus genomes. Analysis of this expanded dataset shows that the number of POGs keeps growing without saturation and that a substantial majority of the POGs remain specific to viruses, lacking homologues in prokaryotic cells, outside of known proviruses. Thus, the great majority of virus genes apparently remains to be discovered. A complementary observation is that numerous viral genomes remain poorly if at all covered by POGs. The genome coverage by POGs is expected to increase as more genomes are sequenced. Taxon-specific, single-copy signature genes, that are not observed in prokaryotic genomes outside of detected proviruses, were identified for two-thirds of the 57 taxa (those with genomes available from at least 3 distinct viruses, with half of these present in all members of the respective taxon. These signatures can be used to specifically identify the presence and quantify the abundance of viruses from particular taxa in metagenomic samples and thus gain new insights into the ecology and evolution of viruses in relation to their hosts.
Article
We describe a program, tRNAscan-SE, which identifies 99-100% of transfer RNA genes in DNA sequence while giving less than one false positive per 15 gigabases. Two previously described tRNA detection programs are used as fast, first-pass prefilters to identify candidate tRNAs, which are then analyzed by a highly selective tRNA covariance model. This work represents a practical application of RNA covariance models, which are general, probabilistic secondary structure profiles based on stochastic context-free grammars. tRNAscan-SE searches at approximately 30 000 bp/s. Additional extensions to tRNAscan-SE detect unusual tRNA homologues such as selenocysteine tRNAs, tRNA-derived repetitive elements and tRNA pseudogenes.
Article
Many different types of functional non-coding RNAs participate in a wide range of important cellular functions but the large majority of these RNAs are not routinely annotated in published genomes. Several programs have been developed for identifying RNAs, including specific tools tailored to a particular RNA family as well as more general ones designed to work for any family. Many of these tools utilize covariance models (CMs), statistical models of the conserved sequence, and structure of an RNA family. In this chapter, as an illustrative example, the Infernal software package and CMs from the Rfam database are used to identify RNAs in the genome of the archaeon Methanobrevibacter ruminantium, uncovering some additional RNAs not present in the genome's initial annotation. Analysis of the results and comparison with family-specific methods demonstrate some important strengths and weaknesses of this general approach.
Article
Bacteriophages (or phages) are the most abundant biological entities on earth, and are estimated to outnumber their bacterial prey by tenfold. The constant threat of phage predation has led to the evolution of a broad range of bacterial immunity mechanisms that in turn result in the evolution of diverse phage immune evasion strategies, leading to a dynamic co-evolutionary arms race. Although bacterial innate immune mechanisms against phage abound, the only documented bacterial adaptive immune system is the CRISPR/Cas (clustered regularly interspaced short palindromic repeats/CRISPR-associated proteins) system, which provides sequence-specific protection from invading nucleic acids, including phage. Here we show a remarkable turn of events, in which a phage-encoded CRISPR/Cas system is used to counteract a phage inhibitory chromosomal island of the bacterial host. A successful lytic infection by the phage is dependent on sequence identity between CRISPR spacers and the target chromosomal island. In the absence of such targeting, the phage-encoded CRISPR/Cas system can acquire new spacers to evolve rapidly and ensure effective targeting of the chromosomal island to restore phage replication.