Fig 2 - available via license: Creative Commons Attribution 4.0 International
Content may be subject to copyright.
Spearman correlation between GAMBIT distance and ANI for different GAMBIT parameter values. Each subplot in both A and B represents a different choice of prefix sequence. Subplots show the absolute value of the Spearman correlation ρ vs. value of the k parameter for all pairwise comparisons within genome sets 1-4 (Table 1). Standard values of the prefix and k parameters used throughout the rest of the manuscript are highlighted by a blue subplot border and blue vertical line, respectively. A. Variations of our standard prefix ATGAC with length between 4 and 7 nucleotides. B. Standard prefix sequence plus 7 random sequences of the same length. See S1 Fig for the full set of variations in prefix sequence and length. https://doi.org/10.1371/journal.pone.0277575.g002
Source publication
Whole genome sequencing (WGS) of clinical bacterial isolates has the potential to transform the fields of diagnostics and public health. To realize this potential, bioinformatic software that reports identification results needs to be developed that meets the quality standards of a diagnostic test. We developed GAMBIT (Genomic Approximation Method...
Contexts in source publication
Context 1
... by calculating all pairwise GAMBIT distances for each of our test genome sets and determining the Spearman correlation with ANI. In Fig 2 we show results for a limited set of prefix values along with all values of k. Fig 2A displays variations of the default prefix to different lengths, and Fig 2B compares the default prefix to random sequences of the same length. ...
Context 2
... by calculating all pairwise GAMBIT distances for each of our test genome sets and determining the Spearman correlation with ANI. In Fig 2 we show results for a limited set of prefix values along with all values of k. Fig 2A displays variations of the default prefix to different lengths, and Fig 2B compares the default prefix to random sequences of the same length. The full combined set of prefix variations are shown in S1 Fig. ...
Context 3
... Fig 2 we show results for a limited set of prefix values along with all values of k. Fig 2A displays variations of the default prefix to different lengths, and Fig 2B compares the default prefix to random sequences of the same length. The full combined set of prefix variations are shown in S1 Fig. ...
Context 4
... panels in these figures use the same axes and are comparable. The set of parameters used in the final version of GAMBIT are shown in Fig 2A on the plot with the blue highlighted border. Beginning with this plot, we would expect to see the highest Spearman correlation in the upper right corner of the plot--which corresponds to the longest length of those k-mers (the longer the k-mers, the more of the genomic information is being retrieved). ...
Context 5
... all four panels in Fig 2A, we observe the effect of changing the length of the prefix. We predict that as prefix length increases correlation should decrease because less of the genome is being sampled. ...
Context 6
... we observe the effect of the actual sequence of the prefix sequence. We compare our prefix sequence (at length 5) to seven other randomly generated prefix sequences (Fig 2B). We compared Spearman correlation between ANI and GAMBIT distance for each sequence over k-mer lengths ranging from 7 to 17. ...
Citations
... Following genome assembly, the assembly FASTA files are passed to the Genomic Approximation Method for Bacterial Identification and Tracking (GAMBIT) tool for taxonomic identification (39). GAMBIT infers taxonomy by querying a sample genome against a database of genomes with known taxonomic information and identifying the most similar genome to the query. ...
... In order to infer taxonomic assignments from fungal genomic data, we created a novel fungal GAMBIT database using a similar process as the prokaryotic GAMBIT database (39). The process of creating a GAMBIT database requires the calculation of compressed representations of each genome that will be included in the database, or GAMBIT signatures, which enable the calculation of GAMBIT distances between genomes. ...
... GAMBIT was designed for microbial taxonomic identification by querying genome assemblies against a database and assigning taxonomy based on curated diagnostic thresholds (39). The initial GAMBIT database contained only prokaryotic genomes, but nothing precluded the extension of GAMBIT to eukaryotic microbes. ...
Introduction
The clinical incidence of antimicrobial-resistant fungal infections has dramatically increased in recent years. Certain fungal pathogens colonize various body cavities, leading to life-threatening bloodstream infections. However, the identification and characterization of fungal isolates in laboratories remain a significant diagnostic challenge in medicine and public health. Whole-genome sequencing provides an unbiased and uniform identification pipeline for fungal pathogens but most bioinformatic analysis pipelines focus on prokaryotic species. To this end, TheiaEuk_Illumina_PE_PHB (TheiaEuk) was designed to focus on genomic analysis specialized to fungal pathogens.
Methods
TheiaEuk was designed using containerized components and written in the workflow description language (WDL) to facilitate deployment on the cloud-based open bioinformatics platform Terra. This species-agnostic workflow enables the analysis of fungal genomes without requiring coding, thereby reducing the entry barrier for laboratory scientists. To demonstrate the usefulness of this pipeline, an ongoing outbreak of C. auris in southern Nevada was investigated. We performed whole-genome sequence analysis of 752 new C. auris isolates from this outbreak. Furthermore, TheiaEuk was utilized to observe the accumulation of mutations in the FKS1 gene over the course of the outbreak, highlighting the utility of TheiaEuk as a monitor of emerging public health threats when combined with whole-genome sequencing surveillance of fungal pathogens.
Results
A primary result of this work is a curated fungal database containing 5,667 unique genomes representing 245 species. TheiaEuk also incorporates taxon-specific submodules for specific species, including clade-typing for Candida auris (C. auris) . In addition, for several fungal species, it performs dynamic reference genome selection and variant calling, reporting mutations found in genes currently associated with antifungal resistance ( FKS1 , ERG11 , FUR1 ). Using genome assemblies from the ATCC Mycology collection, the taxonomic identification module used by TheiaEuk correctly assigned genomes to the species level in 126/135 (93.3%) instances and to the genus level in 131/135 (97%) of instances, and provided zero false calls. Application of TheiaEuk to actual specimens obtained in the course of work at a local public health laboratory resulted in 13/15 (86.7%) correct calls at the species level, with 2/15 called at the genus level. It made zero incorrect calls. TheiaEuk accurately assessed clade type of Candida auris in 297/302 (98.3%) of instances.
Discussion
TheiaEuk demonstrated effectiveness in identifying fungal species from whole genome sequence. It further showed accuracy in both clade-typing of C. auris and in the identification of mutations known to associate with drug resistance in that organism.
... Once the assembly has been generated an assembly quality assessment is performed using QUAST. Using the assembly, species taxon identification is performed by GAMBIT (20). The GAMBIT implementation in TheiaEuk_PE uses a custom fungal database containing 5,667 genomes and 245 species. ...
A Candida auris outbreak has been ongoing in Southern Nevada since August 2021. In this manuscript we describe the sequencing of over 200 C. auris isolates from patients at several facilities. Genetically distinct subgroups of C. auris were detected from Clade I (3 distinct lineages) and III (1 lineage). Open-source bioinformatic tools were developed and implemented to aid in the epidemiological investigation. The work herein compares three methods for C. auris whole genome analysis: Nullarbor, MycoSNP and a new pipeline TheiaEuk. We also describe a novel analysis method focused on elucidating phylogenetic linkages between isolates within an ongoing outbreak. Moreover, this study places the ongoing outbreaks in a global context utilizing existing sequences provided worldwide. Lastly, we describe how the generated results were communicated to the epidemiologists and infection control to generate public health interventions.
We have adopted an open bioinformatics ecosystem to address the challenges of bioinformatics implementation in public health laboratories (PHLs). Bioinformatics implementation for public health requires practitioners to undertake standardized bioinformatic analyses and generate reproducible, validated and auditable results. It is essential that data storage and analysis are scalable, portable and secure, and that implementation of bioinformatics fits within the operational constraints of the laboratory. We address these requirements using Terra, a web-based data analysis platform with a graphical user interface connecting users to bioinformatics analyses without the use of code. We have developed bioinformatics workflows for use with Terra that specifically meet the needs of public health practitioners. These Theiagen workflows perform genome assembly, quality control, and characterization, as well as construction of phylogeny for insights into genomic epidemiology. Additonally, these workflows use open-source containerized software and the WDL workflow language to ensure standardization and interoperability with other bioinformatics solutions, whilst being adaptable by the user. They are all open source and publicly available in Dockstore with the version-controlled code available in public GitHub repositories. They have been written to generate outputs in standardized file formats to allow for further downstream analysis and visualization with separate genomic epidemiology software. Testament to this solution meeting the requirements for bioinformatic implementation in public health, Theiagen workflows have collectively been used for over 5 million sample analyses in the last 2 years by over 90 public health laboratories in at least 40 different countries. Continued adoption of technological innovations and development of further workflows will ensure that this ecosystem continues to benefit PHLs.
Background: Antimicrobial resistant infections continue to be a leading global public health crisis. Mobile genetic elements, such as plasmids, have been shown to play a major role in the dissemination of antimicrobial resistance genes. Despite its ongoing threat to human health, surveillance in the United States is often limited to phenotypic resistance. Genomic analyses are important to better understand the underlying resistance mechanisms, assess risk, and implement appropriate prevention methods. This study aimed to investigate the extent of plasmid mediated antimicrobial resistance that can be inferred from short read sequences of carbapenem resistant E. coli (CR-Ec) in Alameda County, California. E. coli isolates from healthcare locations in Alameda County were sequenced using an Illumina MiSeq and assembled with Unicycler. Genomes were categorized according to predefined multilocus sequence typing (MLST) and core genome multilocus sequence typing (cgMLST) schemes. Resistance genes were identified and corresponding contigs were predicted to be plasmid-borne or chromosome-borne using two bioinformatic tools (MOB-suite and mlplasmids).
Results: Among 82 of CR-Ec identified between 2017 and 2019, twenty-five sequence types (STs) were detected. ST131 was the most prominent (n=17) followed closely by ST405 (n=12). blaCTX-M were the most common ESBL genes and just over half (18/30) of these genes were predicted to be plasmid-borne by both MOB-suite and mlplasmids. Three genetically related groups of E. coli isolates were identified with cgMLST. One of the groups contained an isolate with a chromosome-borne blaCTX-M-15 gene and an isolate with a plasmid-borne blaCTX-M-15 gene.
Conclusions: This study provides insights into the dominant clonal groups driving carbapenem resistant E. coli infections in Alameda County, CA, USA clinical sites and highlights the relevance of whole-genome sequencing in routine local genomic surveillance. The finding of multi-drug resistant plasmids harboring high-risk resistance genes is of concern as it indicates a risk of dissemination to previously susceptible clonal groups in the community, potentially complicating clinical and public health intervention.