Fig 2 - available from: Nature Communications
This content is subject to copyright. Terms and conditions apply.
mCaller workflow and classification of E. coli sites in R9.4 data. a The pipeline for classification of adenines as methylated or unmethylated. b Probabilities of methylation defined by a neural network classifier for methylated compared to unmethylated positions in E. coli, with a model trained on one dataset and tested on a second. c ROC curves for the neural network model using R9.4 data, showing true positive rate (methylated positions correctly identified) as a function of false positive rate (unmethylated positions called as methylated) with varying probability thresholds for classification. We tested modifications to the standard model ("NN") using only high quality reads (average base quality > 9, "NN_hq") and classifying observations that included a maximum of two skips ("NN_sk"). A curve was calculated for genomic positions with ≥15× coverage by varying the fraction of reads with probability of methylation scores ≥0.5 required to define a position as methylated ("NN_pos"). Boxplot center lines show medians and whiskers 1.5x interquartile range
Source publication
The DNA base modification N6-methyladenine (m⁶A) is involved in many pathways related to the survival of bacteria and their interactions with hosts. Nanopore sequencing offers a new, portable method to detect base modifications. Here, we show that a neural network can improve m⁶A detection at trained sequence contexts compared to previously publish...
Contexts in source publication
Context 1
... we hypothesized that patterns in current shifts would be shared enough to predict m 6 A; overall for R9 E. coli data, methylation at the 4th and 5th positions of a 6-mer in particular tended to increase the current with respect to the model values ( Figure 1c). We thus used current deviations as features to train four binary classifiers (Fig. 2a), including neural network, random forest, naïve Bayes, and logistic ...
Context 2
... neural network classifier produced the highest accuracy, although a random forest model performed comparably (Supple- mentary Figure 2). Tested on a second data set from the same E. coli strain produced in a second lab, the model achieved 81.3% accuracy (compared to 80.8% for the random forest model) using all quality levels of reads and comparing methylated positions to a random selection of unmethylated sites in the same genome (Fig. 2b, Supplementary Table 1). ...
Context 3
... on a second data set from the same E. coli strain produced in a second lab, the model achieved 81.3% accuracy (compared to 80.8% for the random forest model) using all quality levels of reads and comparing methylated positions to a random selection of unmethylated sites in the same genome (Fig. 2b, Supplementary Table 1). The Spearman correlation between the probability estimates from the top two predictors, neural network and random forest, was high, at 0.93 (Supple- mentary Figure 2D). A receiver operator characteristic curve showed the changes in accuracy at varying thresholds for classification (Fig. 2c). ...
Context 4
... sites in the same genome (Fig. 2b, Supplementary Table 1). The Spearman correlation between the probability estimates from the top two predictors, neural network and random forest, was high, at 0.93 (Supple- mentary Figure 2D). A receiver operator characteristic curve showed the changes in accuracy at varying thresholds for classification (Fig. 2c). Accuracy improved to 84.2% for higher quality reads (mean quality > 9) and decreased to 77.8% with a maximum of two skips per prediction, or 6-mers for which the sequencer missed recording a current value. When summarizing predictions at single sites with a minimum of 15× coverage, the classifier achieved 95.4% accuracy and an area ...
Context 5
... summarizing predictions at single sites with a minimum of 15× coverage, the classifier achieved 95.4% accuracy and an area under the curve (AUC) of 0.99, with comparison to true negatives drawn from unmethylated positions, although these estimated accuracies did not account for bias towards specific sequence contexts, as discussed below. We then tested the hypothesis that methylation would affect a similar range of surrounding current levels as the canonical bases in the ONT models (six) and found that using four or eight 6-mers surrounding a base reduced classification accuracy (Supplementary Figure 2A and C). ...
Citations
... It is known that modified nucleotides affect the signal levels when passing through nanopore [10]. For known modification patterns, such as CpG methylation, it is possible to prepare artificially methylated sequences and either estimate a signal model for contexts containing methylated nucleotides [11,12,13] or to train a machine learning algorithm, such as a neural network, to recognize signals from sequences that contain methylated nucleotides [14,15,16]. We call this approach supervised methylation detection, since substantial amounts of labeled training data are required to build such models. ...
Base calling in nanopore sequencing is a difficult and computationally intensive problem, typically resulting in high error rates. In many applications of nanopore sequencing, analysis of raw signal is a viable alternative. Dynamic time warping (DTW) is an important building block for raw signal analysis. In this paper, we propose several improvements to DTW class of algorithms to better account for specifics of nanopore signal modeling. We have implemented these improvements in a new signal-to-reference alignment tool Nadavca. We demonstrate that Nadavca alignments improve unsupervised methylation detection over Tombo. We also demonstrate that by providing additional information about the discriminative power of positions in the signal, an otherwise unsupervised method can approach the accuracy of supervised models.
Availability and implementation
Nadavca is available under MIT license at https://github.com/fmfi-compbio/nadavca . Nanopore sequencing data sets are available from ENA bioproject PRJEB64246. Jaminaea angkorensis reference genome assembly is available from Zenodo https://doi.org/10.5281/zenodo.8145315 .
... Unlike SMRT sequencing whose sequencing speed is determined by polymerization, ONT devices record a measurement of current at a predefined sampling rate and then aggregate the measurements into strides, which are the smallest length of measurement accepted by the basecaller and represent a single base translocation. Recent work on detecting DNA methylation from the charge in ONT reads provides evidence that DNA modifications can affect ONT TTs (Stoiber et al. 2017;Liu et al. 2019;McIntyre et al. 2019;Ni et al. 2019). ...
Motivation:
Non-canonical (or non-B) DNA are genomic regions whose three-dimensional conformation deviates from the canonical double helix. Non-B DNA play an important role in basic cellular processes and are associated with genomic instability, gene regulation, and oncogenesis. Experimental methods are low-throughput and can detect only a limited set of non-B DNA structures, while computational methods rely on non-B DNA base motifs, which are necessary but not sufficient indicators of non-B structures. Oxford Nanopore sequencing is an efficient and low-cost platform, but it is currently unknown whether nanopore reads can be used for identifying non-B structures.
Results:
We build the first computational pipeline to predict non-B DNA structures from nanopore sequencing. We formalize non-B detection as a novelty detection problem and develop the GoFAE-DND, an autoencoder that uses goodness-of-fit (GoF) tests as a regularizer. A discriminative loss encourages non-B DNA to be poorly reconstructed and optimizing Gaussian GoF tests allows for the computation of P-values that indicate non-B structures. Based on whole genome nanopore sequencing of NA12878, we show that there exist significant differences between the timing of DNA translocation for non-B DNA bases compared with B-DNA. We demonstrate the efficacy of our approach through comparisons with novelty detection methods using experimental data and data synthesized from a new translocation time simulator. Experimental validations suggest that reliable detection of non-B DNA from nanopore sequencing is achievable.
Availability and implementation:
Source code is available at https://github.com/bayesomicslab/ONT-nonb-GoFAE-DND.
... Since the nanopore technique can correlate the values of the electric current intensity with the properties of the molecule, the identification of methylated genes is performed on the same principle, observing differences between unmodified and modified bases [95]. It is not enough just to identify differences in current amplitude, for a satisfactory yield, it is necessary to directly call the modified bases, and for these different algorithms are used [50], such as the Markov model [96] and neural networks [97,98]. Currently, there are multiple software tools developed by researchers for this purpose, but most of them do not fully satisfy the requirements as the methylation levels are not linearly distributed [99]. ...
Modern biomedical sensing techniques have significantly increased in precision and accuracy due to new technologies that enable speed and that can be tailored to be highly specific for markers of a particular disease. Diagnosing early-stage conditions is paramount to treating serious diseases. Usually, in the early stages of the disease, the number of specific biomarkers is very low and sometimes difficult to detect using classical diagnostic methods. Among detection methods, biosensors are currently attracting significant interest in medicine, for advantages such as easy operation, speed, and portability, with additional benefits of low costs and repeated reliable results. Single-molecule sensors such as nanopores that can detect biomolecules at low concentrations have the potential to become clinically relevant. As such, several applications have been introduced in this field for the detection of blood markers, nucleic acids, or proteins. The use of nanopores has yet to reach maturity for standardization as diagnostic techniques, however, they promise enormous potential, as progress is made into stabilizing nanopore structures, enhancing chemistries, and improving data collection and bioinformatic analysis. This review offers a new perspective on current biomolecule sensing techniques, based on various types of nanopores, challenges, and approaches toward implementation in clinical settings.
... In plant genomes, conservation of 6mA enrichment around the transcription start site is correlated with active gene expression, such as in C. reinhardtii, Arabidopsis, and rice [6,12,18]. The development of methodologies for Plants 2023, 12,1949 2 of 18 detecting modified bases from ONT (Oxford Nanopore Technologies) data offers an avenue for 6mA identification [19][20][21][22]. Moreover, similar to that of 5mC, the 6mA methylation level can vary in different tissues and respond to multiple abiotic stresses, as observed in Arabidopsis and rice [12,18]. ...
N 6-methyladenine (6mA) DNA methylation has emerged as an important epigenetic modification in eukaryotes. Nevertheless, the evolution of the 6mA methylation of homologous genes after species and after gene duplications remains unclear in plants. To understand the evolution of 6mA methylation, we detected the genome-wide 6mA methylation patterns of four lotus plants (Nelumbo nucifera) from different geographic origins by nanopore sequencing and compared them to patterns in Arabidopsis and rice. Within lotus, the genomic distributions of 6mA sites are different from the widely studied 5mC methylation sites. Consistently, in lotus, Arabidopsis and rice, 6mA sites are enriched around transcriptional start sites, positively correlated with gene expression levels, and preferentially retained in highly and broadly expressed orthologs with longer gene lengths and more exons. Among different duplicate genes, 6mA methylation is significantly more enriched and conserved in whole-genome duplicates than in local duplicates. Overall, our study reveals the convergent patterns of 6mA methylation evolution based on both lineage and duplicate gene divergence, which underpin their potential role in gene regulatory evolution in plants.
... In the case of ONT, the presence of a modified nucleotide can lead to an electric current change ( Figure 3B). For both technologies, the sequence context has an impact on modification detection and IPD or electric current profiles of positions adjacent to the modified nucleotide may be affected [123][124][125]. ...
... As PacBio and ONT sequencing are based on fundamentally different approaches, they are likely to differ in this aspect as well. Indeed, while ONT detects 5mC more strongly than 6mA in DNA, the opposite is true for PacBio [125][126][127] and, therefore, both technologies are complementary. ...
Long-read sequencing (LRS) technologies have provided extremely powerful tools to explore genomes. While in the early years these methods suffered technical limitations, they have recently made significant progress in terms of read length, throughput, and accuracy and bioinformatics tools have strongly improved. Here, we aim to review the current status of LRS technologies, the development of novel methods, and the impact on genomics research. We will explore the most impactful recent findings made possible by these technologies focusing on high-resolution sequencing of genomes and transcriptomes and the direct detection of DNA and RNA modifications. We will also discuss how LRS methods promise a more comprehensive understanding of human genetic variation, transcriptomics, and epigenetics for the coming years.
... Specialized hardware could reduce the footprint of computer needs while delivering very high AI/ML performance. For example, onboard graphics processing units would accelerate AI/ML capabilities for DNA sequencing requiring real-time base calling for detecting modified nucleic acids 105 . The workshop recommended more investment in the development of hardware enabling AI/ML. ...
Human exploration of deep space will involve missions of substantial distance and duration. To effectively mitigate health hazards, paradigm shifts in astronaut health systems are necessary to enable Earth-independent healthcare, rather than Earth-reliant. Here we present a summary of decadal recommendations from a workshop organized by NASA on artificial intelligence, machine learning and modelling applications that offer key solutions toward these space health challenges. The workshop recommended various biomonitoring approaches, biomarker science, spacecraft/habitat hardware, intelligent software and streamlined data management tools in need of development and integration to enable humanity to thrive in deep space. Participants recommended that these components culminate in a maximally automated, autonomous and intelligent Precision Space Health system, to monitor, aggregate and assess biomedical statuses.
... Public dataset used for error type comparison are collected from the work by Alexa and co-workers. 20 The accession numbers are listed in the key resource table. d All original codes have been deposited at Zenodo and are publicly available as of the date of publication. ...
Sequencing of hypervariable regions as well as internal transcribed spacer regions of ribosomal RNA genes (rDNA) is broadly used to identify bacteria and fungi, but taxonomic and phylogenetic resolution is hampered by insufficient sequencing length using high throughput, cost-efficient second-generation sequencing. We developed a method to obtain nearly full-length rDNA by assembling single DNA molecules combining DNA co-barcoding with single-tube long fragment read technology and second-generation sequencing. Benchmarking was performed using mock bacterial and fungal communities as well as two forest soil samples. All mock species rDNA were successfully recovered with identities above 99.5% compared to the reference sequences. From the soil samples we obtained good coverage with identification of more than 20,000 unknown species, as well as high abundance correlation between replicates. This approach provides a cost-effective method for obtaining extensive and accurate information on complex environmental microbial communities.
... Using enriched ON-OFF pairs in our prototype strains for ModP1, ModP2 and ModQ, we used a combination of SMRT sequencing and Oxford Nanopore sequencing with Nucleic Acids Research, 2023 17 technology specific methylome analysis to determine the specificity of each Mod protein. Although PacBio long read technology is well established as the gold-standard to solve methyltransferase specificities de novo (76,77), Nanopore sequencing to solve the specificity of bacterial DNA methyltransferases has also been used effectively (55,76). This analysis demonstrated that ModQ is an adenine specific phase-variable DNA methyltransferase, methylating the sequence 5 -AG (m6) ATG-3 . ...
Actinobacillus pleuropneumoniae is the cause of porcine pleuropneumonia, a severe respiratory tract infection that is responsible for major economic losses to the swine industry. Many host-adapted bacterial pathogens encode systems known as phasevarions (phase-variable regulons). Phasevarions result from variable expression of cytoplasmic DNA methyltransferases. Variable expression results in genome-wide methylation differences within a bacterial population, leading to altered expression of multiple genes via epigenetic mechanisms. Our examination of a diverse population of A. pleuropneumoniae strains determined that Type I and Type III DNA methyltransferases with the hallmarks of phase variation were present in this species. We demonstrate that phase variation is occurring in these methyltransferases, and show associations between particular Type III methyltransferase alleles and serovar. Using Pacific BioSciences Single-Molecule, Real-Time (SMRT) sequencing and Oxford Nanopore sequencing, we demonstrate the presence of the first ever characterised phase-variable, cytosine-specific Type III DNA methyltransferase. Phase variation of distinct Type III DNA methyltransferase in A. pleuropneumoniae results in the regulation of distinct phasevarions, and in multiple phenotypic differences relevant to pathobiology. Our characterisation of these newly described phasevarions in A. pleuropneumoniae will aid in the selection of stably expressed antigens, and direct and inform development of a rationally designed subunit vaccine against this major veterinary pathogen.
... The real PacBio and ONT datasets for chimpanzee are available in the NCBI under the accession number PRJNA659034. PacBio data for the mock microbial community from ZymoBIOMICS Microbial Community Standards are extracted, which are publicly available from (McIntyre et al., 2019). The ONT data for the same mock standard are obtained from (Nicholls et al., 2019). ...
Metagenomic sequencing facilitates large-scale constitutional analysis and functional characterization of complex microbial communities without cultivation. Recent advances in long-read sequencing techniques utilize long-range information to simplify repeat-aware metagenomic assembly puzzles and complex genome binning tasks. However, it remains methodologically challenging to remove host-derived DNA sequences from the microbial community at the read resolution due to high sequencing error rates and the absence of reference genomes. We here present Symbiont-Screener ( https://github.com/BGI-Qingdao/Symbiont-Screener ), a reference-free approach to identifying high-confidence host’s long reads from symbionts and contaminants and overcoming the low sequencing accuracy according to a trio-based screening model. The remaining host’s sequences are then automatically grouped by unsupervised clustering. When applied to both simulated and real long-read datasets, it maintains higher precision and recall rates of identifying the host’s raw reads compared to other tools and hence promises the high-quality reconstruction of the host genome and associated metagenomes. Furthermore, we leveraged both PacBio HiFi and nanopore long reads to separate the host’s sequences on a real host-microbe system, an algal-bacterial sample, and retrieved an obvious improvement of host assembly in terms of assembly contiguity, completeness, and purity. More importantly, the residual symbiotic microbiomes illustrate improved genomic profiling and assemblies after the screening, which elucidates a solid basis of data for downstream bioinformatic analyses, thus providing a novel perspective on symbiotic research.
... BOSS-RUNS and readfish depend on reference genomes to infer the origin of sequencing reads. To mimic a more realistic scenario where we do not know the exact bacterial strains, we elected not to use reference genomes from the strains contained in the microbial mixture but, instead, used closely related reference genomes identified in ref. 37 . We measured their divergence in terms of the percentage of aligning nucleotides and ANI values using JSpecies 38 , which range from 86.07% to 99.70% and 98.82% to 99.92%, respectively ( Supplementary Fig. 9). ...
Nanopore sequencers can select which DNA molecules to sequence, rejecting a molecule after analysis of a small initial part. Currently, selection is based on predetermined regions of interest that remain constant throughout an experiment. Sequencing efforts, thus, cannot be re-focused on molecules likely contributing most to experimental success. Here we present BOSS-RUNS, an algorithmic framework and software to generate dynamically updated decision strategies. We quantify uncertainty at each genome position with real-time updates from data already observed. For each DNA fragment, we decide whether the expected decrease in uncertainty that it would provide warrants fully sequencing it, thus optimizing information gain. BOSS-RUNS mitigates coverage bias between and within members of a microbial community, leading to improved variant calling; for example, low-coverage sites of a species at 1% abundance were reduced by 87.5%, with 12.5% more single-nucleotide polymorphisms detected. Such data-driven updates to molecule selection are applicable to many sequencing scenarios, such as enriching for regions with increased divergence or low coverage, reducing time-to-answer. Nanopore selective sequencing with real-time decision updates mitigates coverage bias.