Base-quality score distributions for real and simulated data sets. (A) Real tomato data. (B) Real cucumber data. (C) Simulated tomato data. (D) Simulated cucumber data. Full-size  DOI: 10.7717/peerj.10501/fig-4

Base-quality score distributions for real and simulated data sets. (A) Real tomato data. (B) Real cucumber data. (C) Simulated tomato data. (D) Simulated cucumber data. Full-size  DOI: 10.7717/peerj.10501/fig-4

Source publication
Article
Full-text available
Background Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire genome. However, read depth at each locus is low, making sequencing error difficult to separate from actual variation. Prior to variant calling, sequencer reads are aligned to a reference genome, with alignments stored in Sequence Alignment/Map (SAM) files...

Contexts in source publication

Context 1
... training set was constructed containing all false positive samples and an equal number of true positives. Training was carried out again on this new set using grid search with five-fold crossvalidation. After training, prediction was done on the 80% test data. After all subsets were exhausted the best performing model was retained for later use (Fig. ...
Context 2
... data set. However, using the predictions to compute new MAPQ scores resulted in improvements in SNP calling performance. The average number of true and false positive SNP calls for the fourteen 3× cucumber subsets is shown in Table 7. We also found that the base-call quality profiles of real and simulated data sets can differ significantly (Fig. 4). In the tomato data sets the default Illumina HiSeq 2000 profile provided by the ART read simulator tracked quite well the actual profile in the real tomato data (Figs. 4A and 4C). This was not the case for the cucumber data. The default profile in ART will produce base-quality score distributions in Fig. 4C. The actual distributions ...
Context 3
... SNP calls for the fourteen 3× cucumber subsets is shown in Table 7. We also found that the base-call quality profiles of real and simulated data sets can differ significantly (Fig. 4). In the tomato data sets the default Illumina HiSeq 2000 profile provided by the ART read simulator tracked quite well the actual profile in the real tomato data (Figs. 4A and 4C). This was not the case for the cucumber data. The default profile in ART will produce base-quality score distributions in Fig. 4C. The actual distributions in the cucumber data are shown in Fig. ...
Context 4
... data sets can differ significantly (Fig. 4). In the tomato data sets the default Illumina HiSeq 2000 profile provided by the ART read simulator tracked quite well the actual profile in the real tomato data (Figs. 4A and 4C). This was not the case for the cucumber data. The default profile in ART will produce base-quality score distributions in Fig. 4C. The actual distributions in the cucumber data are shown in Fig. ...
Context 5
... the default Illumina HiSeq 2000 profile provided by the ART read simulator tracked quite well the actual profile in the real tomato data (Figs. 4A and 4C). This was not the case for the cucumber data. The default profile in ART will produce base-quality score distributions in Fig. 4C. The actual distributions in the cucumber data are shown in Fig. ...
Context 6
... that tool we generated a profile from the downsampled cucumber data and used it in the cucumber data simulation. Figure 4D shows the base-quality score distributions in the simulated cucumber data. They match very well the distributions in the real data. ...

Citations

... Trimming the reference genome down allows for more accurate mapping quality (MAPQ) scores and alignment accuracy since 50% of the human genome consists of repetitive sequences, with about 89% of these repeats located within introns, allowing for reads to map equally to multiple locations [49,50]. MAPQ scores indicate the quality of the individual read alignment and the probability that a read is misaligned [51]. A large, single spike in coverage is observed in Figure 4B for all samples around the location of MDM4 in human chromosome 1. ...
Article
Full-text available
MDM4 is upregulated in the majority of melanoma cases and has been described as a “key therapeutic target in cutaneous melanoma”. Numerous isoforms of MDM4 exist, with few studies examining their specific expression in human tissues. The changes in splicing of MDM4 during human melanomagenesis are critical to p53 activity and represent potential therapeutic targets. Compounding this, studies relying on short reads lose “connectivity” data, so full transcripts are frequently only inferred from the presence of splice junction reads. To address this problem, long-read nanopore sequencing was utilized to read the entire length of transcripts. Here, MDM4 transcripts, both alternative and canonical, are characterized in a pilot cohort of human melanoma specimens. RT-PCR was first used to identify the presence of novel splice junctions in these specimens. RT-qPCR then quantified the expression of major MDM4 isoforms observed during sequencing. The current study both identifies and quantifies MDM4 isoforms present in melanoma tumor samples. In the current study, we observed high expression levels of MDM4-S, MDM4-FL, MDM4-A, and the previously undescribed Ensembl transcript MDM4-209. A novel transcript lacking both exons 6 and 9 is observed and named MDM4-A/S for its resemblance to both MDM4-A and MDM4-S isoforms.
... Although it is possible to improve MAPQ accuracy through postprocessing (e.g. Ruffalo et al. 2012, Langmead 2017, Cline et al. 2020, this can require a significant amount of additional processing time. ...
Article
Full-text available
Motivation: Read alignment is an essential first step in the characterization of DNA sequence variation. The accuracy of variant calling results depends not only on the quality of read alignment and variant calling software but also on the interaction between these complex software tools. Results: In this review, we evaluate short-read aligner performance with the goal of optimizing germline variant calling accuracy. We examine the performance of three general-purpose short-read aligners - BWA-MEM, Bowtie 2, and Arioc - in conjunction with three germline variant callers: DeepVariant, FreeBayes, and GATK HaplotypeCaller. We discuss the behavior of the read aligners with regard to the data elements on which the variant callers rely, and illustrate how the runtime configurations of these software tools combine to affect variant calling performance. Availability: The quick brown fox jumps over the lazy dog. Supplementary information: Supplementary information is available at Bioinformatics online.
... While correct read location is desired (high mapping accuracy), an aligner also needs accurate base-level alignments when calling SNPs and, in particular, indels. While such an analysis supplements a mapping accuracy analysis, a caveat is that variant callers use MAPQ scores for SNV and indel prediction [34]; therefore, some callers may be developed or tuned based on popular aligners' MAPQ scores. Specifically, it has been shown that bcftools call was the SNP caller that produced the best result with BWA-MEM alignments out of seven variant calling tools [35]. ...
Article
Full-text available
Read alignment is often the computational bottleneck in analyses. Recently, several advances have been made on seeding methods for fast sequence comparison. We combine two such methods, syncmers and strobemers, in a novel seeding approach for constructing dynamic-sized fuzzy seeds and implement the method in a short-read aligner, strobealign. The seeding is fast to construct and effectively reduces repetitiveness in the seeding step, as shown using a novel metric E-hits. strobealign is several times faster than traditional aligners at similar and sometimes higher accuracy while being both faster and more accurate than more recently proposed aligners for short reads of lengths 150nt and longer. Availability: https://github.com/ksahlin/strobealign
... However, post-processing is strongly recommended in genome analysis since mis- After deduplication and indel realignment, BQSR is recommended to improve the accuracy of base quality scores before the variant calling step (Cline et al., 2020). ...
Thesis
Animal’s behaviors and pheno-types are in part embedded in the genetic molecules of life : RNA and DNA. By decipher-ing the information hidden in these molecules, we can peek into the mysteries of biology. Next generation sequencing (NGS) was developed as a powerful tool for decoding DNA and RNA molecules on a very large scale. NGS has consid-erably broadened our understanding of all areas of biology, from molecular biology to genetics, medicine, ecology and epidemiology. A corner-stone of NGS data analysis is the comparison with a reference genome. Although scientists use one reference genome per species, the explo-sive growth of sequencing output has challenged this view by showing that actual DNA and RNA sequences are much more diverse.In this thesis, we propose new bioinformatics protocols for NGS analysis that do not rely on a reference. Our projects aim to exploit the power of alignment-free approaches to discover novel variations in cancer transcriptomes and genomes in hard-to-map regions or regions absent from the reference genome. We applied this strategy to uncover novel phenotype-related events from large-scale cancer cohorts. From the perspec-tive of genome analysis, we uncovered novel re-current variants from prostate cancer patients. Based on transcriptome analysis, we discovered non-reference events with high replicability.We demonstrate that a large number of novel events relevant to diseases can be discovered in the manner of alignment-free. These novel non-reference events do not require a priori knowledge of the human genome or transcriptome and present significant prognostic values and potential to produce neoantigens. In addition, these novel non-reference events involved in cancer risk may orient biologists towards new oncogenesis mechanisms.