Eliot Cline’s research while affiliated with Mae Fah Luang University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (1)


Figure 3 SNP call accuracy plotted as a function of call quality score. Full-size  DOI: 10.7717/peerj.10501/fig-3
Figure 4 Base-quality score distributions for real and simulated data sets. (A) Real tomato data. (B) Real cucumber data. (C) Simulated tomato data. (D) Simulated cucumber data. Full-size  DOI: 10.7717/peerj.10501/fig-4
Feature importance rankings sorted by Info Gain.
Algorithms used in the study.
Correctly aligned and incorrectly aligned reads per organism.

+6

Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data
  • Article
  • Full-text available

December 2020

·

146 Reads

·

4 Citations

Eliot Cline

·

·

·

[...]

·

Background Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire genome. However, read depth at each locus is low, making sequencing error difficult to separate from actual variation. Prior to variant calling, sequencer reads are aligned to a reference genome, with alignments stored in Sequence Alignment/Map (SAM) files. Each alignment has a mapping quality (MAPQ) score indicating the probability a read is incorrectly aligned. This study investigated the recalibration of probability estimates used to compute MAPQ scores for improving variant calling performance in single-sample, low-coverage settings. Materials and Methods Simulated tomato, hot pepper and rice genomes were implanted with known variants. From these, simulated paired-end reads were generated at low coverage and aligned to the original reference genomes. Features extracted from the SAM formatted alignment files for tomato were used to train machine learning models to detect incorrectly aligned reads and output estimates of the probability of misalignment for each read in all three data sets. MAPQ scores were then re-computed from these estimates. Next, the SAM files were updated with new MAPQ scores. Finally, Variant calling was performed on the original and recalibrated alignments and the results compared. Results Incorrectly aligned reads comprised only 0.16% of the reads in the training set. This severe class imbalance required special consideration for model training. The F1 score for detecting misaligned reads ranged from 0.76 to 0.82. The best performing model was used to compute new MAPQ scores. Single Nucleotide Polymorphism (SNP) detection was improved after mapping score recalibration. In rice, recall for called SNPs increased by 5.2%, while for tomato and pepper it increased by 3.1% and 1.5%, respectively. For all three data sets the precision of SNP calls ranged from 0.91 to 0.95, and was largely unchanged both before and after mapping score recalibration. Conclusion Recalibrating MAPQ scores delivers modest improvements in single-sample variant calling results. Some variant callers operate on multiple samples simultaneously. They exploit every sample’s reads to compensate for the low read-depth of individual samples. This improves polymorphism detection and genotype inference. It may be that small improvements in single-sample settings translate to larger gains in a multi-sample experiment. A study to investigate this is ongoing.

Download

Citations (1)


... Trimming the reference genome down allows for more accurate mapping quality (MAPQ) scores and alignment accuracy since 50% of the human genome consists of repetitive sequences, with about 89% of these repeats located within introns, allowing for reads to map equally to multiple locations [49,50]. MAPQ scores indicate the quality of the individual read alignment and the probability that a read is misaligned [51]. A large, single spike in coverage is observed in Figure 4B for all samples around the location of MDM4 in human chromosome 1. ...

Reference:

Long-Read MDM4 Sequencing Reveals Aberrant Isoform Landscape in Metastatic Melanomas
Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data