[Show abstract][Hide abstract]ABSTRACT: Motivation:
Determining the methylation state of regions with high copy numbers is challenging for second-generation sequencing, because the read length is insufficient to map reads uniquely, especially when repetitive regions are long and nearly identical to each other. Single-molecule real-time (SMRT) sequencing is a promising method for observing such regions, because it is not vulnerable to GC bias, it produces long read lengths, and its kinetic information is sensitive to DNA modifications.
We propose a novel linear-time algorithm that combines the kinetic information for neighboring CpG sites and increases the confidence in identifying the methylation states of those sites. Using a practical read coverage of ∼30-fold from an inbred strain medaka (Oryzias latipes), we observed that both the sensitivity and precision of our method on individual CpG sites were ∼93.7%. We also observed a high correlation coefficient (R = 0.884) between our method and bisulfite sequencing, and for 92.0% of CpG sites, methylation levels ranging over [0, 1] were in concordance within an acceptable difference 0.25. Using this method, we characterized the landscape of the methylation status of repetitive elements, such as LINEs, in the human genome, thereby revealing the strong correlation between CpG density and hypomethylation and detecting hypomethylation hot spots of LTRs and LINEs. We uncovered the methylation states for nearly identical active transposons, two novel LINE insertions of identity ∼99% and length 6050 base pairs (bp) in the human genome, and 16 Tol2 elements of identity >99.8% and length 4682 bp in the medaka genome.
AgIn (Aggregate on Intervals) is available at: https://github.com/hacone/AgIn CONTACT: firstname.lastname@example.org, email@example.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
[Show abstract][Hide abstract]ABSTRACT: Author
We describe the genome assembly of Asian seabass (Lates calcarifer), a marine teleost with aquaculture relevance. Though >500 eukaryotic genome sequences are available in public repositories, the majority are highly fragmented with incomplete assemblies, which explains why considerable effort and resources are often spent to improve their quality after publication. In our study, we employed long read sequencing combined with genetic and optical mapping, and syntenic information to produce a chromosomal level assembly. The largely continuous genome assembly will be useful for comparative genomics and offers an opportunity to look into regions less explored such as tandem repeats (the core component of centromeres and telomeres). In addition, population structure of the species was analysed based on low-coverage genome sequence information from 61 individuals representing diverse geographic locations stretching from North-Western India across South-East Asia and Australia to Papua New Guinea.
[Show abstract][Hide abstract]ABSTRACT: Cross-validation error analyses to identify the number of Ks which explain variation in the Asian seabass species complex.
Cross-validation methodology was used to find number of Ks (clusters/population) which better explain observed variation. The best model was obtained at K = 3, with the lowest error level.
[Show abstract][Hide abstract]ABSTRACT: The number of contigs in the primary Asian seabass genome assembly (v1; 3,917 contigs) compared to those of published fish genome assemblies (see S23 Table for more details).
[Show abstract][Hide abstract]ABSTRACT: The Asian seabass genome assembly contains a more continuous cluster of MHC-class I genes compared to the well-assembled G. aculeatus genome.
The L. calcarifer MHC-class I genes were found to be located on eight contigs/scaffolds, four of which were placed onto linkage group 3 (LG3). Four of these eight contigs/scaffolds were also >1Mb in length. The dashed connecting-lines indicate gaps introduced during sequence placement of contigs/scaffolds into linkage groups, while the yellow bars within the “scaffold_” sequences indicate Ns introduced during scaffolding. To allow for comparison at the level of contigs/scaffolds, the G. aculeatus chromosome groupX was split at the gapped regions (indicated by the dashed connecting-lines). The G. aculeatus MHC-class I genes were found to occupy 14 contigs/scaffolds, all except one being <113 kb in length.
[Show abstract][Hide abstract]ABSTRACT: The Asian seabass genome assembly (v2; blue bars) anchored to the 24 linkage groups (white bars) using 772 markers .
Regions indicated in red represent positions of contig/scaffold containing Lca_217 (peri-centromeric sequences).