To read the full-text of this research, you can request a copy directly from the authors.
We present a genome assembly from an individual female Phalera bucephala (the buff-tip; Arthropoda; Insecta; Lepidoptera; Notodontidae). The genome sequence is 933 megabases in span. The majority of the assembly, 99.27%, is scaffolded into 31 chromosomal pseudomolecules, with the W and Z sex chromosome assembled.
Methods for evaluating the quality of genomic and metagenomic data are essential to aid genome assembly and to correctly interpret the results of subsequent analyses. BUSCO estimates the completeness and redundancy of processed genomic data based on universal single-copy orthologs. Here we present new functionalities and major improvements of the BUSCO software, as well as the renewal and expansion of the underlying datasets in sync with the OrthoDB v10 release. Among the major novelties, BUSCO now enables phylogenetic placement of the input sequence to automatically select the most appropriate dataset for the assessment, allowing the analysis of metagenome-assembled genomes of unknown origin. A newly-introduced genome workflow increases the efficiency and runtimes especially on large eukaryotic genomes. BUSCO is the only tool capable of assessing both eukaryotic and prokaryotic species, and can be applied to various data types, from genome assemblies and metagenomic bins, to transcriptomes and gene sets.
Genome sequence assemblies provide the basis for our understanding of biology. Generating error-free assemblies is therefore the ultimate, but sadly still unachieved goal of a multitude of research projects. Despite the ever-advancing improvements in data generation, assembly algorithms and pipelines, no automated approach has so far reliably generated near error-free genome assemblies for eukaryotes. Whilst working towards improved datasets and fully automated pipelines, assembly evaluation and curation is actively used to bridge this shortcoming and significantly reduce the number of assembly errors. In addition to this increase in product value, the insights gained from assembly curation are fed back into the automated assembly strategy and contribute to notable improvements in genome assembly quality. We describe our tried and tested approach for assembly curation using gEVAL, the genome evaluation browser. We outline the procedures applied to genome curation using gEVAL and also our recommendations for assembly curation in a gEVAL-independent context to facilitate the uptake of genome curation in the wider community.
Insects and pathogens frequently exploit the same host plant and can potentially impact each other's performance. However, studies on plant–pathogen–insect interactions have mainly focused on a fixed temporal setting or on a single interaction partner. In this study, we assessed the impact of time of attacker arrival on the outcome and symmetry of interactions between aphids (Tuberculatus annulatus), powdery mildew (Erysiphe alphitoides), and caterpillars (Phalera bucephala) feeding on pedunculate oak, Quercus robur, and explored how single versus multiple attackers affect oak performance. We used a multifactorial greenhouse experiment in which oak seedlings were infected with either zero, one, two, or three attackers, with the order of attacker arrival differing among treatments. The performances of all involved organisms were monitored throughout the experiment. Overall, attackers had a weak and inconsistent impact on plant performance. Interactions between attackers, when present, were asymmetric. For example, aphids performed worse, but powdery mildew performed better, when co-occurring. Order of arrival strongly affected the outcome of interactions, and early attackers modified the strength and direction of interactions between later-arriving attackers. Our study shows that interactions between plant attackers can be asymmetric, time-dependent, and species specific. This is likely to shape the ecology and evolution of plant–pathogen–insect interactions.
Reconstruction of target genomes from sequence data produced by instruments that are agnostic as to the species-of-origin may be confounded by contaminant DNA. Whether introduced during sample processing or through co-extraction alongside the target DNA, if insufficient care is taken during the assembly process, the final assembled genome may be a mixture of data from several species. Such assemblies can confound sequence-based biological inference and, when deposited in public databases, may be included in downstream analyses by users unaware of underlying problems. We present BlobToolKit, a software suite to aid researchers in identifying and isolating non-target data in draft and publicly available genome assemblies. BlobToolKit can be used to process assembly, read and analysis files for fully reproducible interactive exploration in the browser-based Viewer. BlobToolKit can be used during assembly to filter non-target DNA, helping researchers produce assemblies with high biological credibility. We have been running an automated BlobToolKit pipeline on eukaryotic assemblies publicly available in the International Nucleotide Sequence Data Collaboration and are making the results available through a public instance of the Viewer at https://blobtoolkit.genomehubs.org/view . We aim to complete analysis of all publicly available genomes and then maintain currency with the flow of new genomes. We have worked to embed these views into the presentation of genome assemblies at the European Nucleotide Archive, providing an indication of assembly quality alongside the public record with links out to allow full exploration in the Viewer.
Rapid development in long read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either only focus on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors.
Here we present a novel tool "purge_dups" that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines.
The source code is written in C and is available at https://github.com/dfguan/purge_dups.
Supplementary data are available at Bioinformatics online.
Long-read sequencing and novel long-range assays have revolutionized de novo genome assembly by automating the reconstruction of reference-quality genomes. In particular, Hi-C sequencing is becoming an economical method for generating chromosome-scale scaffolds. Despite its increasing popularity, there are limited open-source tools available. Errors, particularly inversions and fusions across chromosomes, remain higher than alternate scaffolding technologies. We present a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph. We demonstrate higher accuracy than the state-of-the-art methods across a variety of Hi-C library preparations and input assembly sizes. The Python and C++ code for our method is openly available at https://github.com/machinegun/SALSA.
We present HiGlass, an open source visualization tool built on web technologies that provides a rich interface for rapid, multiplex, and multiscale navigation of 2D genomic maps alongside 1D genomic tracks, allowing users to combine various data types, synchronize multiple visualization modalities, and share fully customizable views with others. We demonstrate its utility in exploring different experimental conditions, comparing the results of analyses, and creating interactive snapshots to share with collaborators and the broader public. HiGlass is accessible online at http://higlass.io and is also available as a containerized application that can be run on any platform.
Electronic supplementary material
The online version of this article (10.1186/s13059-018-1486-1) contains supplementary material, which is available to authorized users.
For most research approaches, genome analyses are dependent on the existence of a high quality genome reference assembly. However, the local accuracy of an assembly remains difficult to assess and improve. The gEVAL browser allows the user to interrogate an assembly in any region of the genome by comparing it to different datasets and evaluating the concordance. These analyses include: a wide variety of sequence alignments, comparative analyses of multiple genome assemblies, and consistency with optical and other physical maps. gEVAL highlights allelic variations, regions of low complexity, abnormal coverage, and potential sequence and assembly errors, and offers strategies for improvement. Although gEVAL focuses primarily on sequence integrity, it can also display arbitrary annotation including from Ensembl or TrackHub sources. We provide gEVAL web sites for many human, mouse, zebrafish and chicken assemblies to support the Genome Reference Consortium, and gEVAL is also downloadable to enable its use for any organism and assembly.
Availability and implementation:
Web Browser: http://geval.sanger.ac.uk, Plugin: http://wchow.github.io/wtsi-geval-plugin CONTACT: email@example.comSupplementary information: Supplementary data are available at Bioinformatics online.
Complete and accurate genome assemblies form the basis of most downstream genomic analyses and are of critical importance. Recent genome assembly projects have relied on a combination of noisy long-read sequencing and accurate short-read sequencing, with the former offering greater assembly continuity and the latter providing higher consensus accuracy. The recently introduced PacBio HiFi sequencing technology bridges this divide by delivering long reads (>10 kbp) with high per-base accuracy (>99.9%). Here we present HiCanu, a modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering. We benchmark HiCanu with a focus on the recovery of haplotype diversity, major histocompatibility complex (MHC) variants, satellite DNAs, and segmental duplications. For diploid human genomes sequenced to 30× HiFi coverage, HiCanu achieved superior accuracy and allele recovery compared to the current state of the art. On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultra-long Oxford Nanopore reads in terms of both accuracy and continuity. This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of 9 complete human centromeric regions. Although gaps and errors still remain within the most challenging regions of the genome, these results represent a significant advance towards the complete assembly of human genomes.
Recent outbreaks of defoliating insects on trees and shrubs used for landscaping major roads in the U.K. are reported. The reasons for some outbreaks were investigated with special reference to effects of the roadside environment. An outbreak of Phalera bucephala on Fagus sylvatia was examined each year between 1975 and 1979 and one of Euproctis similis on Crataegus monogyna from 1975-1978. The nitrogen content of vegetation near a heavily used motorway was greatly enhanced, probably by oxides of nitrogen emitted from vehicle exhausts. The increased nitrogen content of the plants probably increases the insect populations. The outbreaks could not be explained by relaxation of predation.
The direct detection of haplotypes from short-read DNA sequencing data
requires changes to existing small-variant detection methods. Here, we develop
a Bayesian statistical framework which is capable of modeling multiallelic loci
in sets of individuals with non-uniform copy number. We then describe our
implementation of this framework in a haplotype-based variant detector,
The Buff-Tip Moth (Phalera Bucephala L.)-a Pest of Apple Trees.