Article

A draft reference genome assembly of California Pipevine, Aristolochia californica Torr

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The California Pipevine, Aristolochia californica Torr., is the only endemic California species within the cosmopolitan birthwort family Aristolochiaceae. It occurs as an understory vine in riparian and chaparral areas and in forest edges and windrows. The geographic range of this plant species almost entirely overlaps with that of its major specialized herbivore, the California Pipevine Swallowtail Butterfly Battus philenor hirsuta. While this species pair is a useful, ecologically well-understood system to study co-evolution, until recently, genomic resources for both have been lacking. Here, we report a new, chromosome-level assembly of A. californica as part of the California Conservation Genomics Project (CCGP). Following the sequencing and assembly strategy of the CCGP, we used Pacific Biosciences HiFi long reads and Hi-C chromatin proximity sequencing technology to produce a de novo assembled genome. Our genome assembly, the first for any species in the genus, contains 531 scaffolds spanning 661 megabase (Mb) pairs, with a contig N50 of 6.53 Mb, a scaffold N50 of 42.2 Mb, and BUSCO complete score of 98%. In combination with the recently published B. philenor hirsuta reference genome assembly, the A. californica reference genome assembly will be a powerful tool for studying co-evolution in a rapidly changing California landscape.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The increasing availability of genomic resequencing data sets and high-quality reference genomes across the tree of life present exciting opportunities for comparative population genomic studies. However, substantial challenges prevent the simple reuse of data across different studies and species, arising from variability in variant calling pipelines, data quality, and the need for computationally intensive reanalysis. Here, we present snpArcher, a flexible and highly efficient workflow designed for the analysis of genomic resequencing data in nonmodel organisms. snpArcher provides a standardized variant calling pipeline and includes modules for variant quality control, data visualization, variant filtering, and other downstream analyses. Implemented in Snakemake, snpArcher is user-friendly, reproducible, and designed to be compatible with high-performance computing clusters and cloud environments. To demonstrate the flexibility of this pipeline, we applied snpArcher to 26 public resequencing data sets from nonmammalian vertebrates. These variant data sets are hosted publicly to enable future comparative population genomic analyses. With its extensibility and the availability of public data sets, snpArcher will contribute to a broader understanding of genetic variation across species by facilitating the rapid use and reuse of large genomic data sets.
Article
Full-text available
Landscape genomics can harness environmental and genetic data to inform conservation decisions by providing essential insights into how landscapes shape biodiversity. The massive increase in genetic data afforded by the genomic era provides exceptional resolution for answering critical conservation genetics questions. The accessibility of genomic data for non‐model systems has also enabled a shift away from population‐based sampling to individual‐based sampling, which now provides accurate and robust estimates of genetic variation that can be used to examine the spatial structure of genomic diversity, population connectivity and the nature of environmental adaptation. Nevertheless, the adoption of individual‐based sampling in conservation genetics has been slowed due, in large part, to concerns over how to apply methods developed for population‐based sampling to individual‐based sampling schemes. Here, we discuss the benefits of individual‐based sampling for conservation and describe how landscape genomic methods, paired with individual‐based sampling, can answer fundamental conservation questions. We have curated key landscape genomic methods into a user‐friendly, open‐source workflow, which we provide as a new R package, A Landscape Genomics Analysis Toolkit in R ( algatr) . The algatr package includes novel added functionality for all of the included methods and extensive vignettes designed with the primary goal of making landscape genomic approaches more accessible and explicitly applicable to conservation biology.
Article
Full-text available
The California Pipevine Swallowtail Butterfly, Battus philenor hirsuta, and its host plant, the California Pipevine or Dutchman's Pipe, Aristolochia californica Torr., are an important California endemic species pair. While this species pair is an ideal system to study co-evolution, genomic resources for both are lacking. Here, we report a new, chromosome-level assembly of B. philenor hirsuta as part of the California Conservation Genomics Project (CCGP). Following the sequencing and assembly strategy of the CCGP, we used Pacific Biosciences HiFi long reads and Hi-C chromatin proximity sequencing technology to produce a de novo assembled genome. Our genome assembly, the first for any species in the genus, contains 109 scaffolds spanning 443 mega base (Mb) pairs, with a contig N50 of 14.6 Mb, a scaffold N50 of 15.2 Mb, and BUSCO complete score of 98.9%. In combination with the forthcoming A. californica reference genome, the B. philenor hirsuta genome will be a powerful tool for documenting landscape genomic diversity and plant-insect co-evolution in a rapidly changing California landscape.
Article
Full-text available
Aristolochic acids (AAs) are a group of nitrophenanthrene carboxylic acids present in many medicinal herbs of the Aristolochia genus that may cause irreversible hepatotoxicity, nephrotoxicity, genotoxicity and carcinogenicity. However, the specific profile of AAs and their toxicity in Aristolochia plants, except for AAs Ι and ΙΙ, still remain unclear. In this study, a total of 52 batches of three medicinal herbs belonging to the Aristolochia family were analyzed for their AA composition profiles and AA contents using the UPLC-QTOF-MS/MS approach. The studied herbs were A. mollissima Hance (AMH), A. debilis Sieb.etZucc (ADS), and A. cinnabaria C.Y.Cheng (ACY). Chemometrics methods, including PCA and OPLS-DA, were used for the evaluation of the Aristolochia medicinal herbs. Additionally, cytotoxicity and genotoxicity of the selected AAs and the extracts of AMH and ADS were evaluated in a HepG2 cell line using the MTT method and a Comet assay, respectively. A total of 44 AAs, including 23 aristolochic acids and 21 aristolactams (ALs), were detected in A. mollissima. Moreover, 41 AAs (23 AAs and 18 ALs) were identified from A. debilis Sieb, and 45 AAs (29 AAs and 16 ALs) were identified in A. cinnabaria. Chemometrics results showed that 16, 19, and 22 AAs identified in AMH, ADS, and ACY, respectively, had statistical significance for distinguishing the three medicinal herbs of different origins. In the cytotoxicity assay, compounds AL-BΙΙ, AAΙ and the extract of AMH exhibited significant cytotoxicities against the HepG2 cell line with the IC50 values of 0.2, 9.7 and 50.2 μM, respectively. The results of the Comet assay showed that AAΙ caused relatively higher damage to cellular DNA (TDNA 40–95%) at 50 μM, while AAΙΙ, AMH and ADS extracts (ranged from 10 to 131 μM) caused relatively lower damage to cellular DNA (TDNA 5–20%).
Preprint
Full-text available
Single molecule sequencing requires optimized sample and library preparation protocols to obtain long-read lengths and high sequencing yields. Numerous protocols exist for the extraction of DNA from plant species, but the genomic DNA from these extractions is either too low yield, of insufficient purity for sensitive sequencing platforms, e.g. nanopore sequencing, too fragmented to achieve long reads, or otherwise unattainable from recalcitrant adult tissue. This renders many plant sequencing projects cost prohibitive or methodologically intractable. Existing protocols are also labor intensive, taking days to complete. Our protocol described here yields micrograms of high molecular weight gDNA from a single gram of adult or seedling leaf tissue in only a few hours, and produces high quality sequencing libraries for the Oxford Nanopore system, with typical yields ranging from 3-10 Gb per R9.4.1 flowcell and producing reads averaging 5-8 kb, with read length N50s ranging from 6-30 kb depending on the style of library preparation (details in sequencing outcomes section), and maximum lengths extending up to 200 kb+.
Article
Full-text available
Conservation science and environmental regulation are sibling constructs of the latter half of the 20th century, part of a more general awakening to humanity’s effect on the natural world in the wake of two world wars. Efforts to understand the evolution of biodiversity using the models of population genetics and the data derived from DNA sequencing, paired with legal and political mandates to protect biodiversity through novel laws, regulations, and conventions arose concurrently. The extremely rapid rate of development of new molecular tools to document and compare genetic identities, and the global goal of prioritizing species and habitats for protection are separate enterprises that have benefited from each other, ultimately leading to improved outcomes for each. In this article, we explore how the California Conservation Genomics Project has, and should, contribute to ongoing and future conservation implementation, and how it serves as a model for other geopolitical regions and taxon-oriented conservation efforts. One of our primary conclusions is that conservation genomics can now be applied, at scale, to inform decision-makers and identify regions and their contained species that are most resilient, and most in need of conservation interventions.
Article
Full-text available
Aristolochia hainanensis Merr. 1922, a well-known Chinese medicinal plant, is distributed in Hainan Province and Guangxi Province, China. In the current study, we sequenced the complete chloroplast genome of A. hainanensis. The complete plastome genome was 159,764 bp in length, with a GC content of 38.8%, showing a typical quadripartite organization. The genome contained a large single-copy (LSC) of 89,134 bp, a small single-copy (SSC) of 19,306 bp, and a pair of inverted repeats (IRs) of 25,662 bp. A total of 113 genes were annotated, including 79 protein-coding genes, 30 tRNAs, and four rRNAs. The trnK-UUU gene contained the longest intron (2644 bp). The topology of the maximum-likelihood tree supported a close relationship between A. hainanensis and A. kwangsiensis.
Article
Full-text available
Incorporating measures of taxonomic diversity into research and management plans has long been a tenet of conservation science. Increasingly, active conservation programs are turning towards multi-species landscape and regional conservation actions, and away from single species approaches. This is both a reflection of changing trends in conservation science and advances in foundational technologies, including genomics and geospatial science. Multi-species approaches may provide more fundamental insights into evolutionary processes and equip managers with a more holistic understanding of the landscapes under their jurisdiction. Central to this approach are data generation and analyses which embrace and reflect a broad range of taxonomic diversity. Here we examine the family-level phylogenetic breadth of the California Conservation Genomics Project (CCGP) based on family-level phylogenetic diversity, family-level phylogenetic distinctness, and family richness. We place this in the context of the diversity present in California and compare it to the 35-plus years of genetic research compiled in the CaliPopGen Database. We found that the family-level phylogenetic diversity in the CCGP reflected that of California very well, slightly over-representing chordates and under-representing arthropods, and that 42% of CCGP phylogenetic diversity represented new contributions to genetic data for the state. In one focused effort, the CCGP was able to achieve roughly half the family-level phylogenetic diversity studied over the last several decades. To maximize studied phylogenetic diversity, future work should focus on arthropods, a conclusion that likely reflects the overall lack of attention to this hyper diverse clade.
Article
Full-text available
Genome size variation and evolutionary forces behind have been long pursued in flowering plants. The genus Oryza, consisting of approximately 25 wild species and two cultivated rice, harbors eleven extant genome types, six of which are diploid (AA, BB, CC, EE, FF, and GG) and five of which are tetraploid (BBCC, CCDD, HHJJ, HHKK, and KKLL). To obtain the most comprehensive knowledge of genome size variation in the genus Oryza, we performed flow cytometry experiments and estimated genome sizes of 166 accessions belonging to 16 non-AA genome Oryza species. k-mer analyses were followed to verify the experimental results of the two accessions for each species. Our results showed that genome sizes largely varied fourfold in the genus Oryza, ranging from 279 Mb in Oryza brachyantha (FF) to 1,203 Mb in Oryza ridleyi (HHJJ). There was a 2-fold variation (ranging from 570 to 1,203 Mb) in genome size among the tetraploid species, while the diploid species had 3-fold variation, ranging from 279 Mb in Oryza brachyantha (FF) to 905 Mb in Oryza australiensis (EE). The genome sizes of the tetraploid species were not always two times larger than those of the diploid species, and some diploid species even had larger genome sizes than those of tetraploids. Nevertheless, we found that genome sizes of newly formed allotetraploids (BBCC-) were almost equal to totaling genome sizes of their parental progenitors. Our results showed that the species belonging to the same genome types had similar genome sizes, while genome sizes exhibited a gradually decreased trend during the evolutionary process in the clade with AA, BB, CC, and EE genome types. Comparative genomic analyses further showed that the species with different rice genome types may had experienced dissimilar amplification histories of retrotransposons, resulting in remarkably different genome sizes. On the other hand, the closely related rice species may have experienced similar amplification history. We observed that the contents of transposable elements, long terminal repeats (LTR) retrotransposons, and particularly LTR/Gypsy retrotransposons varied largely but were significantly correlated with genome sizes. Therefore, this study demonstrated that LTR retrotransposons act as an active driver of genome size variation in the genus Oryza.
Article
Full-text available
Background Pacific Biosciences HiFi read technology is currently the industry standard for high accuracy long-read sequencing that has been widely adopted by large sequencing and assembly initiatives for generation of de novo assemblies in non-model organisms. Though adapter contamination filtering is routine in traditional short-read analysis pipelines, it has not been widely adopted for HiFi workflows. Results Analysis of 55 publicly available HiFi datasets revealed that a read-sanitation step to remove sequence artifacts derived from PacBio library preparation from read pools is necessary as adapter sequences can be erroneously integrated into assemblies. Conclusions Here we describe the nature of adapter contaminated reads, their consequences in assembly, and present HiFiAdapterFilt, a simple and memory efficient solution for removing adapter contaminated reads prior to assembly.
Article
Full-text available
Aristolochic acids (AAs) and their derivatives exist in multiple Aristolochiaceae species which had been or are being used as medicinal materials. During the past decades, AAs have received increasing attention due to their nephrotoxicity and carcinogenecity. Elimination of AAs in medicinal materials using biotechnological approaches is important to improve medication safety. However, it has not been achieved because of the limited information of AA biosynthesis available. Here, we report a high-quality reference-grade genome assembly of the AA-containing vine, Aristolochia contorta. Total size of the assembly is 209.27 Mb, which is assembled into 7 pseudochromosomes. Synteny analysis, Ks distribution and 4DTv suggest absences of whole-genome duplication events in A. contorta after the angiosperm-wide WGD. Based on genomic, transcriptomic and metabolic data, pathways and candidate genes of benzylisoquinoline alkaloid (BIA) and AA biosynthesis in A. contorta were proposed. Five O-methyltransferase genes, including AcOMT1–3, AcOMT5 and AcOMT7, were cloned and functionally characterized. The results provide a high-quality reference genome for AA-containing species of Aristolochiaceae. It lays a solid foundation for further elucidation of AA biosynthesis and regulation and molecular breeding of Aristolochiaceae medicinal materials.
Article
Full-text available
Aristolochia, a genus in the magnoliid order Piperales, has been famous for centuries for its highly specialized flowers and wide medicinal applications. Here, we present a new, high-quality genome sequence of Aristolochia fimbriata, a species that, similar to Amborella trichopoda, lacks further whole-genome duplications since the origin of extant angiosperms. As such, the A. fimbriata genome is an excellent reference for inferences of angiosperm genome evolution, enabling detection of two novel whole-genome duplications in Piperales and dating of previously reported whole-genome duplications in other magnoliids. Genomic comparisons between A. fimbriata and other angiosperms facilitated the identification of ancient genomic rearrangements suggesting the placement of magnoliids as sister to monocots, whereas phylogenetic inferences based on sequence data we compiled yielded ambiguous relationships. By identifying associated homologues and investigating their evolutionary histories and expression patterns, we revealed highly conserved floral developmental genes and their distinct downstream regulatory network that may contribute to the complex flower morphology in A. fimbriata. Finally, we elucidated the genetic basis underlying the biosynthesis of terpenoids and aristolochic acids in A. fimbriata.
Article
Full-text available
Background and aims: Genome size varies considerably across the diversity of plant life. Although genome size is, by definition, affected by genetic presence/absence variants, which are ubiquitous in population sequencing studies, genome size is often treated as an intrinsic property of a species. Here, we studied intra- and interspecific genome size variation in taxonomically complex British eyebrights (Euphrasia, Orobanchaceae). Our aim is to document genome size diversity and investigate underlying evolutionary processes shaping variation between individuals, populations and species. Methods: We generated genome size data for 192 individuals of diploid and tetraploid Euphrasia and analysed genome size variation in relation to ploidy, taxonomy, population affiliation, and geography. We further compared the genomic repeat content of 30 samples. Key results: We found considerable intraspecific genome size variation, and observed isolation-by-distance for genome size in outcrossing diploids. Tetraploid Euphrasia showed contrasting patterns, with genome size increasing with latitude in outcrossing Euphrasia arctica, but with little genome size variation in the highly selfing Euphrasia micrantha. Interspecific differences in genome size and the genomic proportions of repeat sequences were small. Conclusions: We show the utility of treating genome size as the outcome of polygenic variation. Like other types of genetic variation, such as single nucleotide polymorphisms, genome size variation may be affected by ongoing hybridisation and the extent of population subdivision. In addition to selection on associated traits, genome size is predicted to be affected indirectly by selection due to pleiotropy of the underlying presence/absence variants.
Article
Full-text available
Methods for evaluating the quality of genomic and metagenomic data are essential to aid genome assembly and to correctly interpret the results of subsequent analyses. BUSCO estimates the completeness and redundancy of processed genomic data based on universal single-copy orthologs. Here we present new functionalities and major improvements of the BUSCO software, as well as the renewal and expansion of the underlying datasets in sync with the OrthoDB v10 release. Among the major novelties, BUSCO now enables phylogenetic placement of the input sequence to automatically select the most appropriate dataset for the assessment, allowing the analysis of metagenome-assembled genomes of unknown origin. A newly-introduced genome workflow increases the efficiency and runtimes especially on large eukaryotic genomes. BUSCO is the only tool capable of assessing both eukaryotic and prokaryotic species, and can be applied to various data types, from genome assemblies and metagenomic bins, to transcriptomes and gene sets.
Article
Full-text available
High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species1–4. To address this issue, the international Genome 10K (G10K) consortium5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.
Article
Full-text available
Haplotype-resolved de novo assembly is the ultimate solution to the study of sequence variations in a genome. However, existing algorithms either collapse heterozygous alleles into one consensus copy or fail to cleanly separate the haplotypes to produce high-quality phased assemblies. Here we describe hifiasm, a de novo assembler that takes advantage of long high-fidelity sequence reads to faithfully represent the haplotype information in a phased assembly graph. Unlike other graph-based assemblers that only aim to maintain the contiguity of one haplotype, hifiasm strives to preserve the contiguity of all haplotypes. This feature enables the development of a graph trio binning algorithm that greatly advances over standard trio binning. On three human and five nonhuman datasets, including California redwood with a ~30-Gb hexaploid genome, we show that hifiasm frequently delivers better assemblies than existing tools and consistently outperforms others on haplotype-resolved assembly. Hifiasm is a haplotype-resolved de novo genome assembler for long-read high-fidelity sequencing data based on phased assembly graphs.
Article
Full-text available
Traditionally, reference genomes in crop species rely on the assembly of one accession, thus occulting most of intraspecific diversity. However, rearrangements, gene duplications, and transposable element content may have a large impact on the genomic structure, which could generate new phenotypic traits. Comparing two Brassica rapa genomes recently sequenced and assembled using long-read technology and optical mapping, we investigated structural variants and repetitive content between the two accessions and genome size variation among a core collection. We explored the structural consequences of the presence of large repeated sequences in B. rapa ‘Z1’ genome vs. the B. rapa ‘Chiifu’ genome, using comparative genomics and cytogenetic approaches. First, we showed that large genomic variants on chromosomes A05, A06, A09, and A10 are due to large insertions and inversions when comparing B. rapa ‘Z1’ and B. rapa ‘Chiifu’ at the origin of important length differences in some chromosomes. For instance, lengths of ‘Z1’ and ‘Chiifu’ A06 chromosomes were estimated in silico to be 55 and 29 Mb, respectively. To validate these observations, we compared using fluorescent in situ hybridization (FISH) the two A06 chromosomes present in an F1 hybrid produced by crossing these two varieties. We confirmed a length difference of 17.6% between the A06 chromosomes of ‘Z1’ compared to ‘Chiifu.’ Alternatively, using a copy number variation approach, we were able to quantify the presence of a higher number of rDNA and gypsy elements in ‘Z1’ genome compared to ‘Chiifu’ on different chromosomes including A06. Using flow cytometry, the total genome size of 12 Brassica accessions corresponding to a B. rapa available core collection was estimated and revealed a genome size variation of up to 16% between these accessions as well as some shared inversions. This study revealed the contribution of long-read sequencing of new accessions belonging to different cultigroups of B. rapa and highlighted the potential impact of differential insertion of repeat elements and inversions of large genomic regions in genome size intraspecific variability.
Article
Full-text available
Recent long-read assemblies often exceed the quality and completeness of available reference genomes, making validation challenging. Here we present Merqury, a novel tool for reference-free assembly evaluation based on efficient k-mer set operations. By comparing k-mers in a de novo assembly to those found in unassembled high-accuracy reads, Merqury estimates base-level accuracy and completeness. For trios, Merqury can also evaluate haplotype-specific accuracy, completeness, phase block continuity, and switch errors. Multiple visualizations, such as k-mer spectrum plots, can be generated for evaluation. We demonstrate on both human and plant genomes that Merqury is a fast and robust method for assembly validation.
Article
Full-text available
Measuring genome size across different species can yield important insights into evolution of the genome and allow for more informed decisions when designing next-generation genomic sequencing projects. New techniques for estimating genome size using shallow genomic sequence data have emerged which have the potential to augment our knowledge of genome sizes, yet these methods have only been used in a limited number of empirical studies. In this project, we compare estimation methods using next-generation sequencing (k-mer methods and average read depth of single-copy genes) to measurements from flow cytometry, a standard method for genome size measures, using ground beetles (Carabidae) and other members of the beetle suborder Adephaga as our test system. We also present a new protocol for using read-depth of single-copy genes to estimate genome size. Additionally, we report flow cytometry measurements for five previously unmeasured carabid species, as well as 21 new draft genomes and six new draft transcriptomes across eight species of adephagan beetles. No single sequence-based method performed well on all species, and all tended to underestimate the genome sizes, although only slightly in most samples. For one species, Bembidion sp. nr. transversale , most sequence-based methods yielded estimates half the size suggested by flow cytometry.
Article
Full-text available
An important assessment prior to genome assembly and related analyses is genome profiling, where the k-mer frequencies within raw sequencing reads are analyzed to estimate major genome characteristics such as size, heterozygosity, and repetitiveness. Here we introduce GenomeScope 2.0 (https://github.com/tbenavi1/genomescope2.0), which applies combinatorial theory to establish a detailed mathematical model of how k-mer frequencies are distributed in heterozygous and polyploid genomes. We describe and evaluate a practical implementation of the polyploid-aware mixture model that quickly and accurately infers genome properties across thousands of simulated and several real datasets spanning a broad range of complexity. We also present a method called Smudgeplot (https://github.com/KamilSJaron/smudgeplot) to visualize and estimate the ploidy and genome structure of a genome by analyzing heterozygous k-mer pairs. We successfully apply the approach to systems of known variable ploidy levels in the Meloidogyne genus and the extreme case of octoploid Fragaria × ananassa. Prior to genome assembly, the raw sequencing reads must be analyzed for assessment of major genome characteristics such as genome size, heterozygosity, and repetitiveness. For this purpose, the authors introduce GenomeScope 2.0, an extension of GenomeScope for polyploid genomes, and Smudgeplot, which can estimate a genome’s ploidy.
Article
Full-text available
Reconstruction of target genomes from sequence data produced by instruments that are agnostic as to the species-of-origin may be confounded by contaminant DNA. Whether introduced during sample processing or through co-extraction alongside the target DNA, if insufficient care is taken during the assembly process, the final assembled genome may be a mixture of data from several species. Such assemblies can confound sequence-based biological inference and, when deposited in public databases, may be included in downstream analyses by users unaware of underlying problems. We present BlobToolKit, a software suite to aid researchers in identifying and isolating non-target data in draft and publicly available genome assemblies. BlobToolKit can be used to process assembly, read and analysis files for fully reproducible interactive exploration in the browser-based Viewer. BlobToolKit can be used during assembly to filter non-target DNA, helping researchers produce assemblies with high biological credibility. We have been running an automated BlobToolKit pipeline on eukaryotic assemblies publicly available in the International Nucleotide Sequence Data Collaboration and are making the results available through a public instance of the Viewer at https://blobtoolkit.genomehubs.org/view . We aim to complete analysis of all publicly available genomes and then maintain currency with the flow of new genomes. We have worked to embed these views into the presentation of genome assemblies at the European Nucleotide Archive, providing an indication of assembly quality alongside the public record with links out to allow full exploration in the Viewer.
Article
Full-text available
Long-read sequencing and novel long-range assays have revolutionized de novo genome assembly by automating the reconstruction of reference-quality genomes. In particular, Hi-C sequencing is becoming an economical method for generating chromosome-scale scaffolds. Despite its increasing popularity, there are limited open-source tools available. Errors, particularly inversions and fusions across chromosomes, remain higher than alternate scaffolding technologies. We present a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph. We demonstrate higher accuracy than the state-of-the-art methods across a variety of Hi-C library preparations and input assembly sizes. The Python and C++ code for our method is openly available at https://github.com/machinegun/SALSA.
Article
Full-text available
Aristolochiaceae, comprising about 600 species, is a unique plant family containing aristolochic acids (AAs). In this study, we sequenced seven species of Aristolochia, and retrieved eleven chloroplast (cp) genomes published for comparative genomics analysis and phylogenetic constructions. The results show that the cp genomes had a typical quadripartite structure with conserved genome arrangement and moderate divergence. The cp genomes range from 159,308 bp to 160,520 bp in length and have a similar GC content of 38.5%–38.9%. A total number of 113 genes were identified, including 79 protein-coding genes, 30 tRNAs and four rRNAs. Although genomic structure and size were highly conserved, the IR-SC boundary regions were variable between these seven cp genomes. The trnH-GUG genes, are one of major differences between the plastomes of the two subgenera Siphisia and Aristolochia. We analyzed the features of nucleotide substitutions, distribution of repeat sequences and simple sequences repeats (SSRs), positive selections in the cp genomes, and identified 16 hotspot regions for genomes divergence that could be utilized as potential markers for phylogeny reconstruction. Phylogenetic relationships of the family Aristolochiaceae inferred from the 18 cp genome sequences were consistent and robust, using maximum parsimony (MP), maximum likelihood (ML), and Bayesian analysis (BI) methods.
Article
Full-text available
We present HiGlass, an open source visualization tool built on web technologies that provides a rich interface for rapid, multiplex, and multiscale navigation of 2D genomic maps alongside 1D genomic tracks, allowing users to combine various data types, synchronize multiple visualization modalities, and share fully customizable views with others. We demonstrate its utility in exploring different experimental conditions, comparing the results of analyses, and creating interactive snapshots to share with collaborators and the broader public. HiGlass is accessible online at http://higlass.io and is also available as a containerized application that can be run on any platform. Electronic supplementary material The online version of this article (10.1186/s13059-018-1486-1) contains supplementary material, which is available to authorized users.
Article
Full-text available
Despite an abundance of new studies about topologically associating domains (TADs), the role of genetic information in TAD formation is still not fully understood. Here we use our software, HiCExplorer (hicexplorer.readthedocs.io) to annotate >2800 high-resolution (570 bp) TAD boundaries in Drosophila melanogaster. We identify eight DNA motifs enriched at boundaries, including a motif bound by the M1BP protein, and two new boundary motifs. In contrast to mammals, the CTCF motif is only enriched on a small fraction of boundaries flanking inactive chromatin while most active boundaries contain the motifs bound by the M1BP or Beaf-32 proteins. We demonstrate that boundaries can be accurately predicted using only the motif sequences at open chromatin sites. We propose that DNA sequence guides the genome architecture by allocation of boundary proteins in the genome. Finally, we present an interactive online database to access and explore the spatial organization of fly, mouse and human genomes, available at http://chorogenome.ie-freiburg.mpg.de .
Article
Full-text available
Reference-quality genomes are expected to provide a resource for studying gene structure, function, and evolution. However, often genes of interest are not completely or accurately assembled, leading to unknown errors in analyses or additional cloning efforts for the correct sequences. A promising solution is long-read sequencing. Here we tested PacBio-based long-read sequencing and diploid assembly for potential improvements to the Sanger-based intermediate-read zebra finch reference and Illumina-based short-read Anna's hummingbird reference, 2 vocal learning avian species widely studied in neuroscience and genomics. With DNA of the same individuals used to generate the reference genomes, we generated diploid assemblies with the FALCON-Unzip assembler, resulting in contigs with no gaps in the megabase range, representing 150-fold and 200-fold improvements over the current zebra finch and hummingbird references, respectively. These long-read and phased assemblies corrected and resolved what we discovered to be numerous misassemblies in the references, including missing sequences in gaps, erroneous sequences flanking gaps, base call errors in difficult-to-sequence regions, complex repeat structure errors, and allelic differences between the 2 haplotypes. These improvements were validated by single long-genome and transcriptome reads and resulted for the first time in completely resolved protein-coding genes widely studied in neuroscience and specialized in vocal learning species. These findings demonstrate the impact of long reads, sequencing of previously difficult-to-sequence regions, and phasing of haplotypes on generating the high-quality assemblies necessary for understanding gene structure, function, and evolution.
Article
Full-text available
Background Long read technologies have revolutionized de novo genome assembly by generating contigs orders of magnitude longer than that of short read assemblies. Although assembly contiguity has increased, it usually does not reconstruct a full chromosome or an arm of the chromosome, resulting in an unfinished chromosome level assembly. To increase the contiguity of the assembly to the chromosome level, different strategies are used which exploit long range contact information between chromosomes in the genome. Methods We develop a scalable and computationally efficient scaffolding method that can boost the assembly contiguity to a large extent using genome-wide chromatin interaction data such as Hi-C. Results we demonstrate an algorithm that uses Hi-C data for longer-range scaffolding of de novo long read genome assemblies. We tested our methods on the human and goat genome assemblies. We compare our scaffolds with the scaffolds generated by LACHESIS based on various metrics. Conclusion Our new algorithm SALSA produces more accurate scaffolds compared to the existing state of the art method LACHESIS. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3879-z) contains supplementary material, which is available to authorized users.
Article
Full-text available
Aristolochiaceae, a family of worldwide distribution comprising about 500 species, is a member of Piperales. Although Piperales is clearly monophyletic, the precise relationship within the order is ambiguous due to inconsistent placement of Lactoris fernandeziana. The appearance in some studies of Lactoris within Aristolochiaceae and the incongruence in generic treatments have also raised questions about the infrastructure of the family. This study addresses the overall generic relationships in Aristolochiaceae and its position in Piperales based on dense taxon sampling and sequence data from the plastid trnL-F region. The study resolved Piperales consisting of two major clades (Piperaceae plus Saururaceae and Lactoridaceae plus Aristolochiaceae) and Lactoris nested within Aristolochiaceae but with low support. The concept of two subfamilies in Aristolochiaceae, Asaroideae and Aristolochioideae, gains maximum statistical support. A generic treatment of Aristolochiaceae based on trnL-F is proposed which is congruent with recent analyses based on morphological characters.
Article
Full-text available
Limitations of genome sequencing techniques have led to dozens of assembly algorithms, none of which is perfect. A number of methods for comparing assemblers have been developed, but none is yet a recognized benchmark. Further, most existing methods for comparing assemblies are only applicable to new assemblies of finished genomes; the problem of evaluating assemblies of previously unsequenced species has not been adequately considered. Here, we present QUAST-a quality assessment tool for evaluating and comparing genome assemblies. This tool improves on leading assembly comparison software with new ideas and quality metrics. QUAST can evaluate assemblies both with a reference genome, as well as without a reference. QUAST produces many reports, summary tables and plots to help scientists in their research and in their publications. In this study, we used QUAST to compare several genome assemblers on three datasets. QUAST tables and plots for all of them are available in the Supplementary Material, and interactive versions of these reports are on the QUAST website. Availability: http://bioinf.spbau.ru/quast . Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
Background Previous studies in basal angiosperms have provided insight into the diversity within the angiosperm lineage and helped to polarize analyses of flowering plant evolution. However, there is still not an experimental system for genetic studies among basal angiosperms to facilitate comparative studies and functional investigation. It would be desirable to identify a basal angiosperm experimental system that possesses many of the features found in existing plant model systems (e.g., Arabidopsis and Oryza). Results We have considered all basal angiosperm families for general characteristics important for experimental systems, including availability to the scientific community, growth habit, and membership in a large basal angiosperm group that displays a wide spectrum of phenotypic diversity. Most basal angiosperms are woody or aquatic, thus are not well-suited for large scale cultivation, and were excluded. We further investigated members of Aristolochiaceae for ease of culture, life cycle, genome size, and chromosome number. We demonstrated self-compatibility for Aristolochia elegans and A. fimbriata, and transformation with a GFP reporter construct for Saruma henryi and A. fimbriata. Furthermore, A. fimbriata was easily cultivated with a life cycle of just three months, could be regenerated in a tissue culture system, and had one of the smallest genomes among basal angiosperms. An extensive multi-tissue EST dataset was produced for A. fimbriata that includes over 3.8 million 454 sequence reads. Conclusions Aristolochia fimbriata has numerous features that facilitate genetic studies and is suggested as a potential model system for use with a wide variety of technologies. Emerging genetic and genomic tools for A. fimbriata and closely related species can aid the investigation of floral biology, developmental genetics, biochemical pathways important in plant-insect interactions as well as human health, and various other features present in early angiosperms.
Article
Full-text available
Molecular phylogenetic analyses were conducted to determine relationships and to investigate character evolution for the Troidini/Aristolochia interaction, in an attempt to answer the following questions: (1) what is the present pattern of use of Aristolochia by these butterflies; (2) is the pattern we see today related to the phylogeny of plants or to their chemical composition; (3) can the geographical distribution of Aristolochia explain the host plant use observed today; and (4) how did the interaction between Troidini and Aristolochia evolve? Analyses of character optimization suggest that the current pattern of host plant use of these butterflies does not seem to be constrained by the phylogeny of their food plants, neither by the secondary chemicals in these plants nor by their geographical similarity. The current host plant use in these butterflies seems to be simply opportunistic, with species with a wider geographical range using more species of host plants than those with a more restricted distribution. © 2007 The Linnean Society of London, Biological Journal of the Linnean Society, 2007, 90, 247–261.
Article
Full-text available
The California population of the pipevine swallowtail Battus philenor is a specialist on the Dutchman’s pipe Aristolochia californica , an endemic vine that is densely covered with trichomes. Populations of B. philenor outside California use other Aristolochia species that are largely glabrous. The average clutch size of the pipevine swallowtail is larger in California compared with populations elsewhere and larvae feed gregariously until late in the third instar. In the field, caterpillars consumed more leaf material and showed preference for portions of leaves with trichomes removed. However, large groups of caterpillars were consistently observed feeding on the apical portion of the plant, where trichome density was highest. Smaller groups of caterpillars were observed feeding more often on mature leaves on the lower portions of the plant, where trichome density was lower. Laboratory experiments showed that the walking speed of a commonly observed predator, larvae of the green lacewing Chrysopa carnea , was reduced as trichome density increased. Furthermore, lacewing search efficiency and capture rate of a model prey item were compromised by high trichome density. In an additional field experiment, no difference was found in the percentage mortality of groups of four and 12 caterpillars. However, growth rate of the larger group was accelerated by 25% compared with smaller groups. In an experiment using a ladybird beetle larva Hippodamia convergens as the predator, no difference was observed in absolute mortality of caterpillars, suggesting that group size does not function directly as a defence against predators. First instar caterpillars are most vulnerable to predators, thus feeding in larger groups may benefit caterpillars by accelerating growth. Feeding in large groups may also be an effective strategy for B. philenor to overcome plant trichomes and feed on portions of the plant conducive to faster development. However, feeding on areas with dense trichomes does not appear to provide larvae with a refuge from predators.
Article
Full-text available
Gardens with nectar sources and larval host plants have been proposed to stem the decline in butterfly abundance caused by habitat loss. However, no study has provided evidence that gardens benefit butterflies. We examined the use of natural sites and gardens in the San Francisco bay area by the butterfly, Battus philenor. We found that natural sites were more likely to attract adult B. philenor, received more oviposition, and had higher juvenile survival than gardens sites. Butterflies were more likely to be present in gardens with established populations of the host plant, Aristolochia californica, growing in the sun. Battus philenor are unlikely to visit gardens with host plants planted within the past 7years. Gardens between the ages of 8–40years received oviposition, but did not always support completion of larval development of B. philenor. In gardens with host plants over 40years of age, B. philenor consistently survived from egg to the adult stage. Natural enemy induced mortality of eggs did not differ between garden and natural sites, but overall egg survival was lower in gardens than at natural sites. It is unlikely that gardens serve as 'refugia' for B. philenor in years when populations in natural sites experience low survival or low fecundity. Even in gardens capable of supporting larvae to maturity, the density of eggs and survival rates were lower than in natural populations of the host plant suggesting that gardens were not optimal habitats. Therefore, without evidence that juvenile abundance and survival rates in gardens matches or exceeds that in natural sites, it is most likely that gardens act as population sinks for B. philenor.
Article
Full-text available
Riparian vegetation plays an integral role in the ecology of the streams it borders, and in many western US forests, is subjected to frequent wildfire disturbances. Many questions concerning the role of natural fire in the dynamics of riparian zone vegetation remain unanswered. This case study explores the relationships between wildfire burn patterns, stream channel topography, and the short-term response of riparian vegetation to fire along two creeks in the northern Sierra Nevada mixed-conifer forest. Post-fire sampling along 60, 3 m wide transects across riparian zones was used to document the topography, species distribution, sprouting response, and seedling recruitment 1 year after the Lookout fire in the Plumas National Forest, CA. Our results indicate that larger riparian zones acted as natural fire breaks, limiting the progression of the predominantly backing fire downhill toward the stream. On Fourth Water creek's steeper first terraces, where crown fires occurred, the percentage of burned plants that sprouted was higher than in the less-severely burned and more extensive first terraces of Third Water creek (93% versus 33%, P < 0.05). Total seedling recruitment was higher along Fourth Water creek (69 versus 35 seedlings, P < 0.05), while plant regeneration along Third Water creek was primarily vegetative. Along Fourth Water creek, the percent of burned hardwoods that sprouted increased with proximity to the water's edge from 33% on the slope above the riparian zone to 95% on the gravel bar, suggesting that moisture content plays a role in riparian species response to fire. An influx of white fir (Abies concolor Gordon & Glend. (Lindl.)) seedlings on the second terraces of Third Water creek may indicate a shift in species composition if future fires are suppressed and regeneration trends do not change significantly in the next few years. These results contribute to the limited research on natural fire in riparian zones, and can inform management strategies designed to restore and maintain riparian vegetation in the fire-prone forests of the Sierra Nevada.
Article
Full-text available
Many programs for aligning short sequencing reads to a reference genome have been developed in the last 2 years. Most of them are very efficient for short reads but inefficient or not applicable for reads >200 bp because the algorithms are heavily and specifically tuned for short queries with low sequencing error rate. However, some sequencing platforms already produce longer reads and others are expected to become available soon. For longer reads, hashing-based software such as BLAT and SSAHA2 remain the only choices. Nonetheless, these methods are substantially slower than short-read aligners in terms of aligned bases per unit time. We designed and implemented a new algorithm, Burrows-Wheeler Aligner's Smith-Waterman Alignment (BWA-SW), to align long sequences up to 1 Mb against a large sequence database (e.g. the human genome) with a few gigabytes of memory. The algorithm is as accurate as SSAHA2, more accurate than BLAT, and is several to tens of times faster than both. http://bio-bwa.sourceforge.net
Article
The California Conservation Genomics Project (CCGP) is a unique, critically important step forward in the use of comprehensive landscape genetic data to modernize natural resource management at a regional scale. We describe the CCGP, including all aspects of project administration, data collection, current progress, and future challenges. The CCGP will generate, analyze, and curate a single high-quality reference genome and 100-150 resequenced genomes for each of 153 species projects (representing 235 individual species) that span the ecological and phylogenetic breadth of California’s marine, freshwater, and terrestrial ecosystems. The resulting portfolio of roughly 20,000 resequenced genomes will be analyzed with identical informatic and landscape genomic pipelines, providing a comprehensive overview of hotspots of within-species genomic diversity, potential and realized corridors connecting these hotspots, regions of reduced diversity requiring genetic rescue, and the distribution of variation critical for rapid climate adaptation. After two years of concerted effort, full funding ($12M USD) has been secured, species identified, and funds distributed to 68 laboratories and 114 investigators drawn from all 10 University of California campuses. The remaining phases of the CCGP include completion of data collection and analyses, and delivery of the resulting genomic data and inferences to state and federal regulatory agencies to help stabilize species declines. The aspirational goals of the CCGP are to identify geographic regions that are critical to long term preservation of California biodiversity, prioritize those regions based on defensible genomic criteria, and provide foundational knowledge that informs management strategies at both the individual species and ecosystem levels.
Article
Motivation: Most existing coverage-based (epi)genomic datasets are one-dimensional, but newer technologies probing interactions (physical, genetic, etc.) produce quantitative maps with two-dimensional genomic coordinate systems. Storage and computational costs mount sharply with data resolution when such maps are stored in dense form. Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature, while supporting efficient compression and providing fast random access to facilitate development of scalable algorithms for data analysis. Results: We developed a file format called cooler, based on a sparse data model, that can support genomically-labeled matrices at any resolution. It has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns, and metadata. Cooler is based on HDF5 and is supported by a Python library and command line suite to create, read, inspect and manipulate cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium. Availability: Cooler is cross-platform, BSD-licensed, and can be installed from the Python package index or the bioconda repository. The source code is maintained on Github at https://github.com/mirnylab/cooler. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
A phylogenetic analysis was conducted to examine the monophyly and relationships of the four broadly defined genera of Aristolochiaceae. Seventy-two morphological characters were coded from representatives of these genera and from a broad selection of potential outgroups. The data support monophyly of the Aristolochiaceae and monophyly of the broadly defined genera Aristolochia, Thottea, and Asarum. The genera are grouped into two clades within the family, Thottea + Aristolochia and Asarum + Saruma. Based on the results of these analyses, Asaroideae, which have been circumscribed by some authors to consist of Saruma, Asarum, and Thottea, are paraphyletic, and should be emended to exclude Thottea.
Article
Toxic plants with sequestering specialists are presented with a problem because plant derived toxins protect herbivores against natural enemies. It has been suggested that early induction of toxins and later relaxation of these defenses may help the plant resolve this problem because neonate caterpillars incur the physiological cost of dealing with toxins in early life, but are denied toxins when they are able to sequester them efficiently. In California, the pipevine swallowtail, Battus philenor L. (Lepidoptera: Papilionidae), feed exclusively on Aristolochia californica Torrey (Aristolochiaceae), an endemic vine that contains toxic alkaloids called aristolochic acids that caterpillars sequester to provide chemical defense in immature and adult stages. In a field experiment, the concentration of aristolochic acids doubled in the plant following leaf damage and returned to constitutive levels after six days. Neonate pipevine swallowtail caterpillars showed no aversion to high levels of aristolochic acid in a preference test. Caterpillars reared on leaves with supplemented aristolochic acid showed no physiological cost or increased mortality compared to caterpillars reared on un-supplemented leaves. Searching efficiency and capture rate of lacewing larvae (Chrysoperla), a common predator of first instar caterpillars, was compromised significantly after feeding on caterpillars reared on leaves with supplemented concentrations of aristolochic acid compared to caterpillars feeding on control plants. Additionally, mortality of lacewings increased when they were provided with a diet of B. philenor caterpillars reared on supplemented leaves compared to caterpillars reared on control leaves. Thus, the induction of aristolochic acids in the plant following leaf damage does not resolve the problem confronted by the plant and may confer benefits to this sequestering specialist.
Article
The pipevine swallowtail, Battus philenor, feeds exclusively on plants in the genus Aristolochia, many of which are known to contain the toxic alkaloids collectively known as aristolochic acids. Pipevine swallowtails sequester these compounds and use them for their own defense against predators. Numerous palatable butterflies are involved in Batesian mimicry complexes with B. philenor over its range. The California subspecies of the pipevine swallowtail, B. philenor hirsuta, has no mimics. Analysis of the butterfly and its only host plant, Aristolochia californica, indicate that both contain aristolochic acid. Aristolochic acid (I) and (II) are the primary aristolochic acids found in A. californica. The highest concentration of aristolochic acids was found in the flowers, which bloom before B. philenor emerges. Concentrations of aristolochic acids decreased in the leaves but not in stem tissue over the course of the season. Butterflies contained primarily aristolochic acid (I). Aristolochic acid content of individuals from Arizona, which are involved in mimicry complexes, did not differ from California populations. Thus, lack of California mimics cannot be attributed to low aristolochic acid content of the model.
Article
Most models of genome size evolution emphasize changes in relative rates of and/or the efficacy of selection on insertions and deletions. However, transposable elements (TEs) are a major contributor to genome size evolution, and since they experience their own selective pressures for expansion, genome size changes may in part be driven by the dynamics of co-evolution between TEs and their hosts. Under this perspective, predictions about the conditions that allow for genome expansion may be altered. In this review, we outline the evidence for TE-host co-evolution, discuss the conditions under which these dynamics can change, and explore the possible contribution to the evolution of genome size. Aided partly by advances in our understanding of the mechanisms of TE silencing via small RNAs, there is growing evidence that the evolution of transposition rates can be important in driving genome expansion and contraction. Shifts in genome size and transposon abundance associated with interspecific hybridization and changes in mating system are consistent with an important role for transposition rate evolution, although other possible explanations persist. More understanding of the potential for the breakdown of host silencing mechanisms and/or the potential for TEs to evade host immune responses will improve our understanding of the importance of changes in TE activity in driving genome size evolution.
Mirnylab/Pairtools: V0. 2.0
  • Goloborodko