Article

A robust genome assembly with transcriptomic data from the striped bark scorpion, Centruroides vittatus

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Scorpions, a seemingly primitive, stinging arthropod taxa, are known to exhibit marked diversity in their venom components. These venoms are known for their human pathology, but they are also important as models for therapeutic and drug development applications. In this study, we report a high-quality genome assembly and annotation of the striped bark scorpion, Centruroides vittatus, created with several shotgun libraries. The final assembly is 760 Mb in size, with a BUSCO score of 97.8%, a 30.85% GC, and an N50 of 2.35 Mb. We estimated 36,189 proteins with 37.32% assigned to Gene Ontology (GO) terms in our GO annotation analysis. We mapped venom toxin genes to 18 contigs and 2 scaffolds. We were also able to identify expression differences between venom gland (telson) and body tissue (carapace) with 19 sodium toxin and 14 potassium toxin genes to 18 contigs and 2 scaffolds. This assembly, along with our transcriptomic data, provides further data to investigate scorpion venom genomics.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In study of the transcriptome of A. crassicauda assembled by Trinity, also a high BUSCO completeness score (>96%) was reported along with high "duplication" BUSCOs score, in which the clustering step led to decreased the number of duplicated copies of transcriptomes (Salabi and Jafari, 2022). Similarly, in the transcriptome analysis of Centruroides vittatus a high BUSCO score of 97.8% was achieved for the final assembly (Yamashita et al., 2024). ...
Article
Full-text available
Introduction Scorpion venom is a rich source of biological active peptides and proteins. Transcriptome analysis of the venom gland provides detailed insights about peptide and protein venom components. Following the transcriptome analysis of different species in our previous studies, our research team has focused on the Hottentotta zagrosensis as one of the endemic scorpions of Iran to obtain information about its venom proteins, in order to develop biological research focusing on medicinal applications of scorpion venom components and antivenom production. To gain insights into the protein composition of this scorpion venom, we performed transcriptomic analysis. Methods Transcriptomic analysis of the venom gland of H. zagrosensis, prepared from the Khuzestan province, was performed through Illumina paired-end sequencing (RNA-Seq), Trinity de novo assembly, CD-Hit-EST clustering, and annotation of identified primary structures using bioinformatics approaches. Results Transcriptome analysis showed the presence of 96.4% of complete arthropod BUSCOs, indicating a high-quality assembly. From total of 45,795,108 paired-end 150 bp trimmed reads, the clustering step resulted in the generation of 101,180 de novo assembled transcripts with N50 size of 1,149 bp. 96,071 Unigenes and 131,235 transcripts had a significant similarity (E-value 1e-3) with known proteins from UniProt, Swissprot, Animal toxin annotation project, and the Pfam database. The results were validated using InterProScan. These mainly correspond to ion channel inhibitors, metalloproteinases, neurotoxins, protease inhibitors, protease activators, Cysteine-rich secretory proteins, phospholipase A enzymes, antimicrobial peptides, growth factors, lipolysis-activating peptides, hyaluronidase, and, phospholipase D. Our venom gland transcriptomic approach identified several biologically active peptides including five LVP1-alpha and LVP1-beta isoforms, which we named HzLVP1_alpha1, HzLVP1_alpha2, HzLVP1_alpha3, HzLVP1_beta1, and HzLVP1_beta and have extremely characterized here. Discussion Except for HzLVP1_beta1, all other identified LVP1s are predicted to be stable proteins (instability index <40). Moreover, all isoform of LVP1s alpha and beta subunits are thermostable, with the most stability for HzLVP1_alpha2 (aliphatic index = 71.38). HzLVP1_alpha2 has also the highest half-life. Three-dimensional structure of all identified proteins compacts with three disulfide bridges. The extra cysteine residue may allow the proteins to form a hetero- or homodimer. LVP1 subunits of H. zagrosensis potentially interact with adipose triglyceride lipase (ATGL) and hormone-sensitive lipase (HSL), two key enzymes in regulation of lipolysis in adipocytes, suggesting pharmacological properties of these identified proteins.
Article
Full-text available
Spiders are a hyperdiverse taxon and among the most abundant predators in nearly all terrestrial habitats. Their success is often attributed to key developments in their evolution such as silk and venom production and major apomorphies such as a whole-genome duplication. Resolving deep relationships within the spider tree of life has been historically challenging, making it difficult to measure the relative importance of these novelties for spider evolution. Whole-genome data offer an essential resource in these efforts, but also for functional genomic studies. Here, we present de novo assemblies for three spider species: Ryuthela nishihirai (Liphistiidae), a representative of the ancient Mesothelae, the suborder that is sister to all other extant spiders; Uloborus plumipes (Uloboridae), a cribellate orbweaver whose phylogenetic placement is especially challenging; and Cheiracanthium punctorium (Cheiracanthiidae), which represents only the second family to be sequenced in the hyperdiverse Dionycha clade. These genomes fill critical gaps in the spider tree of life. Using these novel genomes along with 25 previously published ones, we examine the evolutionary history of spidroin gene and structural hox cluster diversity. Our assemblies provide critical genomic resources to facilitate deeper investigations into spider evolution. The near chromosome-level genome of the 'living fossil' R. nishihirai represents an especially important step forward, offering new insights into the origins of spider traits.
Article
Full-text available
Body tissue and venom glands from an eastern population of the scorpion Centruroides vittatus (Say, 1821) were homogenized and molecular constituents removed to characterize putative sodium β toxin gene diversity, RT-qPCR, transcriptomic, and proteomic variation. We cloned sodium β toxins from genomic DNA, conducted RT-qPCR experiments with seven sodium β toxin variants, performed venom gland tissue RNA-seq, and isolated venom proteins for mass spectrophotometry. We identified >70 putative novel sodium β toxin genes, 111 toxin gene transcripts, 24 different toxin proteins, and quantified sodium β toxin gene expression variation among individuals and between sexes. Our analyses contribute to the growing evidence that venom toxicity among scorpion taxa and their populations may be associated with toxin gene diversity, specific toxin transcripts variation, and subsequent protein production. Here, slight transcript variation among toxin gene variants may contribute to the major toxin protein variation in individual scorpion venom composition.
Article
Full-text available
Genome sequencing of a diverse array of arthropod genomes is already underway, and these genomes will be used to study human health, agriculture, biodiversity, and ecology. These new genomes are intended to serve as community resources and provide the foundational information required to apply ‘omics technologies to a more diverse set of species. However, biologists require genome annotation to use these genomes and derive a better understanding of complex biological systems. Genome annotation incorporates two related, but distinct, processes: Demarcating genes and other elements present in genome sequences (structural annotation); and associating a function with genetic elements (functional annotation). While there are well-established and freely available workflows for structural annotation of gene identification in newly assembled genomes, workflows for providing the functional annotation required to support functional genomics studies are less well understood. Genome-scale functional annotation is required for functional modeling (enrichment, networks, etc.). A first-pass genome-wide functional annotation effort can rapidly identify under-represented gene sets for focused community annotation efforts. We present an open-source, open access, and containerized pipeline for genome-scale functional annotation of insect proteomes and apply it to various arthropod species. We show that the performance of the predictions is consistent across a set of arthropod genomes with varying assembly and annotation quality.
Article
Full-text available
Background Artificial selection of modern meat-producing chickens (broilers) for production characteristics has led to dramatic changes in phenotype, yet the impact of this selection on metabolic and molecular mechanisms is poorly understood. The first 3 weeks post-hatch represent a critical period of adjustment, during which the yolk lipid is depleted and the bird transitions to reliance on a carbohydrate-rich diet. As the liver is the major organ involved in macronutrient metabolism and nutrient allocatytion, a combined transcriptomics and metabolomics approach has been used to evaluate hepatic metabolic reprogramming between Day 4 (D4) and Day 20 (D20) post-hatch. Results Many transcripts and metabolites involved in metabolic pathways differed in their abundance between D4 and D20, representing different stages of metabolism that are enhanced or diminished. For example, at D20 the first stage of glycolysis that utilizes ATP to store or release glucose is enhanced, while at D4, the ATP-generating phase is enhanced to provide energy for rapid cellular proliferation at this time point. This work has also identified several metabolites, including citrate, phosphoenolpyruvate, and glycerol, that appear to play pivotal roles in this reprogramming. Conclusions At Day 4, metabolic flexibility allows for efficiency to meet the demands of rapid liver growth under oxygen-limiting conditions. At Day 20, the liver’s metabolism has shifted to process a carbohydrate-rich diet that supports the rapid overall growth of the modern broiler. Characterizing these metabolic changes associated with normal post-hatch hepatic development has generated testable hypotheses about the involvement of specific genes and metabolites, clarified the importance of hypoxia to rapid organ growth, and contributed to our understanding of the molecular changes affected by decades of artificial selection.
Article
Full-text available
A major challenge to long read sequencing data is their high error rate of up to 15%. We present Ratatosk, a method to correct long reads with short read data. We demonstrate on 5 human genome trios that Ratatosk reduces the error rate of long reads 6-fold on average with a median error rate as low as 0.22 %. SNP calls in Ratatosk corrected reads are nearly 99 % accurate and indel calls accuracy is increased by up to 37 %. An assembly of Ratatosk corrected reads from an Ashkenazi individual yields a contig N50 of 45 Mbp and less misassemblies than a PacBio HiFi reads assembly.
Article
Full-text available
The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all of the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a pipeline that greatly facilitates this process. This program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete long terminal repeat (LTR) retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately 3 times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license ( https://github.com/Dfam-consortium/RepeatModeler , http://www.repeatmasker.org/RepeatModeler/ ).
Article
Full-text available
Venoms evolved convergently in diverse animal lineages as key adaptations that increase the evolutionary fitness of species which are manifold employed for defense, predation, and competition. They constitute complex cocktails of various toxins that feature a broad range of bioactivities. The majority of described venom proteins belong to protein families that are known to comprise housekeeping genes or harbor protein-domains, which are present in genes with non-venom related functions. However, the evolutionary processes and mechanisms that foster the origin of these venom proteins and triggered their recruitment into the venom delivery system are still critically discussed. In most instances single or combined proteomic and transcriptomic approaches are applied to describe venom compositions and the biological context of venoms. For neglected species these studies represent crucial contributions to improve our understanding of venom diversity on a broader scale. Nonetheless, the inference of the evolutionary origin of putative toxins in these studies could be misleading without appropriate coverage of gene populations from different tissue samples (gene completeness) or complementary genome data. Providing a valid backbone to correctly map transcriptome and proteome data, whole genome sequences facilitate a clear distinction between variability of venom proteins or toxins due to posttranslational modifications, alternative splicing, and false-positive matches that stem from sequencing or read processing and assembly errors. High-quality whole genome sequence data of venomous species are still sparse and unevenly distributed within taxon lineages. However, to reveal the evolutionary pattern of putative toxins in venomous lineages and to identify ancestral variants of venom proteins, the appropriate sampling of genomes from venomous and non-venomous species is crucial. Nevertheless, larger comparative studies based on multiple whole genome data sets are still sparse to uncover processes of venom evolution. Here, we review the general potential of comparative genomics in venomics to unravel mechanisms and patterns of evolutionary origin of toxin genes. Finally, we discuss the benefit of whole genome data to improve transcriptomics and proteomics-only studies, in particular if datasets are applied to assess the evolutionary origin of venom proteins.
Article
Full-text available
Accurate genome assembly is hampered by repetitive regions. Although long single molecule sequencing reads are better able to resolve genomic repeats than short-read data, most long-read assembly algorithms do not provide the repeat characterization necessary for producing optimal assemblies. Here, we present Flye, a long-read assembly algorithm that generates arbitrary paths in an unknown repeat graph, called disjointigs, and constructs an accurate repeat graph from these error-riddled disjointigs. We benchmark Flye against five state-of-the-art assemblers and show that it generates better or comparable assemblies, while being an order of magnitude faster. Flye nearly doubled the contiguity of the human genome assembly (as measured by the NGA50 assembly quality metric) compared with existing assemblers. © 2019, The Author(s), under exclusive licence to Springer Nature America, Inc.
Article
Full-text available
Scorpions are an excellent system for understanding biogeographical patterns. Most major scorpion lineages predate modern landforms, making them suitable for testing hypotheses of vicariance and dispersal. The Caribbean islands are endowed with a rich and largely endemic scorpion fauna, the origins of which have not been previously investigated with modern biogeographical methods. Three sets of hypotheses have been proposed to explain present patterns of diversity in the Caribbean: (1) connections via land bridges, (2) vicariance events, and (3) overwater dispersal from continents and among islands. The present study investigates the biogeographical diversification of the New World buthid scorpion subfamily Centruroidinae Kraus, 1955, a clade of seven genera and more than 110 species; infers the ancestral distributions of these scorpions; and tests the relative roles of vicariance and dispersal in the formation of their present distributions. A fossil-calibrated molecular phylogeny was estimated with a Bayesian criterion to infer the dates of diversification events from which ancestral distributions were reconstructed, and the relative likelihood of models of vicariance vs. dispersal, calculated. Although both the timing of diversification and the ancestral distributions were congruent with the GAARlandia land-bridge hypothesis, there was no significant difference between distance-dependent models with or without the land-bridge. Heteroctenus Pocock, 1893, the Caribbean-endemic sister taxon of Centruroides Marx, 1890 provides evidence for a Caribbean ancestor, which subsequently colonized Central America and North America, and eventually re-colonized the Greater Antilles. This ‘reverse colonization’ event of a continent from an island demonstrates the importance of islands as a potential source of biodiversity.
Article
Full-text available
Arthropod Mycoplasma are little known endosymbionts in insects, primarily known as plant disease vectors. Mycoplasma in other arthropods such as arachnids are unknown. We report the first complete Mycoplasma genome sequenced, identified, and annotated from a scorpion, Centruroides vittatus, and designate it as Mycoplasma vittatus. We find the genome is at least a 683,827 bp single circular chromosome with a GC content of 42.7% and with 987 protein-coding genes. The putative virulence determinants include 11 genes associated with the virulence operon associated with protein synthesis or DNA transcription and ten genes with antibiotic and toxic compound resistance. Comparative analysis revealed that the M. vittatus genome is smaller than other Mycoplasma genomes and exhibits a higher GC content. Phylogenetic analysis shows M. vittatus as part of the Hominis group of Mycoplasma. As arthropod genomes accumulate, further novel Mycoplasma genomes may be identified and characterized.
Article
Full-text available
The pig is a well-studied model animal of biomedical and agricultural importance. Genes of this species, Sus scrofa , are known from experiments and predictions, and collected at the NCBI reference sequence database section. Gene reconstruction from transcribed gene evidence of RNA-seq now can accurately and completely reproduce the biological gene sets of animals and plants. Such a gene set for the pig is reported here, including human orthologs missing from current NCBI and Ensembl reference pig gene sets, additional alternate transcripts, and other improvements. Methodology for accurate and complete gene set reconstruction from RNA is used: the automated SRA2Genes pipeline of EvidentialGene project.
Article
Full-text available
Here we describe NanoPack, a set of tools developed for visualization and processing of long read sequencing data from Oxford Nanopore Technologies and Pacific Biosciences. Availability and implementation: The NanoPack tools are written in Python3 and released under the GNU GPL3.0 License. The source code can be found at https://github.com/wdecoster/nanopack, together with links to separate scripts and their documentation. The scripts are compatible with Linux, Mac OS and the MS Windows 10 subsystem for Linux and are available as a graphical user interface, a web service at http://nanoplot.bioinf.be and command line tools. Contact: wouter.decoster@molgen.vib-ua.be. Supplementary information: Supplementary tables and figures are available at Bioinformatics online.
Article
Full-text available
This contribution attempts to bring some general information on the evolution and, in particular, on the geographic distribution of scorpion species noxious to humans. Since 95% of the scorpions incidents are generated by specimens of the family Buthidae C. L. Koch, the analysis will be limited to this familial group. As in previous similar contributions, the content of this work is mostly addressed to non-specialists whose research embraces scorpions in several fields such as venom toxins and public health. Only in recent years, efforts have been made to create better links between ‘academic scorpion experts’ and other academic non-specialists who use scorpions in their research. Even if a larger progress can yet be expected from such exchanges, crossed information proved to be useful in most fields of scorpion studies. Since the taxonomy of scorpions is complex, misidentifications and even more serious errors concerning scorpion classification/identification are often present in the general literature. Consequently, a precise knowledge of the distribution patterns presented by many scorpion groups and, in particular, those of infamous species, proves to be a key point in the interpretation of final results, leading to a better treatment of the problems caused by infamous scorpion species.
Article
Full-text available
Scorpions are among the oldest terrestrial arthropods, which are distributed worldwide, except for Antarctica and some Pacific islands. Scorpion envenomation represents a public health problem in several parts of the world. Mexico harbors the highest diversity of scorpions in the world, including some of the world’s medically important scorpion species. The systematics and diversity of Mexican scorpion fauna has not been revised in the past decade; and due to recent and exhaustive collection efforts as part of different ongoing major revisionary systematic projects, our understanding of this diversity has changed compared with previous assessments. Given the presence of several medically important scorpion species, the study of their venom in the country is also important. In the present contribution, the diversity of scorpion species in Mexico is revised and updated based on several new systematic contributions; 281 different species are recorded. Commentaries on recent venomic, ecological and behavioral studies of Mexican scorpions are also provided. A list containing the most important peptides identified from 16 different species is included. A graphical representation of the different types of components found in these venoms is also revised. A map with hotspots showing the current knowledge on scorpion distribution and areas explored in Mexico is also provided.
Article
Full-text available
Scorpions represent an iconic lineage of arthropods, historically renowned for their unique bauplan, ancient fossil record and venom potency. Yet, higher level relationships of scorpions, based exclusively on morphology, remain virtually untested, and no multilocus molecular phylogeny has been deployed heretofore towards assessing the basal tree topology. We applied a phylogenomic assessment to resolve scorpion phylogeny, for the first time, to our knowledge, sampling extensive molecular sequence data from all superfamilies and examining basal relationships with up to 5025 genes. Analyses of supermatrices as well as species tree approaches converged upon a robust basal topology of scorpions that is entirely at odds with traditional systematics and controverts previous understanding of scorpion evolutionary history. All analyses unanimously support a single origin of katoikogenic development, a form of parental investment wherein embryos are nurtured by direct connections to the parent's digestive system. Based on the phylogeny obtained herein, we propose the following systematic emendations: Caraboctonidae is transferred to Chactoidea NEW SUPERFAMILIAL ASSIGNMENT: ; superfamily Bothriuroidea REVALIDATED: is resurrected and Bothriuridae transferred therein; and Chaerilida and Pseudochactida are synonymized with Buthida NEW PARVORDINAL SYNONYMIES: . © 2015 The Author(s) Published by the Royal Society. All rights reserved.
Article
Full-text available
Although many NGS read pre-processing tools already existed, we could not find any tool or combination of tools which met our requirements in terms of flexibility, correct handling of paired-end data, and high performance. We have developed Trimmomatic as a more flexible and efficient pre-processing tool, which could correctly handle paired-end data. The value of NGS read pre-processing is demonstrated for both reference-based and reference-free tasks. Trimmomatic is shown to produce output which is at least competitive with, and in many cases superior to, that produced by other tools, in all scenarios tested. Trimmomatic is licensed under GPL V3. It is cross-platform (Java 1.5+ required) and available from http://www.usadellab.org/cms/index.php?page=trimmomatic CONTACT: usadel@bio1.rwth-aachen.de SUPPLEMENTARY INFORMATION: Manual and source code are available from http://www.usadellab.org/cms/index.php?page=trimmomatic.
Article
Full-text available
The episodic nature of natural selection and the accumulation of extreme sequence divergence in venom-encoding genes over long periods of evolutionary time can obscure the signature of positive Darwinian selection. Recognition of the true biocomplexity is further hampered by the limited taxon selection, with easy to obtain or medically important species typically being the subject of intense venom research, relative to the actual taxonomical diversity in nature. This holds true for scorpions, which are one of the most ancient terrestrial venomous animal lineages. The family Buthidae that includes all the medically significant species has been intensely investigated around the globe, while almost completely ignoring the remaining non-buthid families. Australian scorpion lineages, for instance, have been completely neglected, with only a single scorpion species (Urodacus yaschenkoi) having its venom transcriptome sequenced. Hence, the lack of venom composition and toxin sequence information from an entire continent's worth of scorpions has impeded our understanding of the molecular evolution of scorpion venom. The molecular origin, phylogenetic relationships and evolutionary histories of most scorpion toxin scaffolds remain enigmatic. In this study, we have sequenced venom gland transcriptomes of a wide taxonomical diversity of scorpions from Australia, including buthid and non-buthid representatives. Using state-of-art molecular evolutionary analyses, we show that a majority of CSα/β toxin scaffolds have experienced episodic influence of positive selection, while most non-CSα/β linear toxins evolve under the extreme influence of negative selection. For the first time, we have unraveled the molecular origin of the major scorpion toxin scaffolds, such as scorpion venom single von Willebrand factor C-domain peptides (SV-SVC), inhibitor cystine knot (ICK), disulphide-directed beta-hairpin (DDH), bradykinin potentiating peptides (BPP), linear non-disulphide bridged peptides and antimicrobial peptides (AMP). We have thus demonstrated that even neglected lineages of scorpions are a rich pool of novel biochemical components, which have evolved over millions of years to target specific ion channels in prey animals, and as a result, possess tremendous implications in therapeutics.
Article
Full-text available
Second-generation sequencing technologies produce high coverage of the genome by short reads at a very low cost, which has prompted development of new assembly methods. In particular, multiple algorithms based on de Bruijn graphs have been shown to be effective for the assembly problem. In this paper we describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error. Our method transforms very large numbers of paired-end reads into a much smaller number of longer "super-reads." The use of super-reads allows us to assemble combinations of Illumina reads of differing lengths together with longer reads from 454 and Sanger sequencing technologies, making it one of the few assemblers capable of handling such mixtures. We call our system the Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and pronounced "mazurka"). We evaluate the performance of MaSuRCA against two of the most widely used assemblers for Illumina data, Allpaths-LG and SOAPdenovo2, on two data sets from organisms for which high-quality assemblies are available: the bacterium Rhodobacter sphaeroides and chromosome 16 of the mouse genome. We show that MaSuRCA performs on par or better than Allpaths-LG and significantly better than SOAPdenovo on these data, when evaluated against the finished sequence. We then show that MaSuRCA can significantly improve its assemblies when the original data are augmented with long reads. MaSuRCA is available as open-source code at ftp://ftp.genome.umd.edu/pub/MaSuRCA/. Previous (pre-publication) releases have been publicly available for over a year. Aleksey Zimin, alekseyz@ipst.umd.edu.
Article
Full-text available
Scorpion systematics and taxonomy have recently shown a need for revision, partially due to insights from molecular techniques. Scorpion taxonomy has been difficult with morphological characters as disagreement exists among researchers with character choice for adequate species delimitation in taxonomic studies. Within the family Buthidae, species identification and delimitation is particularly difficult due to the morphological similarity among species and extensive intraspecific morphological diversity. The genus Centruroides in the western hemisphere is a prime example of the difficulty in untangling the taxonomic complexity within buthid scorpions. In this paper, we present phylogeographic, Ecological Niche Modeling, and morphometric analyses to further understand how population diversification may have produced morphological diversity in Centruroides vittatus (Say, 1821). We show that C. vittatus populations in the Big Bend and Trans-Pecos region of Texas, USA are phylogeographically distinct and may predate the Last Glacial Maximum (LGM). In addition, we suggest the extended isolation of Big Bend region populations may have created the C. vittatus variant once known as C. pantheriensis.
Article
Full-text available
De novo assembly of RNA-seq data enables researchers to study transcriptomes without the need for a genome sequence; this approach can be usefully applied, for instance, in research on 'non-model organisms' of ecological and evolutionary importance, cancer samples or the microbiome. In this protocol we describe the use of the Trinity platform for de novo transcriptome assembly from RNA-seq data in non-model organisms. We also present Trinity-supported companion utilities for downstream applications, including RSEM for transcript abundance estimation, R/Bioconductor packages for identifying differentially expressed transcripts across samples and approaches to identify protein-coding genes. In the procedure, we provide a workflow for genome-independent transcriptome analysis leveraging the Trinity platform. The software, documentation and demonstrations are freely available from http://trinityrnaseq.sourceforge.net. The run time of this protocol is highly dependent on the size and complexity of data to be analyzed. The example data set analyzed in the procedure detailed herein can be processed in less than 5 h.
Article
Full-text available
Limitations of genome sequencing techniques have led to dozens of assembly algorithms, none of which is perfect. A number of methods for comparing assemblers have been developed, but none is yet a recognized benchmark. Further, most existing methods for comparing assemblies are only applicable to new assemblies of finished genomes; the problem of evaluating assemblies of previously unsequenced species has not been adequately considered. Here, we present QUAST-a quality assessment tool for evaluating and comparing genome assemblies. This tool improves on leading assembly comparison software with new ideas and quality metrics. QUAST can evaluate assemblies both with a reference genome, as well as without a reference. QUAST produces many reports, summary tables and plots to help scientists in their research and in their publications. In this study, we used QUAST to compare several genome assemblers on three datasets. QUAST tables and plots for all of them are available in the Supplementary Material, and interactive versions of these reports are on the QUAST website. Availability: http://bioinf.spbau.ru/quast . Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
Scorpionism in the Americas occurs mainly in Mexico, northern South America and southeast Brazil. This article reviews the local scorpion fauna, available health statistics, and the literature to assess scorpionism in Central America. Notwithstanding its high toxicity in Mexico, most scorpion sting cases in Guatemala, Belize, El Salvador, Nicaragua, and Costa Rica are produced by species in the genus Centruroides that are only mildly toxic to humans despite the existence of ion channel-active toxins in their venoms. Regional morbidity is low with the exception of Panama, where an incidence of 52 cases per 100,000 inhabitants was recorded for 2007, with 28 deaths from 1998 to 2006. Taxa belonging to the genus Tityus (also present in the Atlantic coast of Costa Rica) are responsible for fatalities in Panama, with Tityus pachyurus being the most important species medically. Most Tityus species inhabiting Panama are also found in northern South America from which they probably migrated upon closure of the Panamanian isthmus in the Miocene era. Incorporation of Panama as part of the northern South American endemic area of scorpionism is thereby suggested based on the incidence of these accidents and the geographical distribution of Panamanian Tityus species.
Article
Full-text available
Motivation: Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. Results: To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. Availability and implementation: STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.
Article
Full-text available
Scorpion venoms have been studied for decades, leading to the identification of hundreds of different toxins with medical and pharmacological implications. However, little emphasis has been given to the description of these arthropods from cellular and evolutionary perspectives. In this report, we describe a transcriptomic analysis of the Mexican scorpion Centruroides noxius Hoffmann, performed with a pyrosequencing platform. Three independent sequencing experiments were carried out, each including three different cDNA libraries constructed from RNA extracted from the whole body of the scorpion after telson removal, and from the venom gland before and after venom extraction. Over three million reads were obtained and assembled in almost 19000 isogroups. Within the telson-specific sequences, 72 isogroups (0.4% of total unique transcripts) were found to be similar to toxins previously reported in other scorpion species, spiders and sea anemones. The annotation pipeline also revealed the presence of important elements of the small non-coding RNA processing machinery, as well as microRNA candidates. A phylogenomic analysis of concatenated essential genes evidenced differential evolution rates in this species, particularly in ribosomal proteins and proteasome components. Additionally, statistical comparison of transcript abundance before and after venom extraction showed that 3% and 2% of the assembled isogroups had higher expression levels in the active and replenishing gland, respectively. Thus, our sequencing and annotation strategies provide a general view of the cellular and molecular processes that take place in these arthropods, allowed the discovery of new pharmacological and biotechnological targets and uncovered several regulatory and metabolic responses behind the assembly of the scorpion venom. The results obtained in this report represent the first high-throughput study that thoroughly describes the universe of genes that are expressed in the scorpion Centruroides noxius Hoffmann, a highly relevant organism from medical and evolutionary perspectives.
Article
Full-text available
We describe Trans-ABySS, a de novo short-read transcriptome assembly and analysis pipeline that addresses variation in local read densities by assembling read substrings with varying stringencies and then merging the resulting contigs before analysis. Analyzing 7.4 gigabases of 50-base-pair paired-end Illumina reads from an adult mouse liver poly(A) RNA library, we identified known, new and alternative structures in expressed transcripts, and achieved high sensitivity and specificity relative to reference-based assembly methods.
Article
Full-text available
SOAP2 is a significantly improved version of the short oligonucleotide alignment program that both reduces computer memory usage and increases alignment speed at an unprecedented rate. We used a Burrows Wheeler Transformation (BWT) compression index to substitute the seed strategy for indexing the reference sequence in the main memory. We tested it on the whole human genome and found that this new algorithm reduced memory usage from 14.7 to 5.4 GB and improved alignment speed by 20-30 times. SOAP2 is compatible with both single- and paired-end reads. Additionally, this tool now supports multiple text and compressed file formats. A consensus builder has also been developed for consensus assembly and SNP detection from alignment of short reads on a reference genome. Availability: http://soap.genomics.org.cn.
Article
Scorpions constitute a charismatic lineage of arthropods and comprise more than 2500 described species. Found throughout various tropical and temperate habitats, these predatory arachnids have a long evolutionary history, with a fossil record that began in the Silurian. While all scorpions are venomous, the asymmetrically diverse family Buthidae harbors nearly half the diversity of extant scorpions, and all but one of the 58 species that are medically significant to humans. However, the lack of a densely sampled scorpion phylogeny has hindered broader inferences of the diversification dynamics of scorpion toxins. To redress this gap, we assembled a phylogenomic data set of 100 scorpion venom gland transcriptomes and genomes, emphasizing the sampling of highly toxic buthid genera. To infer divergence times of venom gene families, we applied a phylogenomic node dating approach for the species tree in tandem with phylostratigraphic bracketing to estimate the minimum ages of mammal-specific toxins. Our analyses establish a robustly supported phylogeny of scorpions, particularly with regard to relationships between medically significant taxa. Analysis of venom gene families shows that mammal-active sodium channel toxins (NaTx) have independently evolved in five lineages within Buthidae. Temporal windows of mammal-targeting toxin origins are correlated with the basal diversification of major scorpion mammal predators such as shrews, bats, and rodents. These results suggest an evolutionary model of relatively recent diversification of buthid NaTx homologs in response to the diversification of scorpion predators.
Article
Evaluation of the quality of genomic “data products” such as genome assemblies or gene sets is of critical importance in order to recognize possible issues and correct them during the generation of new data. It is equally essential to guide subsequent or comparative analyses with existing data, as the correct interpretation of the results necessarily requires knowledge about the quality level and reliability of the inputs. Using datasets of near universal single‐copy orthologs derived from OrthoDB, BUSCO can estimate the completeness and redundancy of genomic data by providing biologically meaningful metrics based on expected gene content. These can complement technical metrics such as contiguity measures (e.g., number of contigs/scaffolds, and N50 values). Here, we describe the use of the BUSCO tool suite to assess different data types that can range from genome assemblies of single isolates and assembled transcriptomes and annotated gene sets to metagenome‐assembled genomes where the taxonomic origin of the species is unknown. BUSCO is the only tool capable of assessing all these types of sequences from both eukaryotic and prokaryotic species. The protocols detail the various BUSCO running modes and the novel workflows introduced in versions 4 and 5, including the batch analysis on multiple inputs, the auto‐lineage workflow to run assessments without specifying a dataset, and a workflow for the evaluation of (large) eukaryotic genomes. The protocols further cover the BUSCO setup, guidelines to interpret the results, and BUSCO “plugin” workflows for performing common operations in genomics using BUSCO results, such as building phylogenomic trees and visualizing syntenies. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. [Correction added on May 16, 2022, after first online publication: CSAL funding statement has been added.] Basic Protocol 1 : Assessing an input sequence with a BUSCO dataset specified manually Basic Protocol 2 : Assessing an input sequence with a dataset automatically selected by BUSCO Basic Protocol 3 : Assessing multiple inputs Alternate Protocol : Decreasing analysis runtime when assessing a large number of small genomes with BUSCO auto‐lineage workflow and Snakemake Support Protocol 1 : BUSCO setup Support Protocol 2 : Visualizing BUSCO results Support Protocol 3 : Building phylogenomic trees
Chapter
BRAKER is a pipeline for highly accurate and fully automated gene prediction in novel eukaryotic genomes. It combines two major tools: GeneMark-ES/ET and AUGUSTUS. GeneMark-ES/ET learns its parameters from a novel genomic sequence in a fully automated fashion; if available, it uses extrinsic evidence for model refinement. From the protein-coding genes predicted by GeneMark-ES/ET, we select a set for training AUGUSTUS, one of the most accurate gene finding tools that, in contrast to GeneMark-ES/ET, integrates extrinsic evidence already into the gene prediction step. The first published version, BRAKER1, integrated genomic footprints of unassembled RNA-Seq reads into the training as well as into the prediction steps. The pipeline has since been extended to the integration of data on mapped cross-species proteins, and to the usage of heterogeneous extrinsic evidence, both RNA-Seq and protein alignments. In this book chapter, we briefly summarize the pipeline methodology and describe how to apply BRAKER in environments characterized by various combinations of external evidence.
Article
Arachnids exhibit tremendous species richness and adaptations of biomedical, industrial, and agricultural importance. Yet genomic resources for arachnids are limited, with the first few spider and scorpion genomes becoming accessible in the last four years. We review key insights from these genome projects, and recommend additional genomes for sequencing, emphasizing taxa of greatest value to the scientific community. We suggest greater sampling of spiders whose genomes are understudied but hold important protein recipes for silk and venom production. We further recommend arachnid genomes to address significant evolutionary topics, including the phenotypic impact of genome duplications. A barrier to high-quality arachnid genomes are assemblies based solely on short-read data, which may be overcome by long-range sequencing and other emerging methods.
Article
IntroductionPrevious studies of scorpion envenomation in the United States (US) have focused on Arizona and the bark scorpion, Centruroides sculpturatus. Although many other scorpion species live in the US, information about envenomations in other states is lacking. Methods Nationwide scorpion exposures from 2005 to 2015 were analyzed using the National Poison Data System. ResultsOf the 185,402 total exposures, Arizona (68.2%), Texas (10.3%), and Nevada (4.2%) were the top contributors. However, six other southern states reported greater than 100 cases annually, primarily during the warmer months and evening hours. Envenomations occurred most often in a home (97.8%) and were typically managed on-site (90.1%). Pain was the most common effect nationwide (88.7%). Arizona had the highest frequencies of sensory, neuromuscular, and respiratory effects along with higher hospitalization and ICU admission rates, although the latter appeared to drop over the study period. In contrast, local skin effects such as erythema and edema were more common outside of Arizona. Children under 10 years of age in Arizona and Nevada had the highest rates of systemic effects, hospitalization, and ICU admission. Conclusions Scorpion envenomations occurred throughout the southern US with similar seasonal and daily variations. Common clinical effects included pain, local edema, and erythema, except in Arizona and Nevada where severe systemic symptoms were more common. Systemic effects correlated with high rates of ICU admissions and intubations, especially in children under 10 years of age.
Article
This review categorizes functionally validated actions of defined scorpion toxin (SCTX) neuropeptides across ion channel subclasses, highlighting key trends in this rapidly evolving field. Scorpion envenomation is a common event in many tropical and subtropical countries, with neuropharmacological actions, particularly autonomic nervous system modulation, causing significant mortality. The primary active agents within scorpion venoms are a diverse group of small neuropeptides that elicit specific potent actions across a wide range of ion channel classes. The identification and functional characterisation of these SCTX peptides has tremendous potential for development of novel pharmaceuticals that advance knowledge of ion channels and establish lead compounds for treatment of excitable tissue disorders. This review delineates the unique specificities of 320 individual SCTX peptides that collectively act on 41 ion channel subclasses. Thus the SCTX research field has significant translational implications for pathophysiology spanning neurotransmission, neurohumoral signalling, sensori-motor systems and excitation-contraction coupling.
Article
To the Editor: Rapid improvements in sequencing and array-based platforms are resulting in a flood of diverse genome-wide data, including data from exome and whole-genome sequencing, epigenetic surveys, expression profiling of coding and noncoding RNAs, single nucleotide polymorphism (SNP) and copy number profiling, and functional assays. Analysis of these large, diverse data sets holds the promise of a more comprehensive understanding of the genome and its relation to human disease. Experienced and knowledgeable human review is an essential component of this process, complementing computational approaches. This calls for efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data. However, the sheer volume and scope of data pose a significant challenge to the development of such tools.
Article
Predators feeding on toxic prey may evolve physiological resistance to the preys' toxins. Grasshopper mice (Onychomys spp.) are voracious predators of scorpions in North American deserts. Two species of grasshopper mice (Onychomys torridus and Onychomys arenicola) are broadly sympatric with two species of potentially lethal bark scorpion (Centruroides exilicauda and Centruroides vittatus) in the Sonoran and Chihuahuan deserts, respectively. Bark scorpions produce toxins that selectively bind sodium (Na(+)) and potassium (K(+)) ion channels in vertebrate nerve and muscle tissue. We previously reported that grasshopper mice showed no effects of bark scorpion envenomation following natural stings. Here we conducted a series of toxicity tests to determine whether grasshopper mice have evolved resistance to bark scorpion neurotoxins. Five populations of grasshopper mice, either sympatric with or allopatric to bark scorpions, were injected with bark scorpion venom; LD50s were estimated for each population. All five populations of grasshopper mice demonstrated levels of venom resistance greater than that reported for non-resistant Mus musculus. Moreover, venom resistance in the mice showed intra- and interspecific variability that covaried with bark scorpion sympatry and allopatry, patterns consistent with the hypothesis that venom resistance in grasshopper mice is an adaptive response to feeding on their neurotoxic prey.
Article
We have developed a new set of algorithms, collectively called "Velvet," to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25-50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.
Distributions of the scorpions Centruroides vittatus (say) and Centruroides hentzi (Banks) in the United States and Mexico (Scorpiones, Buthidae)
  • Shelley