ArticlePDF Available

Innovative assembly strategy contributes to understanding the evolution and conservation genetics of the endangered Solenodon paradoxus from the island of Hispaniola


Abstract and Figures

Solenodons are insectivores living in Hispaniola and Cuba that form an isolated branch in the tree of placental mammals highly divergent from other eulipothyplan insectivores The history, unique biology and adaptations of these enigmatic venomous species could be illuminated by the availability of genome data, but a whole genome assembly for solenodons has not been previously performed, partially due to the difficulty in obtaining samples from the field. Island isolation and reduced numbers have likely resulted in high homozygosity within the Hispaniolan solenodon (Solenodon paradoxus), thus we tested the performance of several assembly strategies on the genome of this genetically impoverished species. The string-graph based assembly strategy seemed a better choice compared to the conventional de Bruijn graph approach, due to the high levels of homozygosity, which is often a hallmark of endemic or endangered species. A consensus reference genome was assembled from sequences of five individuals from the southern subspecies (S. p. woodi). In addition, we obtained additional sequence from one sample of the northern subspecies (S. p. paradoxus). The resulting genome assemblies were compared to each other, and annotated for genes, with a specific emphasis on venom genes, repeats, variable microsatellite loci and other genomic variants. Phylogenetic positioning and selection signatures were inferred based on 4,416 single copy orthologs from 10 other mammals. We estimated that solenodons diverged from other extant mammals 73.6 Mya. Patterns of SNP variation allowed us to infer population demography, which supported a subspecies split within the Hispaniolan solenodon at least 300 Kya.
Content may be subject to copyright.
Innovative assembly strategy contributes to understanding the evolution and conservation genetics of
the endangered Solenodon paradoxus from the island of Hispaniola
Kirill Grigorev a,1, Sergey Kliver b,1, Pavel Dobrynin b, Aleksey Komissarov b, Walter Wolfsberger a,c, Ksenia
Krasheninnikova b, - a, Adam L. Brandt d,e, Liz A. Paulino f, Rosanna
Carreras f, Luis E. Rodríguez f,  g, Jessica R. Brandt d,h, Filipe Silva i,j-
Martich k, Audrey J. Majeske a, Agostinho Antunes i,j, Alfred L. Roca d,l,  b,m, Juan Carlos
Martínez-Cruzado a and Taras K. Oleksyk a,c2
a Department of Biology, University of Puerto Rico at Mayagüez, Mayagüez, Puerto Rico
b Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, St.
Petersburg, Russia
c Biology Department, Uzhhorod National University, Uzhhorod, Ukraine
d Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
e Division of Natural Sciences, St. Norbert College, De Pere, Wisconsin, USA
f Instituto Tecnológico de Santo Domingo (INTEC), Santo Domingo, Dominican Republic
g Department of Conservation and Science, Parque Zoologico Nacional (ZOODOM), Santo Domingo,
Dominican Republic
h Department of Biology, Marian University, Fond du Lac, Wisconsin, USA
i CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, University of Porto,
Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos, s/n, 4450208 Porto, Portugal
j Department of Biology, Faculty of Sciences, University of Porto. Rua do Campo Alegre, 4169-007 Porto,
k Instituto de Investigaciones Botánicas y Zoológicas, Universidad Autónoma de Santo Domingo, Santo
Domingo, Dominican Republic
l Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL,
m Oceanographic Center, Nova Southeastern University, Fort Lauderdale, Florida, USA
1 These authors contributed equally. ORCIDs: Kirill Grigorev: 0000-0003-3628-0123 Sergey Kliver: 0000-
2 Corresponding author: Taras K. Oleksyk,, ORCID: 0000-0002-8148-3918
Solenodons are insectivores living in Hispaniola and Cuba that form an isolated branch in the tree of
placental mammals highly divergent from other eulipothyplan insectivores The history, unique biology and
adaptations of these enigmatic venomous species could be illuminated by the availability of genome data,
but a whole genome assembly for solenodons has not been previously performed, partially due to the
difficulty in obtaining samples from the field. Island isolation and reduced numbers have likely resulted in
high homozygosity within the Hispaniolan solenodon (Solenodon paradoxus), thus we tested the
Downloaded from
by guest
on 17 March 2018
performance of several assembly strategies on the genome of this genetically impoverished species. The
string-graph based assembly strategy seemed a better choice compared to the conventional de Bruijn
graph approach, due to the high levels of homozygosity, which is often a hallmark of endemic or
endangered species. A consensus reference genome was assembled from sequences of five individuals
from the southern subspecies (S. p. woodi). In addition, we obtained additional sequence from one sample
of the northern subspecies (S. p. paradoxus). The resulting genome assemblies were compared to each
other, and annotated for genes, with a specific emphasis on venom genes, repeats, variable microsatellite
loci and other genomic variants. Phylogenetic positioning and selection signatures were inferred based on
4,416 single copy orthologs from 10 other mammals. We estimated that solenodons diverged from other
extant mammals 73.6 Mya. Patterns of SNP variation allowed us to infer population demography, which
supported a subspecies split within the Hispaniolan solenodon at least 300 Kya.
Genome, assembly, de Bruijn, string graph, Solenodon paradoxus, Hispaniola, Caribbean, island
biogeography, selection drift, isolation, evolution, microsatellites
The only two surviving species of solenodons, found on the two largest Caribbean islands,
Hispaniola (Solenodon paradoxus) and Cuba (S. cubanus), are among the few endemic terrestrial mammals
that survived human settlement of these islands. Phenotypically, solenodons somewhat resemble shrews
(Figure 1), but molecular evidence indicates that they are actually the sister-group to all other extant
eulipotyphlan insectivores (hedgehogs, moles, shrews) from which they split in the Cretaceous Period [1
3]. These enigmatic species have various local names in Cuba and Hispaniola, including oso (bear),
hormiguero (ant-eater), joron (ferret), milquí (or almiquí) and agouta [4,5], all pointing to the first
impression made on the Spanish colonists by its unusual appearance. Today, the Hispaniolan solenodon
(Solenodon paradoxus) is difficult to find in the wild, both because of its nocturnal activity pattern and the
low population numbers. Here, we report the assembly and annotation of the nuclear genome sequences
and genomic variation of two subspecies of S. paradoxus, using analytical strategies that will allow
researchers to formulate hypotheses and develop genetic tools, to assist future studies of evolutionary
inference and conservation applications.
S. paradoxus was originally described from a skin and partial skull at the St. Petersburg Academy of
Sciences in Russia [6]. It has a large head with a long rostrum with tiny eyes and ears partially hidden by the
dusky brown body fur that turns reddish on the sides of the head, throat and upper chest. The tail, legs,
snout, and eyelids of the S. paradoxus are hairless. The front legs are noticeably more developed, but all
four have strong claws useful for burrowing (Figure 1). Adult animals measure 49-72 cm in total length, and
weigh over 1kg [7]. Solenodons are social animals, they spend their days in extensive underground tunnel
networks shared by family groups, and come to the surface at night to hunt small vertebrates and large
Downloaded from
by guest
on 17 March 2018
invertebrates [8]. A unique feature is the os proboscidis, a bone extending forward from the nasal opening
to support the snout cartilage [9]. Solenodons are venomous mammals that display a fascinating strategy
for venom delivery. The second lower incisor of solenodons has a narrow, almost fully enfolded tubular
channel, through which saliva secreted by the submaxillary gland flows into the victim [10]. The genus
name Solenodon means “grooved tooth” in Greek and refers to the shape of this incisor. Although
solenodons rarely bite humans, the bites can be very painful (Nicolás Corona, personal communication),
and even a small injection of venom has been shown to be fatal to mice in minutes [7]. The chemical
composition of solenodon venom has not yet been resolved [11].
Roca et al. 2004 sequenced 13.9 kb of nuclear and mitochondrial sequences of S. paradoxus,
inferring that solenodon divergence from other eulipotyphlan mammals such as shrews and moles dates
back to the Cretaceous Period, ~76 million years ago (Mya), before the mass extinction of the dinosaurs ~66
Mya. Brandt et al. [12] sequenced complete mitogenome sequences of six Hispaniolan solenodon
specimens from southern part of Hispaniola (Figure 2), corroborating this conclusion, and estimated that S.
paradoxus diverged from all other mammals approximately 78 Mya [12]. Other studies have reported
similarly deep divergence dates (reviewed in [13]). Whole genome analysis of S. paradoxus could provide
support and validation to the earlier evolutionary studies.
Morphometric studies suggest that southern and northern Hispaniolan solenodons may be
distinctive enough to be considered separate subspecies [2,14,15], a notion supported by recent
mitochondrial DNA studies [12,16]. The southern Hispaniolan solenodons had less genetic diversity than
those in the north, so that the control region sequences of all five southern specimens (the same
individuals used in this study) were identical or nearly identical [12], indicating that Hispaniolan solenodons
have a very low level of mitochondrial diversity.
It may now be imperative to study conservation genomics of solenodons, because their extinction
would mean the loss of an entire evolutionary lineage whose antiquity goes back to the age of dinosaurs. S.
paradoxus survived in spectacular island isolation despite the devastating human impact to biodiversity in
recent centuries [3,12]. Nevertheless, survival of this species is now threatened by deforestation, increasing
human activity, and predation by introduced dogs, cats and mongooses. It is declining in population and its
habitat is severely fragmented, and it is listed as endangered by the IUCN Red List of Threatened Species
(Red List category B2ab, assessed in 2008;
In this study, we assembled the genome of S. paradoxus using low coverage genome data (~5x
each) from five individuals of S. paradoxus woodi. We take advantage of the low individual and population
genetic diversity to pool individual data, and apply a string graph assembly approach resulting in a working
genome assembly of the S. paradoxus genome from the combined paired-end dataset (approximately 26x;
Figure 3). Our methodology introduces a useful pipeline for genome assembly to compensate for the
Downloaded from
by guest
on 17 March 2018
limited amount of sequencing which, in this instance, performs better than the assembly by a traditional de
Bruijn algorithm (SOAPdenovo2) [17]. We employed the string-graph assembler Fermi [18] as a principal
tool for contig assembly in conjunction with SSPACE [19] and GapCloser [17] for scaffolding. The resulting
genome sequence data was sufficient for high-quality annotation of genes and functional elements, as well
as for comparative genomics and population genetic analyses. Prior to this study, the string-graph
assembler Fermi [18] has been used only in studies for annotation, or as a complementary tool for de novo
assemblies made with de Bruijn algorithms [20]. We present and compare genome assemblies for the
southern subspecies (S. p. woodi) based on several combinations of assembly tools, provide a high-quality
annotation of genome features and describe genetic variation in two subspecies (S. p. woodi and S. p.
paradoxus), make inferences about recent evolution and selection signatures in genes, trace demographic
histories, and develop molecular tools for future conservation studies.
Data description
Sample collection and sequencing
Five adult individuals of S. paradoxus woodi (NCBI Taxon ID:1906352) from the southern Dominican
Republic were collected in the wild following a general field protocol described earlier [12] including two
specimens caught from La Cañada del Verraco, and three from the El Manguito location in the Pedernales
Province. The captured individuals were visually assessed for obvious signs of disease, weighed, measured,
sexed, and released at the capture site, all within 10 minutes of capture. Geographic coordinates were
recorded for every location. In addition, one S. p. paradoxus (Spa-1) sample was acquired through the
collaboration with ZooDom at Santo Domingo, and originated in the Cordillera Septentrional in the
northern part of the island. Figure 2 highlights geographical locations of sample collection points for the
samples used in this study.
The five S. p. woodi samples were sequenced using Hiseq2000 technology (Illumina Inc.), resulting
in an average of 151,783,327 paired-end reads 101 bp long, or 15.33Gb of sequence data, per individual. In
addition, DNA extracted from the northern solenodon (S. p. paradoxus) Spa-1 was sequenced using MiSeq
V3 technology (Illumina Inc.), and produced a total of 52,358,830 paired-end reads, equating to
approximately 13.09Gb of sequence data. Only the samples of S. paradoxus woodi were used for assembly
since the northern subspecies (S. paradoxus paradoxus) did not have sufficient coverage for the de novo
Further details about sample collection, DNA extraction, library construction and sequencing can be
found in the Methods section. The whole genome shotgun data from this project has been deposited at
DDBJ/ENA/GenBank under the accession NKTL00000000. The version described in this paper is version
Downloaded from
by guest
on 17 March 2018
NKTL01000000. The genome data has also been deposited into NCBI under BioProject PRJNA368679, and to
the GigaScience GigaDB repository [21].
Read correction
After the reduction of adapter contamination with Cookiecutter [22], the k-mer distribution in the
reads for the five individuals of S. paradoxus woodi was assessed with Jellyfish (Jellyfish,
RRID:SCR_005491)[23]. The predicted mean genome coverage was approximately 5x for each sample
(Figure 3), which is too low for individual de-novo genome assembly. However, because of the extremely
low levels of genetic diversity suggested by the earlier study of mitochondrial DNA in the southern
subspecies [12], and in order to increase the average depth of coverage, the reads from the five samples
were combined into a single data set. As a result, the projected mean genome coverage for the combined
genome assembly was 26x. Error correction was applied with QuorUM (QuorUM, RRID:SCR_011840)[24]
using the value k = 31. The k-mer distribution analysis by Jellyfish in the combined and error-corrected data
set indicated very low levels of heterozygosity in accordance with the hypothesis (see Figure 3 legend),
allowing use of the combined dataset for the further genome assembly. The genome size has been
estimated using KmerGenie [25] to be 2.06Gbp.
Assembly tool combinations
We used several alternative combinations of tools to determine the best approach to an assembly
of the combined genome data, outlined in Table 1. First, the combined libraries of paired- end reads were
assembled into contigs with Fermi, a string graph based tool [18]. Second, the same libraries were also
assembled with SOAPdenovo2, a de Bruijn graph based tool (SOAPdenovo2 , RRID:SCR_014986)[17]. The
optimal k-mer length parameter for SOAPdenovo2 was determined to be k = 35 with the use of KmerGenie
[25]. For the scaffolding step we used either SSPACE (SSPACE, RRID:SCR_005056)[19] or the scaffolding
module of SOAPdenovo2 [17]. Finally, for all instances, the GapCloser module of SOAPdenovo2 was used to
fill in gaps in the scaffolds (GapCloser, RRID:SCR_015026)[17]. After assembly, datasets were trimmed:
scaffolds shorter than 1Kbp were removed from the output. In Table 1, the four possible combinations of
tools used for the assembly are referred to with capital letters A, B, C, and D for brevity. However,
SOAPdenovo2 introduces artifacts at the contig construction stage, which it is specifically designed to
mitigate at later stages, and SSPACE is not aware of such artifacts [26]. For this reason, the assembly
produced by combination D (contig assembly with SOAPdenovo2 and scaffolding with SSPACE) was not
Downloaded from
by guest
on 17 March 2018
QC and structural comparisons between the assemblies
We used QUAST (QUAST, RRID:SCR_001228)[27] to estimate the common metrics of assembly
quality for all combinations of assembly tools: N50 and gappedness (the percentage of Ns (Table 1)). Fermi-
assembled contigs (A and B) were overall longer and fewer in number than the SOAPdenovo2 (C and D).
The assembly completeness was also evaluated with both BUSCO (BUSCO , RRID:SCR_015008)[28] and
CEGMA (CEGMA, RRID:SCR_015055)[29] for completeness of conservative genes. Fermi assemblies (A and
B) showed high levels of completeness compared to SOAPdenovo2 (86% vs 42%) at the contig level.
However, this difference is partially mitigated at the scaffolding step where SOAPdenovo2 increases
completeness for Fermi assembly (A), and more than doubles it for the SOAPdenovo2 assembly (C). To
directly evaluate the quality of all the assemblies we applied REAPR [30]. From the REAPR metrics
presented at the bottom part of Table 1, it appears that, even though the scaffolding step has increased the
final N50 for the C assembly, it contains significantly more regions with high probability of mis-assemblies
(low-scoring regions), less error-free bases, and 3 to 6 times higher number of incorrectly oriented reads
compared to the Fermi based assemblies (A and B) (Table 1).
We hypothesized that aligning the three genome assemblies to each other will allow us to detect
some of these mis-assemblies. A comparison to the best, most closely related genome assembly (i.e., Sorex
araneus) will reveal several rearrangements that in many cases reflect real evolutionary events. It is
reasonable to assume that, if all the rearrangements that are detected are real, and not due to the
assembly artifacts, the number of detected rearrangements vs Sorex assembly should be the same for all
three Solenodon assemblies (A, B and C). Following the parsimony principle, an assembly showing
rearrangements is also likely to contain the most assembly artifacts. Conversely, we expected that the best
of the three assemblies of the Solenodon genome should contain the least number of reversals and
transpositions when compared to the best available closely related genome (Sorex araneus).
To test this hypothesis, the three completed assemblies of Solenodon (A, B and C) were aligned to
each other, and to the outgroup, which was the Sorex genome (SorAra 2.0, NCBI accession number
GCA_000181275.2), using Progressive Cactus [31]. Custom scripts were employed to interpret binary
output of the pairwise genome by genome comparisons, and the resulting coverage metrics are presented
in Table 2. In this comparison, all three Solenodon genome assemblies had a substantial overlap, and
resulted in similar levels of synteny when compared against the Sorex reference assembly, but assemblies A
and B had the fewest differences with Sorex, while assembly C had more differences vs. A, B and Sorex.
Downloaded from
by guest
on 17 March 2018
Next, syntenic blocks between each of the three Solenodon assemblies (A, B and C) were compared to the
Sorex assembly, and 50Kbp syntenic blocks were identified using the ragout-maf2 synteny module of the
software package Ragout [32], and the numbers of scaffolds that contained syntenic block rearrangements
were determined. As a result, assembly B had the lowest number of reversals and transpositions when
compared to the S. araneus reference genome (Table 2). Based on the combined results of the evaluations
by REAPR [30], Progressive Cactus [31] and Ragout [32], assembly C (generated by the complete
SOAPdenovo2 run) was not included in further analysis.
Genome annotation and evaluation of assembly completeness
Repeats in assemblies A and B were identified and soft masked using RepeatMasker (RepeatMasker
, RRID:SCR_012954)[33] with the RepBase library [34]. The total percentage of all interspersed repeats
masked in the genome was lower than in S. araneus (22.53% vs 30.48%). One possible reason could be that
a low coverage assembly may perform better in non-repetitive regions. Alternatively, if the repeat content
in S. paradoxus is indeed lower, this would have to be evaluated using a higher quality assembly with the
use of long read data. The total masked repeat content of the S. paradoxus genome including
simple/tandem repeats, satellite DNA, low complexity regions, and other elements is presented in Table 3.
The repeat content can be retrieved from Database S1.
The annotation of protein-coding genes was performed using a combined approach that
synthesized both homology-based and de novo predictions, where de novo predictions were used to fill
gaps and extend homology-based predictions. Gene annotation was performed for both assemblies (A and
B) independently. Proteins of four reference species S. araneus (SorAra 2.0, GCA_000181275.2), Erinaceus
europaeus (EriEur2.0, GCA_000296755.1), Homo sapiens (GRCh38.p7) and Mus musculus (GRCm38.p4)
were aligned to a S. paradoxus assembly with Exonerate [35] with a maximum of three “hits” (matches) per
protein. The obtained alignments were classified into the top (primary) hit and two secondary hits; the
coding sequence (CDS) fragments were cut from each side by 3bp for the top hits and by 9bp for secondary
hits. These truncated fragments were clustered and supplied as hints (local pieces of information about the
gene in the input sequence, such as a likely stretch of coding sequence) of the potentially protein-coding
regions to the AUGUSTUS software package (Augustus: Gene Prediction , RRID:SCR_008417)[36], which
predicted genes in the soft-masked Solenodon assembly. Proteins were extracted from the predicted genes
and aligned by HMMER (Hmmer, RRID:SCR_005305)[37] and BLAST (NCBI BLAST , RRID:SCR_004870)[38] to
Pfam (Pfam , RRID:SCR_004726)[39] and Swiss-Prot [40] databases, respectively. Genes supported by hits
to protein databases and hints were retained; the unsupported sequences were discarded. The annotated
genes can be retrieved from Database S2.
Downloaded from
by guest
on 17 March 2018
Assembly B showed a higher support compared to assembly A (91.7% vs 79.2%) for the protein
coding gene predictions by extrinsic evidence, even though assembly A had a larger N50 value (Table 1).
These values were calculated as a median fraction of exons supported by alignments of proteins from
reference species to genome (Figure 4). In other words, assembly B is more useful for gene predictions, and
is likely to contain better gene models that can be used in the downstream analysis. Therefore, based on
two lines of evidence: low rearrangement counts (Table 2), and high support to gene prediction for the
assembly B, it was chosen for the subsequent analyses as the most useful current representation of the
Solenodon genome.
Non-coding RNA genes
For all non-coding RNA genes except for tRNA and rRNA genes, the search was performed with
INFERNAL (Infernal , RRID:SCR_011809) [41] using the Rfam (Rfam , RRID:SCR_007891)[42] BLASTN hits as
seeds. The tRNA genes were predicted using tRNAScan-SE (tRNAscan-SE, RRID:SCR_010835)[43], and rRNA
genes were predicted with Barrnap ((BAsic Rapid Ribosomal RNA Predictor) version 0.6 (Barrnap,
RRID:SCR_015995)[44]). Additionally, RNA genes discovered by RepeatMasker at the earlier stages of the
analysis were used to cross-reference the findings of rRNA and tRNA-finding software. The list of the non-
coding RNA genes can be accessed in Database S3.
Multiple genome alignment, synteny and duplication structure
To compare the Solenodon genome assembly with other mammalian genomes, a multiple
alignment with genomes of related species was performed using Progressive Cactus [31]. Currently
available genomic assemblies of cow (Bos taurus, BosTau 3.1.1, NCBI accession number DAAA00000000.2),
dog (Canis familiaris, CanFam 3.1, GCA_000002285.2), star nosed mole (Condylura cristata, ConCri 1.0,
GCF_000260355.1), common shrew (S. araneus, SorAra 2.0, GCA_000181275.2) and S. paradoxus woodi
(assembly B from this study) were aligned together, guided by a cladogram representing branching order in
a subset of a larger phylogeny (Figure S1). We evaluated the S. paradoxus coverage by comparing it to the
weighted coverages of other genomes in the alignment to the C. familiaris genome (Table 4). Custom
scripts were employed to interpret the binary output of Progressive Cactus (“Cactus”) [31]. Cactus genome
alignments were used to build a “sparse map” of the homologies between a set of input sequences. Once
this sparse map is constructed, in the form of a Cactus graph, the sequences that were initially unaligned in
the sparse map are also aligned [31]. Weighted coverage of a genome by genome comparison was
calculated by binning an alignment into regions of different coverage and averaging these coverages, with
lengths of bins as weights. The weighted coverage of S. paradoxus to C. familiaris was 1.05, which indicated
Downloaded from
by guest
on 17 March 2018
that the present genome assembly is comparable in quality and duplication structure to other available
mammalian assemblies, which are close to each other and are close to 1.0 (Table 4).
Detection of single-copy orthologs
Single-copy orthologs (single gene copies) are essential for the evolutionary analysis since they
represent a useful conservative homologous set, unlike genes with paralogs, which are difficult to compare
across species. The longest polypeptide coded by each gene of S. paradoxus and of three other Eulipotyphla
Erinaceus europaeus, S. araneus, C. cristata were aligned to profile hidden Markov models of the
TreeFam database (Tree families database , RRID:SCR_013401)[45,46] using HMMER [37]. Top hits from
these alignments were extracted and used for assignment of corresponding proteins to families. The same
procedure was performed in order to assign proteins to orthologous groups using profile HMMs of
orthologous groups of the maNOG subset from the eggNOG database (eggNOG, RRID:SCR_002456)[47] as
reference. Orthologous groups and families for which high levels of error rates were observed while testing
assignment of proteins to them were discarded; the rest of the orthologous groups and families were
retained for further analysis. Proteins and the corresponding assignments were obtained from the maNOG
database for seven other species: H. sapiens, M. musculus, B. taurus, C. familiaris, Equus caballus, Mustela
putorius furo, and Monodelphis domestica. Inspection of assignments across all the species yielded 4,416
orthologous groups containing single copy orthologous genes (Database S5).
Species tree reconstruction and divergence time estimation
We used our genome assembly to infer phylogenetic relationships between S. paradoxus and other
eutherian species with known genome sequences and estimated their divergence time using the new data.
Based on the alignments of the single-copy orthologous proteins for the species included in the analysis, a
maximum likelihood tree was built using RAxML [48] with the PROTGAMMAAUTO option and the JTT
fitting model tested with 1,000 bootstrap replications. From the codon alignments of single-copy orthologs of
the eleven species, 461,539 four-fold degenerate sites were extracted. The divergence time estimation was
made by the MCMCtree tool from the software package PAML (PAML, RRID:SCR_014932)[49] with the
HKY+G model of nucleotide substitutions and 2,200,000 generations of MCMC (of which the first 200,000
generations were discarded as burn-in). A test for substitution saturation [50,51] was performed using
DAMBE6 [52] for both all 3rd codon positions and only 4-fold degenerated sites. In both cases the Iss (index
of substitution saturation) was significantly lower than threshold value for both symmetrical and
Downloaded from
by guest
on 17 March 2018
asymmetrical trees indicating low saturation level. Therefore, saturation was not detected for any of the 3d
positions nor for the 4-fold degenerated sites.
Divergence times were calibrated using fossil-based priors associated with mammalian evolution,
listed in Table 5 and based on [5356]. FigTree [57] was used to plot the resulting tree, shown in Figure 5.
According to this analysis, S. paradoxus diverged from other mammals 73.6 Mya (95% confidence interval
of 61.4-88.2 Mya). This is in accordance with earlier estimates based on nuclear and mitochondrial
sequences (e.g., [3,12]) as reviewed by Springer et al. [58]. This date is also much older than the timeframe
of molecular estimates of divergence times between most island taxa and their closest mainland relatives
[59]. Our data supports solenodons forming a sister group to other eulipotyphlans, i.e., hedgehogs, shrews
and moles [6063], with a divergence date as old as splits between some pairs of mammalian orders, such as
between rodents and primates, or carnivores and artiodactyls (Figure 5).
Positively selected genes
To evaluate signatures of selection in the assembled genomes we used a dataset of 4,416
orthologous groups containing single copy orthologous genes of the mammalian species described earlier.
Single copy orthologs were used as a conservative set necessary for comparing coding sequences that only
arose one time in order to avoid the uncertainties associated with paralogs and lineage specific gene
duplications. First, we translated DNA sequences into amino acids, aligned them in MUSCLE (MUSCLE ,
RRID:SCR_011812)[64], and then translated back into DNA code using the original nucleotide sequences by
PAL2NAL [65]. Genic dN/dS ratios were estimated among the 11 mammalian species (including Solenodon)
used in constructing the phylogeny represented in Figure 5.
To estimate dN/dS ratios, we used the codeml module from the PAML package [49]. The dN/dS
ratios were calculated over the entire length of a protein coding gene. The branch-site model was not
included in the current analysis because of the risk of reporting false positives due to sequencing and
alignment errors [66], especially on smaller datasets, and additional uncertainties could be introduced from
the lack of power under synonymous substitution saturation and high variation in the GC content [67].
All the single copy orthologs were plotted in the dN to dS coordinates and color-coded according to
the 96 Gene Ontology generic categories (Figure 6). We retrieved values of dN, dS and w (w=dN/dS) for all
single copy orthologs and used human annotation categories to assign all the genes with their gene
ontologies (GO) using the Python package goatools [68] and the GO Slim generic database [68] to assign the
genes to the major GO categories.
The dN/dS values for the 12 genes exhibiting positive selection (Table 6) are visible above line
showing dN=dS. Three of these genes belong to the plasma membrane GO category (GO:0005886), while
cytosol (GO:0005829), mitochondrial electron transport chain (GO:0005739), cytoplasm (GO:0005737) and
Downloaded from
by guest
on 17 March 2018
generation of precursor metabolites (GO:0006091) were represented by one gene each. Five of the genes
exhibiting positive selection signatures could not be assigned to GO categories. Some of these are also
associated with the plasma membranes (TMEM56, SMIM3), and one gene (CCRNL4) encodes a protein
highly similar to the nocturnin, a gene identified as a circadian clock regulator in Xenopus laevis [69]. The
full list of genes, GO annotations, and associated dN/dS values are listed in Database S6.
Traditionally, one of the most commonly used signatures of selection is the ratio of non-
synonymous (dN) to synonymous (dS) substitutions, dN/dS [70]. The synonymous rate (dS) expresses the
rate of unconstrained, neutral evolution, so that when dN/dS<1, the usual interpretation is that negative
selection has taken place on non-synonymous substitutions. Otherwise, when dN/dS>1, the interpretation
is that the positive selection is likely to have accelerated the rate of fixation of non-synonymous
substitutions. It is possible to quantify the proportion of non-synonymous substitutions that are slightly
deleterious from the differences in dN/dS between rare and common alleles [71][72]. In our comparison, a
subset of single copy orthologs dN/dS compared to the 10 mammalian species (Figure 5) is estimated to be
~0.18 or 18%, on average, compared for ~0.25 is reported for the humanchimp and ~0.13 reported for the
mouse-rat comparisons [73]. In other words, it suggests that up to 82% of all amino acid replacements in S.
paradoxus are removed by purifying selection [73].
Note that purifying selection is the conservative force in molecular evolution, whereas positive
selection is the diversifying force that drives molecular adaptation. Overall the list of positively selected
genes is relatively short compared to numbers of positively selected genes reported in other studies (e.g.
human to chimpanzee comparison yields several hundreds of human-specific genes under selection [74
76]. This observation could be a consequence of the averaging effect of large comparison group that
included mammals very distantly related to solenodons.
The dN/dS ratios can also be used as a proxy to illustrate the rate of evolution for proteins. By
looking at the trends in fast evolved genes (dN/dS > 0.25) we can make inferences about the factors that
shaped the genome of this species during the millions of years of island isolation. To summarize the
functional contributions, we used the PANTHER Overrepresentation Test and GO Ontology database based
on the H. sapiens (Table S1) and M. musculus (Table S2) genes [77]. Interestingly, genes involved in the
inflammatory response and located on cell surfaces were among those overrepresented among the rapidly
evolving genes in Solenodon genome compared either to the human or mouse databases (Table S1 and
Venom gene identification
Downloaded from
by guest
on 17 March 2018
Since solenodon is one of very few venomous eutherian mammals, of special interest in the
solenodon genome were the putative venom genes. While there was no saliva sample in our possession
that could be analyzed for the expressed toxin genes, a comparative genome approach could be applied as
an indirect way to find venom genes orthologous to genes expressed in venom for other species. First, we
identified 6,534 toxin and venom protein representatives (Tox-Prot) from Uniprot (UniProt ,
RRID:SCR_002380)[78], and queried them with BLAST against the current S. paradoxus genome assembly.
The hit scaffolds were then extracted from the AUGUSTUS CDS prediction file. The same Tox-Prot
sequences were used for Exonerate with the protein-to-genome model. The hits were used as queries
against the NCBI database to ensure gene identity, further examined through phylogenetic analyses with
select model mammalian and venom reptile genes (also adding randomly selected sequences for each
gene, to reduce clade bias). The retrieved sequences were aligned with MUSCLE [64], followed by a
maximum likelihood (WAG+I+G) phylogenetic reconstruction. Hits were matched against their respective
references in an alignment and visually inspected.
As a result, we identified 44 gene hits of the 16 most relevant protein venom classes (all present in
snakes) in the S. paradoxus genome (Table 7). Inspection of pairwise MUSCLE alignments of the putative
Solenodon venom genes (Database S7) with their animal homologs revealed several interesting cues. The
putative venom genes could not be confirmed through genomic information alone, yet they cannot be
discarded given that they were matched to high homology regions of closely related genes, such as those
originally recruited into venom. There were also unusual insertions not found in other species’ venom
genes. Specifically, an insertion in a serine protease, a gene with a role in coagulation (namely coagulation
factor X), is not present in known homologs. The insertion seems to be located at the start of the second
exon. This particular gene was further analyzed to understand the insertion and its potential functional
consequences (Figure 7). Finally, none of the known venom genes from the closest related venomous
insectivore (Blarina brevicauda) have been found by this study. Our results indicate that a more detailed
study of Solenodon venom genes using a transcriptome obtained from a fresh saliva sample is needed to
address their molecular evolution and function.
Genomic variation and demographic history inference
Once the reference alignment was assembled as a consensus between the sequences obtained
from the five S. p. woodi individuals, polymorphisms were identified in the six individual genomes by
aligning them to the combined reference. Single-nucleotide and short variants and indels were identified in
five southern and one northern individual using Bowtie2 (Bowtie , RRID:SCR_005476)[79], SAMtools and
Bcftools (SAMTOOLS , RRID:SCR_002105)[80], and VCFtools (VCFtools, RRID:SCR_001235)[81]. The S. p.
Downloaded from
by guest
on 17 March 2018
woodi individuals differed from the reference by an average of 1.25 million polymorphisms, and the S. p.
paradoxus individual differed by 2.65 million from the reference assembly.
Whole solenodon genome SNV rates, defined as a ratio of all observed SNVs to all possible SNV
sites in the genome, were calculated and found to be comparatively low relative to other mammals (Figure
8) [8285]. To enable this comparison, the same calculations were employed, where SNVs were not filtered
by repetitive regions or mappability mask and the number of possible SNV sites was defined as the genome
assembly size minus number of unknown base pairs ('N').
Based on the variation data from the genomes of two subspecies (S. p. woodi and S. p .paradoxus),
we estimated population dynamics using Pairwise Sequentially Markovian Coalescent (PSMC) model [86].
PSMC uses a coalescent approach to estimate changes in population size that allowed us to create a TMRCA
distribution across the genome and estimate the effective population size (Ne) in recent evolutionary
history (e.g., from 10,000 to 1 million years).
Demographic history was inferred separately for S. p. woodi and S. p. paradoxus, and the resulting
plots revealed differences in demographic histories of the two subspecies (Figure 9). Each southern
individual was considered separately and their demographic histories overlapped. The difference in
demographic history provides another argument in favor of a subspecies split, as evidenced by distinctly
different effective population sizes at least since 300 Kya. According to this analysis, the northern
solenodon subspecies currently has a much larger Ne, which has expanded relatively recently, between
10,000 11,000 years ago (Figure 9). Prior to that, it was the southern subspecies (S. p. woodi) that had a
larger Ne. At the same time, the demographic history inferred for both populations showed similar cyclical
patterns of expansion and contraction around the mean of 6,000 “effective” individuals for the southern
subspecies (S. p. woodi) and 3,000 for the northern subspecies (S. p. paradoxus). One unusual result of this
analysis is that the northern subspecies shows a much lower Ne for all but the most recent time period.
Development of tools to study population and conservation genetics of S. paradoxus
The presence of genome wide sequences of multiple individuals from two subspecies created a
possibility for the development of practical tools for conservation genetics of this endangered species.
Generally, microsatellite loci are both abundant and widely distributed throughout the genome, while
usable loci are characterized by a unique flanking DNA sequence so that a single locus can be
independently amplified in many individuals [8789]. The major advantages of microsatellite markers are
well known: codominant transmission, high levels of polymorphisms leading to the high information
content, high mutation rates that allow differentiation between individuals or populations within a species,
and ease of genotyping. While a genome obtained from one individual can be searched for potentially
Downloaded from
by guest
on 17 March 2018
variable microsatellite loci, this would (1) miss the majority of variable loci not represented in the
individual’s two chromosomes, and (2) result in many positives that may be monomorphic following
laboratory tests (usually by electrophoresis of the amplified fragments from population samples). The
availability of several genomes can allow generation of a more comprehensive set of variable markers,
while reducing false positives
All three assemblies from this study (A, B and C) were independently analyzed using a short tandem
repeat (STR) detection pipeline. A, B and C assemblies were analyzed separately with TRF (Tandem Repeats
Finder) to locate and display tandem repeats [90]. Each of the six individual samples from the two
solenodon subspecies (five from S. p. woodi and one from S. p. paradoxus) were aligned to the reference
assemblies A, B, and C by Burrow-Wheelers Aligner (Li and Durbin, 2009). Each set of individual alignments
was analyzed with HipSTR [91]. Only loci that shared more than 20 reads in the sample alignments were
considered for further steps in the search for variable microsatellite loci. The result of this search was saved
in a Variant Call Format (VCF) file that includes annotations of all loci that had variation between samples
and passed the minimum qualification of the reads parameter: to be successfully identified in silico in the
data from at least one individual. The loci that did not pass these criteria were labeled as unsuccessfully
verified and excluded from the list.
The remaining loci were subjected to additional filtering: all genotypes that had less than 90%
posterior probability according to HipSTR [91], genotypes with a flank indel in more than 15% of reads, and
genotypes with more than 15% of reads with detected PCR stutter artifacts were discarded. The final set
contains loci that have at least two allele calls in two different individuals after filtering have been
deposited in the polymorphic microsatellite database (Database S8). This database contains a list of
variable microsatellites discovered, a total of 1,200 bp flanking sequence for primer construction, and the
information on whether and where it was found variable - between subspecies, or within one of the
subspecies. We also report the type (di- tri-, etc.), number of repeats, number of variants, % variable, and
provide up to 600bp flanking sequence on each side that can be used to develop primer sequences
(Database S8).
In this study, we sequenced and assembled the genome of an endangered Antillean mammal that
survived tens of millions of years of island isolation, but nevertheless is currently threatened with extinction
due to anthropogenic activities. Our approach demonstrated sequencing, assembly and annotation of a
genome of a highly divergent lineage within the placental mammal tree, delivering an important
phylogenetically diverse mammalian genome for analysis in a comparative context [92]. Although the full
description of genome diversity of this rare enigmatic mammal needs to be further improved with more
Downloaded from
by guest
on 17 March 2018
samples and analyses, our initial assembly of the solenodon genome contributes information and tools for
future studies of evolution and conservation. Future studies can combine the current genome annotations
with the inclusion of additional genetic and ecological data from further sampling.
With the new genome-wide assembly, we inferred a phylogeny that validates previous estimates of
the time of divergence of Solenodon from other eulipotyphlan insectivores [3,12], also providing a window
into genetic underpinnings of adaptive features, including genes responsible for inflammation and venom,
and how these may reflect its adaptation. In addition, we developed tools that will help guide future
genome studies as well as conservation surveys of the remaining solenodon populations on the island of
Hispaniola. In this study, we have made the first step into the whole-genome analysis of the Solenodon. A
more complete genome sequence may provide a better picture of its evolutionary history, possible
signatures of selection, clues about the genetic basis of adaptive phenotypic features facilitating life on
Caribbean islands, and contribute to a better insight into island evolution and possible responses to current
and future climate change.
The string graph assembly approach for homozygous genomes
The advantages of the string graph assemblies in our particular case can be understood by looking
at the nature of the underlying algorithms. The de Bruijn graph is a mathematical concept that simplifies
genome assembly by reducing information from short next generation sequencing reads, of which there
can be billions, to an optimized computational problem that can be solved efficiently [93]. However, some
information may indeed be lost, as the set of reads is effectively replaced with a set of much shorter k-mers
to produce an optimal assembly path. Usually, this is compensated by overwhelming amounts of data in
high coverage assemblies, and the difference in effectiveness between this and other types of algorithms,
barring speed, becomes less evident. While sequencing becomes cheaper, genome projects continue to
rely on the increased high quality coverage, increasing the cost of the sequence data rather than trying to
increase the efficacy of the assembly itself. In contrast, the string graph-based algorithms for genome
assembly are intrinsically less erroneous than de Bruijn graph based ones, since building and resolving a
string graph does not require breaking reads into k-mers and therefore does not sacrifice long-range
information [18]. This also helps reduce the probability of mis-assemblies: in theory, any path in a string
graph represents a valid assembly [94,95]. String graph based approaches have already been applied
successfully to assemblies from high coverage read sets; and one example is the Assemblathon 2 [96]. In
projects with lower genome coverage like ours, adoption of a string graph based approach might be of
benefit to the genome assembly because it uses more information from the sequences. However, there are
two major downsides for its widespread use: (1) it is more computationally intensive than methods utilizing
de Bruijn graph algorithms, and (2) the implementation of the string graph model is sensitive to sequence
variation, and the effectiveness of this approach may depend on the level of heterozygosity in a DNA
sample. It is worth noting that Fermi [18] was primarily intended for variant annotation via de novo local
Downloaded from
by guest
on 17 March 2018
assembly, and not for whole genome assembly. Nevertheless, the new genome-wide data produced by our
pipeline was sufficient for the comparative analysis, and has been annotated for the genes and repetitive
elements, and interrogated for phylogeny, demographic history and signatures of selection. In addition,
using the current genome assembly we were able to annotate large transpositions and translocations in the
Solenodon in relation to the closest available high-quality genome assembly (S. araneus).
Potential implications
Comparative genomics
We have taken advantage of the fact that the genome of this mammal shows reduced
heterozygosity [12], which made it feasible to combine samples of multiple individuals in order to provide
higher coverage and achieve a better assembly using Illumina reads. The current assembly was performed
without the use of mate pair libraries and without high quality DNA, nevertheless it is comparable in quality
to other available mammalian assemblies. In terms of contig N50 as a measure of contiguity, our assembly
resulted in contig N50 of 54,944 while the most closely related available genome sequences of Sorex
araneus (SorAra2.0) assembly features a contig N50 of 22,623, and the Condylura cristata (ConCri1.0)
assembly has contig N50 of 46,163. It should be noted that scaffold N50 values are not to be compared as
this study used only paired-end reads, as opposed to S. araneus and C. cristata. More importantly, the
assembly provided complete or partial annotation for more than 95% of the genes based on the
evolutionarily-informed expectations of gene content from near-universal single-copy orthologs selected
from OrthoDB v9 by BUSCO [28]. Among these, 4416 single copy genes that have clear one-to-one
orthologs across species (single copy orthologs) [97,98] were chosen for a subsequent comparative analysis
involving genes in different mammalian species.
Specifically, the repetitive composition of the solenodon genome was evaluated. Compared to the
estimates based on the reference human genome [99], very conspicuous is the lower numbers of SINEs (no
Alu elements), and a substantially lower number of LINEs as well. Transpositions and translocations
between the genomes of S. paradoxus and S. araneus were identified; very few rearrangements and
translocations between the assembly and the S. araneus genome were found. At the same time a higher
coverage would be needed to do more detailed analyses, for instance to address the relative length and
similarity of indels and copy number polymorphisms between solenodon populations [100].
Evolutionary genomics
Using the nuclear genomes, we were able to confirm earlier divergence time estimates based on
sets of genes [3], as well as full mitochondrial sequences [12]. The whole genome analysis points to a split
Downloaded from
by guest
on 17 March 2018
between Solenodon and other eulypotiphlans that occurred around 74 Mya (Figure 5), which is very close
to our earlier estimate of 78 Mya based on the full mitochondrial genome [12]. Our result does not support
the 60 Mya point estimate made by a phylogenetic analysis based on sequences of five slowly evolving
nuclear genes [13].
Our assembly provided enough gene sequences to gain insights into the evolution of functional
elements in the solenodon genome. It is reasonable to suggest that this species historically had low
effective population sizes, if they remained close to those estimated by this study: or about 4,000 on
average (Figure 9). Among the 4,416 single copy orthologs analyzed for dN/dS ratios over the entire length
of a protein-coding gene between S. paradoxus and 10 other mammals, 12 genes were identified as
positively selected. Among these, the majority were membrane proteins, with one gene (CCRNL4) similar to
a circadian clock regulator (Table 6). It is possible that the short list of the positively selected genes could
be a consequence of the large comparison group that included mammals very distantly related to
solenodon, and its genes need to be compared with more closely related species, for example once the
genome of S. cubanus is reported, and better gene annotations for Sorex araneus become available.
Solenodon is one of few mammals that use venomous saliva to disable prey. It delivers its venom
similarly to snakes using its teeth to inject venomous saliva into its target. Different approaches could be
used to characterize venom genes, such as the use of non-curated databases to widen the search spectrum,
which may include different molecules that could be found in Solenodon. For example, 6,534 toxin and
venom protein representatives can be found in the UniProt database. It is also important to note that the
database of venom gene sequences may not include those relevant to solenodons given their deep
divergence from any other venomous mammalian species. The venom of Solenodon may contain novel
protein modifications with unknown potential or application, making it valuable for future detailed
Genes associated with venom, such as serine proteases involved in coagulation (namely the
coagulation factor X) are of major interest, since factor X in solenodon exhibited unusual insertions when
compared to its homologs (Figure 7). The detection of an unusual insertion in a serine protease has been
previously found in another venomous mammalian species, the shrew Blarina brevicauda, but in a different
gene than in solenodon. The coagulation factor X is involved in the circulatory system and is responsible for
activating thrombin and inducing clotting. The insertion in the coagulation factor X gene seems to be a
hydrophilic alpha helix with three potential protein-protein interaction sites. It occurs at the end of the
region annotated as the signal peptide, while having a signal peptide cleavage site itself at the beginning of
its sequence. The factor X protein structure was successfully modeled by Swiss-Model based on the
venomous elapid snake Pseudonaja textilis (pdb: 4bxs), to have a heavy chain that contains the serine
protease activity, which was modeled with a high degree of confidence (Figure 10). The venom
prothrombin activator has an advantage as a toxin in part due to modifications in inhibition sites, making it
Downloaded from
by guest
on 17 March 2018
difficult to stop its activity. Another advantage is that the molecules are always found in an active form
(Kinin). We hypothesize that the insertion could allow a more successful interaction with molecules capable
of activating the F10 protein. In mice, venom extracted from solenodons and venom prothombin activator
injections can both be lethal in minutes [7,101]. The insertion was also searched against possible mobile
DNA elements, but no matches were found. Our results should be followed in the future by detailed
pharmacological studies.
Conservation genetics
The low variation that exists between the solenodon sequences is hardly surprising, because the
theoretical consensus in conservation genetics predicts that small populations lose genetic diversity more
rapidly than large populations [102], and measures of genetic diversity have been explicitly suggested to
IUCN as a factor to consider in identifying species of conservation concern [103]. The historical Ne for each
subspecies was examined by our analysis (Figure 9), and showed lower levels recently in S. p. woodi. Due to
the limitations of PSMC, the most recent Ne cannot be calculated from the genome sequences [86].
Therefore, this estimate of diversity does not reflect the recent impact on the solenodon population caused
by anthropogenic factors in the last 10,000 years (Figure 9).
Many endangered species with small populations also have reduced heterozygosity across their
genomes, and would benefit from a computational approach that reduces the cost and optimizes the
amount of data for the genome assembly. The real-life scenarios where no high-quality DNA can be
produced because of the remoteness of sampling location, difficulty in transportation and storage, or when
the high coverage cannot be produced due to the limited funds are well known to many, especially in the
field of conservation genetics. The difficult field conditions and international regulations make it difficult to
obtain samples with high molecular weight DNA. To aid future conservation studies, we have mined the
current dataset for microsatellite markers that are useful within and between subspecies, to be used as
tools for studies on population diversity, censoring and monitoring.
The comparative analysis of the number and the length of microsatellite alleles pointed once more
to the advantage of assembly B over A and C. The average length of microsatellite short tandem repeats in
assembly B was the highest: 20.95 (assembly A), vs. 21.14 (assembly B) vs. 18.86 (assembly C). This may be
a direct consequence of the high number of microsatellite alleles that were successfully genotyped in all of
the southern samples for assembly B (2,660), as well as microsatellites that proved variable between the
two subspecies but fixed within the southern samples (639). The low number of variable microsatellites
between the two subspecies was likely due to the reduced amount of information obtainable from a single
low coverage genome of the northern subspecies (S. p. paradoxus) used in this study. Venn diagrams
showing overlap in microsatellite variation in three assemblies are presented in Figure 11.
Downloaded from
by guest
on 17 March 2018
Recently, a genetic survey using mitochondrial cytochrome b and control region sequences from 34
solenodon samples identified distinct haplotypes in northern and southern Hispaniola [16], along with a
distinctive third group, a small remnant population at the Massif de la Hotte in the extreme western tip of
Haiti [16,104] not sampled for this study. The northsouth subspecies subdivision within S. paradoxus was
further supported by mitogenomic sequences [12]. The island of Hispaniola has been divided into three
main biogeographic regions that differ in climate and habitat. The north and center of the island provide
the largest area with known solenodon populations, and shows no discontinuity with the southeast.
However, the solenodon populations in the southwestern part of the island are currently geographically
isolated by Cordillera Central, and may have been isolated in the past by the ancient marine divide across
the Neiba Valley (Figure 2). This geographic isolation is likely the reason why the S. p. paradoxus in the
larger northern area, and S. p. woodi in the southwest, show morphological differences suggestive of
separate subspecies [15]. Future conservation strategies directed at protecting and restoring solenodon
populations on Hispaniola should take into consideration this subdivision, and treat the two subspecies as
two separate conservation units.
Provenance of the samples is shown on the map (Figure 2), with coordinates listed in Table S4.
Solenodons were caught with help of local guides (Nicolás Corona and Yimell Corona). During the day,
potential locations were inspected in daylight for animal tracks, burrows, droppings and other signs of
solenodon activity. At dawn, ambushes were set up in the forested areas along the potential animal trails.
The approaching solenodons were identified by sound, and chased with flashlights when approached. Since
solenodons move slowly, animals were picked up by their tails, which is the only way to avoid potentially
venomous bites. All wild caught animals were released back into their habitats within 10 minutes after their
capture. Before the release, the animals’ tails were marked with a Sharpie pen to avoid recapturing.
Blood was drawn by a licensed ZooDom veterinarian () from the vena jugularis using a
3mL syringe with a 23G x 1 needle. The blood volume collected never exceeded 1% of body weight of
animals. Before the draw, an aseptic technique was applied using a povidoneiodine solution, followed by
isopropyl alcohol. Once collected, the samples were transferred to a collection tube with anticoagulant (BD
Microtainer, 1.0mg K2EDTA for 250500lL volume). Collection tubes were refrigerated and transported to
the lab at the Instituto Tecnológico de Santo Domingo (INTEC) where DNA was extracted from samples
using the DNeasy Blood & Tissue kit (Qiagen, Hilden, Germany). This study has been reviewed and
approved by the Institutional Animal Care and Use Committee of the University of Puerto Rico at Mayagüez
(UPR-M). All the required collection and permits had been obtained before any field work was started. The
samples have been collected and exported in compliance with Export Permit No: VAPB-00909 (Dominican
Republic Environment and Natural Resources Ministry Viceministry of Protected Areas and Biodiversity
Downloaded from
by guest
on 17 March 2018
Department of Biodiversity) and imported in compliance with CITES/ESA Import permit number
14US84465A/9 (University of Illinois Board of Trustees).
Sequences for S. p. woodi were generated by Illumina HiSeq 2000 (Illumina Inc) with 100bp paired-
end reads. The Illumina HiSeq generated raw images utilizing HCS (HiSeq Control Software v2.2.38) for
system control and base calling through an integrated primary analysis software called RTA (Real Time
Analysis. v1.18.61.0). The BCL (base calls) binaries were converted into FASTQ utilizing the Illumina package
bcl2fastq (v1.8.4). Sequences for S. p. paradoxus were generated by the Illumina MiSeq V3 (Illumina Inc.) at
the Roy J. Carver Biotechnology Center, University of Illinois. The sequencing data for each sample used in
this study is presented in Table S5.
Availability of supporting data and materials
Database S1: Lists of repeats in the solenodon genome (assemblies A and B)
Database S2: List of protein coding genes in the solenodon genome (assembly B)
also cds for each gene and translated sequences
Database S3: List of the annotated non-coding RNAs in the solenodon genome
Database S5: List of single-copy orthologs in the solenodon genome (columns include: ENOG id, gene
Database S6: List of genes with dN/dS values and GO annotations
Downloaded from
by guest
on 17 March 2018
Database S7: List of venom genes
Datablase S8: Microsatellite loci discovered in genomes of two solenodon subspecies Solenodon paradoxus
paradoxus (northern) and S. p. woodi (southern), alleles, 600bp flanking regions (a total of 1,200 bp per
locus), and frequency information for the two subspecies
Database S9: Lists of single nucleotide differences (SND) from the assembled individual genome of Spa-1
(from Solenodon paradoxus paradoxus) and Spa K, - L, - M , -N, and O (from the five S. p. woddi)) used to
show estimates of heterozygosity in Figure 8 (see explanation in text)
Supporting raw data is in the NCBI SRA [ENA: NKTL01000000, PROJECT: PRJNA368679] and genome
assemblies, custom codes and annotations are in the GigaScience GigaDB database [21].
bp: base pair; BCL: base call; BUSCO: Benchmarking universal single-copy orthologs; CDS: coding
sequence; CEGMA: Core eukaryotic genes mapping approach; Gb: Giga base; GO: gene ontology; HMM:
hidden Markov model; IUCN: International Union for Conservation of Nature; kb: kilo base; Kya: thousand
(kilo) years ago; MYA: million years ago; PSMC: Pairwise Sequentially Markovian Coalescent; SND: single
nucleotide difference; SNP: Single Nucleotide Polymorphism; SNV: Single Nucleotide Variation; SRA:
Sequence Read Archive; STR: short tandem repeat; VCF: Variant Call Format.
Authors have declared that they do not have any competing interests.
The authors thank Nicolás Corona and Yimell Corona for assistance in collecting samples. A special thanks
to Kara Fore for helping to edit this manuscript. Authors at the University of Puerto Rico t Mayaguez were
supported in part by NSF award #1432092. Authors at the Theodosius Dobzhansky Center for Genome
Bioinformatics were supported by the Russian Ministry of Science Mega-grant (no 11.G34.31.0068), and St.
Petersburg State University grant (no 1.50.1623.2013). Sample collection in Dominican Republic was
performed under permit number 00201171139 from the Ministry of Environment and Natural Resources.
Downloaded from
by guest
on 17 March 2018
1. MacPhee RDE, Flemming C, Lunde DP. “ Last occurrence” of the Antillean insectivoran Nesophontes: new
radiometric dates and their interpretation. American Museum novitates; no. 3261. New York, NY: American
Museum of Natural History; 1999;
2. Ottenwalder JA. Systematics and biogeography of the West Indian genus Solenodon. Biogeogr. West
Indies Patterns Perspect. Second Ed. CRC Press; 2001. p. 253329.
3. Roca AL, Bar-Gal GK, Eizirik E, Helgen KM, Maria R, Springer MS, et al. Mesozoic origin for West Indian
insectivores. Nature. Nature Publishing Group; 2004;429:64951.
4. Verill AH. Notes on the Habits and External Characters of the Solenodon of San Domingo (Solenodon
paradoxus). Am. J. Sci. 1907;XXIV:557.
5. Allen JA. Notes on Solenodon paradoxus Brandt. Bull. Am. Museum Nat. Hist. 1908;XXIV:5055017.
6. Brandt JF. De Solenodonte: novo mammalium insectivororum genere. Mem. l’Académie Impériale des
Sci. St. Pétersbg. l’Académie Impériale des Sciences de St. Pétersbourg; 1833;2:459–78.
7. Derbridge JJ, Posthumus EE, Chen HL, Koprowski JL. Solenodon paradoxus (Soricomorpha:
Solenodontidae). BioOne; 2015;
8. Feldhamer GA. Mammalogy: adaptation, diversity, ecology. JHU Press; 2007.
9. Wible JR. On the cranial osteology of the Hispaniolan solenodon, Solenodon paradoxus Brandt, 1833
(Mammalia, Lipotyphla, Solenodontidae). Ann. Carnegie Museum. BioOne; 2008;77:321402.
10. Folinsbee KE, Müller J, Reisz RR. Canine grooves: morphology, function, and relevance to venom. J.
Vertebr. Paleontol. BioOne; 2007;27:54751.
11. Dufton MJ. Venomous mammals. Pharmacol. Ther. Elsevier; 1992;53:199215.
12. Brandt AL, Grigorev K, Afanador-Hernández YM, Paulino LA, Murphy WJ, Núñez A, et al. Mitogenomic
sequences support a north--south subspecies subdivision within Solenodon paradoxus. Mitochondrial DNA
Part A [Internet]. Taylor & Francis; 2017;28:66270. Available from:
13. Sato JJ, Ohdachi SD, Echenique-Diaz LM, Borroto-Páez R, Begué-Quiala G, Delgado-Labañino JL, et al.
Molecular phylogenetic analysis of nuclear genes suggests a Cenozoic over-water dispersal origin for the
Cuban solenodon. Sci. Rep. Nature Publishing Group; 2016;6.
14. Ottenwalder JA. The distribution and habitat of Solenodon in the Dominican Republic. 1985.
15. Ottenwalder JA. The systematics, biology, and conservation of Solenodon. 1991;
16. Turvey ST, Peters S, Brace S, Young RP, Crumpton N, Hansford J, et al. Independent evolutionary
histories in allopatric populations of a threatened Caribbean land mammal. Divers. Distrib. Wiley Online
Library; 2016;
17. Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, et al. SOAPdenovo2: an empirically improved memory-efficient
short-read de novo assembler. Gigascience. BioMed Central; 2012;1:18.
Downloaded from
by guest
on 17 March 2018
18. Li H. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly.
Bioinformatics. Oxford Univ Press; 2012;28:183844.
19. Boetzer M, Henkel C V, Jansen HJ, Butler D, Pirovano W. Scaffolding pre-assembled contigs using
SSPACE. Bioinformatics. Oxford Univ Press; 2011;27:5789.
20. Wang Y, Lu Y, Zhang Y, Ning Z, Li Y, Zhao Q, et al. The draft genome of the grass carp
(Ctenopharyngodon idellus) provides insights into its evolution and vegetarian adaptation. Nat. Genet.
Nature Research; 2015;47:62531.
21. Grigorev K, Kliver S, Dobrynin P, Komissarov A, Wolfsberger W, Krasheninnikova K, et al. Supporting
data for “Innovative assembly strategy contributes to understanding the evolution and conservation
genetics of the endangered Solenodon paradoxus from the island of Hispaniola.” GigaScience Database.
22. Starostina E, Tamazian G, Dobrynin P, O’Brien S, Komissarov A. Cookiecutter: a tool for kmer-based read
filtering and extraction. bioRxiv. Cold Spring Harbor Labs Journals; 2015;24679.
23. Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-
mers. Bioinformatics. 2011;27:76470.
24. Marçais G, Yorke JA, Zimin A. QuorUM: an error corrector for Illumina reads. PLoS One.
25. Chikhi R, Medvedev P. Informed and automated k-mer size selection for genome assembly.
Bioinformatics. 2013;btt310.
26. Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief.
Bioinform. 2010. p. 47383.
27. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies.
Bioinformatics. 2013;29:10725.
28. Simão FA, Waterhouse RM, Ioannidis P, Kriventseva E V, Zdobnov EM. BUSCO: assessing genome
assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;btv351.
29. Parra G, Bradnam K, Korf I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic
genomes. Bioinformatics. 2007;23:10617.
30. Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, Otto TD. REAPR: a universal tool for genome
assembly evaluation. Genome Biol. 2013;14:R47.
31. Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D. Cactus: Algorithms for genome multiple
sequence alignment. Genome Res. 2011;21:151228.
32. Kolmogorov M, Raney B, Paten B, Pham S. Ragouta reference-assisted assembly tool for bacterial
genomes. Bioinformatics. 2014;30:i302--i309.
33. Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996.
34. Bao W, Kojima KK, Kohany O. Repbase Update, a database of repetitive elements in eukaryotic
genomes. Mob. DNA. 2015;6:11.
Downloaded from
by guest
on 17 March 2018
35. Slater GSC, Birney E. Automated generation of heuristics for biological sequence comparison. BMC
Bioinformatics. 2005;6:31.
36. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B. AUGUSTUS: ab initio prediction of
alternative transcripts. Nucleic Acids Res. 2006;34:W435--W439.
37. Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic
Acids Res. Oxford Univ Press; 2011;gkr367.
38. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol.
Elsevier; 1990;215:40310.
39. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, et al. The Pfam protein families
database. Nucleic Acids Res. 2004;32:D138--D141.
40. Consortium U, others. UniProt: a hub for protein information. Nucleic Acids Res. Oxford Univ Press;
41. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. Oxford Univ
Press; 2013;29:29335.
42. Nawrocki EP, Burge SW, Bateman A, Daub J, Eberhardt RY, Eddy SR, et al. Rfam 12.0: updates to the
RNA families database. Nucleic Acids Res. 2014;gku1063.
43. Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic
sequence. Nucleic Acids Res. 1997;25:95564.
44. Seemann T, Booth T. BARNAP: BAsic Rapid Ribosomal RNA Predictor [Internet]. Berlin: GitHub; 2013. p. Accessed 1st March 2018
45. Ruan J, Li H, Chen Z, Coghlan A, Coin LJM, Guo Y, et al. TreeFam: 2008 update. Nucleic Acids Res.
46. Li H, Coghlan A, Ruan J, Coin LJ, Heriche J-K, Osmotherly L, et al. TreeFam: a curated database of
phylogenetic trees of animal gene families. Nucleic Acids Res. 2006;34:D572--D580.
47. Huerta-Cepas J, Szklarczyk D, Forslund K, Cook H, Heller D, Walter MC, et al. eggNOG 4.5: a hierarchical
orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences.
Nucleic Acids Res. 2015;gkv1248.
48. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.
Bioinformatics. 2014;30:13123.
49. Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. SMBE; 2007;24:158691.
50. Xia X, Xie Z, Salemi M, Chen L, Wang. Y. An index of substitution saturation and its application. Mol.
Phylogenet. Evol. 2003;17.
51. Xia X, Lemey P. Assessing substitution saturation with DAMBE. In: Philippe Lemey, Marco Salemi and
Anne-Mieke Vandamme eds., editor. Phylogenetic Handb. A Pract. Approach to DNA Protein Phylogeny.
2nd ed. Cambridge Univ Press; 2009. p. 61530.
52. Xia X. DAMBE6: New tools for microbial genomics, phylogenetics and molecular evolution. J. Hered.
Downloaded from
by guest
on 17 March 2018
53. Ksepka DT, Parham JF, Allman JF, Benton MJ, Carrano MT, Cranston KA, et al. The fossil calibration
databasea new resource for divergence dating. Syst. Biol. 2015;syv025.
54. Benton MJ, Donoghue PCJ, Asher RJ, Friedman M, Near TJ, Vinther J. Constraints on the timescale of
animal evolutionary history. Palaeontol. Electron. Paleontological Society; 2015;18:1106.
55. Munthe K. Canidae, p. 124--143. Evol. Tert. Mamm. North Am. Cambridge Univ. Press. Cambridge.
56. Wang X, Whistler DP, Takeuchi GT. A new basal skunk Martinogale (Carnivora, Mephitinae) from late
Miocene Dove Spring Formation, California, and origin of new world mephitines. J. Vertebr. Paleontol.
BioOne; 2005;25:93649.
57. Rambaut A. FigTree [Internet]. 2016. Available from:
Accessed 1st March 2018
58. Springer MS, Murphy WJ, Roca AL. Appropriate fossil calibrations and tree constraints uphold the
Mesozoic divergence of solenodons from other extant mammals. Mol. Phylogenet. Evol. 2018;(in press).
59. Hedges SB. Vicariance and Dispersal in Caribbean Biogeography. Herpetologica. 1996;52:46673.
60. McDowell SB. The Greater Antillean insectivores. Bull. Am. Museum Nat. Hist. [Internet]. 1958;115:117.
Available from:
61. Butler PM. Phylogeny of the insectivores. In: Benton MJ, editor. Phylogeny Classif. Tetrapods. Oxford:
Clarendon; 1988. p. 11741.
62. MacPhee RD., Novacek M. Definition and relationships of the Lipotyphla. In: Soule F, Novacek M,
McKenna M, editors. Mamm. Phylogeny, Vol. 2, Placentals. New York: Springer-Verlag; 1993. p. 1331.
63. McKenna M, Bell S, Simpson S. Classification of mammals above the species level. New York: Columbia
University Press; 1997.
64. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids
Res. 2004;32:17927.
65. Suyama M, Torrents D, Bork P. PAL2NAL: robust conversion of protein sequence alignments into the
corresponding codon alignments. Nucleic Acids Res. 2006;34:W609--W612.
66. Soto-Girón MJ, Ospina OE, Massey SE. Elevated levels of adaption in Helicobacter pylori genomes from
Japan; a link to higher incidences of gastric cancer? Evol. Med. public Heal. 2015;eov005.
67. Gharib WH, Robinson-Rechavi M. The branch-site test of positive selection is surprisingly robust but
lacks power under synonymous substitution saturation and variation in GC. Mol. Biol. Evol. SMBE;
68. Tang H, Klopfenstein D, Pedersen B, Flick P, Sato K, Ramirez F, et al. GOATOOLS: Tools for Gene
Ontology [Internet]. Zenodo; 2015. Available from: Accessed 1st
March 2018
69. Baggs JE, Green CB. Nocturnin, a deadenylase in Xenopus laevis retina: a mechanism for
Downloaded from
by guest
on 17 March 2018
posttranscriptional control of circadian-related mRNA. Curr. Biol. 2003;13:18998.
70. Oleksyk TK, Smith MW, O’Brien SJ. Genome-wide scans for footprints of natural selection. Philos. Trans.
R. Soc. London Ser. B Biol. Sci. 2010;365:185205.
71. Fay JC, Wyckoff GJ, Wu CI. Positive and negative selection on the human genome. Genetics
72. Fay JC, Wu C-I. The Neutral Theory in the Genomic Era. Curr. Opin. Genet. Dev. 2001;11:6426.
73. Ellegren H. Evolution: Natural selection in the evolution of humans and chimps. Curr. Biol. 2005.
74. Gayà-Vidal M, Albà M. Uncovering adaptive evolution in the human lineage. BMC Genomics
75. Bakewell MA, Shi P, Zhang J. More genes underwent positive selection in chimpanzee evolution than in
human evolution. Proc. Natl. Acad. Sci. 2007;104:748994.
76. Olson M V., Varki A. Sequencing the chimpanzee genome: insights into human evolution and disease.
Nat. Rev. Genet. 2003;4:208. doi:10.1038/nrg981
77. Mi H, Huang X, Muruganujan A, Tang H, Mills C, Kang D, et al. PANTHER version 11: expanded
annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements.
Nucleic Acids Res. 2016;gkw1138.
78. Jungo F, Bougueleret L, Xenarios I, Poux S. The UniProtKB/Swiss-Prot Tox-Prot program: a central hub of
integrated venom protein data. Toxicon 2012;60:5517.
79. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012;9:3579.
80. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics.
81. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and
VCFtools. Bioinformatics. 2011;27:21568.
82. Dobrynin P, Liu S, Tamazian G, Xiong Z, Yurchenko AA, Krasheninnikova K, et al. Genomic legacy of the
African cheetah, Acinonyx jubatus. Genome Biol. 2015. 16:277. doi: 10.1186/s13059-015-0837-4..
83. Gordon D, Huddleston J, Chaisson MJP, Hill CM, Kronenberg ZN, Munson KM, et al. Long-read sequence
assembly of the gorilla genome. Science (80-. ). 2016;352:aae0344-aae0344. doi:10.1126/science.aae0344
84. Li R, Fan W, Tian G, Zhu H, He L, Cai J, et al. The sequence and de novo assembly of the giant panda
genome. Nature. 2010;463:11061106. doi:10.1038/nature08846
85. Cho YS, Hu L, Hou H, Lee H, Xu J, Kwon S, et al. The tiger genome and comparative analysis with lion and
snow leopard genomes. Nat. Commun. 2013;4.doi:10.1038/ncomms3433
86. Li H, Durbin R. Inference of human population history from individual whole-genome sequences.
Nature. 2011;475:4936.
87. Weber JL. Human DNA polymorphisms and methods of analysis. Curr. Opin. Biotechnol. 1990;1:16671.
88. Weber JL, Wong C. Mutation of human short tandem repeats. Hum. Mol. Genet. 1993;2:11238.
Downloaded from
by guest
on 17 March 2018
89. Weber JL, May PE. Abundant class of human DNA polymorphisms which can be typed using the
polymerase chain reaction. Am. J. Hum. Genet. 1989;44:38896. PMC1715443
90. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res.
91. Willems T, Zielinski D, Yuan J, Gordon A, Gymrek M, Erlich Y. Genome-wide profiling of heritable and de
novo STR variations. Nat Methods. 2017 Jun;14(6):590-592. doi: 10.1038/nmeth.4267.
92. Koepfli K-P, Paten B, O’Brien SJ. The Genome 10K Project: a way forward. Annu. Rev. Anim. Biosci.
Annual Reviews; 2015;3:57111.
93. Compeau PEC, Pevzner PA, Tesler G. How to apply de Bruijn graphs to genome assembly. Nat.
Biotechnol. 2011;29:98791.
94. Myers EW. Toward simplifying and accurately formulating fragment assembly. J. Comput. Biol.
95. Myers EW. The fragment assembly string graph. Bioinformatics. 2005;21:ii79--ii85.
96. Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, et al. Assemblathon 2: evaluating de
novo methods of genome assembly in three vertebrate species. Gigascience. 2013;2:10.
97. Gogarten JP, Olendzenski L. Orthologs, paralogs and genome comparisons. Curr. Opin. Genet. Dev.
1999. p. 6306.
98. Creevey CJ, Muller J, Doerks T, Thompson JD, Arendt D, Bork P. Identifying Single Copy Orthologs in
Metazoa. PLOS Comput. Biol. 2011;7:e1002269.
99. Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: computational challenges
and solutions. Nat. Rev. Genet. 2012;13:3646.
100. Volfovsky N, Oleksyk TK, Cruz KC, Truelove AL, Stephens RM, Smith MW. Genome and gene alterations
by insertions and deletions in the evolution of human and chimpanzee chromosome 22. BMC Genomics.
101. Rabb GB. Toxic salivary glands in the primitive insectivore Solenodon. Nat. Hist. Misc. 1959;170:13.
102. Allendorf FW, Luikart G. Conservation and the genetics of populations. John Wiley & Sons; 2009.
103. Willoughby JR, Sundaram M, Wijayawardena BK, Kimble SJA, Ji Y, Fernandez NB, et al. The reduction of
genetic diversity in threatened vertebrates and new recommendations regarding IUCN conservation
rankings. Biol. Conserv. 2015;191:495503.
104. Turvey ST, Meredith HMR, Scofield RP. Continued survival of Hispaniolan solenodon Solenodon
paradoxus in Haiti. Oryx. Cambridge Univ Press; 2008;42:6114.
105. Consortium GO, others. The Gene Ontology (GO) database and informatics resource. Nucleic Acids
Res. Oxford Univ Press; 2004;32:D258--D261.
Downloaded from
by guest
on 17 March 2018
Table 1. Description of the assembly strategies and comparison of metrics for the resulting assemblies
Assembly Names
Assembly Tools
Contig assembly tool
Scaffolding tool
Gap closing tool
Assembly Metrics
Total contigs (>1,000 bp)
Contig N50
Contig CEGMA (%) *
Contig BUSCO (%)
Total scaffolds (>1,000 bp)
Final N50
Final CEGMA (%)
Final BUSCO (%)
Percentage of Ns (%)
REAPR error-free bases (%)
REAPR low-scoring regions
REAPR incorrectly oriented reads
* BUSCO [28] and CEGMA [29] percentages are reported for all genes (complete and partial), while the percentage of
complete genes are shown in parentheses.
Downloaded from
by guest
on 17 March 2018
Table 2. Pairwise genomic coverage for the three assemblies and the Sorex araneus genome (SorAra 2.0,
NCBI accession number GCA_000181275.2) obtained from the Progressive Cactus [31] alignments. While
all three assemblies have similar amounts of syntenic coverage to the Sorex genome, assembly B contains the
least numbers of structural rearrangements (inversions and translocations) compared to the other two
assemblies (A and C).
vs S. paradoxus woodi
vs S. araneus
Pairwise genome coverage (%) *
* Values in cells at the intersection of rows and columns represent the percentage (%) of coverage between the two
compared genome assemblies. Syntenic blocks between each of the three solenodon assemblies (A, B and C) were
compared to the S. araneus assembly, and 50Kbp syntenic blocks were identified using the ragout-maf2synteny
module of the software package Ragout [32].
Downloaded from
by guest
on 17 March 2018
Table 3. Repeat content of the Solenodon paradoxus genome (Assembly B), annotated by RepeatMasker
[33] with the RepBase library [34].
Length (bp)
Percentage (%)
Total interspersed repeats
LTR elements
DNA elements
Small RNAs
Simple repeats
Low complexity regions
Downloaded from
by guest
on 17 March 2018
Table 4. The weighted coverages of the genomes in the Progressive Cactus alignment [31], as calculated
against the C. familiaris genome. The weighted coverage of the S. paradoxus genome assembly from our
study is comparable to other high coverage mammalian genome assemblies. The cladogram used for multiple
genome alignment with Progressive Cactus is shown in Figure S1.
Query genome
Dog (Canis familiaris)
Cow (Bos taurus)
Common shrew (Sorex araneus)
Star-nosed mole (Condylura cristata)
Hispaniolan solenodon (Solenodon paradoxus)
* The weighted coverage of a genome to itself is parenthesized as it is not a comparative value
Downloaded from
by guest
on 17 March 2018
Table 5. Fossil-based priors associated with mammalian evolution used for calibration of divergence times
[5356]. The 4,416 single copy orthologs identified in our assembly were used for phylogeny inference via
four-fold degenerate sites with programs RAxML [48] and PAML [49]. The resulting phylogenetic tree was
plotted with FigTree [57] and is presented in Figure 5.
Calibration prior
on clade
Node min.
age (Mya)
Node max.
age (Mya)
Opossum - placental
mammals split
Eutheria - Metatheria
Fossil (Benton et al. 2015)
Human - mouse
Archonta - Glires
Biostratigraphy (Benton and
Donoghue, 2007)
Primates, mouse - dog,
horse, cow
Euarchontaglires -
Fossil (Benton et al. 2015)
Dog - ferret
Canidae - Arctoidea
Fossil (Wang et al., 2005; Munthe,
Solenodon - hedgehog,
shrew, mole
Fossil (Benton et al. 2015)
Cow - horse
Artiodactyla as soft
Fossil (Benton et al. 2015)
Downloaded from
by guest
on 17 March 2018
Table 6. The putative targets of positive selection in the solenodon genome. The dN/dS values and the GO
categories for the 12 genes that showed signatures of positive selection in the Solenodon paradoxus woodi
genome (dN>dS). All other genes are reported in Database S6.
Solenodon gene
GO category
Plasma membrane
Plasma membrane
Plasma membrane
Generation of precursor
metabolites and energy
Downloaded from
by guest
on 17 March 2018
Table 7. Homologous matches for the most relevant protein venom classes in the Solenodon paradoxus
genome. Genes were identified by querying 6,534 toxin and venom protein representatives found in animal
venoms in Tox-Prot from Uniprot [78]. All of the protein groups are present in snake venoms. The sequences
of the putative venom genes from S. paradoxus are available in the Database S7.
Protein groups
found in animal venoms
Number of matches in the
S. paradoxus genome
Metalloproteinase; Serine protease
8 each
Calglandulin; Nerve growth factors
4 each
Hydrolase; Kunitz serine protease inhibitor; Nucleotidase; O-
methyltransferase; Oxidase; Peptidase; Phosphodiesterase;
Phospholipase; Vascular endothelial growth factor
1 each
Downloaded from
by guest
on 17 March 2018
Figure 1. The two subspecies of Solenodon paradoxus. A) A captive Hispaniolan solenodon from the
northern subspecies (S. p. paradoxus) photographed at the Santo Domingo Zoo (photo taken by Juan C.
Martínez-Cruzado in 2014). B). A mounted specimen of the southern subspecies (S. p. woodi) photographed
at the Museo Nacional de Historia Natural prof. Eugenio de Jesús Marcano in Santo Domingo, Dominican
Republic (photo taken by Taras K. Oleksyk in 2017).
A. B.
Downloaded from
by guest
on 17 March 2018
Figure 2. Origins of the genomic DNA samples of Solenodon paradoxus from the island of Hispaniola.
Approximate locations of capture for five wild individuals of S. p. woodi: Spa-K and Spa-L from La Cañada
del Verraco, as well as Spa-M, Spa-N, and Spa-O from the El Manguito location in the Pedernales Province
in the southwest corner of the Dominican Republic bordering Haiti. In addition, one S. p. paradoxus sample
(Spa-1) from Cordillera Septentrional in the northern part of the island. Exact coordinates of each sample
location are listed in Brand et al. 2017. The dashed line indicates the position of the Cul de Sac Plain and
Neiba Valley; this region was periodically inundated by a marine canal that separated Hispaniola into north
and south paleo-islands during the Pliocene and Pleistocene [15]. The original map is in the public domain
(courtesy of NASA).
Downloaded from
by guest
on 17 March 2018
Figure 3. Heterozygosity and k-mer distribution. k-mer distributions for the S. p. woodi reads. Only one
original sample (SPA-K) distribution is shown as a solid gray line, as the distributions were identical for each
of the individual samples. The predicted mean genome coverage was approximately 5x for each sample
(x=5). One example is plotted by a black solid line on the left. The combined uncorrected dataset is plotted
in a dashed red line indicates a maximum at x=26. The combined dataset corrected with QuorUM [24] is
plotted in a solid blue line, also with a maximum at x=26. A smaller local maximum on the left side for both
combined distributions, corrected and uncorrected (representing k-mers found once or very few times) is
expected from differences between overlapping reads, most likely the sequencing errors. Other local
maxima (seen as a small bulge at the x=5) are interpreted as heterozygous sites. These proved to have
almost no impact on the combined sample even after read correction, indicating a lack of heterozygous
sites for this solenodon subspecies. The largest local maxima (to the right) are interpreted as projected
coverage. For the combined samples, this value is x=26..
Downloaded from
by guest
on 17 March 2018
Downloaded from
by guest
on 17 March 2018
Figure 4. Distribution of the gene prediction support by extrinsic evidence for Solenodon assemblies A (on
the left) and B (on the right). Proteins of four reference species S. araneus (SorAra 2.0, GCA_000181275.2),
Erinaceus europaeus (EriEur2.0, GCA_000296755.1), Homo sapiens (GRCh38.p7) and Mus musculus
(GRCm38.p4) were aligned to a S. paradoxus assembly with Exonerate [35] with a maximum of three best
matches per protein. Coding sequences (CDS) were cut from each, clustered and uploaded into the
AUGUSTUS software package [36] to predict genes in the soft-masked Solenodon assembly. Proteins from
the predicted genes were aligned by HMMER [37] and BLAST [38] to Pfam [39] and Swiss-Prot [40]
databases. Genes supported by matches to protein databases and hints” (see definition in main text) were
retained; the rest were discarded. Substantially more transcripts have higher hint support in assembly B.
The annotated genes can be retrieved from Database S2. Assembly C has not been evaluated.
Downloaded from
by guest
on 17 March 2018
Figure 5. Phylogenetic relationships of Solenodon paradoxus and other mammals from whole-genome data.
A. Maximum likelihood phylogeny showing branch lengths. The tree was built using RAxML [48] with the
PROTGAMMAAUTO option and the JTT fitting model tested with 1,000 bootstrap replicates. B.
Divergence time estimates based on 461,539 four-fold degenerate sites from the codon alignments of single-
copy orthologs and using fossil-based priors (Table 5). The divergence time estimation was made by the
MCMCtree tool from the software package PAML [49] with the HKY+G model of nucleotide substitutions
and 2,200,000 generations of MCMC (of which the first 200,000 generations were discarded as burn-in). The
95% confidence intervals are given in square brackets and depicted as semitransparent boxes around the
nodes. The inferred divergence time of S. paradoxus from other mammals is 73.6 Mya (95% confidence
interval of 61.4-88.2 Mya).
Downloaded from
by guest
on 17 March 2018
Downloaded from
by guest
on 17 March 2018
Figure 6. The dN/dS ratios for 4,416 single copy orthologous genes. The dN and dS ratios were calculated
with the codeml module from the PAML package [49], and calculated over the entire length of a protein
coding gene. Values are color-coded by GO term aggregated by the GO Slim generic database [68,105], and
the color code legend is presented in Figure S2. The solid black line represents dN=dS; dots above it
represent genes showing signatures of positive selection. The figure is truncated at dN=1 and dS=2, so
larger values are not shown on the graph, but all , dN, and dS values are available in Database S6.
Downloaded from
by guest
on 17 March 2018
Figure 7. (A) Predicted coagulation factor X (F10) gene structure arrangement from the structure of known
homologs (due to the scaffolding, the total gene length is unknown in solenodon). The 21 codon insertion is
highlighted in red on exon two of the solenodon F10 gene. Exons are represented as black boxes and
introns as lines connecting exons. (B) F10 protein sequence alignment showing an unusual insertion in the
Solenodon paradoxus genome absent in all other mammalian and reptilian genes retrieved from the Tox-
Prot from Uniprot [78]. The insertion of 21 amino acids is indicated with a red-boxed line in the alignment.
(C) Reconstructed mammalian F10 phylogenetic maximum likelihood tree using the model GTR+I+Γ, 1000
bootstrap replicates (1590 bp-long alignment). The numbers set indicate approximate likelihood-ratio
branch test (aLRT), Bayesian-like modification of the aLRT and bootstrap percentage, respectively.
Downloaded from
by guest
on 17 March 2018
Figure 8. Low genome heterozygosity in Solenodon paradoxus woodi compared to other mammalian taxa.
The SNV rate in the S. p. woodi genome is shown relative to other mammal genomes as an estimate of
genome diversity (h). The value for each sequenced individual was estimated using all variant positions,
with repetitive regions not filtered. The SNVs are deposited in Database S9.
Domestic cat
Giant panda
Naked mole rat
African lion
Hispaniolan solenodon
Amur tiger
Tasmanian devil
African cheetah
Downloaded from
by guest
on 17 March 2018
Figure 9. Demographic history inference for the southern S. p. woodi (red) and the northern S. p. paradoxus
(blue) subspecies using the pairwise sequentially Markovian coalescent (PSMC) model [86].
Downloaded from
by guest
on 17 March 2018
Figure 10. (A) Simplified version of the coagulation cascade, indicating key steps involving the coagulation
factor X (F10). (B) Protein modeling of solenodon sequence data using SWISS-MODEL. The target model
(4bxs) used was the F10-like protease of the venomous elapid snake Pseudonaja textilis. Due to its location
the insertion cannot be represented in the model (its location is indicated according to the PDB
annotation). Colors indicate model quality, with red being low quality and blue high quality modeling.
Colors also separate F10’s light chain (EGF-like domain) in red from the heavy chain (serine protease
domain) in blue (the half circle line in black separates both domains). (C) Amino acid sequence properties
calculated for the solenodon F10 translated gene, with focus on the insertion region 23-43. One signal
peptide cleavage site was detected between position 25 and 26. Predicted protein interaction sites at
position 26, 29-30 and 32-40. Hydropathy analysis showed a relatively hydrophilic structure for the
Downloaded from
by guest
on 17 March 2018
Downloaded from
by guest
on 17 March 2018
Figure 11. Numbers of variable microsatellite alleles discovered in S. paradoxus assemblies. The diagrams
were built independently for Fermi-based assemblies (A and B) and one SOAPdenovo2 based assembly (C).
The red circle indicates microsatellites that were successfully genotyped in all samples with at least one
alternative allele in the southern subspecies (S. p. woodi). The blue circle indicates microsatellites that were
successfully genotyped in all samples with at least one alternative allele in the northern subspecies (S. p.
paradoxus). The overlap indicates microsatellite loci with at least one alternative variant found in both
subspecies. All alleles discovered, number of fixed alleles in each population and number of unique alleles
in each population are presented in Table S3. All the candidate microsatellite loci discovered in this study,
along with their 5’ and 3’ flanking regions are listed in the Database S8.
Downloaded from
by guest
on 17 March 2018
... It is generally accepted that there are fewer than 20 confirmed venomous mammals, which have different venom delivery systems and functions [1,19,[36][37][38][40][41][42]. The best studied is the platypus (Monotremata: Ornithorhynchus anatinus), in which only the male is venomous, using his venom for intraspecific competition against other males to gain access to females in the breeding season [43,44]. Venom is more widely distributed amongst Eulipotyphyla, including solenodons (Solenodon paradoxus and Atopogale cubana) [18,43] and some shrews (Blarina brevicaudua, Blarinella quadraticauda, Neomys anomalus, N. fodiens and Sorex araneus) [35,42,[44][45][46], which deliver their venom secreted from salivary glands via modified grooves in their incisors to capture prey and keep it alive in "food larders" to eat later. ...
... The best studied is the platypus (Monotremata: Ornithorhynchus anatinus), in which only the male is venomous, using his venom for intraspecific competition against other males to gain access to females in the breeding season [43,44]. Venom is more widely distributed amongst Eulipotyphyla, including solenodons (Solenodon paradoxus and Atopogale cubana) [18,43] and some shrews (Blarina brevicaudua, Blarinella quadraticauda, Neomys anomalus, N. fodiens and Sorex araneus) [35,42,[44][45][46], which deliver their venom secreted from salivary glands via modified grooves in their incisors to capture prey and keep it alive in "food larders" to eat later. Three genera of vampire bats (Desmodontinae: Desmodus rotundus, Diphylla ecaudata and Diaemus youngi) [47,48] secrete their venom from salivary glands, delivering it via cuts to the skin of their prey, with venom facilitating and supporting blood feeding. ...
... Another feature of the slow loris saliva system is the morphology of the ducts of the salivary glands, which contain giant granules delimited by a membrane, which are 1.5 times larger than the largest granules ever reported in animals of similar size [96]. The glands are also characterised by striated ducts containing kallikrein, in an abundance similar to that seen in solenodons and shrew saliva where it is considered a toxin [18,42,43]. This suite of traits suggest that the saliva is specialised for secretion rather than transport. ...
Full-text available
Since the early 2000s, studies of the evolution of venom within animals have rapidly expanded, offering new revelations on the origins and development of venom within various species. The venomous mammals represent excellent opportunities to study venom evolution due to the varying functional usages, the unusual distribution of venom across unrelated mammals and the diverse variety of delivery systems. A group of mammals that excellently represents a combination of these traits are the slow (Nycticebus spp.) and pygmy lorises (Xanthonycticebus spp.) of south-east Asia, which possess the only confirmed two-step venom system. These taxa also present one of the most intriguing mixes of toxic symptoms (cytotoxicity and immunotoxicity) and functional usages (intraspecific competition and ectoparasitic defence) seen in extant animals. We still lack many pieces of the puzzle in understanding how this venom system works, why it evolved what is involved in the venom system and what triggers the toxic components to work. Here, we review available data building upon a decade of research on this topic, focusing especially on why and how this venom system may have evolved. We discuss that research now suggests that venom in slow lorises has a sophisticated set of multiple uses in both intraspecific competition and the potential to disrupt the immune system of targets; we suggest that an exudate diet reveals several toxic plants consumed by slow and pygmy lorises that could be sequestered into their venom and which may help heal venomous bite wounds; we provide the most up-to-date visual model of the brachial gland exudate secretion protein (BGEsp); and we discuss research on a complement component 1r (C1R) protein in saliva that may solve the mystery of what activates the toxicity of slow and pygmy loris venom. We conclude that the slow and pygmy lorises possess amongst the most complex venom system in extant animals, and while we have still a lot more to understand about their venom system, we are close to a breakthrough, particularly with current technological advances.
... One of these exceptions was the Eulipotyphla (hedgehogs, shrews, moles, and solenodons), where the initial diversification of its crown clade was inferred to have begun around 77.3 Ma in the Late Cretaceous (95% Credible Interval [CI] did not overlap with the age of the K-Pg; Meredith et al., 2011). Other studies often have supported the Late Cretaceous origin of the Eulipotyphla (Roca et al., 2004;Brace et al., 2016;Brandt et al., 2017;Springer et al., 2018;Grigorev et al., 2018). However, a problem with the view of a Late Cretaceous divergence is how the survival of lineages could be explained beyond the impact of the Chicxulub bolide at the K-Pg because it has been suggested that the impact caused catastrophic global environmental change and led to mass extinction of species and ecological disruption (Schulte et al., 2010). ...
... Within the Eulipotyphla, evolutionary rates in the Erinaceidae and Soricidae were inferred to be higher than in other mammalian lineages (Wu et al., 2017). Phylogenetic studies of various scale have also suggested that both lineages seem to have relatively longer branches in phylogenies (Brace et al., 2016;Sato et al., 2016;Esselstyn et al., 2017;Grigorev et al., 2018). It is therefore essential when dating phylogenetic relationships to accommodate the difference in evolutionary rates among lineages (Tamura et al., 2012;Ho, 2014). ...
... All the under-and overestimates observed in Table 4 can be explained by the evolutionary rate difference among the lineages of the Carnivora (slower rate), Eulipotyphla (intermediate rate in this study), and Rodentia (higher rate), as demonstrated in Nabholz et al. (2008) although their study was based only on mitochondrial DNA. We realised that the late Cretaceous origin of the Eulipotyphla, as suggested in most previous studies (Roca et al., 2004;Meredith et al., 2011;Brace et al., 2016;Brandt et al., 2017;Springer et al., 2018;Grigorev et al., 2018), was consistent with the divergence time estimated with incorrect evolutionary rates. We assume that both calibrations above produce true and incorrect estimates, depending upon the lineage in question. ...
The origin of the mammalian order Eulipotyphla has been debated intensively with arguments around whether they began diversifying before or after the Cretaceous-Palaeogene (K-Pg) boundary at 66 Ma. Here, we used an in-solution nucleotide capture method and next generation DNA sequencing to determine the sequence of hundreds of ultra-conserved elements (UCEs), and conducted phylogenomic and molecular dating analyses for the four extant eulipotyphlan lineages-Erinaceidae, Solenodontidae, Soricidae, and Talpidae. Concatenated maximum-likelihood analyses with single or partitioned models and a coalescent species-tree analysis showed that divergences among the four major eulipotyphlan lineages occurred within a short period of evolutionary time, but did not resolve the interrelationships among them. Alternative suboptimal phylogenetic hypotheses received consistently the same amount of support from different UCE loci, and were not significantly different from the maximum likelihood tree topology, suggesting the prevalence of stochastic lineage sorting. Molecular dating analyses that incorporated among-lineage evolutionary rate differences supported a scenario where the four eulipotyphlan families diversified between 57.8 and 63.2 Ma. Given short branch lengths with low support values, traces of rampant genome-wide stochastic lineage sorting, and post K-Pg diversification, we concluded that the crown eulipotyphlan lineages arose through a rapid diversification after the K-Pg boundary when novel niches were created by the mass extinction of species.
... Why this did not happen earlier is not clear but could be caused by parrots' inability to fly over long distances without feeding, inhospitable landscapes of the emerging land, as well as restricted ecological opportunities when encountering resident Central American fauna [96,101]. While many terrestrial species in the Caribbean are of the vicariant origin [91,102,103] or dispersed along the paths of the prevailing water currents [104] and hurricanes [105], birds can disperse between islands where the distance permits a direct flight. This was possibly facilitated by the low sea levels at the time, opening up opportunities: the birds spread across Central America relatively quickly, and the first round of island colonization in the Greater Antilles probably occurred after 3.47 MYA (Figure 2, Supplementary Table S7 node 4). ...
Full-text available
Amazon parrots (Amazona spp.) colonized the islands of the Greater Antilles from the Central American mainland, but there has not been a consensus as to how and when this happened. Today, most of the five remaining island species are listed as endangered, threatened, or vulnerable as a consequence of human activity. We sequenced and annotated full mitochondrial genomes of all the extant Amazon parrot species from the Greater Antillean (A. leucocephala (Cuba), A. agilis, A. collaria (both from Jamaica), A. ventralis (Hispaniola), and A. vittata (Puerto Rico)), A. albifrons from mainland Central America, and A. rhodocorytha from the Atlantic Forest in Brazil. The assembled and annotated mitogenome maps provide information on sequence organization, variation, population diversity, and evolutionary history for the Caribbean species including the critically endangered A. vittata. Despite the larger number of available samples from the Puerto Rican Parrot Recovery Program, the sequence diversity of the A. vittata population in Puerto Rico was the lowest among all parrot species analyzed. Our data support the stepping-stone dispersal and speciation hypothesis that has started approximately 3.47 MYA when the ancestral population arrived from mainland Central America and led to diversification across the Greater Antilles, ultimately reaching the island of Puerto Rico 0.67 MYA. The results are presented and discussed in light of the geological history of the Caribbean and in the context of recent parrot evolution, island biogeography, and conservation. This analysis contributes to understating evolutionary history and empowers subsequent assessments of sequence variation and helps design future conservation efforts in the Caribbean.
... (2) elucidate the possible detrimental effect that unmanaged captive facilities could have on wild iguana populations; and (3) shed light on the taxonomic dispute of Cyclura iguanas on Mona Island. The collision of the north and south paleo-islands, resulting in the creation of Hispaniola, in the mid-Miocene, as well as subsequent sea inundations submerging the Enriquillo Basin until the mid-Pleistocene have impacted the genetic structure of many species (Glor et al. 2004;Townsend et al. 2007;Gifford and Larson 2008;Sly et al. 2011;Brace et al. 2012;Grigorev et al. 2018) (Fig. 1). A single C. cornuta mtDNA haplotype was found in the region corresponding to the southern paleo-island, or the area south of the Enriquillo Basin (haplotype D; Figs. 2, 4). ...
Full-text available
Hispaniola is the second largest island in the Caribbean and harbors an extensive amount of biodiversity. The geologic history and resulting complex topography of the island has led to significant differentiation across various taxonomic groups. Hispaniola is the only Caribbean Island with two species of Rock Iguanas, genus Cyclura. Rhinoceros Rock Iguanas (C. cornuta) are wide-ranging across Hispaniola, occurring in isolated pockets, primarily in low elevation xeric areas. To better understand the population structure of this species, we used a combination of mtDNA and nuclear markers to elucidate the genetic variation of wild populations across 13 sampling regions in the Dominican Republic (DR), as well as neighboring Mona Island, home to a Cyclura population of uncertain taxonomic status. Further, we evaluate the origin of iguanas in captive facilities throughout the DR. Our data reveal a high degree of genetic diversity across wild populations within the DR and shed light on the taxonomic status of the Mona island population. Further, novel genetic diversity is found in captive facilities, most likely resulting from interbreeding between individuals from genetically distinct populations within the captive facilities. Our results suggest that the captive facilities may pose a threat to wild populations and increased regulation of these facilities is needed.
... Recent advances in the amplification of DNA from degraded material opened several windows into the origin of Caribbean mammalian diversity. Insight into the exceptionally 3 rare, giant (~1 kg) insectivore Solenodon using museum collections and degraded modern tissue samples (Grigorev et al. 2018), strongly suggest a Mesozoic, North American origin for the family. Ancient DNA from a ~750-year-old subfossil of Nesophontes represented the first paleogenetic material from the Greater Antilles and was able to confirm longstanding morphological hypotheses uniting Nesophontidae with Solenodontidae as a single lineage. ...
Full-text available
Ancient biomolecule analyses are proving increasingly useful in the study of evolutionary patterns, including extinct organisms. Proteomic sequencing techniques complement genomic approaches, having the potential to examine lineages further back in time than achievable using ancient DNA, given the less stringent preservation requirements. In this study, we demonstrate the ability to use collagen sequence analyses via proteomics to provide species delimitation as a foundation for informing evolutionary patterns. We uncover biogeographic information of an enigmatic and recently extinct lineage of Nesophontes across their range on the Caribbean islands. First, evolutionary relationships reconstructed from collagen sequences reaffirm the affinity of Nesophontes and Solenodon as sister taxa within Solenodonota. This relationship helps lay the foundation for testing geographical isolation hypotheses across islands within the Greater Antilles, including movement from Cuba towards Hispaniola. Second, our results are consistent with Cuba having just two species of Nesophontes (N. micrus and N. major) that exhibit intrapopulation morphological variation. Finally, analysis of the recently described species from the Cayman Islands (N. hemicingulus) indicates that it is a closer relative to the Cuban species, N. major rather than N. micrus as previously speculated. Our proteomic sequencing improves our understanding of the origin, evolution, and distribution of this extinct mammal lineage, particularly with respect to approximate timing of speciation. Such knowledge is vital for this biodiversity hotspot, where the magnitude of recent extinctions may obscure true estimates of species richness in the past.
... Recent advances in the amplification of DNA from degraded material opened several windows into the origin of Caribbean mammalian diversity. Insight into the exceptionally rare, giant (1 kg) insectivore Solenodon using museum collections and degraded modern tissue samples (Grigorev et al. 2018), strongly suggest a Mesozoic, North American origin for the family. Ancient DNA from a 750-year-old subfossil of Nesophontes represented the first paleogenetic material from the Greater Antilles and was able to confirm longstanding morphological hypotheses uniting Nesophontidae with Solenodontidae as a single lineage the Solenodonota having diverged 70 Ma from all other living true insectivores (Eulipotyphla: shrews, hedgehogs, moles) (Roca et al. 2004;Brace et al. 2016). ...
Full-text available
Ancient biomolecule analyses are proving increasingly useful in the study of evolutionary patterns, including extinct organisms. Proteomic sequencing techniques complement genomic approaches, having the potential to examine lineages further back in time than achievable using ancient DNA, given the less stringent preservation requirements. In this study, we demonstrate the ability to use collagen sequence analyses via proteomics to provide species delimitation as a foundation for informing evolutionary patterns. We uncover biogeographic information of an enigmatic and recently extinct lineage of Nesophontes across their range on the Caribbean islands. First, evolutionary relationships reconstructed from collagen sequences reaffirm the affinity of Nesophontes and Solenodon as sister taxa within Solenodonota. This relationship helps lay the foundation for testing geographical isolation hypotheses across islands within the Greater Antilles, including movement from Cuba towards Hispaniola. Second, our results are consistent with Cuba having just two species of Nesophontes (N. micrus and N. major) that exhibit intrapopulation morphological variation. Finally, analysis of the recently described species from the Cayman Islands (N. hemicingulus) indicates that it is a closer relative to the Cuban species, N. major rather than N. micrus as previously speculated. Our proteomic sequencing improves our understanding of the origin, evolution, and distribution of this extinct mammal lineage, particularly with respect to approximate timing of speciation. Such knowledge is vital for this biodiversity hotspot, where the magnitude of recent extinctions may obscure true estimates of species richness in the past.
... Initial analysis via shotgun experiments revealed solenodon venom is primarily composed of proteins that exhibit highscoring annotations to kallikrein-1-like serine proteases (KLK1like; 7 of 17 total venom proteins identified), although various other protein types were also detected ( Fig. 2 B and C and SI Appendix, Table S2). None of the venom proteins directly identified here show similarity to those recently predicted by other researchers, who used genomic data alone to predict venom toxin identity based on sequence similarity to previously described, yet distinct, animal venom toxins (23). These findings highlight the importance of direct sampling (e.g., gene expression or protein) to robustly characterize proteins associated with venom secretions (25). ...
Full-text available
Venom systems are key adaptations that have evolved throughout the tree of life and typically facilitate predation or defense. Despite venoms being model systems for studying a variety of evolutionary and physiological processes, many taxonomic groups remain understudied, including venomous mammals. Within the order Eulipotyphla, multiple shrew species and solenodons have oral venom systems. Despite morphological variation of their delivery systems, it remains unclear whether venom represents the ancestral state in this group or is the result of multiple independent origins. We investigated the origin and evolution of venom in eulipotyphlans by characterizing the venom system of the endangered Hispaniolan solenodon ( Solenodon paradoxus ). We constructed a genome to underpin proteomic identifications of solenodon venom toxins, before undertaking evolutionary analyses of those constituents, and functional assessments of the secreted venom. Our findings show that solenodon venom consists of multiple paralogous kallikrein 1 ( KLK1 ) serine proteases, which cause hypotensive effects in vivo, and seem likely to have evolved to facilitate vertebrate prey capture. Comparative analyses provide convincing evidence that the oral venom systems of solenodons and shrews have evolved convergently, with the 4 independent origins of venom in eulipotyphlans outnumbering all other venom origins in mammals. We find that KLK1 s have been independently coopted into the venom of shrews and solenodons following their divergence during the late Cretaceous, suggesting that evolutionary constraints may be acting on these genes. Consequently, our findings represent a striking example of convergent molecular evolution and demonstrate that distinct structural backgrounds can yield equivalent functions.
... Variation related to geographic differences are possible given geologic history of Hispaniola and the Greater Antilles that previously divided the island into north and south regions until the Pleistocene (Mann et al. 1991;Iturralde-Vinent and MacPhee 1999;Graham 2003). Such a division into the southern paleo-island versus northern paleo-island (Figure 1: dashed line for Nebia Valley) could have produced geographic barriers impeding the populations from interbreeding through allopatry, and which has been noted for a variety of other taxa (Brace et al. 2012;Gifford et al. 2004;Sly et al. 2010;Matos-Maraví et al. 2014;Turvey et al. 2016;Lim et al. 2017;Grigorev et al. 2018). Clinal differences are possible to evaluate as elevation data is available for nearly all of the cave localities, and the elevation differences run from low-land coastal to mountainous (Appendix A). ...
Megalonychid sloth fossils have been found on numerous islands across the Antilles, with most associated with the Greater Antilles. New specimens from the Dominican Republic permit exploration of intraspecific variation ranges the recognised taxa (Acratocnus, Parocnus, Megalocnus, Neocnus) of Hispaniola. Using length of upper limb bones as a proxy for body mass, sexual dimorphism and geographic variation were explored, along with additional assessments about changes to the known taxonomy resulting from intraspecific variations. Sexual dimorphism is considered to occur, at differing levels, for all Hispaniola sloths, supported by the presence of large and short morphs within individual localities. Geographically, Acratocnus exhibits differences between northern and southern paleo-island sites, suggesting the existence of two subspecies or an unrecognised new, northern paleo-island species. Megalocnus zile is considered a junior synonym of Parocnus serus, as it falls within morphological variation ranges for that taxon. Parocnus specimens from Parque Nacional del Este are ~15% smaller than all other sites, regardless of paleo-island association, and may represent a new species. Results for Neocnus demonstrate the necessity for a reassessment of the current taxonomy and recognised characters. Radiometric dates are needed for more localities in order to rule out patterns resulting from temporal variation. ARTICLE HISTORY
... We found that RTE1_Sar was probably horizontally transferred between an unsampled parasitic nematode and S. araneus after the split of the lineages leading to S. araneus and the hedgehog (Erinaceus europaeus, ca. 60 million years ago (mya) [15]). ...
Full-text available
Background: As the genomes of more metazoan species are sequenced, reports of horizontal transposon transfers (HTT) have increased. Our understanding of the mechanisms of such events is at an early stage. The close physical relationship between a parasite and its host could facilitate horizontal transfer. To date, two studies have identified horizontal transfer of RTEs, a class of retrotransposable elements, involving parasites: ticks might act as vector for BovB between ruminants and squamates, and AviRTE was transferred between birds and parasitic nematodes. Results: We searched for RTEs shared between nematode and mammalian genomes. Given their physical proximity, it was necessary to detect and remove sequence contamination from the genome datasets, which would otherwise distort the signal of horizontal transfer. We developed an approach that is based on reads instead of genomic sequences to reliably detect contamination. From comparison of 43 RTEs across 197 genomes, we identified a single putative case of horizontal transfer: we detected RTE1_Sar from Sorex araneus, the common shrew, in parasitic nematodes. From the taxonomic distribution and evolutionary analysis, we show that RTE1_Sar was horizontally transferred. Conclusion: We identified a new horizontal RTE transfer in host-parasite interactions, which suggests that it is not uncommon. Further, we present and provide the workflow a read-based method to distinguish between contamination and horizontal transfer.
This narrative is a personal view of adventures in genetic science and society that have blessed my life and career across five decades. The advances I enjoyed and the lessons I learned derive from educational training, substantial collaboration, and growing up in the genomics age. I parse the stories into six research disciplines my students, fellows, and colleagues have entered and, in some cases, made an important difference. The first is comparative genetics, where evolutionary inference is applied to genome organization, from building gene maps in the 1970s to building whole genome sequences today. The second area tracks the progression of molecular evolutionary advances and applications to resolve the hierarchical relationship among living species in the silence of prehistory. The third endeavor outlines the birth and maturation of genetic studies and application to species conservation. The fourth theme discusses how emerging viruses studied in a genomic sense opened our eyes to host–pathogen interaction and interdependence. The fifth research emphasis outlines the population genetic–based search and discovery of human restriction genes that influence the epidemiological outcome of abrupt outbreaks, notably HIV–AIDS and several cancers. Finally, the last arena explored illustrates how genetic individualization in human and animals has improved forensic evidence in capital crimes. Each discipline has intuitive and technological overlaps, and each has benefitted from the contribution of genetic and genomic principles I learned so long ago from Drosophila. The journey continues. Expected final online publication date for the Annual Review of Animal Biosciences, Volume 8 is February 15, 2020. Please see for revised estimates.
Full-text available
DAMBE is a comprehensive software workbench for data analysis in molecular biology, phylogenetics and evolution. Several important new functions have been added since version 5 of DAMBE: 1) comprehensive genomic profiling of translation initiation efficiency of different genes in different prokaryotic species, 2) a new index of translation elongation (ITE) that takes into account both tRNA-mediated selection and background mutation on codon-anticodon adaptation, 3) a new and accurate phylogenetic approach based on pairwise alignment only, which is useful for highly divergent sequences from which a reliable multiple sequence alignment is difficult to obtain. Many other functions have been updated and improved including PWM for motif characterization, Gibbs sampler for de novo motif discovery, hidden Markov models for protein secondary structure prediction, self-organizing map for non-linear clustering of transcriptomic data, comprehensive sequence alignment and phylogenetic functions. DAMBE features a graphic, user-friendly and intuitive interface, and is freely available from
Full-text available
The PANTHER database (Protein ANalysis THrough Evolutionary Relationships, contains comprehensive information on the evolution and function of protein-coding genes from 104 completely sequenced genomes. PANTHER software tools allow users to classify new protein sequences, and to analyze gene lists obtained from large-scale genomics experiments. In the past year, major improvements include a large expansion of classification information available in PANTHER, as well as significant enhancements to the analysis tools. Protein subfamily functional classifications have more than doubled due to progress of the Gene Ontology Phylogenetic Annotation Project. For human genes (as well as a few other organisms), PANTHER now also supports enrichment analysis using pathway classifications from the Reactome resource. The gene list enrichment tools include a new 'hierarchical view' of results, enabling users to leverage the structure of the classifications/ontologies; the tools also allow users to upload genetic variant data directly, rather than requiring prior conversion to a gene list. The updated coding single-nucleotide polymorphisms (SNP) scoring tool uses an improved algorithm. The hidden Markov model (HMM) search tools now use HMMER3, dramatically reducing search times and improving accuracy of E-value statistics. Finally, the PANTHER Tree-Attribute Viewer has been implemented in JavaScript, with new views for exploring protein sequence evolution.
Full-text available
The Cuban solenodon (Solenodon cubanus) is one of the most enigmatic mammals and is an extremely rare species with a distribution limited to a small part of the island of Cuba. Despite its rarity, in 2012 seven individuals of S. cubanus were captured and sampled successfully for DNA analysis, providing new insights into the evolutionary origin of this species and into the origins of the Caribbean fauna, which remain controversial. We conducted molecular phylogenetic analyses of five nuclear genes (Apob, Atp7a, Bdnf, Brca1 and Rag1; total, 4,602 bp) from 35 species of the mammalian order Eulipotyphla. Based on Bayesian relaxed molecular clock analyses, the family Solenodontidae diverged from other eulipotyphlan in the Paleocene, after the bolide impact on the Yucatan Peninsula, and S. cubanus diverged from the Hispaniolan solenodon (S. paradoxus) in the Early Pliocene. The strikingly recent divergence time estimates suggest that S. cubanus and its ancestral lineage originated via over-water dispersal rather than vicariance events, as had previously been hypothesised.
The distinction between deleterious, neutral, and adaptive mutations is a fundamental problem in the study of molecular evolution. Two significant quantities are the fraction of DNA variation in natural populations that is deleterious and destined to be eliminated and the fraction of fixed differences between species driven by positive Darwinian selection. We estimate these quantities using the large number of human genes for which there are polymorphism and divergence data. The fraction of amino acid mutations that is neutral is estimated to be 0.20 from the ratio of common amino acid (A) to synonymous (S) single nucleotide polymorphisms (SNPs) at frequencies of ≥ 15%. Among the 80% of amino acid mutations that are deleterious at least 20% of them are only slightly deleterious and often attain frequencies of 1–10%. We estimate that these slightly deleterious mutations comprise at least 3% of amino acid SNPs in the average individual or at least 300 per diploid genome. This estimate is not sensitive to human population history. The A/S ratio of fixed differences is greater than that of common SNPs and suggests that a large fraction of protein divergence is adaptive and driven by positive Darwinian selection.
The mammalian order Eulipotyphla includes four extant families of insectivorans: Solenodontidae (solenodons); Talpidae (moles); Soricidae (shrews); and Erinaceidae (hedgehogs). Of these, Solenodontidae includes only two extant species, which are endemic to the largest islands of the Greater Antilles: Cuba and Hispaniola. Most molecular studies suggest that eulipotyphlan families diverged from each other across several million years, with the basal split between Solenodontidae and other families occurring in the Late Cretaceous. By contrast, Sato et al. (2016) suggest that eulipotyphlan families diverged from each other in a polytomy ∼58.6 million years ago (Mya). This more recent divergence estimate for Solenodontidae versus other extant eulipotyphlans suggests that solenodons must have arrived in the Greater Antilles via overwater dispersal rather than vicariance. Here, we show that the young timetree estimates for eulipotyphlan families and the polytomy are due to an inverted ingroup-outgroup arrangement of the tree, the result of using Tracer rather than TreeAnnotator to compile interfamilial divergence times, and of not enforcing the monophly of well-established clades such as Laurasiatheria and Eulipotyphla. Finally, Sato et al.'s (2016) timetree includes several zombie lineages where estimated divergence times are much younger than minimum ages that are implied by the fossil record. We reanalyzed Sato et al.'s (2016) original data with enforced monophyly for well-established clades and updated fossil calibrations that eliminate the inference of zombie lineages. Our resulting timetrees, which were compiled with TreeAnnotator rather than Tracer, produce dates that are in good agreement with other recent studies and place the basal split between Solenodontidae and other eulipotyphlans in the Late Cretaceous.
Short tandem repeats (STRs) are highly variable elements that play a pivotal role in multiple genetic diseases, population genetics applications, and forensic casework. However, it has proven problematic to genotype STRs from high-throughput sequencing data. Here, we describe HipSTR, a novel haplotype-based method for robustly genotyping and phasing STRs from Illumina sequencing data, and we report a genome-wide analysis and validation of de novo STR mutations. HipSTR is freely available at
We describe a program, tRNAscan-SE, which identifies 99-100% of transfer RNA genes in DNA sequence while giving less than one false positive per 15 gigabases. Two previously described tRNA detection programs are used as fast, first-pass prefilters to identify candidate tRNAs, which are then analyzed by a highly selective tRNA covariance model. This work represents a practical application of RNA covariance models, which are general, probabilistic secondary structure profiles based on stochastic context-free grammars. tRNAscan-SE searches at approximately 30 000 bp/s. Additional extensions to tRNAscan-SE detect unusual tRNA homologues such as selenocysteine tRNAs, tRNA-derived repetitive elements and tRNA pseudogenes.