
Arian Smit- PhD
- Institute for Systems Biology
Arian Smit
- PhD
- Institute for Systems Biology
About
156
Publications
79,702
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
75,346
Citations
Introduction
Current institution
Publications
Publications (156)
Vocal production learning is a convergently evolved trait in vertebrates. To identify brain genomic elements associated with mammalian vocal learning, we integrated genomic, anatomical and neurophysiological data from the Egyptian fruit-bat with analyses of the genomes of 215 placental mammals. First, we identified a set of proteins evolving more s...
Current knowledge of cancer genomics remains biased against noncoding mutations. To systematically search for regulatory noncoding mutations, we assessed mutations in conserved positions in the genome under the assumption that these are more likely to be functional than mutations in positions with low conservation. To this end, we use whole-genome...
Protein-coding differences between species often fail to explain phenotypic diversity, suggesting the involvement of genomic elements that regulate gene expression such as enhancers. Identifying associations between enhancers and phenotypes is challenging because enhancer activity can be tissue-dependent and functionally conserved despite low seque...
Annotating coding genes and inferring orthologs are two classical challenges in genomics and evolutionary biology that have traditionally been approached separately, limiting scalability. We present TOGA (Tool to infer Orthologs from Genome Alignments), a method that integrates structural gene annotation and orthology inference. TOGA implements a d...
Human accelerated regions (HARs) are conserved genomic loci that evolved at an accelerated rate in the human lineage and may underlie human-specific traits. We generated HARs and chimpanzee accelerated regions with an automated pipeline and an alignment of 241 mammalian genomes. Combining deep learning with chromatin capture experiments in human an...
Thousands of genomic regions have been associated with heritable human diseases, but attempts to elucidate biological mechanisms are impeded by an inability to discern which genomic positions are functionally important. Evolutionary constraint is a powerful predictor of function, agnostic to cell type or disease mechanism. Single-base phyloP scores...
The precise pattern and timing of speciation events that gave rise to all living placental mammals remain controversial. We provide a comprehensive phylogenetic analysis of genetic variation across an alignment of 241 placental mammal genome assemblies, addressing prior concerns regarding limited genomic sampling across species. We compared neutral...
We examined transposable element (TE) content of 248 placental mammal genome assemblies, the largest de novo TE curation effort in eukaryotes to date. We found that although mammals resemble one another in total TE content and diversity, they show substantial differences with regard to recent TE accumulation. This includes multiple recent expansion...
Conserved genomic sequences disrupted in humans may underlie uniquely human phenotypic traits. We identified and characterized 10,032 human-specific conserved deletions (hCONDELs). These short (average 2.56 base pairs) deletions are enriched for human brain functions across genetic, epigenomic, and transcriptomic datasets. Using massively parallel...
We reconstruct the phenotype of Balto, the heroic sled dog renowned for transporting diphtheria antitoxin to Nome, Alaska, in 1925, using evolutionary constraint estimates from the Zoonomia alignment of 240 mammals and 682 genomes from dogs and wolves of the 21st century. Balto shares just part of his diverse ancestry with the eponymous Siberian hu...
Species persistence can be influenced by the amount, type, and distribution of diversity across the genome, suggesting a potential relationship between historical demography and resilience. In this study, we surveyed genetic variation across single genomes of 240 mammals that compose the Zoonomia alignment to evaluate how historical effective popul...
Zoonomia is the largest comparative genomics resource for mammals produced to date. By aligning genomes for 240 species, we identify bases that, when mutated, are likely to affect fitness and alter disease risk. At least 332 million bases (~10.7%) in the human genome are unusually conserved across species (evolutionarily constrained) relative to ne...
Understanding the regulatory landscape of the human genome is a long-standing objective of modern biology. Using the reference-free alignment across 241 mammalian genomes produced by the Zoonomia Consortium, we charted evolutionary trajectories for 0.92 million human candidate cis-regulatory elements (cCREs) and 15.6 million human transcription fac...
Horizontal transfer of transposable elements is an important mechanism contributing to genetic diversity and innovation. Bats (order Chiroptera) have repeatedly been shown to experience horizontal transfer of transposable elements at what appears to be a high rate compared to other mammals. We investigated the occurrence of horizontally transferred...
Horizontal transfer of transposable elements is an important mechanism contributing to genetic diversity and innovation. Bats (order Chiroptera) have repeatedly been shown to experience horizontal transfer of transposable elements at what appears to be a high rate compared to other mammals. We investigated the occurrence of horizontally transferred...
Evolutionary constraint and acceleration are powerful, cell-type agnostic measures of functional importance. Previous studies in mammals were limited by species number and reliance on human-referenced alignments. We explore the evolution of placental mammals, including humans, through reference-free whole-genome alignment of 240 species and protein...
We examined transposable element (TE) content of 248 placental mammal genome assemblies, the largest de novo TE curation effort in eukaryotes to date. We find that while mammals resemble one another in total TE content and diversity, they show substantial differences with regard to recent TE accumulation. This includes multiple recent expansion and...
The discovery and characterization of transposable element (TE) families are crucial tasks in the process of genome annotation. Careful curation of TE libraries for each organism is necessary as each has been exposed to a unique and often complex set of TE families. De novo methods have been developed; however, a fully automated and accurate approa...
Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion-base pair sequence of a human genome, T2T-CHM13, that incl...
Mobile elements and repetitive genomic regions are sources of lineage-specific genomic innovation and uniquely fingerprint individual genomes. Comprehensive analyses of such repeat elements, including those found in more complex regions of the genome, require a complete, linear genome assembly. We present a de novo repeat discovery and annotation o...
The construction of a high-quality multiple sequence alignment (MSA) from copies of a transposable element (TE) is a critical step in the characterization of a new TE family. Most studies of MSA accuracy have been conducted on protein or RNA sequence families, where structural features and strong signals of selection may assist with alignment. Less...
Background
The rice weevil Sitophilus oryzae is one of the most important agricultural pests, causing extensive damage to cereal in fields and to stored grains. S. oryzae has an intracellular symbiotic relationship (endosymbiosis) with the Gram-negative bacterium Sodalis pierantonius and is a valuable model to decipher host-symbiont molecular inter...
The construction of a high-quality multiple sequence alignment (MSA) from copies of a transposable element (TE) is a critical step in the characterization of a new TE family. Most studies of MSA accuracy have been conducted on protein or RNA sequence families where structural features and strong signals of selection may assist with alignment. Less...
Mobile elements and highly repetitive genomic regions are potent sources of lineage-specific genomic innovation and fingerprint individual genomes. Comprehensive analyses of large, composite or arrayed repeat elements and those found in more complex regions of the genome require a complete, linear genome assembly. Here we present the first de novo...
Transposable elements (TEs) have the ability to alter individual genomic landscapes and shape the course of evolution for species in which they reside. Such profound changes can be understood by studying the biology of the organism and the interplay of the TEs it hosts. Characterizing and curating TEs across a wide range of species is a fundamental...
In 2001, Celera Genomics and the International Human Genome Sequencing Consortium published their initial drafts of the human genome, which revolutionized the field of genomics. While these drafts and the updates that followed effectively covered the euchromatic fraction of the genome, the heterochromatin and many other complex regions were left un...
Background
Among beetles, the rice weevil Sitophilus oryzae is one of the most important pests causing extensive damage to cereal in fields and to stored grains. S. oryzae has an intracellular symbiotic relationship (endosymbiosis) with the Gram-negative bacterium Sodalis pierantonius and is a valuable model to decipher host-symbiont molecular inte...
Annotation of a biological sequence is usually performed by aligning that sequence to a database of known sequence elements. When that database contains elements that are highly similar to each other, the proper annotation may be ambiguous, because several entries in the database produce high-scoring alignments. Typical annotation methods work by a...
Dfam is an open access database of repetitive DNA families, sequence models, and genome annotations. The 3.0–3.3 releases of Dfam ( https://dfam.org ) represent an evolution from a proof-of-principle collection of transposable element families in model organisms into a community resource for a broad range of species, and for both curated and uncura...
The Zoonomia Project is investigating the genomics of shared and specialized traits in eutherian mammals. Here we provide genome assemblies for 131 species, of which all but 9 are previously uncharacterized, and describe a whole-genome alignment of 240 species of considerable phylogenetic diversity, comprising representatives from more than 80% of...
The 3.0-3.2 releases of Dfam (https://dfam.org) represent an evolution from a proof-of-principle collection of transposable element families in model organisms into a community resource for a broad range of species and for both curated and uncurated datasets. In addition, releases since Dfam 3.0 provide auxiliary consensus sequence models, transpos...
The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critica...
The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critica...
We identified a novel repeat family, termed Platy-1, in the Callithrix jacchus (common marmoset) genome that arose around the time of the divergence of platyrrhines and catarrhines, and established itself as a repeat family in New World monkeys (NWMs). A full-length Platy-1 element is ~100 bp in length, making it the shortest known SINE in primates...
Repetitive DNA, especially that due to transposable elements (TEs), makes up a large fraction of many genomes. Dfam is an
open access database of families of repetitive DNA elements, in which each family is represented by a multiple sequence alignment
and a profile hidden Markov model (HMM). The initial release of Dfam, featured in the 2013 NAR Dat...
DNA derived from transposable elements (TEs) constitutes large parts of the genomes of complex eukaryotes, with major impacts not only on genomic research but also on how organisms evolve and function. Although a variety of methods and tools have been developed to detect and annotate TEs, there are as yet no standard benchmarks—that is, no standard...
For the past 15 years, the UCSC Genome Browser (http://genome.ucsc.edu/) has served the international research community by offering an integrated platform for viewing and analyzing information
from a large database of genome assemblies and their associated annotations. The UCSC Genome Browser has been under continuous
development since its incepti...
To provide context for the diversification of archosaurs--the group that includes crocodilians, dinosaurs, and birds--we generated draft genomes of three crocodilians: Alligator mississippiensis (the American alligator), Crocodylus porosus (the saltwater crocodile), and Gavialis gangeticus (the Indian gharial). We observed an exceptionally slow rat...
To provide context for the diversification of archosaurs—the group that includes crocodilians, dinosaurs, and birds—we generated draft genomes of three crocodilians: Alligator mississippiensis (the American alligator), Crocodylus porosus (the saltwater crocodile), and Gavialis gangeticus (the Indian gharial). We observed an exceptionally slow rate...
Background
Crocodilians are thought to be hosts to a diverse and divergent complement of endogenous retroviruses (ERVs) but a comprehensive investigation is yet to be performed. The recent sequencing of three crocodilian genomes provides an opportunity for a more detailed and accurate representation of the ERV diversity that is present in these spe...
Chicken repeat 1 (CR1) retroposons are Long INterspersed Elements (LINEs) that are ubiquitous within amniote genomes and constitute the most abundant family of transposed elements in birds, crocodilians, turtles, and snakes. They are also present in mammalian genomes, where they reside as numerous relics of ancient retroposition events. Yet, despit...
ABSTRACT CGGBP1 (CGG triplet repeat-binding protein 1) regulates cell proliferation, stress response, cytokinesis, telomeric integrity and transcription. It could affect these processes by modulating target gene expression under different conditions. Identification of CGGBP1-target genes and their regulation could reveal how a transcription regulat...
Gibbons are small arboreal apes that display an accelerated rate of evolutionary chromosomal rearrangement and occupy a key node in the primate phylogeny between Old World monkeys and great apes. Here we present the assembly and analysis of a northern white-cheeked gibbon (Nomascus leucogenys) genome. We describe the propensity for a gibbon-specifi...
Gibbons are small arboreal apes that display an accelerated rate of evolutionary chromosomal rearrangement and occupy a key node in the primate phylogeny between Old World monkeys and great apes. Here we present the assembly and analysis of a northern white-cheeked gibbon (Nomascus leucogenys) genome. We describe the propensity for a gibbon-specifi...
We report the whole-genome sequence of the common marmoset (Callithrix jacchus). The 2.26-Gb genome of a female marmoset was assembled using Sanger read data (6x) and a whole-genome shotgun strategy. A first analysis has permitted comparison with the genomes of apes and Old World monkeys and the identification of specific features that might contri...
A common practice in computational genomic analysis is to use a set of ‘background’ sequences as negative controls for evaluating
the false-positive rates of prediction tools, such as gene identification programs and algorithms for detection of cis-regulatory elements. Such ‘background’ sequences are generally taken from regions of the genome presu...
Genomic information is encoded on a wide range of distance scales, ranging from tens of bases to megabases. We developed a multiscale framework to analyze and visualize the information content of genomic signals. Different types of signals, such as G+C content or DNA methylation, are characterized by distinct patterns of signal enrichment or deplet...
We present a database of repetitive DNA elements, called Dfam (http://dfam.janelia.org). Many genomes contain a large fraction of repetitive DNA, much of which is made up of remnants of transposable elements
(TEs). Accurate annotation of TEs enables research into their biology and can shed light on the evolutionary processes that
shape genomes. Ide...
Sequence analysis of the orangutan genome revealed that recent proliferative activity of Alu elements has been uncharacteristically quiescent in the Pongo (orangutan) lineage, compared with all previously studied primate genomes. With relatively few young polymorphic insertions, the genomic landscape of the orangutan seemed like the ideal place to...
This file contains the TranspoGene database [42] output for the DGBK gene.
This file provides the Alu subfamily consensus sequences for the three youngest orangutan-specific subfamilies.
Table S2. This file provides FASTA output for three of the potential secondary source elements in the orangutan from the [ponAbe2] genome assembly and Additional file 2: Table S2.
Figure S1 and Table S1. This file contains a sequence alignment report of the Chr7 Alu insertion with flanking sequence for multiple orangutan individuals and other primates obtained by Sanger sequencing of PCR amplicons and a list the DNA samples used to construct the alignment.
Table S2. This file contains a series of tables listing the PCR primers and conditions, genotypes and allele frequencies, and DNA samples used in the study.
Assignment of alleles to haplotypes for nearly all the variants on all chromosomes can be performed by genetic analysis of a nuclear family with three or more children. Whole-genome sequence data enable deterministic phasing of nearly all sequenced alleles by permitting assignment of recombinations to precise chromosomal positions and specific meio...
'Orang-utan' is derived from a Malay term meaning 'man of the forest' and aptly describes the southeast Asian great apes native to Sumatra and Borneo. The orang-utan species, Pongo abelii (Sumatran) and Pongo pygmaeus (Bornean), are the most phylogenetically distant great apes from humans, thereby providing an informative perspective on hominid evo...
The zebra finch is an important model organism in several fields with unique relevance to human neuroscience. Like other songbirds, the zebra finch communicates through learned vocalizations, an ability otherwise documented only in humans and a few other animals and lacking in the chicken-the only bird with a sequenced genome until now. Here we pre...
We analyzed the whole-genome sequences of a family of four, consisting of two siblings and their parents. Family-based sequencing
allowed us to delineate recombination sites precisely, identify 70% of the sequencing errors (resulting in > 99.999% accuracy),
and identify very rare single-nucleotide polymorphisms. We also directly estimated a human i...
We present a draft genome sequence of the platypus, Ornithorhynchus anatinus. This monotreme exhibits a fascinating combination of reptilian and mammalian characters. For example, platypuses have a coat of fur adapted to an aquatic lifestyle; platypus females lactate, yet lay eggs; and males are equipped with venom similar to that of reptiles. Anal...
Segmental duplications (SDs) play an important role in genome rearrangement, evolution, and the copy-number variation (CNV) of primate genomes. Such sequences are difficult to detect, a priori, because they share no defining sequence features that distinguish them from unique portions of the genome. Current sequence annotation of segmental duplicat...
We present a draft genome sequence of the platypus, Ornithorhynchus anatinus. This monotreme exhibits a fascinating combination of reptilian and mammalian characters. For example, platypuses have a coat of fur adapted to an aquatic lifestyle; platypus females lactate, yet lay eggs; and males are equipped with venom similar to that of reptiles. Anal...
We present a draft genome sequence of the platypus, Ornithorhynchus anatinus. This monotreme exhibits a fascinating combination of reptilian and mammalian characters. For example, platypuses have a coat of fur adapted to an aquatic lifestyle; platypus females lactate, yet lay eggs; and males are equipped with venom similar to that of reptiles. Anal...
The rhesus macaque (Macaca mulatta) is an abundant primate species that diverged from the ancestors of Homo sapiens about 25 million years ago. Because they are genetically and physiologically similar to humans, rhesus monkeys are the most
widely used nonhuman primate in basic and applied biomedical research. We determined the genome sequence of an...
The completion of the draft sequence of the rhesus macaque genome allowed us to study the genomic composition and evolution
of transposable elements in this representative of the Old World monkey lineage, a group of diverse primates closely related
to humans. The L1 family of long interspersed elements appears to have evolved as a single lineage, a...
The rhesus macaque (Macaca mulatta) is an abundant primate species that diverged from the ancestors of Homo sapiens about 25 million years ago. Because they are genetically and physiologically similar to humans, rhesus monkeys are the most widely used nonhuman primate in basic and applied biomedical research. We determined the genome sequence of an...
The rhesus macaque (Macaca mulatta) is an abundant primate species that diverged from the ancestors of Homo sapiens about 25 million years ago. Because they are genetically and physiologically similar to humans, rhesus monkeys are the most widely used nonhuman primate in basic and applied biomedical research. We determined the genome sequence of an...
Recent comparative analyses of mammalian sequences have revealed that a large number of nonprotein-coding genomic regions are under strong selective constraint. Here, we report that some of these loci have been derived from a newly defined family of ancient SINEs (short interspersed repetitive elements). This is a surprising result, as SINEs and ot...
The identification and characterization of the complete ensemble of genes is a main goal of deciphering the digital information stored in the human genome. Many algorithms for computational gene prediction have been described, ultimately derived from two basic concepts: (1) modeling gene structure and (2) recognizing sequence similarity. Successful...
We investigated the evolution of the families of LINE-1 (L1) retrotransposons that have amplified in the human lineage since the origin of primates. We identified two phases in the evolution of L1. From approximately 70 million years ago (Mya) until approximately 40 Mya, three distinct L1 lineages were simultaneously active in the genome of ancestr...
PASTA: PASs
(A) A statistical analysis of PASs in transcribed nonrepetitive sequence revealed significant orientation biases, after correcting for nucleotide composition skews. In this schematic, PAS are represented by octagons, with color indicating signal strength (darker represents stronger signals). Within repeats, we found biases in PAS streng...
Repeat Content of Known Genes and Novel FEAST Predictions
There is a large set of annotated known genes (blue) composed entirely of repetitive sequence. The novel FEAST predictions (Z > 3, red) with greater than 90% repeats are mostly satellite-rich pericentromeric regions and most probably represent false positives.
(190 KB PPT)
Expanded MDS Analysis
We include here the genomewide annotation of known genes (KG), Ensembl genes (ENS), Twinscan (TW), GenScan (GS), Softberry genes (SB), EC genes (EC), GeneID (GID), RNAs (RNA), Mammalian Gene Collection (MGC), pseudogenes (PS), Exoniphy exons (EX), Exoniphy exons bridged when in the same orientation and within 25 kb of each oth...
ROAST: Biased Repeats
(A) Schematic describing the hypothesis of how the introduction of an interrupting signal (red) tends to be rejected, while the same signal in the opposite strand is not disruptive (white) and therefore is neutral. This process yields a strand bias.
(B) The log-likelihood ratio contribution, of a single repeat, to the claim of...
FEAST Scores in Intergenic Segments
We selected the regions between consecutive genes in the same orientation and normalized their orientation to the forward strand. These regions show a prevalence of positive FEAST scores, indicating preference for transcription in the same strand as that of the flanking genes. Legend is the same as for Figure 3....