Effect of positive selection on disease-associated genes. Mean
dN/dS between Homo
sapiens and Pan troglodytes for
disease-associated (green triangles) and nondisease–associated
(blue circles) orthologs (A) and paralogs
(B) in each taxonomic level. Category axis labels
corresponds to each taxonomic level. Inset bar chart displays percentage
of both disease-associated and nondisease–associated genes in each
taxonomic level.

Effect of positive selection on disease-associated genes. Mean dN/dS between Homo sapiens and Pan troglodytes for disease-associated (green triangles) and nondisease–associated (blue circles) orthologs (A) and paralogs (B) in each taxonomic level. Category axis labels corresponds to each taxonomic level. Inset bar chart displays percentage of both disease-associated and nondisease–associated genes in each taxonomic level.

Source publication
Article
Full-text available
Over 3,000 human diseases are known to be linked to heritable genetic variation, mapping to over 1,700 unique genes. Dating of the evolutionary age of these disease-associated genes has suggested that they have a tendency to be ancient, specifically coming into existence with early metazoa. The approach taken by past studies, however, assumes that...

Citations

... One of the possible reasons for the erroneous predictions is the inability of the automated algorithms to exclude functionally unrelated homologs from the multiple sequence alignment (MSA) used for variant interpretation. The majority of disease genes have been duplicated in their evolutionary history and usually only one copy of the gene is associated with a disease (7). Therefore, understanding gene's history and . ...
Preprint
Full-text available
Mutations in signal transduction pathways lead to various diseases including cancers. MEK1 kinase, encoded by the human MAP2K1 gene, is one of the central components of the MAPK pathway and more than a hundred somatic mutations in MAP2K1 gene were identified in various tumors. Germline mutations deregulating MEK1 also lead to congenital abnormalities, such as the Cardiofaciocutaneous Syndrome and Arteriovenous Malformation. Evaluating variants associated with a disease is a challenge and computational genomic approaches aid in this process. Establishing evolutionary history of a gene improves computational prediction of disease-causing mutations; however, the evolutionary history of MEK1 is not well understood. Here, by revealing a precise evolutionary history of MEK1 we construct a well-defined dataset of MEK1 metazoan orthologs, which provides sufficient depth to distinguish between conserved and variable amino acid positions. We used this dataset to match known and predicted disease-causing and benign mutations to evolutionary changes observed in corresponding amino acid positions. We found that all known and the vast majority of suspected disease-causing mutations are evolutionarily intolerable. We selected several MEK1 mutations that cannot be unambiguously assessed by automated variant prediction tools, but that are confidently identified as evolutionary intolerant and thus “damaging” by our approach, for experimental validation in Drosophila . In all cases, evolutionary intolerant variants caused increased mortality and severe defects in fruit fly embryos confirming their damaging nature predicted by out computational strategy. We anticipate that our analysis will serve as a blueprint to help evaluate known and novel missense variants in MEK1 and that our approach will contribute to improving automated tools for disease-associated variant interpretation. Significance Statement High-throughput genome sequencing has significantly improved diagnosis, management, and treatment of genetic diseases and cancers. However, in addition to its indisputable utility, genome sequencing produces many variants that cannot be easily interpreted – so called variants of uncertain significance (VUS). Various automated bioinformatics tools can help predicting functional consequences of VUS, but their accuracy is relatively low. Here, by tracing precise evolutionary history of each amino acid position in MEK1 kinase, mutations in which cause neurodegenerative diseases and cancer in humans, we can establish whether VUS seen in humans are evolutionarily tolerant. Using published data and newly performed experiments in an animal model, we show that evolutionarily tolerable variants in MEK1 are benign, whereas intolerable substitutions are damaging. Our approach will help in diagnostics of MEK1-associated diseases, it is generalizable to many other disease-associated genes, and it can help improving automated predictors.
... The same trend holds for an ever-increasing emergence of disease-associated genes in more recent speciation events (Dickerson & Robertson, 2012;Lopez-Bigas & Ouzounis, 2004), raising the question whether specific residues can be directly implicated in particular diseases. A correlation between intrinsic disorder and various human diseases such as cancer, diabetes, amyloidosis, and neurodegenerative diseases has already been established in specific cases (Choudhary et al., 2022;Monti et al., 2021Monti et al., , 2022, and is emerging as a significant biomedical research endeavour. ...
Article
Full-text available
Background: The evolutionary rate of disordered proteins varies greatly due to the lack of structural constraints. So far, few studies have investigated the presence/absence patterns of intrinsically disordered regions (IDRs) across phylogenies in conjunction with human disease. In this study, we report a genome-wide analysis of compositional bias association with disease in human proteins and their taxonomic distribution. Methods: The human genome protein set provided by the Ensembl database was annotated and analysed with respect to both disease associations and the detection of compositional bias. The Uniprot Reference Proteome dataset, containing 11297 proteomes was used as target dataset for the comparative genomics of a well-defined subset of the Human Genome, including 100 characteristic, compositionally biased proteins, some linked to disease. Results: Cross-evaluation of compositional bias and disease-association in the human genome reveals a significant bias towards low complexity regions in disease-associated genes, with charged, hydrophilic amino acids appearing as over-represented. The phylogenetic profiling of 17 disease-associated, low complexity proteins across 11297 proteomes captures characteristic taxonomic distribution patterns. Conclusions: This is the first time that a combined genome-wide analysis of low complexity, disease-association and taxonomic distribution of human proteins is reported, covering structural, functional, and evolutionary properties. The reported framework can form the basis for large-scale, follow-up projects, encompassing the entire human genome and all known gene-disease associations.
... FBRSL1 and AUTS2 belong to a tripartite gene family, the AUTS2 family, which also includes Fibrosin (FBRS) (Singh et al., 2015). The AUTS2 family is predicted to be an ohnolog gene family (Singh et al., 2015), representing a group of paralog genes generated by two rounds of whole genome duplication during vertebrate evolution and frequently implicated in human disease (Dickerson and Robertson, 2012;Singh et al., 2012;Malaguti et al., 2014;Mclysaght et al., 2014). The AUTS2 family ohnologs show a large overlap of conserved regions, but also unique elements which likely contribute to the functional diversity of the proteins (Sellers et al., 2020). ...
Article
Full-text available
Truncating variants in specific exons of Fibrosin-like protein 1 (FBRSL1) were recently reported to cause a novel malformation and intellectual disability syndrome. The clinical spectrum includes microcephaly, facial dysmorphism, cleft palate, skin creases, skeletal anomalies and contractures, postnatal growth retardation, global developmental delay as well as respiratory problems, hearing impairment and heart defects. The function of FBRSL1 is largely unknown, but pathogenic variants in the FBRSL1 paralog Autism Susceptibility Candidate 2 (AUTS2) are causative for an intellectual disability syndrome with microcephaly (AUTS2 syndrome). Some patients with AUTS2 syndrome also show additional symptoms like heart defects and contractures overlapping with the phenotype presented by patients with FBRSL1 mutations. For AUTS2, a dual function, depending on different isoforms, was described and suggested for FBRSL1. Both, nuclear FBRSL1 and AUTS2 are components of the Polycomb subcomplexes PRC1.3 and PRC1.5. These complexes have essential roles in developmental processes, cellular differentiation and proliferation by regulating gene expression via histone modification. In addition, cytoplasmic AUTS2 controls neural development, neuronal migration and neurite extension by regulating the cytoskeleton. Here, we review recent data on FBRSL1 in respect to previously published data on AUTS2 to gain further insights into its molecular function, its role in development as well as its impact on human genetics.
... predictions, namely, PolyPhen2 [32] (PPh2), Sorting Intolerant From Tolerant [33] (SIFT), Missense Badness Polyphen and Constraint [34] (MPC), Missense Tolerance Ratio [35] (MTR), Constrained Coding Regions [36] (CCR) and para-Z-score for paralog conservation [37,38]. The rationale behind the use of these scores is detailed in the supplemental methods (see the Appendix). ...
Article
Full-text available
Background: Analyses of few gene-sets in epilepsy showed a potential to unravel key disease associations. We set out to investigate the burden of ultra-rare variants (URVs) in a comprehensive range of biologically informed gene-sets presumed to be implicated in epileptogenesis. Methods: The burden of 12 URV types in 92 gene-sets was compared between cases and controls using whole exome sequencing data from individuals of European descent with developmental and epileptic encephalopathies (DEE, n = 1,003), genetic generalized epilepsy (GGE, n = 3,064), or non-acquired focal epilepsy (NAFE, n = 3,522), collected by the Epi25 Collaborative, compared to 3,962 ancestry-matched controls. Findings: Missense URVs in highly constrained regions were enriched in neuron-specific and developmental genes, whereas genes not expressed in brain were not affected. GGE featured a higher burden in gene-sets derived from inhibitory vs. excitatory neurons or associated receptors, whereas the opposite was found for NAFE, and DEE featured a burden in both. Top-ranked susceptibility genes from recent genome-wide association studies (GWAS) and gene-sets derived from generalized vs. focal epilepsies revealed specific enrichment patterns of URVs in GGE vs. NAFE. Interpretation: Missense URVs affecting highly constrained sites differentially impact genes expressed in inhibitory vs. excitatory pathways in generalized vs. focal epilepsies. The excess of URVs in top-ranked GWAS risk-genes suggests a convergence of rare deleterious and common risk-variants in the pathogenesis of generalized and focal epilepsies. Funding: DFG Research Unit FOR-2715 (Germany), FNR (Luxembourg), NHGRI (US), NHLBI (US), DAAD (Germany).
... Here, we hypothesize that variation in gene families with related structure and function in the brain will result in subtypes of NDDs with related pathology. With over 80% of Mendelian disease-associated genes being part of gene families and/or having functionally redundant paralogs, this provides an opportunity to divide many NDD candidate genes into subgroups [5,6]. In fact, it has recently been shown that DNVs are enriched among a subset of gene families in probands with NDDs [7]. ...
Article
Full-text available
Background With the increasing number of genomic sequencing studies, hundreds of genes have been implicated in neurodevelopmental disorders (NDDs). The rate of gene discovery far outpaces our understanding of genotype–phenotype correlations, with clinical characterization remaining a bottleneck for understanding NDDs. Most disease-associated Mendelian genes are members of gene families, and we hypothesize that those with related molecular function share clinical presentations. Methods We tested our hypothesis by considering gene families that have multiple members with an enrichment of de novo variants among NDDs, as determined by previous meta-analyses. One of these gene families is the heterogeneous nuclear ribonucleoproteins (hnRNPs), which has 33 members, five of which have been recently identified as NDD genes ( HNRNPK , HNRNPU , HNRNPH1 , HNRNPH2 , and HNRNPR ) and two of which have significant enrichment in our previous meta-analysis of probands with NDDs ( HNRNPU and SYNCRIP ). Utilizing protein homology, mutation analyses, gene expression analyses, and phenotypic characterization, we provide evidence for variation in 12 HNRNP genes as candidates for NDDs. Seven are potentially novel while the remaining genes in the family likely do not significantly contribute to NDD risk. Results We report 119 new NDD cases (64 de novo variants) through sequencing and international collaborations and combined with published clinical case reports. We consider 235 cases with gene-disruptive single-nucleotide variants or indels and 15 cases with small copy number variants. Three hnRNP-encoding genes reach nominal or exome-wide significance for de novo variant enrichment, while nine are candidates for pathogenic mutations. Comparison of HNRNP gene expression shows a pattern consistent with a role in cerebral cortical development with enriched expression among radial glial progenitors. Clinical assessment of probands ( n = 188–221) expands the phenotypes associated with HNRNP rare variants, and phenotypes associated with variation in the HNRNP genes distinguishes them as a subgroup of NDDs. Conclusions Overall, our novel approach of exploiting gene families in NDDs identifies new HNRNP -related disorders, expands the phenotypes of known HNRNP -related disorders, strongly implicates disruption of the hnRNPs as a whole in NDDs, and supports that NDD subtypes likely have shared molecular pathogenesis. To date, this is the first study to identify novel genetic disorders based on the presence of disorders in related genes. We also perform the first phenotypic analyses focusing on related genes. Finally, we show that radial glial expression of these genes is likely critical during neurodevelopment. This is important for diagnostics, as well as developing strategies to best study these genes for the development of therapeutics.
... Le positionnement correct des noeuds de duplication dans les arbres de gènes est un problème général, qui ne se restreint pas à celui des duplications complètes de génome. Notamment, le génome humain est composé d'au moins 40% de gènes dupliqués et la grande majorité des gènes de maladies humaines sont des gènes possédant au moins un paralogue (ZHANG 2003 ;MAKINO et MCLYSAGHT 2010 ;DICKERSON et ROBERTSON 2012 ;SACERDOT et al. 2018). En plus de la prise en compte insuffisante d'autres événements biologiques lors de la réconciliation à l'arbre d'espèces (voir la partie 6.3), les incertitudes dans les topologies d'arbre de gènes en elles-mêmes sont une source majeure d'erreurs de positionnement des duplications. ...
Thesis
Full-text available
Les duplications complètes de génome sont des événements majeurs dans l’histoire évolutive des espèces. Elles produisent des copies surnuméraires de gènes qui peuvent acquérir de nouvelles fonctions et ainsi contribuer aux processus d’adaptation et de diversification. Deux duplications complètes de génome ont eu lieu dans la lignée précédant l’ancêtre des Vertébrés, suivies d’une troisième à la base des poissons téléostéens (datée à 320 millions d’années). L’impressionnante diversité du clade téléostéen, représentant plus de la moitié des espèces de Vertébrés actuelles, permet d’explorer un large éventail de questions fonctionnelles et évolutives. De fait, le séquençage récent et en cours de nombreuses espèces de poissons promet de complémenter le modèle bien établi du poisson-zèbre. Néanmoins, leur événement partagé de duplication complète représente un défi pour l’analyse et la comparaison des génomes de poissons. En effet, suite à la duplication, de nombreux gènes demeurent en deux copies dans les génomes, ce qui complexifie la caractérisation des relations d’homologies entre gènes de différentes espèces. Afin de résoudre ce problème, j’ai développé une nouvelle méthodologie spécifique à la reconstruction d’arbres de gènes dans le contexte de duplications complètes de génomes, nommée SCORPiOs (Syntenyguided CORrection of Paralogies and Orthologies). L’innovation notable derrière SCORPiOs est l’intégration d’information provenant de l’organisation des gènes dans les génomes (synténie) afin de compléter les méthodes basées sur l’évolution moléculaire des séquences. Je présente comment l’application de cette nouvelle méthode à différents jeux de génomes de poissons améliore notre compréhension de l’évolution et de la structure des génomes de téléostéens. Dans un premier temps, je montre que SCORPiOs met en évidence la contribution des gènes dupliqués aux innovations évolutives des téléostéens. L’identification précise de gènes orthologues et paralogues m’a également permis d’établir la première cartographie à grande échelle des régions dupliquées entre génomes de poissons. Ce second résultat représente une nouvelle ressource qui devrait faciliter l’extrapolation d’annotations fonctionnelles entre espèces modèles et non-modèles. Enfin, je démontre comment l’analyse fine des désaccords de prédictions basées sur la synténie et la séquence permet de préciser les patrons spatio-temporels du retour à l’état diploïde après la duplication complète. Mon travail propose un cadre pour faciliter les analyses comparatives chez les poissons téléostéens et améliore nos connaissances concernant l’évolution des génomes après duplication complète.
... In general, these results for the CDEN and DDEN gene sets agree with the previous observations regarding epigenetic regulation genes in posttraumatic stress disorder [57]. Another previous study demonstrated that the evolutionary origins of heritable genetic disease genes tend to be ancient, originating with the early metazoans [58]. The present analysis indicated that the origins of the CDEN, DDEN, and ER genes are all more ancient still: these genes likely originated in unicellular eukaryotes. ...
Article
Full-text available
We carried out a system-level analysis of epigenetic regulators (ERs) and detailed the protein-protein interaction (PPI) network characteristics of disease-associated ERs. We found that most diseases associated with ERs can be clustered into two large groups, cancer diseases and developmental diseases. ER genes formed a highly interconnected PPI subnetwork, indicating a high tendency to interact and agglomerate with one another. We used the disease module detection (DIAMOnD) algorithm to expand the PPI subnetworks into a comprehensive cancer disease ER network (CDEN) and developmental disease ER network (DDEN). Using the transcriptome from early mouse developmental stages, we identified the gene co-expression modules significantly enriched for the CDEN and DDEN gene sets, which indicated the stage-dependent roles of ER-related disease genes during early embryonic development. The evolutionary rate and phylogenetic age distribution analysis indicated that the evolution of CDEN and DDEN genes was mostly constrained, and these genes exhibited older evolutionary age. Our analysis of human polymorphism data revealed that genes belonging to DDEN and Seed-DDEN were more likely to show signs of recent positive selection in human history. This finding suggests a potential association between positive selection of ERs and risk of developmental diseases through the mechanism of antagonistic pleiotropy.
... Ohnologs may have facilitated increased genomic, morphological and developmental complexity of vertebrates, for example, the expansion of the vertebrate cerebral cortex, and are associated with signalling pathways and developmental genes in vertebrates [13]. Retained ohnologs are also disproportionately affected by pathogenic copy number variants, have an increased susceptibility to deleterious mutations, and are frequently associated with cancer and other genetic diseases [13][14][15]. Expression analyses in Zebrafish (Danio rerio) show that Auts2, Fbrs and Fbrsl1 display distinct spatiotemporal and isoform-specific neuronal expression patterns throughout embryonic and juvenile development [16]. Auts2 and Fbrsl1 both encode C-terminal isoforms in zebrafish [16]; two C-terminal isoforms of Auts2 (Variants 1 and 2) are documented in mouse (Mus musculus), and a homolog of Variant 2 has been confirmed in humans [2,5]. ...
... The TayD can be divided into three discrete subdomains: nTayD (AUTS2 exons 9-11), mTayD (AUTS2 exons 12-13) and cTayD (AUTS2 exons [14][15][16][17][18]. There is no sequence similarity for the mTayD in aAUTS2p and Tay orthologs. ...
... This is supported by previous interaction studies, which show redundant binding activity between AUTS2, FBRS, and FBRSL1 with Polycomb and CK2 subunits [4,22]. As ohnologs are frequently identified as disease-associated genes [14], both FBRS and FBRSL1 should be investigated as potentially important proteins for future research, although they may not be as biologically important as AUTS2, due to their lower levels of internal conservation and the higher tolerance for missense variants occurring within evolutionarily conserved residues. In addition, FBRS is not present within any species of bird and, therefore, may perform either a non-essential or a detrimental function within avian biology. ...
Article
Full-text available
Autism susceptibility candidate 2 ( AUTS2 ) is a neurodevelopmental regulator associated with an autosomal dominant intellectual disability syndrome, AUTS2 syndrome, and is implicated as an important gene in human-specific evolution. AUTS2 exists as part of a tripartite gene family, the AUTS2 family, which includes two relatively undefined proteins, Fibrosin (FBRS) and Fibrosin-like protein 1 (FBRSL1). Evolutionary ancestors of AUTS2 have not been formally identified outside of the Animalia clade. A Drosophila melanogaster protein, Tay bridge, with a role in neurodevelopment, has been shown to display limited similarity to the C-terminal of AUTS2, suggesting that evolutionary ancestors of the AUTS2 family may exist within other Protostome lineages. Here we present an evolutionary analysis of the AUTS2 family, which highlights ancestral homologs of AUTS2 in multiple Protostome species, implicates AUTS2 as the closest human relative to the progenitor of the AUTS2 family, and demonstrates that Tay bridge is a divergent ortholog of the ancestral AUTS2 progenitor gene. We also define regions of high relative sequence identity, with potential functional significance, shared by the extended AUTS2 protein family. Using structural predictions coupled with sequence conservation and human variant data from 15,708 individuals, a putative domain structure for AUTS2 was produced that can be used to aid interpretation of the consequences of nucleotide variation on protein structure and function in human disease. To assess the role of AUTS2 in human-specific evolution, we recalculated allele frequencies at previously identified human derived sites using large population genome data, and show a high prevalence of ancestral alleles, suggesting that AUTS2 may not be a rapidly evolving gene, as previously thought.
... On the other hand, duplication may also have important deleterious effects in humans and can be associated with some diseases [6]. For example, the analysis of human genes linked to diseases made it possible to show that 80% of them have been duplicated in their evolutionary history, the disease-associated mutation being associated with only one of the duplicated copies [7]. Recently, the analysis of the evolution of cancer suppression in mammals revealed that species known to be resistant to cancer contain the most cancer gene copies [8]. ...
Article
Full-text available
Gene duplication is an important evolutionary mechanism allowing to provide new genetic material and thus opportunities to acquire new gene functions for an organism, with major implications such as speciation events. Various processes are known to allow a gene to be duplicated and different models explain how duplicated genes can be maintained in genomes. Due to their particular importance, the identification of duplicated genes is essential when studying genome evolution but it can still be a challenge due to the various fates duplicated genes can encounter. In this review, we first describe the evolutionary processes allowing the formation of duplicated genes but also describe the various bioinformatic approaches that can be used to identify them in genome sequences. Indeed, these bioinformatic approaches differ according to the underlying duplication mechanism. Hence, understanding the specificity of the duplicated genes of interest is a great asset for tool selection and should be taken into account when exploring a biological question.
... However, many cancer genes have a more ancient origin and can be traced back to unicellular organisms. These trends seem to apply to the appearance of disease genes [4] and novel genes in general as well [5]. These studies were based on the evolutionary history of the founder domains. ...
... However, new genes can also be generated by duplication either in whole or from part of existing genes, when the duplicate copy of a gene becomes associated with a different phenotype to its paralogous partner. This mechanism can also influence the emergence of disease genes [5]. ...
Article
Full-text available
Cancer is a heterogeneous genetic disease that alters the proper functioning of proteins involved in key regulatory processes such as cell cycle, DNA repair, survival, or apoptosis. Mutations often accumulate in hot-spots regions, highlighting critical functional modules within these proteins that need to be altered, amplified, or abolished for tumor formation. Recent evidence suggests that these mutational hotspots can correspond not only to globular domains, but also to intrinsically disordered regions (IDRs), which play a significant role in a subset of cancer types. IDRs have distinct functional properties that originate from their inherent flexibility. Generally, they correspond to more recent evolutionary inventions and show larger sequence variations across species. In this work, we analyzed the evolutionary origin of disordered regions that are specifically targeted in cancer. Surprisingly, the majority of these disordered cancer risk regions showed remarkable conservation with ancient evolutionary origin, stemming from the earliest multicellular animals or even beyond. Nevertheless, we encountered several examples where the mutated region emerged at a later stage compared with the origin of the gene family. We also showed the cancer risk regions become quickly fixated after their emergence, but evolution continues to tinker with their genes with novel regulatory elements introduced even at the level of humans. Our concise analysis provides a much clearer picture of the emergence of key regulatory elements in proteins and highlights the importance of taking into account the modular organisation of proteins for the analyses of evolutionary origin.