Dannie Durand's research while affiliated with Carnegie Mellon University and other places

Publications (87)

Article
Full-text available
Motivation: Simulation is an essential technique for generating biomolecular data with a 'known' history for use in validating phylogenetic inference and other evolutionary methods. On longer time scales, simulation supports investigations of equilibrium behavior and provides a formal framework for testing competing evolutionary hypotheses. Twenty...
Article
The exon shuffling theory posits that intronic recombination creates new domain combinations, facilitating the evolution of novel protein function. This theory predicts that introns will be preferentially situated near domain boundaries. Many studies have sought evidence for exon shuffling by testing the correspondence between introns and domain bo...
Data
Specificity residues of predicted Spo0 Proteins. (XLSX)
Article
Full-text available
The evolution of signal transduction pathways is constrained by the requirements of signal fidelity, yet flexibility is necessary to allow pathway remodeling in response to environmental challenges. A detailed understanding of how flexibility and constraint shape bacterial two component signaling systems is emerging, but how new signal transduction...
Data
Phylogram of 84 representative Firmicute species. Phylogram constructed from the concatenated alignment of 50 ribosomal protein sequences from 84 genomes using RaxML [62], as described in Methods. Outgroup rooted with Leptotrichia buccalis. Branch labels represent bootstrap replicates; branch lengths in units of substitutions per site. Colored bran...
Data
Phylogenetic distribution of predicted Spo0 pathway proteins in the Yutin tree. Cladogram of 68 Firmicutes genomes, outgroup rooted using Leptotrichia buccalis and Fusibacterium nucleatum, adapted from Fig 1 in Yutin and Galperin [44]. Leaves are labeled with the taxonomic names used in the original publication. The names of species that have been...
Data
Genome content conservation in regions flanking Spo0B marker genes. Firmicutes cladogram from Fig 3, annotated with Spo0B marker gene neighborhoods. The marker gene neighborhood in each genome was identified as follows: The gene identifiers and domain annotations of the eight genes flanking each Spo0B marker gene (L21, L27, and ObgE) were retrieved...
Data
Spo0 pathway proteins experimentally verified in prior studies. (XLSX)
Data
Time course of Dtox_1918 kinase autophosphorylation. Radiograph of Dtox_1918 time course indicates that peak autophosphorylation is achieved within 15 minutes of addition of radiolabeled ATP. Each lane contains a sample from an incubation of 5 μM kinase in HKEG buffer, supplemented with 5 mM MgCl2, 500 μM ATP, and 0.5 μCi/μL [γ32P]-ATP from a stock...
Data
Genomic Location of Spo0 Proteins. (XLSX)
Data
Comparison of the distribution of predicted Spo0 pathways in three Firmicute phylogenies. (PDF)
Data
Phylogenetic distribution of predicted Spo0 pathway proteins in the Antunes tree. Cladogram of 205 Firmicutes genomes, rooted by 13 outgroup species, adapted from Fig 2 in Antunes et al. [43]. Leaves are labeled with the taxonomic names used in the original publication. The names of species that have been recently reclassified, or are under conside...
Data
Comparison of orphan kinase catalytic domain content. (PDF)
Data
Genome content conservation in regions flanking Spo0F marker genes. Firmicutes cladogram from Fig 3, annotated with Spo0F marker gene neighborhoods. The marker gene neighborhood in each genome was identified as follows: The gene identifiers and domain annotations of the eight genes flanking each Spo0F marker gene (Fructose bisphosphate aldolase, Tr...
Data
Spatial distribution of Spo0-encoding genes along the genome. Genomic location of genes encoding predicted Spo0 proteins in (A) B. subtilis (5 orphan kinases) and (B) all spore-forming members of the representative Firmicutes used in this study for which a complete, fully assembled genome sequence is available (40 genomes, 204 orphan kinases). Norm...
Data
Lack of evidence for Spo0B autophosphorylation. Phosphotransfer from Dtox_1918 was assessed in incubations with Dtox_Spo0F, Dtox_Spo0B, or both, as indicated. Reactions were sampled 5, 15, and 30 minutes after initiation of the reaction and mixed with LDS prior to separation on a 10% SDS-PAGE gel. Dtox_Spo0B phosphorylation is only observed in the...
Data
Orphan kinase catalytic domain content. (XLSX)
Data
Oligonucleotide sequences used in the protein expression constructs. See S4 Text for details. (FASTA)
Data
Quantitative comparison of specificity residues. (PDF)
Data
Construction of expression vectors for N-terminal fusion proteins. (PDF)
Data
Amino acid sequences of N-terminally tagged Spo0 proteins. See S4 Text for details. (FASTA)
Article
Full-text available
Highly Iterated Palindrome 1 (HIP1, GCGATCGC) is hyper-abundant in most cyanobacterial genomes. In some cyanobacteria, average HIP1 abundance exceeds one motif per gene. Such high abundance suggests a significant role in cyanobacterial biology. However, 20 years of study have not revealed whether HIP1 has a function, much less what that function mi...
Article
Full-text available
Streptococcus pneumoniae (pneumococcus) displays broad tissue tropism and infects multiple body sites in the human host. However, infections of the conjunctiva are limited to strains within a distinct phyletic group with multilocus sequence types ST448, ST344, ST1186, ST1270, and ST2315. In this study, we sequenced the genomes of six pneumococcal s...
Conference Paper
Weak branch supports in a gene tree suggest that the signal in sequence data is insufficient to resolve a particular branching order. One approach to reduce uncertainty takes the topology of the species tree into account. Under a maximum parsimony model, the best resolution of the weak branches is the binary tree that minimizes the cost of duplicat...
Article
Full-text available
Motivation: Orthology analysis is a fundamental tool in comparative genomics. Sophisticated methods have been developed to distinguish between orthologs and paralogs and to classify paralogs into subtypes depending on the duplication mechanism and timing, relative to speciation. However, no comparable framework exists for xenologs: gene pairs whos...
Article
Full-text available
Background: Reconstructing evolution provides valuable insights into the processes of gene evolution and function. However, while there have been great advances in algorithms and software to reconstruct the history of gene families, these tools do not model the domain shuffling events (domain duplication, insertion, transfer, and deletion) that dr...
Article
Full-text available
Phylogenetic birth-death models are opening a new window on the processes of genome evolution in studies of the evolution of gene and protein families, protein-protein interaction networks, microRNAs, and copy number variation. Given a species tree and a set of genomic characters in present-day species, the birth-death approach estimates the most l...
Article
Gene functions, interactions, disease associations, and ecological distributions are all correlated with gene age. However, it is challenging to estimate the intricate series of evolutionary events leading to a modern-day gene and then to reduce this history to a single age estimate. Focusing on eukaryotic gene families, we introduce a framework th...
Article
Full-text available
Gene duplication (D), transfer (T), loss (L) and incomplete lineage sorting (I) are crucial to the evolution of gene families and the emergence of novel functions. The history of these events can be inferred via comparison of gene and species trees, a process called reconciliation, yet current reconciliation algorithms model only a subset of these...
Article
Inferring a protein's function by homology is a powerful tool for biologists. The Princeton Protein Orthology Database (P-POD) offers a simple way to visualize and analyze the relationships between homologous proteins in order to infer function. P-POD contains computationally generated analysis distinguishing orthologs from paralogs combined with c...
Article
Full-text available
Motivation: Classification of gene and protein sequences into homologous families, i.e. sets of sequences that share common ancestry, is an essential step in comparative genomic analyses. This is typically achieved by construction of a sequence homology network, followed by clustering to identify dense subgraphs corresponding to families. Accurate...
Article
Full-text available
P-POD, the Princeton Protein Orthology Database, classifies proteins from model organisms and medically-important organisms into families of homologs and provides curated evidence from the literature addressing these relationships. The web page for each protein family includes a phylogenetic tree, sequence alignment, and cross-references to disease...
Article
Full-text available
Identifying genomic regions that descended from a common ancestor is important for understanding the function and evolution of genomes. In distantly related genomes, clusters of homologous gene pairs are evidence of candidate homologous regions. Demonstrating the statistical significance of such “gene clusters” is an essential component of comparat...
Article
Full-text available
Reconciliation extracts information from the topological incongruence between gene and species trees to infer duplications and losses in the history of a gene family. The inferred duplication-loss histories provide valuable information for a broad range of biological applications, including ortholog identification, estimating gene duplication times...
Data
Full-text available
Distribution of Neighborhood Correlation scores for all sequence pairs. (0.00 MB PDF)
Data
Precision and Recall for predictions using simple alignment coverage thresholds of 0.3, 0.6, and 0.8 for all families. (0.07 MB DOC)
Data
Distributions of alignment coverage for all families. Distributions of alignment coverage calculated with the optimal alignment length only (FF: blue, FO: red) and with combined non-conflicting alignments (FF: turquoise, FO: brown) for all families. (0.03 MB PDF)
Data
Distributions of BLAST and NC scores for all families. (FF: blue, FO: red). (0.04 MB PDF)
Data
ROC-100k curves for all families. ROC-100k curves of Neighborhood Correlation (blue), PSI-BLAST (magenta), DAC (purple), and BLAST sequence similarity with alignment coverage thresholds of α≥0.0 (red), α≥0.3 (green), α≥0.6 (yellow), and α≥0.8 (orange) for all families. (0.15 MB PDF)
Data
Precision and recall for predictions using combined alignment coverage thresholds of 0.3, 0.6, and 0.8 for all families. (0.07 MB DOC)
Article
Full-text available
We address the problem of homology identification in complex multidomain families with varied domain architectures. The challenge is to distinguish sequence pairs that share common ancestry from pairs that share an inserted domain but are otherwise unrelated. This distinction is essential for accuracy in gene annotation, function prediction, and co...
Article
Gene clusters that span three or more chromosomal regions are of increasing importance, yet statistical tests to validate such clusters are in their infancy. Current approaches either conduct several pairwise comparisons or consider only the number of genes that occur in all of the regions. In this paper, we provide statistical tests for clusters s...
Article
Homology identification is the first step for many genomic studies. Current methods, based on sequence comparison, can result in a substantial number of mis-assignments due to the similarity of homologous domains in otherwise unrelated sequences. Here we propose methods to detect homologs through explicit comparison of protein domain content. We de...
Conference Paper
Homology identification is the first step for many genomic studies. Current methods, based on sequence comparison, can result in a substantial number of mis-assignments due to the alignment of homologous domains in otherwise unrelated sequences. Here we propose methods to detect homologs through explicit comparison of domain architecture. We develo...
Article
Full-text available
We study properties of multidomain proteins from a graph theoretical perspective. In particular, we demonstrate connections between properties of the domain overlap graph and certain variants of Dollo parsimony models. We apply our graph theoretical results to address several interrelated questions: do proteins acquire new domains infrequently, or...
Article
New genes arise through duplication and modification of DNA sequences on a range of scales: single gene duplication, duplication of large chromosomal fragments and whole-genome duplication. Each duplication mechanism has specific characteristics that influence the fate of the resulting duplicates, such as the size of the duplicated fragment, the po...
Article
Full-text available
Gene family evolution is determined by microevolutionary processes (e.g., point mutations) and macroevolutionary processes (e.g., gene duplication and loss), yet macroevolutionary considerations are rarely incorporated into gene phylogeny reconstruction methods. We present a dynamic program to find the most parsimonious gene family tree with respec...
Article
Statistical validation of gene clusters is imperative for many important applications in comparative genomics which depend on the identification of genomic regions that are historically and/or functionally related. We develop the first rigorous statistical treatment of max-gap clusters, a cluster definition frequently used in empirical studies. We...
Conference Paper
There is widespread interest in comparative genomics in de- termining if historically and/or functionally related genes are spatially clustered in the genome, and whether the same sets of genes reappear in clusters in two or more genomes. We formalize and analyze the desir- able properties of gene clusters and cluster denitions. Through detailed an...
Conference Paper
Identification of homologous chromosomal regions is impor- tant for understanding evolutionary processes that shape genome evolu- tion, such as genome rearrangements and large scale duplication events. If these chromosomal regions have diverged significantly, statistical tests to determine whether observed similarities in gene content are due to hi...
Article
Gene family evolution is determined by microevolutionary processes (e.g., point mutations) and macroevolutionary processes (e.g., gene duplication and loss), yet macroevolutionary considerations are rarely incorporated into gene phylogeny reconstruction methods. We present a dynamic program to find the most parsimonious gene family tree with respec...
Conference Paper
Identifying gene clusters, genomic regions that share local similarities in gene organization, is a prerequisite for many different types of genomic analyses, including operon prediction, reconstruction of chro- mosomal rearrangements, and detection of whole-genome duplications. A number of formal definitions of gene clusters have been proposed, as...
Article
Full-text available
Gene duplications are a widely studied phenomenon. Gene duplications di#er from other genomic rearrangments, such as transpositions and reversals, in that the time of duplication can be estimated; that is, we can in some cases calibrate duplications with respect to speciation events. The dating of duplication events has been used to argue for or ag...
Article
A growing imbalance in CPU and I/O speeds has led to a communications bottleneck in distributed architectures, especially for data-intensive applications such as multimedia information systems, databases, and Grand Challenge problems. Our solution is to schedule parallel I/O operations explicitly. We present a class of decentralized scheduling algo...
Article
The number and role of whole-genome duplications in vertebrate evolution has intrigued evolutionary biologists since Ohno first proposed genome duplication as the force driving the 'big leap' in vertebrate morphological innovation. Attempts to resolve these issues have been thwarted by small and noisy datasets, and by lack of computational accuracy...
Article
Comparing chromosomal gene order in two or more related species is an important approach to studying the forces that guide genome organization and evolution. Linked clusters of similar genes found in related genomes are often used to support arguments of evolutionary relatedness or functional selection. However, as the gene order and the gene compl...
Article
Full-text available
Large scale gene duplication is a major force driving the evolution of genetic functional innovation. Whole genome duplications are widely believed to have played an important role in the evolution of the maize, yeast, and vertebrate genomes. The use of evolutionary trees to analyze the history of gene duplication and estimate duplication times pro...
Article
Full-text available
Large scale gene duplication is a major force driving the evolution of genetic functional innovation. Whole genome duplications are widely believed to have played an important role in the evolution of the maize, yeast and vertebrate genomes. The use of evolutionary trees to analyze the history of gene duplication and estimate duplication times prov...
Article
We use stochastic population models to study the evolution of Ultraselfish Gene Complexes (USGC's). USGC's are chromosomal regions characterized by segregation distortion: a heterozygote bearing the USGC passes it to more than 50% of o#spring. USGC-bearing homozygotes are sterile. USGC's promote themselves at the expense of other genes in the same...
Article
Full-text available
A growing imbalance in CPU and I/O speeds has led to a communications bottleneck in distributed architectures, especially for data-intensive applications such as multimedia information systems, databases, and Grand Challenge problems. Our solution is to schedule parallel I/O operations explicitly. We present a class of decentralized scheduling algo...
Article
Full-text available
The advent of recombinant DNA technology during the 1970s has led to an inundation of biological sequence data. The compilation and analysis of DNA and protein sequences is now a fundamental task in molecular biology requiring. Computational Molecular Biology is the field of computer science that has emerged to solve algorithmic problems in determi...
Article
Evolutionary trees are frequently used as the underlying model in the design of algorithms, optimization criteria and software packages for multiple sequence alignment (MSA). In this paper, we reexamine the suitability of trees as a universal model for MSA in light of the broad range of biological questions that MSA's are used to address. A tree mo...
Article
The t-haplotype is a chromosomal region in Mus musculus characterized by meiotic drive such that heterozygous males transmit t-bearing chromosomes to roughly 90% of their offspring. Most naturally occurring t-haplotypes express a recessive embryonic lethality, preventing fixation of the t-haplotype. Surprisingly, the t-haplotype occurs in nature as...
Article
Full-text available
Multiple sequence alignment (MSA) is important in functional, structural and evolutionary studies of sequence data. Much research has focussed on posing MSA as an optimization problem, and several optimization criteria have been explored. In this paper, we discuss biological and mathematical problems that arise in cost function design for the multi...
Article
Self-scheduling is a method for task scheduling in parallel programs, in which each processor acquires a new block of tasks for execution whenever it becomes idle. To get the best performance, the block size must be chosen to balance the scheduling overhead against the load imbalance. To determine the best block size, a better understanding of the...
Chapter
The cost of data transfers, and in particular of I/O operations, is a growing problem in parallel computing. This performance bottleneck is especially severe for data-intensive applications such as multimedia information systems, databases, and Grand Challenge problems. A promising approach to alleviating this bottleneck is to schedule parallel I/O...
Conference Paper
Full-text available
We propose a parameterized, randomized edge coloring algorithm for use in coordinating data transfers in fully connected distributed architectures such as parallel I/O subsystems and multimedia information systems. Our approach is to preschedule I/O requests to eliminate contention for I/O ports while maintaining an efficient use of bandwidth. Requ...
Article
Full-text available
The cost of data transfers, and in particular of I/O operations, is a growing problem in parallel computing. A promising approach to alleviating this bottleneck is to schedule parallel I/O operations explicitly. We develop a class of decentralized algorithms for scheduling parallel I/O operations, where the objective is to reduce the time required...
Conference Paper
Self-scheduling is a method for task schedul ing in parallel programs, in which each processor acquires a new block of tasks for execution whenever it becomes idle. To get the best performance, the block size must be chosen to balance the scheduling overhead against the load im balance. To determine the best block size, better analytical models of...
Article
• Macroevolutionary tree refinement: Macroevolutionary considerations are rarely incorporated in gene phylogeny reconstruction methods, leaving an important source of information untapped. Notung 2.0 offers a novel, computationally tractable approach to gene tree reconstruction that takes both micro- and macroevolution into account. Speed is achiev...
Article
A crucial task in the completion of genomic sequencing projects is annotation of predicted genes. Annotation is facilitated by classification of sequences into families. Since members of a fam-ily tend to have similar properties, attributes of well studied genes can be used to predict the properties of newly discovered genes. Typically, only a smal...

Citations

... These observations confirm that the presence of spo0A is insufficient to conclude that a bacterium is a sporeformer, although its absence unequivocally indicates that it is not. In agreement with the previous reports (32,33,51), Spo0A-encoding nonsporeformers were found in the classes Bacilli, Clostridia, and Erysipelotrichia. Among Bacilli, our set included six Spo0A-encoding nonsporeformers, all in the order Bacillales: Lentibacillus amyloliquefaciens in the family Bacillaceae, Kurthia zopfii and Planococcus antarcticus in Planococcaceae; Macrococcus caseolyticus in Staphylococcaceae, Novibacillus thermophilus in Thermoactinomycetaceae, and Exiguobacterium sp. from Bacillales family XII, incertae sedis. ...
... PCC 6803 (from here: Synechocystis 6803) HIP1 instances occur at the frequency of one copy in every 1131 bp [1,3]. Statistical analyses supported the hypothesis that HIP1 motifs are maintained by selection, suggesting that HIP1 motifs likely perform biological functions [4]. A relation between the presence of HIP1 motifs and DNA recombination and/or repair processes has been suggested [5]. ...
... Genetic flux events during asymptomatic nasopharyngeal carriage facilitate acquisition of diverse virulence factors that may be responsible for tissue tropism, ultimately determining the nature and course of subsequent invasive pneumococcal disease (IPD) [8,10,19,23,24,25]. Despite the large extent of genetic exchanges resulting in diverse pneumococcal lineages, pneumococci maintain a relatively constant genome size (~2.1 Mb) and GC-content (~39.5%), ...
... In DTL, the problem is NP hard [93]. Heuristics [94] and exact fixed parameter tractable algorithms [93] are possible resolutions. ...
... Methods exist to perform gene tree species tree reconciliation to detect xenologs (i.e HGTs) [7,8] and are able to distinguish genes that were transferred horizontally with or without duplication events. However, providing a species tree together with the gene tree is required. ...
... Gene duplication and loss events were reconstructed by reconciling a gene tree with a species tree in NOTUNG version 2.9.1.3 (Chen et al. 2000;Durand et al. 2006;Vernot et al. 2008). A gene tree of putatively functional OR proteins at least 300 amino acids in length was generated using published protein sequences for 4 ants (A. ...
... Les familles de protéines multidomaines évoluent par brassage de domaines, un processus qui inclut l'acquisition (e.g., insertion, transfert), la duplication, la fusion, la fission et la perte de domaines [LB19b ; Oak17 ; WRK12] (Figure 2.5). Les gains, pertes et remplacements/modifications de domaines issus de ce brassage de domaines provoquent alors des changements immédiats et drastiques des fonctions de la protéine, permettant alors la variation fonctionnelle des gènes d'une même famille [Sto+15]. Le brassage des domaines résulte principalement du brassage des exons des gènes [CFS10 ;Pat96], ce dernier permet ainsi l'évolution des phénotypes et la diversité fonctionnelle des protéines d'une même famille [Kaw+09], qui inclut notamment l'émergence d'interactions protéineprotéine [CFS10]. ...
... More recent work (see [JWB96] for some examples) has successfully addressed the need for higher throughput I/O, but these systems are almost exclusively tailored for the sort of structured I/O patterns associated with regular scientific applications. Nieuwejaar and Kotz's extensions to I/O paradigms to allow efficient strided access [NK96] are a good example of this line of work, as are [MS96] and [DJT96]. ...
... The analysis was carried out on the combined set of mouse and human sequences. In a preliminary study, we compared the performance of Neighborhood Correlation on a smaller, combined set of mouse and human sequences with its performance on separate sets of mouse and human sequences [75] to determine whether Neighborhood Correlation performs differently on comparisons within and across genomes. The mouse-only and human-only data test the ability to classify paralogs within a single mammalian species, as opposed to the combination of orthologs and paralogs seen in the combined dataset. ...
... However, after substraction this gene was in the list of 409 zebrafish genes without lamprey homologs and therefore could be classified as novel for fishes as compared to lamprey. This is an example of a gene with complex evolutionary history which may not have a single well defined age [11]. That is why we also used another method of estimation of gene orthology, OMA (Orthologous MAtrix). ...