Daniel H Haft

National Institutes of Health, 베서스다, Maryland, United States

Are you Daniel H Haft?

Claim your profile

Publications (62)648.49 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The evolution of CRISPR-cas loci, which encode adaptive immune systems in archaea and bacteria, involves rapid changes, in particular numerous rearrangements of the locus architecture and horizontal transfer of complete loci or individual modules. These dynamics complicate straightforward phylogenetic classification, but here we present an approach combining the analysis of signature protein families and features of the architecture of cas loci that unambiguously partitions most CRISPR-cas loci into distinct classes, types and subtypes. The new classification retains the overall structure of the previous version but is expanded to now encompass two classes, five types and 16 subtypes. The relative stability of the classification suggests that the most prevalent variants of CRISPR-Cas systems are already known. However, the existence of rare, currently unclassifiable variants implies that additional types and subtypes remain to be characterized.
    Full-text · Article · Sep 2015 · Nature Reviews Microbiology
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: During 11–12 August 2014, a Protein Bioinformatics and Community Resources Retreat was held at the Wellcome Trust Genome Campus in Hinxton, UK. This meeting brought together the principal investigators of several specialized protein resources (such as CAZy, TCDB and MEROPS) as well as those from protein databases from the large Bioinformatics centres (including UniProt and RefSeq). The retreat was divided into five sessions: (1) key challenges, (2) the databases represented, (3) best practices for maintenance and curation, (4) information flow to and from large data centers and (5) communication and funding. An important outcome of this meeting was the creation of a Specialist Protein Resource Network that we believe will improve coordination of the activities of its member resources. We invite further protein database resources to join the network and continue the dialogue.
    Full-text · Article · Jul 2015 · Database The Journal of Biological Databases and Curation
  • Daniel H Haft
    [Show abstract] [Hide abstract]
    ABSTRACT: Bioinformatics looks to many microbiologists like a service industry. In this view, annotation starts with what is known from experiments in the lab, makes reasonable inferences of which genes match other genes in function, builds databases to make all that we know accessible, but creates nothing truly new. Experiments lead, then biocuration and computational biology follow. But the astounding success of genome sequencing is changing the annotation paradigm. Every genome sequenced is an intercepted coded message from the microbial world, and as all cryptographers know, it is easier to decode a thousand messages than a single message. Some biology is best discovered not by phenomenology, but by decoding genome content, forming hypotheses, and doing the first few rounds of validation computationally. Through such reasoning, a role and function may be assigned to a protein with no sequence similarity to any protein yet studied. Experimentation can follow after the discovery to cement and to extend the findings. Unfortunately, this approach remains so unfamiliar to most bench scientists that lab work and comparative genomics typically segregate to different teams working on unconnected projects. This review will discuss several themes in comparative genomics as a discovery method, including highly derived data, use of patterns of design to reason by analogy, and in silico testing of computationally generated hypotheses. Copyright © 2014 Elsevier Ltd. All rights reserved.
    No preview · Article · Jan 2015 · Current Opinion in Microbiology
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Unlabelled: Acinetobacter baumannii is a globally important nosocomial pathogen characterized by an increasing incidence of multidrug resistance. Routes of dissemination and gene flow among health care facilities are poorly resolved and are important for understanding the epidemiology of A. baumannii, minimizing disease transmission, and improving patient outcomes. We used whole-genome sequencing to assess diversity and genome dynamics in 49 isolates from one United States hospital system during one year from 2007 to 2008. Core single-nucleotide-variant-based phylogenetic analysis revealed multiple founder strains and multiple independent strains recovered from the same patient yet was insufficient to fully resolve strain relationships, where gene content and insertion sequence patterns added additional discriminatory power. Gene content comparisons illustrated extensive and redundant antibiotic resistance gene carriage and direct evidence of gene transfer, recombination, gene loss, and mutation. Evidence of barriers to gene flow among hospital components was not found, suggesting complex mixing of strains and a large reservoir of A. baumannii strains capable of colonizing patients. Importance: Genome sequencing was used to characterize multidrug-resistant Acinetobacter baumannii strains from one United States hospital system during a 1-year period to better understand how A. baumannii strains that cause infection are related to one another. Extensive variation in gene content was found, even among strains that were very closely related phylogenetically and epidemiologically. Several mechanisms contributed to this diversity, including transfer of mobile genetic elements, mobilization of insertion sequences, insertion sequence-mediated deletions, and genome-wide homologous recombination. Variation in gene content, however, lacked clear spatial or temporal patterns, suggesting a diverse pool of circulating strains with considerable interaction between strains and hospital locations. Widespread genetic variation among strains from the same hospital and even the same patient, particularly involving antibiotic resistance genes, reinforces the need for molecular diagnostic testing and genomic analysis to determine resistance profiles, rather than a reliance primarily on strain typing and antimicrobial resistance phenotypes for epidemiological studies.
    Full-text · Article · Dec 2014 · mBio
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Toward achieving rapid and large scale genome modification directly in a target organism, we have developed a new genome engineering strategy that uses a combination of bioinformatics aided design, large synthetic DNA and site-specific recombinases. Using Cre recombinase we swapped a target 126-kb segment of the Escherichia coli genome with a 72-kb synthetic DNA cassette, thereby effectively eliminating over 54 kb of genomic DNA from three non-contiguous regions in a single recombination event. We observed complete replacement of the native sequence with the modified synthetic sequence through the action of the Cre recombinase and no competition from homologous recombination. Because of the versatility and high-efficiency of the Cre-lox system, this method can be used in any organism where this system is functional as well as adapted to use with other highly precise genome engineering systems. Compared to present-day iterative approaches in genome engineering, we anticipate this method will greatly speed up the creation of reduced, modularized and optimized genomes through the integration of deletion analyses data, transcriptomics, synthetic biology and site-specific recombination.
    Full-text · Article · Jun 2014 · Nucleic Acids Research
  • [Show abstract] [Hide abstract]
    ABSTRACT: Whole-genome sequencing has become a powerful and informative approach for determining the genetic basis of known bacterial properties, predicting new properties, and enabling post-genomic tools. However, genome sequencing and annotation are most useful in the context of comparative genomic and evolutionary analysis, which allows the determination of phylogenetic relationships between extant organisms, provides insights into the evolution of different biological systems, and sheds light on processes accounting for organismal diversity. Genome sequence information is currently available for 20 species of Planctomycetes, Verrucomicrobia, and Lentisphaerae as well as for a number of chlamydial species. In this chapter, we show how this information can be employed to infer molecular mechanisms underlying important ecological, physiological, and evolutionary characteristics of these bacteria and address general questions regarding mechanisms of genome evolution in the PVC superphylum. © 2013 Springer Science+Business Media New York. All rights reserved.
    No preview · Article · Jan 2014
  • Kira S. Makarova · Daniel H. Haft · Eugene V. Koonin
    [Show abstract] [Hide abstract]
    ABSTRACT: This chapter presents an overview of all proteins families related to the Clustered regularly interspaced short palindrome repeats (CRISPR)-CRISPR-associated (Cas) systems, with particular emphasis on their characteristic domains and domain architectures, and briefly discusses the functional and evolutionary implications. By far the most common domain in Cas proteins is the RNA recognition motif (RRM). The RRM domains show remarkable diversity within the CRISPR-Cas systems and in particular comprise the scaffold of the Cascade complex. The combination of experimental structural studies and comparative analysis provides for detailed models of the structures of the Cascade complexes from different CRISPR-Cas types, revealing remarkable architectural uniformity.
    No preview · Chapter · Nov 2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Leptospirosis is a globally important, neglected zoonotic infection caused by spirochetes of the genus Leptospira. Since genetic transformation remains technically limited for pathogenic Leptospira, a systems biology pathogenomic approach was used to infer leptospiral virulence genes by whole genome comparison of culture-attenuated Leptospira interrogans serovar Lai with its virulent, isogenic parent. Among the 11 pathogen-specific protein-coding genes in which non-synonymous mutations were found, a putative soluble adenylate cyclase with host cell cAMP-elevating activity, and two members of a previously unstudied ∼15 member paralogous gene family of unknown function were identified. This gene family was also uniquely found in the alpha-proteobacteria Bartonella bacilliformis and Bartonella australis that are geographically restricted to the Andes and Australia, respectively. How the pathogenic Leptospira and these two Bartonella species came to share this expanded gene family remains an evolutionary mystery. In vivo expression analyses demonstrated up-regulation of 10/11 Leptospira genes identified in the attenuation screen, and profound in vivo, tissue-specific up-regulation by members of the paralogous gene family, suggesting a direct role in virulence and host-pathogen interactions. The pathogenomic experimental design here is generalizable as a functional systems biology approach to studying bacterial pathogenesis and virulence and should encourage similar experimental studies of other pathogens.
    Full-text · Article · Oct 2013 · PLoS Neglected Tropical Diseases
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Biological oxidation of methane to methanol by aerobic bacteria is catalysed by two different enzymes, the cytoplasmic or soluble methane monooxygenase (sMMO) and the membrane-bound or particulate methane monooxygenase (pMMO). Expression of MMOs is controlled by a 'copper-switch', i.e. sMMO is only expressed at very low copper : biomass ratios, while pMMO expression increases as this ratio increases. Methanotrophs synthesize a chalkophore, methanobactin, for the binding and import of copper. Previous work suggested that methanobactin was formed from a polypeptide precursor. Here we report that deletion of the gene suspected to encode for this precursor, mbnA, in Methylosinus trichosporium OB3b, abolishes methanobactin production. Further, gene expression assays indicate that methanobactin, together with another polypeptide of previously unknown function, MmoD, play key roles in regulating expression of MMOs. Based on these data, we propose a general model explaining how expression of the MMO operons is regulated by copper, methanobactin and MmoD. The basis of the 'copper-switch' is MmoD, and methanobactin amplifies the magnitude of the switch. Bioinformatic analysis of bacterial genomes indicates that the production of methanobactin-like compounds is not confined to methanotrophs, suggesting that its use as a metal-binding agent and/or role in gene regulation may be widespread in nature.
    Full-text · Article · Apr 2013 · Environmental Microbiology
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Computational prediction of protein function is frequently error-prone and incomplete. In Mycobacterium tuberculosis (Mtb), ∼25% of all genes have no predicted function and are annotated as hypothetical proteins, severely limiting our understanding of Mtb pathogenicity. Here, we utilize a high-throughput quantitative activity-based protein profiling (ABPP) platform to probe, annotate, and validate ATP-binding proteins in Mtb. We experimentally validate prior in silico predictions of >240 proteins and identify 72 hypothetical proteins as ATP binders. ATP interacts with proteins with diverse and unrelated sequences, providing an expanded view of adenosine nucleotide binding in Mtb. Several hypothetical ATP binders are essential or taxonomically limited, suggesting specialized functions in mycobacterial physiology and pathogenicity.
    Preview · Article · Jan 2013 · Chemistry & biology
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: TIGRFAMs, available online at http://www.jcvi.org/tigrfams is a database of protein family definitions. Each entry features a seed alignment of trusted representative sequences, a hidden Markov model (HMM) built from that alignment, cutoff scores that let automated annotation pipelines decide which proteins are members, and annotations for transfer onto member proteins. Most TIGRFAMs models are designated equivalog, meaning they assign a specific name to proteins conserved in function from a common ancestral sequence. Models describing more functionally heterogeneous families are designated subfamily or domain, and assign less specific but more widely applicable annotations. The Genome Properties database, available at http://www.jcvi.org/genome-properties, specifies how computed evidence, including TIGRFAMs HMM results, should be used to judge whether an enzymatic pathway, a protein complex or another type of molecular subsystem is encoded in a genome. TIGRFAMs and Genome Properties content are developed in concert because subsystems reconstruction for large numbers of genomes guides selection of seed alignment sequences and cutoff values during protein family construction. Both databases specialize heavily in bacterial and archaeal subsystems. At present, 4284 models appear in TIGRFAMs, while 628 systems are described by Genome Properties. Content derives both from subsystem discovery work and from biocuration of the scientific literature.
    Full-text · Article · Nov 2012 · Nucleic Acids Research
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Covering: 1988 to 2012This review presents recommended nomenclature for the biosynthesis of ribosomally synthesized and post-translationally modified peptides (RiPPs), a rapidly growing class of natural products. The current knowledge regarding the biosynthesis of the >20 distinct compound classes is also reviewed, and commonalities are discussed.
    Full-text · Article · Nov 2012 · Natural Product Reports
  • Daniel H Haft · Andrey Tovchigrechko

    No preview · Article · Jun 2012 · Nature Methods
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Biofilms are dense microbial communities. Although widely distributed and medically important, how biofilm cells interact with one another is poorly understood. Recently, we described a novel process whereby myxobacterial biofilm cells exchange their outer membrane (OM) lipoproteins. For the first time we report here the identification of two host proteins, TraAB, required for transfer. These proteins are predicted to localize in the cell envelope; and TraA encodes a distant PA14 lectin-like domain, a cysteine-rich tandem repeat region, and a putative C-terminal protein sorting tag named MYXO-CTERM, while TraB encodes an OmpA-like domain. Importantly, TraAB are required in donors and recipients, suggesting bidirectional transfer. By use of a lipophilic fluorescent dye, we also discovered that OM lipids are exchanged. Similar to lipoproteins, dye transfer requires TraAB function, gliding motility and a structured biofilm. Importantly, OM exchange was found to regulate swarming and development behaviors, suggesting a new role in cell-cell communication. A working model proposes TraA is a cell surface receptor that mediates cell-cell adhesion for OM fusion, in which lipoproteins/lipids are transferred by lateral diffusion. We further hypothesize that cell contact-dependent exchange helps myxobacteria to coordinate their social behaviors.
    Full-text · Article · Apr 2012 · PLoS Genetics
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As the deluge of genomic DNA sequence grows the fraction of protein sequences that have been manually curated falls. In turn, as the number of laboratories with the ability to sequence genomes in a high-throughput manner grows, the informatics capability of those labs to accurately identify and annotate all genes within a genome may often be lacking. These issues have led to fears about transitive annotation errors making sequence databases less reliable. During the lifetime of the Pfam protein families database a number of protein families have been built, which were later identified as composed solely of spurious open reading frames (ORFs) either on the opposite strand or in a different, overlapping reading frame with respect to the true protein-coding or non-coding RNA gene. These families were deleted and are no longer available in Pfam. However, we realized that these may perform a useful function to identify new spurious ORFs. We have collected these families together in AntiFam along with additional custom-made families of spurious ORFs. This resource currently contains 23 families that identified 1310 spurious proteins in UniProtKB and a further 4119 spurious proteins in a collection of metagenomic sequences. UniProt has adopted AntiFam as a part of the UniProtKB quality control process and will investigate these spurious proteins for exclusion.
    Full-text · Article · Jan 2012 · Database The Journal of Biological Databases and Curation
  • Source
    Daniel H Haft · Samuel H Payne · Jeremy D Selengut
    [Show abstract] [Hide abstract]
    ABSTRACT: Multiple new prokaryotic C-terminal protein-sorting signals were found that reprise the tripartite architecture shared by LPXTG and PEP-CTERM: motif, TM helix, basic cluster. Defining hidden Markov models were constructed for all. PGF-CTERM occurs in 29 archaeal species, some of which have more than 50 proteins that share the domain. PGF-CTERM proteins include the major cell surface protein in Halobacterium, a glycoprotein with a partially characterized diphytanylglyceryl phosphate linkage near its C terminus. Comparative genomics identifies a distant exosortase homolog, designated archaeosortase A (ArtA), as the likely protein-processing enzyme for PGF-CTERM. Proteomics suggests that the PGF-CTERM region is removed. Additional systems include VPXXXP-CTERM/archeaosortase B in two of the same archaea and PEF-CTERM/archaeosortase C in four others. Bacterial exosortases often fall into subfamilies that partner with very different cohorts of extracellular polymeric substance biosynthesis proteins; several species have multiple systems. Variant systems include the VPDSG-CTERM/exosortase C system unique to certain members of the phylum Verrucomicrobia, VPLPA-CTERM/exosortase D in several alpha- and deltaproteobacterial species, and a dedicated (single-target) VPEID-CTERM/exosortase E system in alphaproteobacteria. Exosortase-related families XrtF in the class Flavobacteria and XrtG in Gram-positive bacteria mark distinctive conserved gene neighborhoods. A picture emerges of an ancient and now well-differentiated superfamily of deeply membrane-embedded protein-processing enzymes. Their target proteins are destined to transit cellular membranes during their biosynthesis, during which most undergo additional posttranslational modifications such as glycosylation.
    Full-text · Article · Jan 2012 · Journal of bacteriology
  • Source
    Daniel H Haft · Neha Varghese
    [Show abstract] [Hide abstract]
    ABSTRACT: The rhomboid family of serine proteases occurs in all domains of life. Its members contain at least six hydrophobic membrane-spanning helices, with an active site serine located deep within the hydrophobic interior of the plasma membrane. The model member GlpG from Escherichia coli is heavily studied through engineered mutant forms, varied model substrates, and multiple X-ray crystal studies, yet its relationship to endogenous substrates is not well understood. Here we describe an apparent membrane anchoring C-terminal homology domain that appears in numerous genera including Shewanella, Vibrio, Acinetobacter, and Ralstonia, but excluding Escherichia and Haemophilus. Individual genomes encode up to thirteen members, usually homologous to each other only in this C-terminal region. The domain's tripartite architecture consists of motif, transmembrane helix, and cluster of basic residues at the protein C-terminus, as also seen with the LPXTG recognition sequence for sortase A and the PEP-CTERM recognition sequence for exosortase. Partial Phylogenetic Profiling identifies a distinctive rhomboid-like protease subfamily almost perfectly co-distributed with this recognition sequence. This protease subfamily and its putative target domain are hereby renamed rhombosortase and GlyGly-CTERM, respectively. The protease and target are encoded by consecutive genes in most genomes with just a single target, but far apart otherwise. The signature motif of the Rhombo-CTERM domain, often SGGS, only partially resembles known cleavage sites of rhomboid protease family model substrates. Some protein families that have several members with C-terminal GlyGly-CTERM domains also have additional members with LPXTG or PEP-CTERM domains instead, suggesting there may be common themes to the post-translational processing of these proteins by three different membrane protein superfamilies.
    Preview · Article · Dec 2011 · PLoS ONE
  • Source
    Malay K Basu · Jeremy D Selengut · Daniel H Haft
    [Show abstract] [Hide abstract]
    ABSTRACT: Phylogenetic profiling is a technique of scoring co-occurrence between a protein family and some other trait, usually another protein family, across a set of taxonomic groups. In spite of several refinements in recent years, the technique still invites significant improvement. To be its most effective, a phylogenetic profiling algorithm must be able to examine co-occurrences among protein families whose boundaries are uncertain within large homologous protein superfamilies. Partial Phylogenetic Profiling (PPP) is an iterative algorithm that scores a given taxonomic profile against the taxonomic distribution of families for all proteins in a genome. The method works through optimizing the boundary of each protein family, rather than by relying on prebuilt protein families or fixed sequence similarity thresholds. Double Partial Phylogenetic Profiling (DPPP) is a related procedure that begins with a single sequence and searches for optimal granularities for its surrounding protein family in order to generate the best query profiles for PPP. We present ProPhylo, a high-performance software package for phylogenetic profiling studies through creating individually optimized protein family boundaries. ProPhylo provides precomputed databases for immediate use and tools for manipulating the taxonomic profiles used as queries. ProPhylo results show universal markers of methanogenesis, a new DNA phosphorothioation-dependent restriction enzyme, and efficacy in guiding protein family construction. The software and the associated databases are freely available under the open source Perl Artistic License from ftp://ftp.jcvi.org/pub/data/ppp/.
    Full-text · Article · Nov 2011 · BMC Bioinformatics
  • Source
    Daniel H Haft · Malay Kumar Basu
    [Show abstract] [Hide abstract]
    ABSTRACT: Data mining methods in bioinformatics and comparative genomics commonly rely on working definitions of protein families from prior computation. Partial phylogenetic profiling (PPP), by contrast, optimizes family sizes during its searches for the cooccurring protein families that serve different roles in the same biological system. In a large-scale investigation of the incredibly diverse radical S-adenosylmethionine (SAM) enzyme superfamily, PPP aided in building a collection of 68 TIGRFAMs hidden Markov models (HMMs) that define nonoverlapping and functionally distinct subfamilies. Many identify radical SAM enzymes as molecular markers for multicomponent biological systems; HMMs defining their partner proteins also were constructed. Newly found systems include five groupings of protein families in which at least one marker is a radical SAM enzyme while another, encoded by an adjacent gene, is a short peptide predicted to be its substrate for posttranslational modification. The most prevalent, in over 125 genomes, featuring a peptide that we designate SCIFF (six cysteines in forty-five residues), is conserved throughout the class Clostridia, a distribution inconsistent with putative bacteriocin activity. A second novel system features a tandem pair of putative peptide-modifying radical SAM enzymes associated with a highly divergent family of peptides in which the only clearly conserved feature is a run of His-Xaa-Ser repeats. A third system pairs a radical SAM domain peptide maturase with selenocysteine-containing targets, suggesting a new biological role for selenium. These and several additional novel maturases that cooccur with predicted target peptides share a C-terminal additional 4Fe4S-binding domain with PqqE, the subtilosin A maturase AlbA, and the predicted mycofactocin and Nif11-class peptide maturases as well as with activators of anaerobic sulfatases and quinohemoprotein amine dehydrogenases. Radical SAM enzymes with this additional domain, as detected by TIGR04085, significantly outnumber lantibiotic synthases and cyclodehydratases combined in reference genomes while being highly enriched for members whose apparent targets are small peptides. Interpretation of comparative genomics evidence suggests unexpected (nonbacteriocin) roles for natural products from several of these systems.
    Preview · Article · Jun 2011 · Journal of bacteriology
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The CRISPR-Cas (clustered regularly interspaced short palindromic repeats-CRISPR-associated proteins) modules are adaptive immunity systems that are present in many archaea and bacteria. These defence systems are encoded by operons that have an extraordinarily diverse architecture and a high rate of evolution for both the cas genes and the unique spacer content. Here, we provide an updated analysis of the evolutionary relationships between CRISPR-Cas systems and Cas proteins. Three major types of CRISPR-Cas system are delineated, with a further division into several subtypes and a few chimeric variants. Given the complexity of the genomic architectures and the extremely dynamic evolution of the CRISPR-Cas systems, a unified classification of these systems should be based on multiple criteria. Accordingly, we propose a 'polythetic' classification that integrates the phylogenies of the most common cas genes, the sequence and organization of the CRISPR repeats and the architecture of the CRISPR-cas loci.
    Full-text · Article · Jun 2011 · Nature Reviews Microbiology

Publication Stats

11k Citations
648.49 Total Impact Points

Institutions

  • 2011-2015
    • National Institutes of Health
      • National Center for Biotechnology Information
      베서스다, Maryland, United States
  • 2008-2015
    • J. Craig Venter Institute
      • Informatics
      Роквилл, Maryland, United States
  • 2010
    • The University of Warwick
      • Biological Sciences
      Coventry, England, United Kingdom
  • 2009
    • University of Melbourne
      • Department of Microbiology and Immunology
      Melbourne, Victoria, Australia
  • 1999-2006
    • Biomedical Research Institute, Rockville
      Maryland, United States
  • 2004
    • George Washington University
      Washington, Washington, D.C., United States
  • 2003
    • The Forsyth Institute
      Cambridge, Massachusetts, United States
  • 2002
    • Wellcome Trust Sanger Institute
      Cambridge, England, United Kingdom