Daniel H Haft

J. Craig Venter Institute, Роквилл, Maryland, United States

Are you Daniel H Haft?

Claim your profile

Publications (74)731.58 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The evolution of CRISPR-cas loci, which encode adaptive immune systems in archaea and bacteria, involves rapid changes, in particular numerous rearrangements of the locus architecture and horizontal transfer of complete loci or individual modules. These dynamics complicate straightforward phylogenetic classification, but here we present an approach combining the analysis of signature protein families and features of the architecture of cas loci that unambiguously partitions most CRISPR-cas loci into distinct classes, types and subtypes. The new classification retains the overall structure of the previous version but is expanded to now encompass two classes, five types and 16 subtypes. The relative stability of the classification suggests that the most prevalent variants of CRISPR-Cas systems are already known. However, the existence of rare, currently unclassifiable variants implies that additional types and subtypes remain to be characterized.
    Nature Reviews Microbiology 09/2015; DOI:10.1038/nrmicro3569 · 23.57 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: During 11–12 August 2014, a Protein Bioinformatics and Community Resources Retreat was held at the Wellcome Trust Genome Campus in Hinxton, UK. This meeting brought together the principal investigators of several specialized protein resources (such as CAZy, TCDB and MEROPS) as well as those from protein databases from the large Bioinformatics centres (including UniProt and RefSeq). The retreat was divided into five sessions: (1) key challenges, (2) the databases represented, (3) best practices for maintenance and curation, (4) information flow to and from large data centers and (5) communication and funding. An important outcome of this meeting was the creation of a Specialist Protein Resource Network that we believe will improve coordination of the activities of its member resources. We invite further protein database resources to join the network and continue the dialogue.
    Database The Journal of Biological Databases and Curation 07/2015; 2015:bav063. DOI:10.1093/database/bav063 · 3.37 Impact Factor
  • Daniel H Haft
    [Show abstract] [Hide abstract]
    ABSTRACT: Bioinformatics looks to many microbiologists like a service industry. In this view, annotation starts with what is known from experiments in the lab, makes reasonable inferences of which genes match other genes in function, builds databases to make all that we know accessible, but creates nothing truly new. Experiments lead, then biocuration and computational biology follow. But the astounding success of genome sequencing is changing the annotation paradigm. Every genome sequenced is an intercepted coded message from the microbial world, and as all cryptographers know, it is easier to decode a thousand messages than a single message. Some biology is best discovered not by phenomenology, but by decoding genome content, forming hypotheses, and doing the first few rounds of validation computationally. Through such reasoning, a role and function may be assigned to a protein with no sequence similarity to any protein yet studied. Experimentation can follow after the discovery to cement and to extend the findings. Unfortunately, this approach remains so unfamiliar to most bench scientists that lab work and comparative genomics typically segregate to different teams working on unconnected projects. This review will discuss several themes in comparative genomics as a discovery method, including highly derived data, use of patterns of design to reason by analogy, and in silico testing of computationally generated hypotheses. Copyright © 2014 Elsevier Ltd. All rights reserved.
    Current Opinion in Microbiology 01/2015; 23C:189-196. DOI:10.1016/j.mib.2014.11.017 · 5.90 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Unlabelled: Acinetobacter baumannii is a globally important nosocomial pathogen characterized by an increasing incidence of multidrug resistance. Routes of dissemination and gene flow among health care facilities are poorly resolved and are important for understanding the epidemiology of A. baumannii, minimizing disease transmission, and improving patient outcomes. We used whole-genome sequencing to assess diversity and genome dynamics in 49 isolates from one United States hospital system during one year from 2007 to 2008. Core single-nucleotide-variant-based phylogenetic analysis revealed multiple founder strains and multiple independent strains recovered from the same patient yet was insufficient to fully resolve strain relationships, where gene content and insertion sequence patterns added additional discriminatory power. Gene content comparisons illustrated extensive and redundant antibiotic resistance gene carriage and direct evidence of gene transfer, recombination, gene loss, and mutation. Evidence of barriers to gene flow among hospital components was not found, suggesting complex mixing of strains and a large reservoir of A. baumannii strains capable of colonizing patients. Importance: Genome sequencing was used to characterize multidrug-resistant Acinetobacter baumannii strains from one United States hospital system during a 1-year period to better understand how A. baumannii strains that cause infection are related to one another. Extensive variation in gene content was found, even among strains that were very closely related phylogenetically and epidemiologically. Several mechanisms contributed to this diversity, including transfer of mobile genetic elements, mobilization of insertion sequences, insertion sequence-mediated deletions, and genome-wide homologous recombination. Variation in gene content, however, lacked clear spatial or temporal patterns, suggesting a diverse pool of circulating strains with considerable interaction between strains and hospital locations. Widespread genetic variation among strains from the same hospital and even the same patient, particularly involving antibiotic resistance genes, reinforces the need for molecular diagnostic testing and genomic analysis to determine resistance profiles, rather than a reliance primarily on strain typing and antimicrobial resistance phenotypes for epidemiological studies.
    mBio 12/2014; 5(1). DOI:10.1128/mBio.00963-13 · 6.79 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The InterPro database (http://www.ebi.ac.uk/interpro/) is a freely available resource that can be used to classify sequences into protein families and to predict the presence of important domains and sites. Central to the InterPro database are predictive models, known as signatures, from a range of different protein family databases that have different biological focuses and use different methodological approaches to classify protein families and domains. InterPro integrates these signatures, capitalizing on the respective strengths of the individual databases, to produce a powerful protein classification resource. Here, we report on the status of InterPro as it enters its 15th year of operation, and give an overview of new developments with the database and its associated Web interfaces and software. In particular, the new domain architecture search tool is described and the process of mapping of Gene Ontology terms to InterPro is outlined. We also discuss the challenges faced by the resource given the explosive growth in sequence data in recent years. InterPro (version 48.0) contains 36 766 member database signatures integrated into 26 238 InterPro entries, an increase of over 3993 entries (5081 signatures), since 2012. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
    Nucleic Acids Research 11/2014; 43(D1). DOI:10.1093/nar/gku1243 · 9.11 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Toward achieving rapid and large scale genome modification directly in a target organism, we have developed a new genome engineering strategy that uses a combination of bioinformatics aided design, large synthetic DNA and site-specific recombinases. Using Cre recombinase we swapped a target 126-kb segment of the Escherichia coli genome with a 72-kb synthetic DNA cassette, thereby effectively eliminating over 54 kb of genomic DNA from three non-contiguous regions in a single recombination event. We observed complete replacement of the native sequence with the modified synthetic sequence through the action of the Cre recombinase and no competition from homologous recombination. Because of the versatility and high-efficiency of the Cre-lox system, this method can be used in any organism where this system is functional as well as adapted to use with other highly precise genome engineering systems. Compared to present-day iterative approaches in genome engineering, we anticipate this method will greatly speed up the creation of reduced, modularized and optimized genomes through the integration of deletion analyses data, transcriptomics, synthetic biology and site-specific recombination.
    Nucleic Acids Research 06/2014; 42(14). DOI:10.1093/nar/gku509 · 9.11 Impact Factor
  • Kira S. Makarova · Daniel H. Haft · Eugene V. Koonin
    [Show abstract] [Hide abstract]
    ABSTRACT: This chapter presents an overview of all proteins families related to the Clustered regularly interspaced short palindrome repeats (CRISPR)-CRISPR-associated (Cas) systems, with particular emphasis on their characteristic domains and domain architectures, and briefly discusses the functional and evolutionary implications. By far the most common domain in Cas proteins is the RNA recognition motif (RRM). The RRM domains show remarkable diversity within the CRISPR-Cas systems and in particular comprise the scaffold of the Cascade complex. The combination of experimental structural studies and comparative analysis provides for detailed models of the structures of the Cascade complexes from different CRISPR-Cas types, revealing remarkable architectural uniformity.
    Protein Families, 11/2013: pages 341-381; , ISBN: 9780470624227
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Leptospirosis is a globally important, neglected zoonotic infection caused by spirochetes of the genus Leptospira. Since genetic transformation remains technically limited for pathogenic Leptospira, a systems biology pathogenomic approach was used to infer leptospiral virulence genes by whole genome comparison of culture-attenuated Leptospira interrogans serovar Lai with its virulent, isogenic parent. Among the 11 pathogen-specific protein-coding genes in which non-synonymous mutations were found, a putative soluble adenylate cyclase with host cell cAMP-elevating activity, and two members of a previously unstudied ∼15 member paralogous gene family of unknown function were identified. This gene family was also uniquely found in the alpha-proteobacteria Bartonella bacilliformis and Bartonella australis that are geographically restricted to the Andes and Australia, respectively. How the pathogenic Leptospira and these two Bartonella species came to share this expanded gene family remains an evolutionary mystery. In vivo expression analyses demonstrated up-regulation of 10/11 Leptospira genes identified in the attenuation screen, and profound in vivo, tissue-specific up-regulation by members of the paralogous gene family, suggesting a direct role in virulence and host-pathogen interactions. The pathogenomic experimental design here is generalizable as a functional systems biology approach to studying bacterial pathogenesis and virulence and should encourage similar experimental studies of other pathogens.
    PLoS Neglected Tropical Diseases 10/2013; 7(10):e2468. DOI:10.1371/journal.pntd.0002468 · 4.45 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Experimental data exists for only a vanishingly small fraction of sequenced microbial genes. This community page discusses the progress made by the COMBREX project to address this important issue using both computational and experimental resources.
    PLoS Biology 08/2013; 11(8):e1001638. DOI:10.1371/journal.pbio.1001638 · 9.34 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cell surfaces are decorated by a variety of proteins that facilitate interactions with their environments and support cell stability. These secreted proteins are anchored to the cell by mechanisms that are diverse, and, in archaea, poorly understood. Recently published in silico data suggest that in some species a subset of secreted euryarchaeal proteins, which includes the S-layer glycoprotein, is processed and covalently linked to the cell membrane by enzymes referred to as archaeosortases. In silico work led to the proposal that an independent, sortase-like system for proteolysis-coupled, carboxy-terminal lipid modification exists in bacteria (exosortase) and archaea (archaeosortase). Here, we provide the first in vivo characterization of an archaeosortase in the haloarchaeal model organism Haloferax volcanii. Deletion of the artA gene (HVO_0915) resulted in multiple biological phenotypes: (a) poor growth, especially under low-salt conditions, (b) alterations in cell shape and the S-layer, (c) impaired motility, suppressors of which still exhibit poor growth, and (d) impaired conjugation. We studied one of the ArtA substrates, the S-layer glycoprotein, using detailed proteomic analysis. While the carboxy-terminal region of S-layer glycoproteins, consisting of a threonine-rich O-glycosylated region followed by a hydrophobic transmembrane helix, has been notoriously resistant to any proteomic peptide identification, we were able to identify two overlapping peptides from the transmembrane domain present in the ΔartA strain but not in the wild-type strain. This clearly shows that ArtA is involved in carboxy-terminal posttranslational processing of the S-layer glycoprotein. As it is known from previous studies that a lipid is covalently attached to the carboxy-terminal region of the S-layer glycoprotein, our data strongly support the conclusion that archaeosortase functions analogously to sortase, mediating proteolysis-coupled, covalent cell surface attachment.
    Molecular Microbiology 05/2013; 88(6). DOI:10.1111/mmi.12248 · 4.42 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Biological oxidation of methane to methanol by aerobic bacteria is catalysed by two different enzymes, the cytoplasmic or soluble methane monooxygenase (sMMO) and the membrane-bound or particulate methane monooxygenase (pMMO). Expression of MMOs is controlled by a 'copper-switch', i.e. sMMO is only expressed at very low copper : biomass ratios, while pMMO expression increases as this ratio increases. Methanotrophs synthesize a chalkophore, methanobactin, for the binding and import of copper. Previous work suggested that methanobactin was formed from a polypeptide precursor. Here we report that deletion of the gene suspected to encode for this precursor, mbnA, in Methylosinus trichosporium OB3b, abolishes methanobactin production. Further, gene expression assays indicate that methanobactin, together with another polypeptide of previously unknown function, MmoD, play key roles in regulating expression of MMOs. Based on these data, we propose a general model explaining how expression of the MMO operons is regulated by copper, methanobactin and MmoD. The basis of the 'copper-switch' is MmoD, and methanobactin amplifies the magnitude of the switch. Bioinformatic analysis of bacterial genomes indicates that the production of methanobactin-like compounds is not confined to methanotrophs, suggesting that its use as a metal-binding agent and/or role in gene regulation may be widespread in nature.
    Environmental Microbiology 04/2013; DOI:10.1111/1462-2920.12150 · 6.20 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Computational prediction of protein function is frequently error-prone and incomplete. In Mycobacterium tuberculosis (Mtb), ∼25% of all genes have no predicted function and are annotated as hypothetical proteins, severely limiting our understanding of Mtb pathogenicity. Here, we utilize a high-throughput quantitative activity-based protein profiling (ABPP) platform to probe, annotate, and validate ATP-binding proteins in Mtb. We experimentally validate prior in silico predictions of >240 proteins and identify 72 hypothetical proteins as ATP binders. ATP interacts with proteins with diverse and unrelated sequences, providing an expanded view of adenosine nucleotide binding in Mtb. Several hypothetical ATP binders are essential or taxonomically limited, suggesting specialized functions in mycobacterial physiology and pathogenicity.
    Chemistry & biology 01/2013; 20(1):123-33. DOI:10.1016/j.chembiol.2012.11.008 · 6.65 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: TIGRFAMs, available online at http://www.jcvi.org/tigrfams is a database of protein family definitions. Each entry features a seed alignment of trusted representative sequences, a hidden Markov model (HMM) built from that alignment, cutoff scores that let automated annotation pipelines decide which proteins are members, and annotations for transfer onto member proteins. Most TIGRFAMs models are designated equivalog, meaning they assign a specific name to proteins conserved in function from a common ancestral sequence. Models describing more functionally heterogeneous families are designated subfamily or domain, and assign less specific but more widely applicable annotations. The Genome Properties database, available at http://www.jcvi.org/genome-properties, specifies how computed evidence, including TIGRFAMs HMM results, should be used to judge whether an enzymatic pathway, a protein complex or another type of molecular subsystem is encoded in a genome. TIGRFAMs and Genome Properties content are developed in concert because subsystems reconstruction for large numbers of genomes guides selection of seed alignment sequences and cutoff values during protein family construction. Both databases specialize heavily in bacterial and archaeal subsystems. At present, 4284 models appear in TIGRFAMs, while 628 systems are described by Genome Properties. Content derives both from subsystem discovery work and from biocuration of the scientific literature.
    Nucleic Acids Research 11/2012; 41(Database issue). DOI:10.1093/nar/gks1234 · 9.11 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Covering: 1988 to 2012This review presents recommended nomenclature for the biosynthesis of ribosomally synthesized and post-translationally modified peptides (RiPPs), a rapidly growing class of natural products. The current knowledge regarding the biosynthesis of the >20 distinct compound classes is also reviewed, and commonalities are discussed.
    Natural Product Reports 11/2012; 30(1). DOI:10.1039/c2np20085f · 10.11 Impact Factor
  • Daniel H Haft · Andrey Tovchigrechko
    Nature Methods 06/2012; 9(8):793-4. DOI:10.1038/nmeth.2080 · 32.07 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Biofilms are dense microbial communities. Although widely distributed and medically important, how biofilm cells interact with one another is poorly understood. Recently, we described a novel process whereby myxobacterial biofilm cells exchange their outer membrane (OM) lipoproteins. For the first time we report here the identification of two host proteins, TraAB, required for transfer. These proteins are predicted to localize in the cell envelope; and TraA encodes a distant PA14 lectin-like domain, a cysteine-rich tandem repeat region, and a putative C-terminal protein sorting tag named MYXO-CTERM, while TraB encodes an OmpA-like domain. Importantly, TraAB are required in donors and recipients, suggesting bidirectional transfer. By use of a lipophilic fluorescent dye, we also discovered that OM lipids are exchanged. Similar to lipoproteins, dye transfer requires TraAB function, gliding motility and a structured biofilm. Importantly, OM exchange was found to regulate swarming and development behaviors, suggesting a new role in cell-cell communication. A working model proposes TraA is a cell surface receptor that mediates cell-cell adhesion for OM fusion, in which lipoproteins/lipids are transferred by lateral diffusion. We further hypothesize that cell contact-dependent exchange helps myxobacteria to coordinate their social behaviors.
    PLoS Genetics 04/2012; 8(4):e1002626. DOI:10.1371/journal.pgen.1002626 · 7.53 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As the deluge of genomic DNA sequence grows the fraction of protein sequences that have been manually curated falls. In turn, as the number of laboratories with the ability to sequence genomes in a high-throughput manner grows, the informatics capability of those labs to accurately identify and annotate all genes within a genome may often be lacking. These issues have led to fears about transitive annotation errors making sequence databases less reliable. During the lifetime of the Pfam protein families database a number of protein families have been built, which were later identified as composed solely of spurious open reading frames (ORFs) either on the opposite strand or in a different, overlapping reading frame with respect to the true protein-coding or non-coding RNA gene. These families were deleted and are no longer available in Pfam. However, we realized that these may perform a useful function to identify new spurious ORFs. We have collected these families together in AntiFam along with additional custom-made families of spurious ORFs. This resource currently contains 23 families that identified 1310 spurious proteins in UniProtKB and a further 4119 spurious proteins in a collection of metagenomic sequences. UniProt has adopted AntiFam as a part of the UniProtKB quality control process and will investigate these spurious proteins for exclusion.
    Database The Journal of Biological Databases and Curation 01/2012; 2012:bas003. DOI:10.1093/database/bas003 · 3.37 Impact Factor
  • Source
    Daniel H Haft · Samuel H Payne · Jeremy D Selengut
    [Show abstract] [Hide abstract]
    ABSTRACT: Multiple new prokaryotic C-terminal protein-sorting signals were found that reprise the tripartite architecture shared by LPXTG and PEP-CTERM: motif, TM helix, basic cluster. Defining hidden Markov models were constructed for all. PGF-CTERM occurs in 29 archaeal species, some of which have more than 50 proteins that share the domain. PGF-CTERM proteins include the major cell surface protein in Halobacterium, a glycoprotein with a partially characterized diphytanylglyceryl phosphate linkage near its C terminus. Comparative genomics identifies a distant exosortase homolog, designated archaeosortase A (ArtA), as the likely protein-processing enzyme for PGF-CTERM. Proteomics suggests that the PGF-CTERM region is removed. Additional systems include VPXXXP-CTERM/archeaosortase B in two of the same archaea and PEF-CTERM/archaeosortase C in four others. Bacterial exosortases often fall into subfamilies that partner with very different cohorts of extracellular polymeric substance biosynthesis proteins; several species have multiple systems. Variant systems include the VPDSG-CTERM/exosortase C system unique to certain members of the phylum Verrucomicrobia, VPLPA-CTERM/exosortase D in several alpha- and deltaproteobacterial species, and a dedicated (single-target) VPEID-CTERM/exosortase E system in alphaproteobacteria. Exosortase-related families XrtF in the class Flavobacteria and XrtG in Gram-positive bacteria mark distinctive conserved gene neighborhoods. A picture emerges of an ancient and now well-differentiated superfamily of deeply membrane-embedded protein-processing enzymes. Their target proteins are destined to transit cellular membranes during their biosynthesis, during which most undergo additional posttranslational modifications such as glycosylation.
    Journal of bacteriology 01/2012; 194(1):36-48. DOI:10.1128/JB.06026-11 · 2.81 Impact Factor
  • Source
    Daniel H Haft · Neha Varghese
    [Show abstract] [Hide abstract]
    ABSTRACT: The rhomboid family of serine proteases occurs in all domains of life. Its members contain at least six hydrophobic membrane-spanning helices, with an active site serine located deep within the hydrophobic interior of the plasma membrane. The model member GlpG from Escherichia coli is heavily studied through engineered mutant forms, varied model substrates, and multiple X-ray crystal studies, yet its relationship to endogenous substrates is not well understood. Here we describe an apparent membrane anchoring C-terminal homology domain that appears in numerous genera including Shewanella, Vibrio, Acinetobacter, and Ralstonia, but excluding Escherichia and Haemophilus. Individual genomes encode up to thirteen members, usually homologous to each other only in this C-terminal region. The domain's tripartite architecture consists of motif, transmembrane helix, and cluster of basic residues at the protein C-terminus, as also seen with the LPXTG recognition sequence for sortase A and the PEP-CTERM recognition sequence for exosortase. Partial Phylogenetic Profiling identifies a distinctive rhomboid-like protease subfamily almost perfectly co-distributed with this recognition sequence. This protease subfamily and its putative target domain are hereby renamed rhombosortase and GlyGly-CTERM, respectively. The protease and target are encoded by consecutive genes in most genomes with just a single target, but far apart otherwise. The signature motif of the Rhombo-CTERM domain, often SGGS, only partially resembles known cleavage sites of rhomboid protease family model substrates. Some protein families that have several members with C-terminal GlyGly-CTERM domains also have additional members with LPXTG or PEP-CTERM domains instead, suggesting there may be common themes to the post-translational processing of these proteins by three different membrane protein superfamilies.
    PLoS ONE 12/2011; 6(12):e28886. DOI:10.1371/journal.pone.0028886 · 3.23 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: CharProtDB (http://www.jcvi.org/charprotdb/) is a curated database of biochemically characterized proteins. It provides a source of direct rather than transitive assignments of function, designed to support automated annotation pipelines. The initial data set in CharProtDB was collected through manual literature curation over the years by analysts at the J. Craig Venter Institute (JCVI) [formerly The Institute of Genomic Research (TIGR)] as part of their prokaryotic genome sequencing projects. The CharProtDB has been expanded by import of selected records from publicly available protein collections whose biocuration indicated direct rather than homology-based assignment of function. Annotations in CharProtDB include gene name, symbol and various controlled vocabulary terms, including Gene Ontology terms, Enzyme Commission number and TransportDB accession. Each annotation is referenced with the source; ideally a journal reference, or, if imported and lacking one, the original database source.
    Nucleic Acids Research 12/2011; 40(Database issue):D237-41. DOI:10.1093/nar/gkr1133 · 9.11 Impact Factor

Publication Stats

14k Citations
731.58 Total Impact Points


  • 2008–2015
    • J. Craig Venter Institute
      • Informatics
      Роквилл, Maryland, United States
  • 2012
    • Pacific Northwest National Laboratory
      • Biological Sciences Division
      Ричленд, Washington, United States
  • 2011
    • National Institutes of Health
      • National Center for Biotechnology Information
      Bethesda, MD, United States
  • 2010
    • The University of Warwick
      • Biological Sciences
      Coventry, England, United Kingdom
  • 2009
    • University of Melbourne
      • Department of Microbiology and Immunology
      Melbourne, Victoria, Australia
  • 2007
    • University of Maryland, College Park
      • Department of Cell Biology & Molecular Genetics
      CGS, Maryland, United States
  • 1999–2006
    • Biomedical Research Institute, Rockville
      Maryland, United States
  • 2004
    • George Washington University
      Washington, Washington, D.C., United States
  • 2003
    • The Forsyth Institute
      Cambridge, Massachusetts, United States
    • University of Oxford
      Oxford, England, United Kingdom
  • 2002
    • Wellcome Trust Sanger Institute
      Cambridge, England, United Kingdom
    • EMBL-EBI
      Cambridge, England, United Kingdom