The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution

Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK.
Nucleic Acids Research (Impact Factor: 8.81). 02/2007; 35(Database issue):D291-7. DOI: 10.1093/nar/gkl959
Source: PubMed

ABSTRACT We report the latest release (version 3.0) of the CATH protein domain database ( There has been a 20% increase in the number of structural domains classified in CATH, up to 86 151 domains. Release 3.0 comprises 1110 fold groups and 2147 homologous superfamilies. To cope with the increases in diverse structural homologues being determined by the structural genomics initiatives, more sensitive methods have been developed for identifying boundaries in multi-domain proteins and for recognising homologues. The CATH classification update is now being driven by an integrated pipeline that links these automated procedures with validation steps, that have been made easier by the provision of information rich web pages summarising comparison scores and relevant links to external sites for each domain being classified. An analysis of the population of domains in the CATH hierarchy and several domain characteristics are presented for version 3.0. We also report an update of the CATH Dictionary of homologous structures (CATH-DHS) which now contains multiple structural alignments, consensus information and functional annotations for 1459 well populated superfamilies in CATH. CATH is directly linked to the Gene3D database which is a projection of CATH structural data onto approximately 2 million sequences in completed genomes and UniProt.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: α Helices are a basic unit of protein secondary structure and therefore the interaction between helices is crucial to understanding tertiary and higher-order folds. Comparing subtle variations in the structural and sequence motifs between membrane and soluble proteins sheds light on the different constraints faced by each environment and elucidates the complex puzzle of membrane protein folding. Here, we demonstrate that membrane and water-soluble helix pairs share a small number of similar folds with various interhelical distances. The composition of the residues that pack at the interface between corresponding motifs shows that hydrophobic residues tend to be more enriched in the water-soluble class of structures and small residues in the transmembrane class. The latter group facilitates packing via sidechain- and backbone-mediated hydrogen bonds within the low-dielectric membrane milieu. The helix-helix interactome space, with its associated sequence preferences and accompanying hydrogen-bonding patterns, should be useful for engineering, prediction, and design of protein structure. Copyright © 2015 Elsevier Ltd. All rights reserved.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Structural domains in proteins are the basic units to form various proteins. In the protein's evolution and functioning, domains play important roles. But the definition of domain is not yet precisely given, and the update cycle of structural domain databases is long. The automatic algorithms identify domains slowly, while protein entities with great structural complexity are on the rise. Here, we present a method which recognizes the compact and modular segments of polypeptide chains to identify structural domains, and contrast some data sets to illuminate their effect. The method combines support vector machine (SVM) with K-means algorithm. It is faster and more stable than most current algorithms and performs better. It also indicates that when proteins are presented as some Alpha-carbon atoms in 3D space, it is feasible to identify structural domains by the spatially structural properties. We have developed a web-server, which would be helpful in identification of structural domains (
    Scientific Reports 12/2014; 4:7476. DOI:10.1038/srep07476 · 5.08 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Q4D059 (UniProt accession number), is an 86-residue protein from Trypanosoma cruzi, conserved in the related kinetoplastid parasites Trypanosoma brucei and Leishmania major. These pathogens are the causal agents of the neglected diseases: Chagas, sleeping sickness and leishmaniases respectively and had recently their genomes sequenced. Q4D059 shows low sequence similarity with mammal proteins and because of its essentiality demonstrated in T. brucei, it is a potential target for anti-parasitic drugs. The 11 hypothetical proteins homologous to Q4D059 are all uncharacterized proteins of unknown function. Here, the solution structure of Q4D059 was solved by NMR and its backbone dynamics was characterized by (15)N relaxation parameters. The structure is composed by a parallel / anti-parallel three-stranded β-sheet packed against four helical regions. The structure is well defined by ca. 9 NOEs per residue and a backbone rmsd of 0.50 ± 0.05 Å for the representative ensemble of 15 lowest-energy structures. The structure is overall rigid except for N-terminal residues A(9) to D(11) at the beginning of β1, K(38), V(39) at the end of helix H3 with rapid motion in the ps-ns timescale and G(25) (helix H2), I(68) (β2) and V(78) (loop 3) undergoing internal motion in the μs-ms timescale. Limited structural similarities were found in protein structures deposited in the PDB, therefore functional inferences based on protein structure information are not clear. Q4D059 adopts a α/β fold that is slightly similar to the ATPase sub-domain IIB of the heat-shock protein 70 (HSP70) and to the N-terminal domain of the ribosomal protein L11. Copyright © 2015 Elsevier Inc. All rights reserved.
    Journal of Structural Biology 03/2015; DOI:10.1016/j.jsb.2015.02.007 · 3.37 Impact Factor

Full-text (3 Sources)

Available from
May 21, 2014