[Show abstract][Hide abstract] ABSTRACT: When annotating protein sequences with the footprints of evolutionarily conserved domains, conservative score or E-value thresholds need to be applied for RPS-BLAST hits, to avoid many false positives. We notice that manual inspection and classification of hits gathered at a higher threshold can add a significant amount of valuable domain annotation. We report an automated algorithm that 'rescues' valuable borderline-scoring domain hits that are well-supported by domain architecture (DA, the sequential order of conserved domains in a protein query), including tandem repeats of domain hits reported at a more conservative threshold. This algorithm is now available as a selectable option on the public conserved domain search (CD-Search) pages. We also report on the possibility to 'suppress' domain hits close to the threshold based on a lack of well-supported DA and to implement this conservatively as an option in live conserved domain searches and for pre-computed results. Improving domain annotation consistency will in turn reduce the fraction of NR sequences with incomplete DAs. URL: http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi.
Published by Oxford University Press 2015. This work is written by US Government employees and is in the public domain in the US.
Database The Journal of Biological Databases and Curation 03/2015; 2015. DOI:10.1093/database/bav012 · 3.37 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: NCBI's CDD, the Conserved Domain Database, enters its 15(th) year as a public resource for the annotation of proteins with the location of conserved domain footprints. Going forward, we strive to improve the coverage and consistency of domain annotation provided by CDD. We maintain a live search system as well as an archive of pre-computed domain annotation for sequences tracked in NCBI's Entrez protein database, which can be retrieved for single sequences or in bulk. We also maintain import procedures so that CDD contains domain models and domain definitions provided by several collections available in the public domain, as well as those produced by an in-house curation effort. The curation effort aims at increasing coverage and providing finer-grained classifications of common protein domains, for which a wealth of functional and structural data has become available. CDD curation generates alignment models of representative sequence fragments, which are in agreement with domain boundaries as observed in protein 3D structure, and which model the structurally conserved cores of domain families as well as annotate conserved features. CDD can be accessed at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.
Published by Oxford University Press on behalf of Nucleic Acids Research 2014. This work is written by US Government employees and is in the public domain in the US.
Nucleic Acids Research 11/2014; 43(D1). DOI:10.1093/nar/gku1221 · 9.11 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: CDD, the Conserved Domain Database, is part of NCBI’s Entrez query and retrieval system and is also accessible via http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. CDD provides annotation of protein sequences with the location of conserved domain footprints and functional sites inferred
from these footprints. Pre-computed annotation is available via Entrez, and interactive search services accept single protein
or nucleotide queries, as well as batch submissions of protein query sequences, utilizing RPS-BLAST to rapidly identify putative
matches. CDD incorporates several protein domain and full-length protein model collections, and maintains an active curation
effort that aims at providing fine grained classifications for major and well-characterized protein domain families, as supported
by available protein three-dimensional (3D) structure and the published literature. To this date, the majority of protein
3D structures are represented by models tracked by CDD, and CDD curators are characterizing novel families that emerge from
protein structure determination efforts.
[Show abstract][Hide abstract] ABSTRACT: Close to 60% of protein sequences tracked in comprehensive databases can be mapped to a known three-dimensional (3D) structure
by standard sequence similarity searches. Potentially, a great deal can be learned about proteins or protein families of interest
from considering 3D structure, and to this day 3D structure data may remain an underutilized resource. Here we present enhancements
in the Molecular Modeling Database (MMDB) and its data presentation, specifically pertaining to biologically relevant complexes
and molecular interactions. MMDB is tightly integrated with NCBI's Entrez search and retrieval system, and mirrors the contents
of the Protein Data Bank. It links protein 3D structure data with sequence data, sequence classification resources and PubChem,
a repository of small-molecule chemical structures and their biological activities, facilitating access to 3D structure data
not only for structural biologists, but also for molecular biologists and chemists. MMDB provides a complete set of detailed
and pre-computed structural alignments obtained with the VAST algorithm, and provides visualization tools for 3D structure
and structure/sequence alignment via the molecular graphics viewer Cn3D. MMDB can be accessed at http://www.ncbi.nlm.nih.gov/structure.
[Show abstract][Hide abstract] ABSTRACT: NCBI’s Conserved Domain Database (CDD) is a resource for the annotation of protein sequences with the location of conserved
domain footprints, and functional sites inferred from these footprints. CDD includes manually curated domain models that make
use of protein 3D structure to refine domain models and provide insights into sequence/structure/function relationships. Manually
curated models are organized hierarchically if they describe domain families that are clearly related by common descent. As
CDD also imports domain family models from a variety of external sources, it is a partially redundant collection. To simplify
protein annotation, redundant models and models describing homologous families are clustered into superfamilies. By default,
domain footprints are annotated with the corresponding superfamily designation, on top of which specific annotation may indicate
high-confidence assignment of family membership. Pre-computed domain annotation is available for proteins in the Entrez/Protein
dataset, and a novel interface, Batch CD-Search, allows the computation and download of annotation for large sets of protein
queries. CDD can be accessed via http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.
[Show abstract][Hide abstract] ABSTRACT: NCBI's Conserved Domain Database (CDD) is a collection of multiple sequence alignments and derived database search models,
which represent protein domains conserved in molecular evolution. The collection can be accessed at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml, and is also part of NCBI's Entrez query and retrieval system, cross-linked to numerous other resources. CDD provides annotation
of domain footprints and conserved functional sites on protein sequences. Precalculated domain annotation can be retrieved
for protein sequences tracked in NCBI's Entrez system, and CDD's collection of models can be queried with novel protein sequences
via the CD-Search service at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. Starting with the latest version of CDD, v2.14, information from redundant and homologous domain models is summarized at
a superfamily level, and domain annotation on proteins is flagged as either ‘specific’ (identifying molecular function with
high confidence) or as ‘non-specific’ (identifying superfamily membership only).
[Show abstract][Hide abstract] ABSTRACT: The conserved domain database (CDD) is part of NCBI's Entrez database system and serves as a primary resource for the annotation of conserved domain footprints on protein sequences in Entrez. Entrez's global query interface can be accessed at http://www.ncbi.nlm.nih.gov/Entrez and will search CDD and many other databases. Domain annotation for proteins in Entrez has been pre-computed and is readily available in the form of 'Conserved Domain' links. Novel protein sequences can be scanned against CDD using the CD-Search service; this service searches databases of CDD-derived profile models with protein sequence queries using BLAST heuristics, at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. Protein query sequences submitted to NCBI's protein BLAST search service are scanned for conserved domain signatures by default. The CDD collection contains models imported from Pfam, SMART and COG, as well as domain models curated at NCBI. NCBI curated models are organized into hierarchies of domains related by common descent. Here we report on the status of the curation effort and present a novel helper application, CDTree, which enables users of the CDD resource to examine curated hierarchies. More importantly, CDD and CDTree used in concert, serve as a powerful tool in protein classification, as they allow users to analyze protein sequences in the context of domain family hierarchies.
[Show abstract][Hide abstract] ABSTRACT: Three-dimensional (3D) structure is now known for a large fraction of all protein families. Thus, it has become rather likely that one will find a homolog with known 3D structure when searching a sequence database with an arbitrary query sequence. Depending on the extent of similarity, such neighbor relationships may allow one to infer biological function and to identify functional sites such as binding motifs or catalytic centers. Entrez's 3D-structure database, the Molecular Modeling Database (MMDB), provides easy access to the richness of 3D structure data and its large potential for functional annotation. Entrez's search engine offers several tools to assist biologist users: (i) links between databases, such as between protein sequences and structures, (ii) pre-computed sequence and structure neighbors, (iii) visualization of structure and sequence/structure alignment. Here, we describe an annotation service that combines some of these tools automatically, Entrez's 'Related Structure' links. For all proteins in Entrez, similar sequences with known 3D structure are detected by BLAST and alignments are recorded. The 'Related Structure' service summarizes this information and presents 3D views mapping sequence residues onto all 3D structures available in MMDB (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=structure).