ArticlePDF Available

Recent improvements to the SMART domain-based sequence annotation resource

Authors:
  • Amazon Web Services

Abstract

SMART (Simple Modular Architecture Research Tool, http://smart.embl-heidelberg.de) is a web-based resource used for the annotation of protein domains and the analysis of domain architectures, with particular emphasis on mobile eukaryotic domains. Extensive annotation for each domain family is available, providing information relating to function, subcellular localization, phyletic distribution and tertiary structure. The January 2002 release has added more than 200 hand-curated domain models. This brings the total to over 600 domain families that are widely represented among nuclear, signalling and extracellular proteins. Annotation now includes links to the Online Mendelian Inheritance in Man (OMIM) database in cases where a human disease is associated with one or more mutations in a particular domain. We have implemented new analysis methods and updated others. New advanced queries provide direct access to the SMART relational database using SQL. This database now contains information on intrinsic sequence features such as transmembrane regions, coiled-coils, signal peptides and internal repeats. SMART output can now be easily included in users’ documents. A SMART mirror has been created at http://smart.ox.ac.uk.
242–244 Nucleic Acids Research, 2002, Vol. 30, No. 1 © 2002 Oxford University Press
Recent improvements to the SMART domain-based
sequence annotation resource
Ivica Letunic, Leo Goodstadt
1
, Nicholas J. Dickens
1
, Tobias Doerks, Joerg Schultz,
Richard Mott
2
, Francesca Ciccarelli, Richard R. Copley, Chris P. Ponting
1
and Peer Bork*
EMBL, Meyerhofstrasse 1, 69012 Heidelberg, Germany,
1
MRC Functional Genetics Unit, Department of Human
Anatomy and Genetics, University of Oxford, South Parks Road, Oxford OX1 3QX, UK and
2
Wellcome Trust Centre
for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, UK
Received September 18, 2001; Accepted September 24, 2001
ABSTRACT
SMART (Simple Modular Architecture Research Tool,
http://smart.embl-heidelberg.de) is a web-based
resource used for the annotation of protein domains
and the analysis of domain architectures, with particular
emphasis on mobile eukaryotic domains. Extensive
annotation for each domain family is available,
providing information relating to function, subcellular
localization, phyletic distribution and tertiary struc-
ture. The January 2002 release has added more than
200 hand-curated domain models. This brings the
total to over 600 domain families that are widely
represented among nuclear, signalling and extra-
cellular proteins. Annotation now includes links to
the Online Mendelian Inheritance in Man (OMIM) data-
base in cases where a human disease is associated
with one or more mutations in a particular domain.
We have implemented new analysis methods and
updated others. New advanced queries provide
direct access to the SMART relational database
using SQL. This database now contains information
on intrinsic sequence features such as transmembrane
regions, coiled-coils, signal peptides and internal
repeats. SMART output can now be easily included in
users’ documents. A SMART mirror has been created
at http://smart.ox.ac.uk.
INTRODUCTION
The task of identifying homologous domains by sequence
similarity is often made more difficult by differences in
domain architectures and by substantial divergence in
sequence. As the number of completely sequenced eukaryotic
genomes increases, so does the need for accurate prediction of
domain homologies. Accordingly, SMART (1) has been developed
to identify and annotate protein domains, particularly those in
eukaryotes that are mobile and difficult to detect.
SMART consists of a library of Hidden Markov models
(HMMs) (2). These provide a robust statistical model of amino
acid preferences and insertion/deletion probabilities at each
position in a sequence alignment. The current database covers
more than 600 protein domain families. These are linked to
multiple sequence alignments, embodied within a web-based
domain annotation tool. SMART provides facilities to query
the underlying relational database for proteins with particular
domain combinations (with the option of restricting these
to
any taxonomic group) and to alert users to sequences that
contain particular domain combinations, after these are newly
available in databases.
IMPROVED DOMAIN COVERAGE
The majority of domain alignments represented in SMART
have been established using standard database searching
methods (3,4). Over the past 2 years, in order to augment the
SMART domain set, we have striven to develop semi-automatic
search methods to identify new and biologically interesting
domains. Of more than 200 domains added, many were identified
in-house by investigating sequence regions that had no
previous domain annotations.
IMPROVED ANNOTATION
Improvements in the annotation of domains, with respect to
human disease and cellular localisation, have been implemented
in the latest version of SMART.
SMART now provides information on known human heritable
genetic disorders arising from missense mutations located
within specified domains. Of the 10 121 missense mutations
annotated in SWISS-PROT (http://ca.expasy.org/sprot/; 5),
many of which are derived from OMIM; 3085 mutations could
be mapped onto 170 different SMART domain types in 335 out
of 734 human disease gene sequences (6).
For each domain family, SMART now provides estimated
probabilities that each domain is part of a secreted, cyto-
plasmic and nuclear protein. These probabilities derive from
observed patterns of domain co-occurrence and their correlations
with protein localisations. The method for generating these
probability values and an estimation of its accuracy will be
presented elsewhere.
*To whom correspondence should be addressed. Tel: +49 6221 387 526; Fax: +49 6221 387 519; Email: bork@embl-heidelberg.de
Present address:
Joerg Schultz, Cellzome, Meyerhofstrasse 1, 69012 Heidelberg, Germany
Nucleic Acids Research, 2002, Vol. 30, No. 1 243
STRUCTURAL CHANGES AND THE SMART
DATABASE
The core of SMART is a relational database management
system (RDBMS) (3) powered by PostgreSQL (http://
www.postgresql.org) that stores information on SMART
domains. Each domain’s hit borders, raw bit score and Expect
(E) value are recorded, together with protein accession code,
description and species name.
For each protein in the relational database, intrinsic features
such as transmembrane regions (7), coiled-coils (8), signal
peptides (9) and internal repeats (10) are now included. Users
can now query the RDBMS for proteins containing not only
particular domains, but also specified intrinsic features
(‘TRANS’, transmembrane regions; ‘COIL’, coiled-coils;
‘SIGNAL’, signal peptides). For example, it is possible to
identify receptor tyrosine kinases by searching for proteins that
contain both a tyrosine kinase domain, and a predicted trans-
membrane region (Fig. 1).
For the latest release of SMART, two new analytical
methods have been employed. TMHMM2 is now being used to
predict transmembrane sequences, since this method demon-
strates 97–98% accuracy for transmembrane prediction (7).
Internal sequence repeats are detected using Prospero (10),
with a significance threshold probability of 10
–4
, after first
filtering the sequence for low complexity and coiled-coil
regions.
IMPROVED WEB INTERFACE
SMART provides a World Wide Web-based interface to its
underlying relational database and HMMER-based search
engine (3). In response to rapidly increasing demand, we have
taken steps to dramatically improve the efficiency and
response times of our server. Underlying code has been
modified to use persistent database connections. Many speed
optimisations have been made thereby providing users with a
much faster and more productive environment.
Schematic representations of proteins are now generated
dynamically and displayed as a single PNG (Portable Network
Graphics) image. This enables easy ‘copy–paste’ inclusion of
SMART output in users’ publications. SMART multiple
sequence alignments may now be coloured by consensus using
CHROMA (11). This highlights patterns of residue conservation,
which can assist in clarifying questions of homology, and can
draw attention to functional positions such as binding and
active sites.
SMART database querying capabilities were recently
greatly extended allowing users to build up more complex
queries of the underlying relational database using SQL
commands. The latest release of SMART also introduced
options to retrieve FASTA-formatted sequences of domains or
proteins that have been viewed using Architecture SMART.
Thus, it is easier for users to generate full alignments for all the
occurrences of a particular domain that occur in a given
species.
APPLICATION OF SMART
Apart from its use as a web tool, SMART has been applied to
large-scale annotation projects such as the annotation of the
human genome draft sequence (12,13), the investigation of
single domain families in model organisms (14), the study of
sequence conservation in multiple alignments (15) and, in
conjunction with genomic data, for the study of conservation
of gene (i.e. intron/exon) structure (16). SMART will continue
to be a valuable resource for large-scale sequence analysis
studies.
SMART has also been incorporated into other domain and
protein family resources that are used for the primary annotation
of sequence databases. It is a component database of InterPro
Figure 1. Using intrinsic features in Architecture SMART queries. The SMART database was queried for all proteins containing a tyrosine kinase domain and a
transmembrane region (TyrKc and TRANS). 387 proteins were found, including the five displayed here. Note that the text colour of domain names has been
designed to correlate both with its subcellular localisation (blue, secreted; black, intracellular) and its catalytic activity (red, catalytic activity).
244 Nucleic Acids Research, 2002, Vol. 30, No. 1
(17), which contributes to the annotation of SWISS-PROT
sequences (5), and of the Conserved Domain Database (CDD)
(http://web.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml) which
contributes to the annotation of RefSeq sequences (18).
CONCLUSIONS
Over the past years, SMART has developed and matured into
an important and widely used biological web tool characterised
by stability and fast response times. Our main goal has been to
continue to provide improvements and feature expansion
together with the highest quality of data. We are committed to
maintain, improve and extend SMART to accommodate the
rising needs of genome and proteome annotation and analysis.
REFERENCES
1. Schultz,J., Milpetz,F., Bork,P. and Ponting,C.P. (1998) SMART, a simple
modular architecture research tool: identification of signaling domains.
Proc. Natl Acad. Sci. USA, 95, 5857–5864.
2. Durbin,R., Eddy,S., Krogh,A. and Mitchison,G. (1998) Biological
Sequence Analysis. Probabilistic Models of Proteins and Nucleic Acids.
Cambridge University Press, Cambridge, UK.
3. Schultz,J., Copley,R.R., Doerks,T., Ponting,C.P. and Bork,P. (2000)
SMART: a web-based tool for the study of genetically mobile domains.
Nucleic Acids Res., 28, 231–234.
4. Ponting,C.P., Schultz,J., Copley,R.R., Andrade,M.A. and Bork,P. (2000)
Evolution of domain families. Adv. Protein Chem., 54, 185–244.
5. Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein sequence
database and its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45–48.
6. Goodstadt,L. and Ponting,C.P. (2001) Sequence variation and disease in
the wake of the draft human genome. Hum. Mol. Genet., 10, 2209–2214.
7. Krogh,A., Larsson,B., von Heijne,G. and Sonnhammer,E.L. (2001)
Predicting transmembrane protein topology with a hidden Markov model:
application to complete genomes. J. Mol. Biol., 305, 567–580.
8. Lupas,A., Van Dyke,M. and Stock,J. (1991) Predicting coiled coils from
protein sequences. Science, 252, 1162–1164.
9. von Heijne,G. (1987) Sequence Analysis in Molecular Biology: Treasure
Trove or Trivial Pursuit. Academic Press, San Diego, CA, 429–436.
10. Mott,R. (2000) Accurate formula for P-values of gapped local sequence
and profile alignments. J. Mol. Biol., 300, 649–659.
11. Goodstadt,L. and Ponting,C.P. (2001) CHROMA: consensus-based colouring
of multiple alignments for publication. Bioinformatics, 17, 845–846.
12. Lander,E.S., Linton,L.M., Birren,B., Nusbaum,C., Zody,M.C.,
Baldwin,J., Devon,K., Dewar,K., Doyle,M., FitzHugh,W. et al. (2001)
Initial sequencing and analysis of the human genome. Nature, 409, 860–921.
13. Venter,J.C., Adams,M.D., Myers,E.W., Li,P.W., Mural,R.J., Sutton,G.G.,
Smith,H.O., Yandell,M., Evans,C.A., Holt,R.A. et al. (2001) The
sequence of the human genome. Science, 291, 1304–1351.
14. Hill,E., Broadbent,I.D., Chothia,C. and Pettitt,J. (2001) Cadherin
superfamily proteins in Caenorhabditis elegans and Drosophila
melanogaster. J. Mol. Biol., 305, 1011–1024.
15. Pei,J. and Grishin,N.V. (2001) AL2CO: calculation of positional conservation
in a protein sequence alignment. Bioinformatics, 17, 700–712.
16. Betts,M.J., Guigo,R., Agarwal,P. and Russell,R.B. (2001) Exon structure
conservation despite low sequence similarity: a relic of dramatic events in
evolution? EMBO J., 20, 5354–5360.
17. Apweiler,R., Attwood,T.K., Bairoch,A., Bateman,A., Birney,E.,
Biswas,M., Bucher,P., Cerutti,L., Corpet,F., Croning,M.D. et al. (2001)
The InterPro database, an integrated documentation resource for protein
families, domains and functional sites. Nucleic Acids Res., 29, 37–40.
18. Pruitt,K.D. and Maglott,D.R. (2001) RefSeq and LocusLink: NCBI
gene-centered resources. Nucleic Acids Res., 29, 137–140.
... We retrieved 97 known B protein sequences (Table 1) using proteins from A. thaliana (AT3G54340 and AT5G20240) and Oryza sativa (OsMADS2, 4, and 16) as query sequences [1,3,6,[18][19][20] in a Basic Local Alignment Search Tool (BLAST) search [20]. Subsequently, the retrieved sequences were entered into Simple Modular Architecture Research Tool (SMART) to confirm they have MADS-box domains [22]. Sequence alignment of the MADS domains was displayed in Additional file 2: Fig. S1. ...
... embl-heide lberg. de/) was used to validate the presence of the MADS domains in the proteins encoded by the target genes [22]. ...
Article
Full-text available
Background MADS-box transcription factors function as homo- or heterodimers and regulate many aspects of plant development; moreover, MADS-box genes have undergone extensive duplication and divergence. For example, the morphological diversity of floral organs is closely related to the functional divergence of the MADS-box gene family. B-class genes (such as Arabidopsis thaliana APETALA3 [ AP3 ] and PISTILLATA [ PI ]) belong to a subgroup of MADS-box genes. Here, we collected 97 MADS-box B protein sequences from 21 seed plant species and examined their motifs to better understand the functional evolution of B proteins. Results We used the MEME tool to identify conserved sequence motifs in these B proteins; unique motif arrangements and sequences were identified in these B proteins. The keratin-like domains of Malus domestica and Populus trichocarpa B proteins differed from those in other angiosperms, suggesting that a novel regulatory network might have evolved in these species. The MADS domains of Nelumbo nucifera , Glycine max , and Amborella trichopoda B-proteins contained motif 9; in contrast, those of other plants contained motif 1. Protein modelling analyses revealed that MADS domains with motif 9 may lack amino acid sites required for DNA-binding. These results suggested that the three species might share an alternative mechanism controlling floral development. Conclusions Amborella trichopoda has B proteins with either motif 1 or motif 9 MADS domains, suggesting that these two types of MADS domains evolved from the ancestral domain into two groups, those with motif 9 ( N. nucifera and G. max ), and those with motif 1. Moreover, our results suggest that the homodimer/heterodimer intermediate transition structure first appeared in A. trichopoda . Therefore, our systematic analysis of the motifs in B proteins sheds light on the evolution of these important transcription factors.
... We compared the results of the two methods to confirm TIFY candidate genes in these species. These candidate genes were identified in their domains with SMART 4 and CDD 5 to ensure that the TIFY domain was in sequence (Letunic et al., 2002;Marchler-Bauer et al., 2002). Finally, the ExPASy 6 ProtParam tool was used to query the physical and chemical properties of the GmTIFYs (Appel et al., 1994;Wang et al., 2020a). ...
Article
Full-text available
TIFY proteins play crucial roles in plant abiotic and biotic stress responses. Our transcriptome data revealed several TIFY family genes with significantly upregulated expression under drought, salt, and ABA treatments. However, the functions of the GmTIFY family genes are still unknown in abiotic stresses. We identified 38 GmTIFY genes and found that TIFY10 homologous genes have the most duplication events, higher selection pressure, and more obvious response to abiotic stresses compared with other homologous genes. Expression pattern analysis showed that GmTIFY10e and GmTIFY10g genes were significantly induced by salt stress. Under salt stress, GmTIFY10e and GmTIFY10g transgenic Arabidopsis plants showed higher root lengths and fresh weights and had significantly better growth than the wild type (WT). In addition, overexpression of GmTIFY10e and GmTIFY10g genes in soybean improved salt tolerance by increasing the PRO, POD, and CAT contents and decreasing the MDA content; on the contrary, RNA interference plants showed sensitivity to salt stress. Overexpression of GmTIFY10e and GmTIFY10g in Arabidopsis and soybean could improve the salt tolerance of plants, while the RNAi of GmTIFY10e and GmTIFY10g significantly increased sensitivity to salt stress in soybean. Further analysis demonstrated that GmTIFY10e and GmTIFY10g genes changed the expression levels of genes related to the ABA signal pathway, including GmSnRK2, GmPP2C, GmMYC2, GmCAT1, and GmPOD. This study provides a basis for comprehensive analysis of the role of soybean TIFY genes in stress response in the future.
... To further confirm, the amino acid sequences of these homologs were searched in HMMER (Hmmer, RRID:SCR_005305) for affirmation based on the hits returned (Potter et al., 2018). Once the homologs were identified, they were further analyzed for domain organization by SMART (SMART, RRID:SCR_005026) (Letunic et al., 2002). After manual evaluation of the domain organization, the domain architecture was constructed to scale using DOG 2.0 software (Ren et al., 2009). ...
Article
Full-text available
The Hippo signaling pathway has been shown to be involved in regulating cellular identity, cell/tissue size maintenance and mechanotransduction. The Hippo pathway consists of a kinase cascade which determines the nucleo-cytoplasmic localization of YAP in the cell. YAP is the effector protein in the Hippo pathway, which acts as a transcriptional cofactor for TEAD. Phosphorylation of YAP upon activation of the Hippo pathway prevents it from entering the nucleus and abrogates its function in the transcription of the target genes. In Cnidaria, the information on the regulatory roles of the Hippo pathway is virtually lacking. Here, we report the existence of a complete set of Hippo pathway core components in Hydra for the first time. By studying their phylogeny and domain organization, we report evolutionary conservation of the components of the Hippo pathway. Protein modelling suggested the conservation of YAP-TEAD interaction in Hydra. Further, we characterized the expression pattern of the homologs of yap, hippo, mob and sav in Hydra using whole-mount RNA in situ hybridization and report their possible role in stem cell maintenance. Immunofluorescence assay revealed that Hvul_YAP expressing cells occur in clusters in the body column and are excluded in the terminally differentiated regions. Actively proliferating cells marked by Ki67 exhibit YAP colocalization in their nuclei. Strikingly, a subset of these colocalized cells is actively recruited to the newly developing bud. Disruption of the YAP-TEAD interaction increased the budding rate indicating a critical role of YAP in regulating cell proliferation in Hydra. Collectively, we posit that the Hippo pathway is an essential signaling system in Hydra; its components are ubiquitously expressed in the Hydra body column and play a crucial role in Hydra tissue homeostasis.
... A comparison of the BnMicEmUP amino acid sequence against a protein sequence database identified a region of the BnMicEmUP proteins sequence that is similar to the predicted Domain of Unknown Function 1118 (DUF1118; Figure 3). Further searches for motifs in BnMicEmUP3 were performed using the domain alignments in PFAM (Bateman et al., 2002), SMART (Letunic et al., 2002), PRINTS (Attwood et al., 2002), and PROSITE databases. These analyses showed that BnMicEmUP contains a putative chloroplast transit peptide (cTP) in the N-terminus of the BnMicEmUP protein. ...
Article
Full-text available
Microspores of Brassica napus can be diverted from normal pollen development into embryogenesis by treating them with a mild heat shock. As microspore embryogenesis closely resembles zygotic embryogenesis, it is used as model for studying the molecular mechanisms controlling embryo formation. A previous study comparing the transcriptomes of three-day-old sorted embryogenic and pollen-like (non-embryogenic) microspores identified a gene homologous to AT1G74730 of unknown function that was upregulated 8-fold in the embryogenic cells. In the current study, the gene was isolated and sequenced from B. napus and named BnMicEmUP ( B. napus microspore embryogenesis upregulated gene). Four forms of BnMicEmUP mRNA and three forms of genomic DNA were identified. BnMicEmUP2,3 was upregulated more than 7-fold by day 3 in embryogenic microspore cultures compared to non-induced cultures. BnMicEmUP1,4 was highly expressed in leaves. Transient expression studies of BnMicEmUP3::GFP fusion protein in Nicotiana benthamiana and in stable Arabidopsis transgenics showed that it accumulates in chloroplasts. The features of the BnMicEmUP protein, which include a chloroplast targeting region, a basic region, and a large region containing 11 complete leucine-rich repeats, suggest that it is similar to a bZIP PEND (plastid envelope DNA-binding protein) protein, a DNA binding protein found in the inner envelope membrane of developing chloroplasts. Here, we report that the BnMicEmUP3 overexpression in Arabidopsis increases the sensitivity of seedlings to exogenous abscisic acid (ABA). The BnMicEmUP proteins appear to be transcription factors that are localized in plastids and are involved in plant responses to biotic and abiotic environmental stresses; as well as the results obtained from this study can be used to improve crop yield.
... Further search algorithms used were there of SMART (www.smart.embl-heidelberg.de/) [9], Protein Families (PFAM, www.sanger.ac.uk/) [10], and ProSite protein family signatures (www.expasy.ch/) [11] databanks. ...
Article
The Crp-Fnr regulators, named after the first two identified members, are DNA-binding proteins which predominantly function as positive transcription factors, though roles of repressors are also important. Among over 1200 proteins with an N-terminally located nucleotide-binding domain similar to the cyclic adenosine monophosphate (cAMP) receptor protein, the distinctive additional trait of the Crp-Fnr superfamily is a C-terminally located helix-turn-helix motif for DNA binding. From a curated database of 369 family members exhibiting both features, we provide a protein tree of Crp-Fnr proteins according to their phylogenetic relationships. This results in the assembly of the regulators ArcR, CooA, CprK, Crp, Dnr, FixK, Flp, Fnr, FnrN, MalR, NnrR, NtcA, PrfA, and YeiL and their homologs in distinct clusters. Lead members and representatives of these groups are described, placing emphasis on the less well-known regulators and target processes. Several more groups consist of sequence-derived proteins of unknown physiological roles; some of them are tight clusters of highly similar members. The Crp-Fnr regulators stand out in responding to a broad spectrum of intracellular and exogenous signals such as cAMP, anoxia, the redox state, oxidative and nitrosative stress, nitric oxide, carbon monoxide, 2-oxoglutarate, or temperature. To accomplish their roles, Crp-Fnr members have intrinsic sensory modules allowing the binding of allosteric effector molecules, or have prosthetic groups for the interaction with the signal. The regulatory adaptability and structural flexibility represented in the Crp-Fnr scaffold has led to the evolution of an important group of physiologically versatile transcription factors.
... Results were downloaded in Newick format for tree representation using the Interactive Tree of Life (iTOL) v3.2.4 (http://itol.embl.de/). SMART (109,110) was used to identify domain architecture and protein motifs with default parameters. The dataset was uploaded for the visualization of protein classification (HK, HHK, GGDEF, EAL, and HD-GYP) and the presence of other domains using iTOL. ...
Preprint
Full-text available
Pseudomonas syringae pv. actinidiae (Psa) is a phytopathogen that causes devastating bacterial canker in kiwifruit. Among five biovars defined by genetic, biochemical and virulence traits, Psa3 is the most aggressive and is responsible for the most recent reported outbreaks, but the molecular basis of its heightened virulence is unclear. We therefore designed the first P. syringae multi-strain whole-genome microarray, encompassing biovars Psa1, Psa2 and Psa3 and the well-established model P. syringae pv. tomato , and analyzed early bacterial responses to an apoplast-like minimal medium. Transcriptomic profiling revealed (i) the strong activation in Psa3 of all hrp / hrc cluster genes, encoding components of the type III secretion system required for bacterial pathogenicity and involved in responses to environmental signals; (ii) potential repression of the hrp / hrc cluster in Psa2; and (iii) activation of flagellum-dependent cell motility and chemotaxis genes in Psa1. The detailed investigation of three gene families encoding upstream regulatory proteins (histidine kinases, their cognate response regulators, and proteins with diguanylate cyclase and/or phosphodiesterase domains) indicated that c-di-GMP may be a key regulator of virulence in Psa biovars. The gene expression data were supported by the quantification of biofilm formation. Our findings suggest that diverse early responses to the host apoplast, even among bacteria belonging to the same pathovar, can lead to different virulence strategies and may explain the differing outcomes of infections. Based on our detailed structural analysis of hrp operons, we also propose a revision of hrp cluster organization and operon regulation in P. syringae. Author summary Pseudomonas syringae pv. actinidiae (Psa) is a bacterial pathogen that infects kiwifruit crops. Recent outbreaks have been particularly devastating due to the emergence of a new biovar (Psa3), but the molecular basis of its virulence is unknown so it is difficult to develop mitigation strategies. In this study, we compared the gene expression profiles of Psa3 and various less-virulent biovars in an environment that mimics early infection, to determine the basis of pathogenicity. Genes involved in the assembly and activity of the type III secretion system, which is crucial for the secretion of virulence effectors, were strongly upregulated in Psa3 while lower or not expressed in the other biovars. We also observed the Psa3-specific expression of genes encoding upstream signaling components, confirming that strains of the same bacterial pathovar can respond differently to early contact with their host. Finally, our data suggested a key role in Psa virulence switch ability for the small chemical signaling molecule c-di-GMP, which suppresses the expression of virulence genes. This effect of c-di-GMP levels on Psa3 virulence should be further investigated and confirmed to develop new mitigation methods to target this pathway.
... In one respect, the benchmark does not carry over to just sequence annotation as we used the structure based domain information th a t is not available for all sequences without coordinates. However, domain assignment can still be obtained from databases such as PRODOM (Corpet et al, 2000), SMART (Letunic et al, 2002) and PFAM (Bateman et al, 2002) for many sequences without known structure (see section 1.2.3 for an introduction into domain databases). ...
Thesis
A strategy for protein structure and function based annotation of genomes was developed, evaluated and applied to the proteins of several genomes including the human genome. First the performance of the widely-used homology-based sequence comparison program PSI-BLAST to detect distant homologous relationships (≤20% sequence identity) was evaluated. The benchmark is based on two sets of sequences from the Structural Classification Of Proteins (SCOP) database for which the homologous relationships are known. About 40% of the test proteome can be annotated via remote homologies. Common sources of errors are identified. PSI-BLAST is applied to assign homologues of known structure and function to proteins of M. genitalium and M. tuberculosis. From the benchmark, the number of missed assignments and the potential extent of new structural and functional families was estimated. An automated proteome annotation system was developed to perform large scale annotations based on analyses such as PSI-BLAST. Computationally intensive analyses can be distributed across several computers. The system is based on a relational database serving as a back-end and a software interface as a front-end. Relational storage of results from different analyses permits straightforward evaluation of results and the comparison of annotations across genomes. The above annotation system was applied to fourteen proteomes including the human proteome. The extent and reliability of structural and functional annotation in these proteomes was evaluated and compared. About 40% of the human proteome can be assigned to protein folds. For 77% of the proteome there is some functional information, but only 26% of the proteome can be assigned to the standard sequence motifs that characterise function. There are substantial differences in the composition of membrane proteins between the proteomes in terms of their globular domains. Commonly occurring structural superfamilies are identified and compared across the proteomes. The frequencies of these superfamilies leads to the estimate that 98% of the human proteome evolved by domain duplication, with four of the ten most duplicated superfamilies potentially specific for multi-cellular organisms. Occurrence of domains in repeats is more common in metazoa than in single-cellular organisms. Superfamily pairs co-occurring in the same protein sequence were analysed and compared across the proteomes. Structural superfamilies over- and under-represented in human disease genes were identified.
Preprint
Full-text available
The Hippo signaling pathway has been shown to be involved in the regulation of cellular identity, cell/tissue size maintenance and mechanotransduction. The Hippo pathway consists of a kinase cascade which determines the nucleo-cytoplasmic localization of YAP in the cell. YAP is the effector protein in the Hippo pathway which acts as a transcriptional cofactor for TEAD. Phosphorylation of YAP upon activation of the Hippo pathway prevents it from entering the nucleus and hence abrogates its function in transcription of target genes. In Cnidaria, the information on the regulatory roles of the Hippo pathway is virtually lacking. Here, we report for the first time the existence of a complete set of Hippo pathway core components in Hydra. By studying their phylogeny and domain organization, we report evolutionary conservation of the components of the Hippo pathway. Protein modelling suggested conservation of YAP-TEAD interaction in Hydra. We also characterized the expression pattern of the homologs of yap, hippo, mob and sav in Hydra using whole mount RNA in situ hybridization and report their possible role in stem cell maintenance. Immunofluorescence assay revealed that Hvul_YAP expressing cells occur in clusters in the body column and are excluded in the terminally differentiated regions. The YAP expressing cells are recruited early during head regeneration and budding implicating the Hippo pathway in early response to injury or establishment of oral fate. These cells exhibit a non-clustered existence at the site of regeneration and budding, indicating the involvement of a new population of YAP expressing cells during oral fate specification. Collectively, we posit that the Hippo pathway is an important signaling system in Hydra, its components are ubiquitously expressed in the Hydra body column, and may play crucial role in Hydra oral fate specification.
Chapter
This chapter discusses how cereal grain proteins affect bread quality. It reviews cereal protein classification, before summarizing the role of prolamins, soluble proteins, xylanase inhibitors, and detergent-solubilized proteins on bread quality.
Article
Full-text available
Background The circadian clock not only participates in regulating various stages of plant growth, development and metabolism, but confers plant environmental adaptability to stress such as drought. Pseudo-Response Regulators (PRRs) are important component of the central oscillator (the core of circadian clock) and play a significant role in plant photoperiod pathway. However, no systematical study about this gene family has been performed in cotton. Methods PRR genes were identified in diploid and tetraploid cotton using bioinformatics methods to investigate their homology, duplication and evolution relationship. Differential gene expression, KEGG enrichment analysis and qRT-PCR were conducted to analyze PRR gene expression patterns under diurnal changes and their response to drought stress. Results A total of 44 PRR family members were identified in four Gossypium species, with 16 in G. hirsutum , 10 in G. raimondii , and nine in G. barbadense as well as in G. arboreum . Phylogenetic analysis indicated that PRR proteins were divided into five subfamilies and whole genome duplication or segmental duplication contributed to the expansion of Gossypium PRR gene family. Gene structure analysis revealed that members in the same clade are similar, and multiple cis-elements related to light and drought stress response were enriched in the promoters of GhPRR genes. qRT-PCR results showed that GhPRR genes transcripts presented four expression peaks (6 h, 9 h, 12 h, 15 h) during 24 h and form obvious rhythmic expression trend. Transcriptome data with PEG treatment, along with qRT-PCR verification suggested that members of clade III ( GhPRR5a, b, d ) and clade V ( GhPRR3a and GhPRR3c ) may be involved in drought response. This study provides an insight into understanding the function of PRR genes in circadian rhythm and in response to drought stress in cotton.
Article
Full-text available
A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies—a whole-genome assembly and a regional chromosome assembly—were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional ∼12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.
Article
Full-text available
SWISS-PROT (http://www.expasy.ch/) is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. Recent developments of the database include: an increase in the number and scope of model organisms; cross-references to two additional databases; a variety of new documentation files and improvements to TrEMBL, a computer annotated supplement to SWISS-PROT. TrEMBL consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except the CDS already included in SWISS-PROT.
Article
Full-text available
Signature databases are vital tools for identifying distant relationships in novel sequences and hence for inferring protein function. InterPro is an integrated documentation resource for protein families, domains and functional sites, which amalgamates the efforts of the PROSITE, PRINTS, Pfam and ProDom database projects. Each InterPro entry includes a functional description, annotation, literature references and links back to the relevant member database(s). Release 2.0 of InterPro (October 2000) contains over 3000 entries, representing families, domains, repeats and sites of post-translational modification encoded by a total of 6804 different regular expressions, profiles, fingerprints and Hidden Markov Models. Each InterPro entry lists all the matches against SWISS-PROT and TrEMBL (more than 1 000 000 hits from 462 500 proteins in SWISS-PROT and TrEMBL). The database is accessible for text- and sequence-based searches at http://www.ebi.ac.uk/interpro/. Questions can be emailed to interhelp{at}ebi.ac.uk.
Article
Full-text available
Accurate multiple alignments of 86 domains that occur in signaling proteins have been constructed and used to provide a Web-based tool (SMART: simple modular architecture research tool) that allows rapid identification and annotation of signaling domain sequences. The majority of signaling proteins are multidomain in character with a considerable variety of domain combinations known. Comparison with established databases showed that 25% of our domain set could not be deduced from SwissProt and 41% could not be annotated by Pfam. SMART is able to determine the modular architectures of single sequences or genomes; application to the entire yeast genome revealed that at least 6.7% of its genes contain one or more signaling domains, approximately 350 greater than previously annotated. The process of constructing SMART predicted (i) novel domain homologues in unexpected locations such as band 4.1-homologous domains in focal adhesion kinases; (ii) previously unknown domain families, including a citron-homology domain; (iii) putative functions of domain families after identification of additional family members, for example, a ubiquitin-binding role for ubiquitin-associated domains (UBA); (iv) cellular roles for proteins, such predicted DEATH domains in netrin receptors further implicating these molecules in axonal guidance; (v) signaling domains in known disease genes such as SPRY domains in both marenostrin/pyrin and Midline 1; (vi) domains in unexpected phylogenetic contexts such as diacylglycerol kinase homologues in yeast and bacteria; and (vii) likely protein misclassifications exemplified by a predicted pleckstrin homology domain in a Candida albicans protein, previously described as an integrin.
Conference Paper
Accurate multiple alignments of 86 domains that occur in signaling proteins have been constructed and used to provide a Web based tool (SMART: simple modular architecture research tool) that allows rapid identification and annotation of signaling domain sequences. The majority of signaling proteins are multidomain in character with a considerable variety of domain combinations known. Comparison with established databases showed that 25% of our domain set could not be deduced from SwissProt and 41% could not be annotated by Pfam, SMART is able to determine the modular architectures of single sequences or genomes; application to the entire yeast genome revealed that at least 6.7% of its genes contain one or more signaling domains, approximately 350 greater than previously annotated. The process of constructing SMART predicted (i) novel domain homologues in unexpected locations such as band 4.1-homologous domains in focal adhesion kinases; (ii) previously unknown domain families, including a citron-homology domain; (iii) putative functions of domain families after identification of additional family members, for example, a ubiquitin-binding role for ubiquitin-associated domains (UBA); (iv) cellular roles for proteins, such predicted DEATH domains in netrin receptors further implicating these molecules in axonal guidance; (v) signaling domains in known disease genes such as SPRY domains in both marenostrin/pyrin and Midline I; (vi) domains in unexpected phylogenetic contexts such as diacylglycerol kinase homologues in yeast and bacteria; and (vii) likely protein misclassifications exemplified by a predicted pleckstrin homology domain in a Candida albicans protein, previously described as an integrin.
Book
This book deals with sequence analysis on the computer. One of its aims is to serve as a brief survey of what one can do with protein and DNA sequences either directly on a microcomputer or by using one of the main sequence/programs data banks such as BioNet or the Wisconsin package. Equally important, the book traces the origins of some of the ideas that have come to be embodied in these programs from both biological and methodological points of view: What do the standard sequence analysis algorithms really analyze, and to what degree can we trust their outputs.
Article
The ability to form selective cell-cell adhesions is an essential property of metazoan cells. Members of the cadherin superfamily are important regulators of this process in both vertebrates and invertebrates. With the advent of genome sequencing projects, determination of the full repertoire of cadherins available to an organism is possible and here we present the identification and analysis of the cadherin repertoires in the genomes of Caenorhabditis elegans and Drosophila melanogaster. Hidden Markov models of cadherin domains were matched to the protein sequences obtained from the translation of the predicted gene sequences. Matches were made to 21 C. elegans and 18 D. melanogastersequences. Experimental and theoretical work on C. elegans sequences, and data from ESTs, show that three pairs of genes, and two triplets, should be merged to form five single genes. It also produced sequence changes at one or both of the 5′ and 3′ termini of half the sequences. In D. melanogaster it is probable that two of the cadherin genes should also be merged together and that three cadherin genes should be merged with other neighbouring genes.
Book
Probablistic models are becoming increasingly important in analyzing the huge amount of data being produced by large-scale DNA-sequencing efforts such as the Human Genome Project. For example, hidden Markov models are used for analyzing biological sequences, linguistic-grammar-based probabilistic models for identifying RNA secondary structure, and probabilistic evolutionary models for inferring phylogenies of sequences from different organisms. This book gives a unified, up-to-date and self-contained account, with a Bayesian slant, of such methods, and more generally to probabilistic methods of sequence analysis. Written by an interdisciplinary team of authors, it is accessible to molecular biologists, computer scientists, and mathematicians with no formal knowledge of the other fields, and at the same time presents the state of the art in this new and important field.
Article
Amino acid sequence alignments are widely used in the analysis of protein structure, function and evolutionary relationships. Proteins within a superfamily usually share the same fold and possess related functions. These structural and functional constraints are reflected in the alignment conservation patterns. Positions of functional and/or structural importance tend to be more conserved. Conserved positions are usually clustered in distinct motifs surrounded by sequence segments of low conservation. Poorly conserved regions might also arise from the imperfections in multiple alignment algorithms and thus indicate possible alignment errors. Quantification of conservation by attributing a conservation index to each aligned position makes motif detection more convenient. Mapping these conservation indices onto a protein spatial structure helps to visualize spatial conservation features of the molecule and to predict functionally and/or structurally important sites. Analysis of conservation indices could be a useful tool in detection of potentially misaligned regions and will aid in improvement of multiple alignments. We developed a program to calculate a conservation index at each position in a multiple sequence alignment using several methods. Namely, amino acid frequencies at each position are estimated and the conservation index is calculated from these frequencies. We utilize both unweighted frequencies and frequencies weighted using two different strategies. Three conceptually different approaches (entropy-based, variance-based and matrix score-based) are implemented in the algorithm to define the conservation index. Calculating conservation indices for 35522 positions in 284 alignments from SMART database we demonstrate that different methods result in highly correlated (correlation coefficient more than 0.85) conservation indices. Conservation indices show statistically significant correlation between sequentially adjacent positions i and i + j, where j < 13, and averaging of the indices over the window of three positions is optimal for motif detection. Positions with gaps display substantially lower conservation properties. We compare conservation properties of the SMART alignments or FSSP structural alignments to those of the ClustalW alignments. The results suggest that conservation indices should be a valuable tool of alignment quality assessment and might be used as an objective function for refinement of multiple alignments. The C code of the AL2CO program and its pre-compiled versions for several platforms as well as the details of the analysis are freely available at ftp://iole.swmed.edu/pub/al2co/.