BIOINFORMATICS APPLICATIONS NOTE
Vol. 26 no. 19 2010, pages 2493–2495
Databases and ontologies
proTF: a comprehensive data and phylogenomics resource for
prokaryotic transcription factors
Jie Bai1,2,†, Junrong Wang3,†, Feng Xue4,†, Jingsong Li1, Lijing Bu1, Junming Hu4,
Gang Xu1, Qiyu Bao1, Guoping Zhao2, Xiaoming Ding2, Jie Yan4,∗and Jinyu Wu1,∗
1Institute of Genomic Medicine/Zhejiang Provincial Key Laboratory of Medical Genetics, Wenzhou Medical College,
Wenzhou 325035,2Department of Microbiology and Microbial Engineering, School of Life Sciences, Fudan
University, Shanghai 200433,3Maternal and Child Health Hospital of Wenling, Wenling 317500 and4Department of
Medical Microbiology and Parasitology, College of Medicine, Zhejiang University, Hangzhou 310058, China
Associate Editor: Jonathan Wren
Advance Access publication July 27, 2010
Summary: Investigation of transcription factors (TFs) is of extreme
significance for gleaning more information about the mechanisms
underlying the dynamic transcriptional regulatory network. Herein,
proTF is constructed to serve as a comprehensive data resource
and phylogenomics analysis platform for prokaryotic TFs. It has
many prominent characteristics: (i) detailed annotation information,
including basic sequence features, domain organization, sequence
homolog and sequence composition, was extensively collected, and
then visually displayed for each TF entry in all prokaryotic genomes;
(ii) workset was employed as the basic frame to provide an efficient
way to organize the retrieved data and save intermediate records;
implemented to investigate the evolutionary roles of specific TFs. In
conclusion, proTF dedicates to the prokaryotic TFs with integrated
multi-function, which will become a valuable resource for prokaryotic
transcriptional regulatory network in the post-genomic era.
Contact: firstname.lastname@example.org; email@example.com
Received on January 20, 2010; revised on June 22, 2010; accepted
on July 22, 2010
transcription factors (TFs) within or among specific organisms
can help us to highlight how they evolutionarily conserved or
diverse in order to fit in the ever-changing environment effectively
(Rodionov, 2007). In the past decade, a number of specialized TF
databases were established. The TRANSFAC and DBD databases
from nine eukaryotic species (Hermoso et al., 2004; Wilson et al.,
2008; Wingender, 2008). Within the plant kingdom, progressively
integrated and comprehensively annotated biological databases had
classificationand phylogenomics analysisof
∗To whom correspondence should be addressed.
†The authors wish it to be known that, in their opinion, the first three authors
should be regarded as joint First Authors.
been constructed, such as plantTFDB (Guo et al., 2008), DPTF
(Zhu et al., 2007),RARTF(Iidaetal.,2005),PlanTAPDB(Richardt
et al., 2007), TOBFAC (Rushton et al., 2008), DATFAP(Fredslund,
2008), GRASSIUS (Yilmaz et al., 2009) and PlnTFDB (Perez-
Rodriguez et al., 2009). Various information of TFs in animal are
(Zheng et al., 2008). In addition, fungal TFs can be accessed from
FTFD (Park et al., 2008). Up to now, however, no comprehensive
platform for computational repository is available to provide access
to the large complete sets of prokaryotic TFs. RegTransBase is a
database of regulatory interactions in prokaryotes, which captures
the latest version (Kazakov et al., 2007). ooTFD is a database aimed
at capturing information regarding the polypeptide interactions,
and TF entries (Ghosh, 2000). ExtraTrain provides integrated and
easily manageable information for 679816 extragenic regions and
for the genes delimiting each of them. ForTF, it only contains 16TF
families (Pareja et al., 2006). In addition, the currently developed
and the BacTregulators contains three TF families (TetR, AraC and
IclR) in bacteria and archaea (Martinez-Bueno et al., 2004; Tobes
et al., 1998), TRACTOR_DB (Perez et al., 2007), cTFbase (Wu
et al., 2007) and ArchaeaTF (Wu et al., 2008) only aim to focus on
cyanobacteria and archaea, respectively.
Herein, a new TF database proTF is constructed to provide an
integrated useful resource for TFs research and facilitate further
investigation of transcriptional regulatory network in prokaryotes.
In comparison with other comprehensive TF databases, proTF
contains the following prominent characteristics: (i) offered an
extensively detailed annotation information of each TF entry in
all the completely sequenced prokaryotic genomes; (ii) employed
the workset as the basic frame to well organize the retrieved data
and save intermediate records; and (iii) implemented a number of
phylogenomics analysis tools to investigate the evolutionary roles
of specific TFs in or across different prokaryotic organisms. In
conclusion, proTF is dedicates to the prokaryotic TFs with multiple
integrated phylogenomics function.
© The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: firstname.lastname@example.org
at Zhejiang University Library huajianchi campus on February 7, 2011
J.Bai et al.
To identify complete putative set of TFs in a given prokaryotic
genome, our previously well-established analysis pipelines in
cTFbase (Wu et al., 2007) and ArchaeaTF (Wu et al., 2008)
were applied to all the fully sequenced proteomes of prokaryotic
species available from KEGG. In brief, we start with the
collection of a set of well-characterized/putative TFs from Swiss-
Prot/TrEMBL databases (release 15.12) and a number of HMM
(version 24.0) and SUPERFAMILY (release 1.73). A combination
of BLAST-based and HMM-based search was adopted to obtain
significant hits using the BLASTP and hmmpfam program with
an E-value of 1e-10 and 0.01, respectively. All the identified TFs
Once a putative TF was identified and classified, it was
extensively annotated using a number of bioinformatics tools
and databases. Particularly, the molecular weight and isoelectric
point of a given TF was identified using the PepStat program
implemented in the EMBOSS package (http://www.ebi.ac.uk/
emboss). The InterProScan program (http://www.ebi.ac.uk/Tools/
InterProScan/) was used to search its domain architectures. The
Gene Ontology terms were obtained from the InterProScan results
interpro2go). Sequence similarity alignment was performed using
the BLAST program against several major databases, including
PDB (http://www.pdb.org/), Uniprot (http://www.uniprot.org/),
KEGG (http://www.genome.jp/kegg/genes.html), Swissprot (http://
www.expasy.ch/sprot/) and Refseq (http://www.ncbi.nlm.nih.gov/
IDENTIFICATION AND ANNOTATION OF TFs
proTF is a relational database hosts on an Apache HTTP server
running on Linux operating system. Various separate MySQL
database tables are retrieved by the Structure Query Language.
PHP is implemented for the connection of database and dynamic
production of user-friendly HTML front-end queries. The web
interface is organized in an operating system-independent way,
which has been tested to work properly in Internet Explorer 7.0,
Firefox 2/3 and Opera 10.00 browsers.
proTF presents a user-friendly web interface for researchers
to store and interrogate all the putative TFs entries. Users can
easily access the data by clicking a specific TF family or the
species in the browse page. In the search page, a multi-layered
query system is employed for users to retrieve the data based on
hierarchized keywords. The search can be performed via locus tag,
family or species. Expression in separate fields can be combined
with the logical operator AND, OR or NOT. The list of registered
species is also arranged hierarchically according to taxonomy
to allow users to easily access the TF entry. In BLAST-based
search page, the BLAST program (http://indra.mullins.microbiol
.washington.edu/blast/viroblast.php) is implemented to enable the
identification of the homologs of the query sequences stored in
proTF. Either the full-length or the DBD region of the TF sequences
can be taken as database to perform BLAST search. In addition,
a number of advanced parameters (such as E-value, matrix and
species) were also provided to allow users to perform more specific
BLAST searches. In the result table, basic information of each TF
entry matching the query will be listed in a table, in which gene IDs
and family IDs are linked to the detailed annotation of the gene and
family. By clicking on the entries, detailed annotation information
will be displayed, including basic sequence features, Gene Ontology
terms, gene domain organization and sequence homolog to other
Workset, incorporated into proTF, is a significant functionality
of having the TF genes and families well organized and
conducting a succession of phylogenomics analyses to investigate
the evolutionary relationship of specific TF family. Using workset,
users can append, remove and configure any retrieved results for
further data manipulation, comparative genomics and molecular
available to customize all the data in the workset through appending
by its corresponding ID to avoid rehandling the retrieved results.
Another prominent feature of proTF is that it can also serve as
a comparative genomics and molecular evolution analysis platform
for prokaryotic TFs. A number of phylogenomics analysis tools are
implemented to allow users to investigate a particular TF within
one prokaryotic genome or a bunch of TFs across different ones,
as well as the TFs items are stored in the workset. Particularly, the
ClustalW (http://www.ebi.ac.uk/clustalw/) and MUSCLE program
(http://www.drive5.com/muscle/) were employed to performing
multiple sequence alignment for the whole TF sequences or just
the DBD sequences of TFs at amino acid or DNA level. Multiple
sequence alignment result was graphically displayed using the
Jalview program (http://www.jalview.org/).The QuickTree program
(http://www.sanger.ac.uk/resources/software/quicktree/) helps users
to investigate the evolutionary relationship of TF items stored in the
workset. The reliability of the phylogenetic tree can be evaluated
by the bootstrap method with replications (at default 100) and the
tree is visualized using theATVprogram (http://www.phylosoft.org/
WORKSET AND PHYLOGENOMIC ANALYSIS
Currently, proTF provided a complete list of centralized putative
prokaryotes TFs. It has contained a number of 127838 TFs from
841 prokaryotic organisms. In future, more prokaryotes TFs from
sequenced organisms will be added into the platform to extend its
functionality. These existing entries will be updated to keep up with
the platform will provide a wealth of information and more robust
and reliable support for the scientific community to decipher and
gain the complete picture of the genetic regulatory networks.
(30800643); National Science and Technology Key Program
for Infectious Diseases of China (2008ZX10004-015).
NationalNatural Science FoundationofChina
Conflict of Interest: none declared.
Fredslund,J. (2008) DATFAP: a database of primers and homology alignments for
transcription factors from 13 plant species. BMC Genomics, 9, 140.
at Zhejiang University Library huajianchi campus on February 7, 2011
proTF Download full-text
Fulton,D.L. et al. (2009) TFCat: the curated catalog of mouse and human transcription
factors. Genome Biol., 10, R29.
Ghosh,D. (2000) Object-oriented transcription factors database (ooTFD). NucleicAcids
Res., 28, 308–310.
Guo,A.Y. et al. (2008) PlantTFDB: a comprehensive plant transcription factor database,
Nucleic Acids Res., 36, D966–D969.
Hermoso,A. et al. (2004) TrSDB: a proteome database of transcription factors. Nucleic
Acids Res., 32, D171–D173.
Huerta,A.M. et al. (1998) RegulonDB: a database on transcriptional regulation in
Escherichia coli. Nucleic Acids Res., 26, 55–59.
Kazakov,A.E. et al. (2007) RegTransBase—a database of regulatory sequences and
interactions in a wide range of prokaryotic genomes. Nucleic Acids Res., 35,
Iida,K. et al. (2005) RARTF: database and tools for complete sets of Arabidopsis
transcription factors. DNA Res., 12, 247–256.
database. Biochem. Biophys. Res. Commun., 322, 787–793.
Martinez-Bueno,M. et al. (2004) BacTregulators: a database of transcriptional
regulators in bacteria and archaea. Bioinformatics, 20, 2787–2791.
Pareja,E. et al. (2006) ExtraTrain: a database of Extragenic regions and Transcriptional
information in prokaryotic organisms. BMC Microbiol., 6, 29.
Park,J. et al. (2008) FTFD: an informatics pipeline supporting phylogenomic analysis
of fungal transcription factors. Bioinformatics, 24, 1024–1025.
Perez,A.G. et al. (2007) Tractor_DB (version 2.0): a database of regulatory interactions
in gamma-proteobacterial genomes. Nucleic Acids Res., 35, D132–D136.
Perez-Rodriguez,P. et al. (2009) PlnTFDB: updated content and new features of the
plant transcription factor database. Nucleic Acids Res., 38, D822–D827.
Pfreundt,U. et al. (2009) FlyTF: improved annotation and enhanced functionality of the
Drosophila transcription factor database. Nucleic Acids Res., 38, D443–D447.
associated proteins. Plant Physiol., 143, 1452–1466.
Rodionov,D.A. (2007) Comparative genomic reconstruction of transcriptional
regulatory networks in bacteria. Chem. Rev., 107, 3467–3497.
Rushton,P.J. et al. (2008) TOBFAC: the database of tobacco transcription factors, BMC
Bioinformatics, 9, 53.
Sierro,N. et al. (2008) DBTBS: a database of transcriptional regulation in Bacillus
subtilis containing upstream intergenic conservation information. Nucleic Acids
Res., 36, D93–D96.
Tobes,R. and Ramos,J.L. (2002) AraC-XylS database: a family of positive
transcriptional regulators in bacteria. Nucleic Acids Res., 30, 318–321.
Wilson,D. et al. (2008) DBD—taxonomically broad transcription factor predictions:
new content and functionality. Nucleic Acids Res., 36, D88–D92.
Wingender,E. (2008) The TRANSFAC project as an example of framework technology
that supports the analysis of genomic regulation. Brief. Bioinform., 9, 326–332.
Wu,J. et al. (2008) ArchaeaTF: an integrated database of putative transcription factors
in Archaea. Genomics, 91, 102–107.
Wu,J. et al. (2007) cTFbase: a database for comparative genomics of transcription
factors in cyanobacteria. BMC Genomics, 8, 104.
Yilmaz,A. et al. (2009) GRASSIUS: a platform for comparative regulatory genomics
across the grasses. Plant Physiol., 149, 171–180.
Zhu,Q. et al. (2007) DPTF: a database of poplar transcription factors. Bioinformatics,
at Zhejiang University Library huajianchi campus on February 7, 2011