[show abstract][hide abstract] ABSTRACT: DBTBS, first released in 1999, is a reference database on transcriptional regulation in Bacillus subtilis, summarizing the experimentally characterized transcription factors, their recognition sequences and the genes they regulate. Since the previous release, the original content was extended by the addition of the data contained in 569 new publications, the total of which now reaches 947. The number of B. subtilis promoters annotated in the database was more than doubled to 1475. In addition, 463 experimentally validated B. subtilis operons and their terminators have been included. Given the increase in the number of fully sequenced bacterial genomes, we decided to extend the usability of DBTBS in comparative regulatory genomics. We therefore created a new section on the conservation of the upstream regulatory sequences between homologous genes in 40 Gram-positive bacterial species, as well as on the presence of overrepresented hexameric motifs that may have regulatory functions. DBTBS can be accessed at: http://dbtbs.hgc.jp.
Nucleic Acids Research 02/2008; 36(Database issue):D93-6. · 8.28 Impact Factor
[show abstract][hide abstract] ABSTRACT: Interspecies sequence comparison is a powerful tool to extract functional or evolutionary information from the genomes of organisms. A number of studies have compared protein sequences or promoter sequences between mammals, which provided many insights into genomics. However, the correlation between protein conservation and promoter conservation remains controversial.
We examined promoter conservation as well as protein conservation for 6,901 human and mouse orthologous genes, and observed a very weak correlation between them. We further investigated their relationship by decomposing it based on functional categories, and identified categories with significant tendencies. Remarkably, the 'ribosome' category showed significantly low promoter conservation, despite its high protein conservation, and the 'extracellular matrix' category showed significantly high promoter conservation, in spite of its low protein conservation.
Our results show the relation of gene function to protein conservation and promoter conservation, and revealed that there seem to be nonparallel components between protein and promoter sequence evolution.
[show abstract][hide abstract] ABSTRACT: Regulation of transcription is controlled by sets of transcription factors binding specific sites in the regulatory regions of genes. It is therefore believed that regulatory regions driving similar expression profiles share some common structural features. We here introduce a computational approach for finding a small set of rules describing the presence and positioning of motifs in a set of promoter sequences. This rule set is subsequently used for finding promoters that drive similar expression profiles from a genomic set of sequences. We applied our approach on muscle-expressed genes in Caenorhabditis elegans. We obtained a high average performance, and in the best case we found that almost 50% of true positive test genes scored higher than 90% of the true negative test genes. High scoring non-training sequences were enriched for muscle-expressed genes, and predicted motifs fitting the rules showed a significant tendency to be present in experimentally verified regulatory regions. Our model is more general than existing cis-regulatory module models, as rules selected by our model contain a variety of information, including not only proximal but also distal positioning of pairs of motifs, positioning with regard to the translation start site, and simply presences of motifs. We believe our model can help to increase our understanding about transcription factor cooperation and transcription initiation.
Genome informatics. International Conference on Genome Informatics 02/2008; 21:188-99.
[show abstract][hide abstract] ABSTRACT: Here we report the new features and improvements in our latest release of the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/), a comprehensive annotation resource for human genes and transcripts. H-InvDB, originally developed as an integrated database of the human transcriptome based on extensive annotation of large sets of full-length cDNA (FLcDNA) clones, now provides annotation for 120 558 human mRNAs extracted from the International Nucleotide Sequence Databases (INSD), in addition to 54 978 human FLcDNAs, in the latest release H-InvDB_4.6. We mapped those human transcripts onto the human genome sequences (NCBI build 36.1) and determined 34 699 human gene clusters, which could define 34 057 (98.1%) protein-coding and 642 (1.9%) non-protein-coding loci; 858 (2.5%) transcribed loci overlapped with predicted pseudogenes. For all these transcripts and genes, we provide comprehensive annotation including gene structures, gene functions, alternative splicing variants, functional non-protein-coding RNAs, functional domains, predicted sub cellular localizations, metabolic pathways, predictions of protein 3D structure, mapping of SNPs and microsatellite repeat motifs, co-localization with orphan diseases, gene expression profiles, orthologous genes, protein-protein interactions (PPI) and annotation for gene families. The current H-InvDB annotation resources consist of two main views: Transcript view and Locus view and eight sub-databases: the DiseaseInfo Viewer, H-ANGEL, the Clustering Viewer, G-integra, the TOPO Viewer, Evola, the PPI view and the Gene family/group.
Nucleic Acids Research 02/2008; 36(Database issue):D793-9. · 8.28 Impact Factor
[show abstract][hide abstract] ABSTRACT: WoLF PSORT is an extension of the PSORT II program for protein subcellular location prediction. WoLF PSORT converts protein amino acid sequences into numerical localization features; based on sorting signals, amino acid composition and functional motifs such as DNA-binding motifs. After conversion, a simple k-nearest neighbor classifier is used for prediction. Using html, the evidence for each prediction is shown in two ways: (i) a list of proteins of known localization with the most similar localization features to the query, and (ii) tables with detailed information about individual localization features. For convenience, sequence alignments of the query to similar proteins and links to UniProt and Gene Ontology are provided. Taken together, this information allows a user to understand the evidence (or lack thereof) behind the predictions made for particular proteins. WoLF PSORT is available at wolfpsort.org.
Nucleic Acids Research 08/2007; 35(Web Server issue):W585-7. · 8.28 Impact Factor
[show abstract][hide abstract] ABSTRACT: We present the second version of Melina, a web-based tool for promoter analysis. Melina II shows potential DNA motifs in promoter regions with a combination of several available programs, Consensus, MEME, Gibbs sampler, MDscan and Weeder, as well as several parameter settings. It allows running a maximum of four programs simultaneously, and comparing their results with graphical representations. In addition, users can build a weight matrix from a predicted motif and apply it to upstream sequences of several typical genomes (human, mouse, S. cerevisiae, E. coli, B. subtilis or A. thaliana) or to public motif databases (JASPAR or DBTBS) in order to find similar motifs. Melina II is a client/server system developed by using Adobe (Macromedia) Flash and is accessible over the web at http://melina.hgc.jp.
Nucleic Acids Research 08/2007; 35(Web Server issue):W227-31. · 8.28 Impact Factor
[show abstract][hide abstract] ABSTRACT: Although recent studies have revealed that the majority of human genes are subject to regulation of alternative promoters, the biological relevance of this phenomenon remains unclear. We have also demonstrated that roughly half of the human RefSeq genes examined contain putative alternative promoters (PAPs). Here we report large-scale comparative studies of PAPs between human and mouse counterpart genes. Detailed sequence comparison of the 17,245 putative promoter regions (PPRs) in 5463 PAP-containing human genes revealed that PPRs in only a minor fraction of genes (807 genes) showed clear evolutionary conservation as one or more pairs. Also, we found that there were substantial qualitative differences between conserved and non-conserved PPRs, with the latter class being AT-rich PPRs of relative minor usage, enriched in repetitive elements and sometimes producing transcripts that encode small or no proteins. Systematic luciferase assays of these PPRs revealed that both classes of PPRs did have promoter activity, but that their strength ranges were significantly different. Furthermore, we demonstrate that these characteristic features of the non-conserved PPRs are shared with the PPRs of previously discovered putative non-protein coding transcripts. Taken together, our data suggest that there are two distinct classes of promoters in humans, with the latter class of promoters emerging frequently during evolution.
Genome Research 08/2007; 17(7):1005-14. · 14.40 Impact Factor
[show abstract][hide abstract] ABSTRACT: In order to understand an overview of promoter activities intrinsic to primary DNA sequences in the human genome within a particular cell type, we carried out systematic quantitative luciferase assays of DNA fragments corresponding to putative promoters for 472 human genes which are expressed in HEK (human embryonic kidney epithelial) 293 cells. We observed the promoter activities of them were distributed in a bimodal manner; putative promoters belonging to the first group (with strong promoter activities) were designated as P1 and the latter (with weak promoter activities) as P2. The frequencies of the TATA-boxes, the CpG islands, and the overall G + C-contents were significantly different between these two populations, indicating there are two separate groups of promoters. Interestingly, similar analysis using 251 randomly isolated genomic DNA fragments showed that P2-type promoter occasionally occurs within the human genome. Furthermore, 35 DNA fragments corresponding to putative promoters of non-protein-coding transcripts (ncRNAs) shared similar features with the P2 in both promoter activities and sequence compositions. At least, a part of ncRNAs, which have been massively identified by full-length cDNA projects with no functional relevance inferred, may have originated from those sporadic promoter activities of primary DNA sequences inherent to the human genome.
DNA Research 05/2007; 14(2):71-7. · 4.43 Impact Factor
[show abstract][hide abstract] ABSTRACT: Publicly available database of co-expressed gene sets would be a valuable tool for a wide variety of experimental designs, including targeting of genes for functional identification or for regulatory investigation. Here, we report the construction of an Arabidopsis thaliana trans-factor and cis-element prediction database (ATTED-II) that provides co-regulated gene relationships based on co-expressed genes deduced from microarray data and the predicted cis elements. ATTED-II (http://www.atted.bio.titech.ac.jp) includes the following features: (i) lists and networks of co-expressed genes calculated from 58 publicly available experimental series, which are composed of 1388 GeneChip data in A.thaliana; (ii) prediction of cis-regulatory elements in the 200 bp region upstream of the transcription start site to predict co-regulated genes amongst the co-expressed genes; and (iii) visual representation of expression patterns for individual genes. ATTED-II can thus help researchers to clarify the function and regulation of particular genes and gene networks.
Nucleic Acids Research 02/2007; 35(Database issue):D863-9. · 8.28 Impact Factor
[show abstract][hide abstract] ABSTRACT: It is widely recognized that much of the information for determining the final subcellular localization of proteins is found in their amino acid sequences. Thus the prediction of protein localization sites is of both theoretical and practical interest. In most cases, the prediction has been attempted in two ways: one is based on the knowledge of experimentally characterized targeting signals, while the other utilizes the statistical differences of general sequence characteristics, such as amino acid composition, between localization sites. Both approaches have limitations, and it is recommended to check the results of various prediction methods based on different principles as well as training data. Recently, increased proteomic analyses of localization sites have provided new data to assess the current status of predictive methods. In this chapter we discuss these issues and close with an example illustrating the use of the WoLF PSORT web server for localization prediction.
Methods in molecular biology (Clifton, N.J.) 02/2007; 390:429-66.
[show abstract][hide abstract] ABSTRACT: We characterized the DNA methylation status at 144 tissue-biased and 37 non-tissue-biased alternative promoters of 61 human genes in five normal tissues. Analysis of the collected data revealed that (i) DNA methylation status differed greatly among alternative promoters belonging to the same gene; (ii) DNA methylation status differed between tissues for the majority of the individual promoters, and (iii) 80-90% of CpG-island-containing promoters were not methylated on either allele throughout the five tissues examined. Furthermore, although the statistical significance was not as clear as for the above features, we also found that (iv) the DNA methylation patterns of tissue-biased promoters changed more drastically than those of non-tissue-biased promoters; (v) tissue-biased promoters tended to be less methylated than their respective alternative promoters in the tissues where they were preferentially expressed, and (vi) the 'null' methylation pattern of a given promoter was enriched in the tissues where the transcription was most active. These findings together indicate that there are dynamic physiological changes of DNA methylation. DNA methylation appears to play a significant role in differential usage of alternative promoters and may be related to functional diversification between CpG-island-containing promoters and CpG-island-less promoters.
DNA Research 09/2006; 13(4):155-67. · 4.43 Impact Factor
[show abstract][hide abstract] ABSTRACT: By analyzing 1,780,295 5'-end sequences of human full-length cDNAs derived from 164 kinds of oligo-cap cDNA libraries, we identified 269,774 independent positions of transcriptional start sites (TSSs) for 14,628 human RefSeq genes. These TSSs were clustered into 30,964 clusters that were separated from each other by more than 500 bp and thus are very likely to constitute mutually distinct alternative promoters. To our surprise, at least 7674 (52%) human RefSeq genes were subject to regulation by putative alternative promoters (PAPs). On average, there were 3.1 PAPs per gene, with the composition of one CpG-island-containing promoter per 2.6 CpG-less promoters. In 17% of the PAP-containing loci, tissue-specific use of the PAPs was observed. The richest tissue sources of the tissue-specific PAPs were testis and brain. It was also intriguing that the PAP-containing promoters were enriched in the genes encoding signal transduction-related proteins and were rarer in the genes encoding extracellular proteins, possibly reflecting the varied functional requirement for and the restricted expression of those categories of genes, respectively. The patterns of the first exons were highly diverse as well. On average, there were 7.7 different splicing types of first exons per locus partly produced by the PAPs, suggesting that a wide variety of transcripts can be achieved by this mechanism. Our findings suggest that use of alternate promoters and consequent alternative use of first exons should play a pivotal role in generating the complexity required for the highly elaborated molecular systems in humans.
Genome Research 02/2006; 16(1):55-65. · 14.40 Impact Factor
[show abstract][hide abstract] ABSTRACT: The high similarity of tunicates and vertebrates during their development coupled with the transparency of tunicate larvae, their well-studied cell lineages and the availability of simple and efficient transgenesis methods makes of this subphylum an ideal system for the investigation of vertebrate physiological and developmental processes. Recently, the sequencing of two different Ciona genomes has lead to the identification of numerous genes. In order to better understand the regulation of these genes, a database was created containing information on regulation of tunicate genes collected from literature. It includes for instance information regarding the minimal promoter length, the transcription factors involved and their binding sites, as well as the localization of the gene expression. Additionally, binding sites for characterized transcription factors were predicted based on published in vitro recognition sites. Comparison of the promoters of homologous genes in different species is also provided to allow identification of conserved cis elements. At the time of writing, information about 184 promoters, containing 73 identified binding sites and >2000 newly predicted binding sites is available. This database is accessible at http://dbtgr.hgc.jp.
Nucleic Acids Research 01/2006; 34(Database issue):D552-5. · 8.28 Impact Factor
[show abstract][hide abstract] ABSTRACT: DBTSS was first constructed in 2002 based on precise, experimentally determined 5' end clones. Several major updates and additions have been made since the last report. First, the number of human clones has drastically increased, going from 190,964 to 1,359,000. Second, information about potential alternative promoters is presented because the number of 5' end clones is now sufficient to determine several promoters for one gene. Namely, we defined putative promoter groups by clustering transcription start sites (TSSs) separated by <500 bases. A total of 8308 human genes and 4276 mouse genes were found to have putative multiple promoters. Third, DBTSS provides detailed sequence comparisons of user-specified TSSs. Finally, we have added TSS information for zebrafish, malaria and schyzon (a red algae model organism). DBTSS is accessible at http://dbtss.hgc.jp.
Nucleic Acids Research 01/2006; 34(Database issue):D86-9. · 8.28 Impact Factor
[show abstract][hide abstract] ABSTRACT: We present a new program for predicting protein subcellular localization from amino acid sequence. WoLF PSORT is a major update to the PSORTII program, based on new sequence
Proceedings of 4th Asia-Pacific Bioinformatics Conference. 13-16 February 2006, Taipei, Taiwan; 01/2006
[show abstract][hide abstract] ABSTRACT: In prokaryotes, genes belonging to the same operon are transcribed in a single mRNA molecule. Transcription starts as the RNA polymerase binds to the promoter and continues until it reaches a transcriptional terminator. Some terminators rely on the presence of the Rho protein, whereas others function independently of Rho. Such Rho-independent terminators consist of an inverted repeat followed by a stretch of thymine residues, allowing us to predict their presence directly from the DNA sequence. Unlike in Escherichia coli, the Rho protein is dispensable in Bacillus subtilis, suggesting a limited role for Rho-dependent termination in this organism and possibly in other Firmicutes. We analyzed 463 experimentally known terminating sequences in B. subtilis and found a decision rule to distinguish Rho-independent transcriptional terminators from non-terminating sequences. The decision rule allowed us to find the boundaries of operons in B. subtilis with a sensitivity and specificity of about 94%. Using the same decision rule, we found an average sensitivity of 94% for 57 bacteria belonging to the Firmicutes phylum, and a considerably lower sensitivity for other bacteria. Our analysis shows that Rho-independent termination is dominant for Firmicutes in general, and that the properties of the transcriptional terminators are conserved. Terminator prediction can be used to reliably predict the operon structure in these organisms, even in the absence of experimentally known operons. Genome-wide predictions of Rho-independent terminators for the 57 Firmicutes are available in the Supporting Information section.
[show abstract][hide abstract] ABSTRACT: It has been envisaged that CpG islands are often observed near the transcriptional start sites (TSS) of housekeeping genes. However, neither the precise positions of CpG islands relative to TSS of genes nor the correlation between the presence of the CpG islands and the expression specificity of these genes is well-understood. Using thousands of sequences with known TSS in human and mouse, we found that there is a clear peak in the distribution of CpG islands around TSS in the genes of these two species. Thus, we classified human (mouse) genes into 6600 (2948) CpG+ genes and 2619 (1830) CpG- ones, based on the presence of a CpG island within the -100: +100 region. We estimated the degree of each gene being a housekeeper by the number of cDNA libraries where its ESTs were detected. Then, the tendency that a gene lacking CpG islands around its TSS is expressed with a higher degree of tissue specificity turned out to be evolutionarily conserved. We also confirmed this tendency by analyzing the gene ontology annotation of classified genes. Since no such clear correlation was found in the control data (mRNAs, pre-mRNAs, and chromosome banding pattern), we concluded that the effect of a CpG island near the TSS should be more important than the global GC content of the region where the gene resides.
[show abstract][hide abstract] ABSTRACT: Gene expression profiling of cancer tissues is expected to contribute to our understanding of cancer biology as well as developments of new methods of diagnosis and therapy. Our collaborative efforts in Japan have been mainly focused on solid tumors such as breast, colorectal and hepatocellular cancers. The expression data are obtained by a high-throughput RT-PCR technique, and patients are recruited mainly from a single hospital. In the cancer gene expression database (CGED), the expression and clinical data are presented in a way useful for scientists interested in specific genes or biological functions. The data can be retrieved either by gene identifiers or by functional categories defined by Gene Ontology terms or the Swiss-Prot annotation. Expression patterns of multiple genes, selected by names or similarity search of the patterns, can be compared. Visual presentation of the data with sorting function enables users to easily recognize of relationships between gene expression and clinical parameters. Data for other cancers such as lung and thyroid cancers will be added in the near future. The URL of CGED is http://cged.hgc.jp.
Nucleic Acids Research 02/2005; 33(Database issue):D533-6. · 8.28 Impact Factor
[show abstract][hide abstract] ABSTRACT: Sigma factors, often in conjunction with other transcription factors, regulate gene expression in prokaryotes at the transcriptional level. Specific transcription factors tend to co-occur with specific sigma factors. To predict new members of the transcription factor regulon, we applied Bayes rule to combine the Bayesian probability of sigma factor prediction calculated from microarray data and the sigma factor binding sequence motif, the motif score of the transcription factor associated with the sigma factor, the empirically determined distance between the transcription start site to the cis-regulatory region, and the tendency for specific sigma factors and transcription factors to co-occur. By combining these information sources, we improve the accuracy of predicting regulation by transcription factors, and also confirm the sigma factor prediction. We applied our proposed method to all genes in Bacillus subtilis to find currently unknown gene regulations by transcription factors and sigma factors.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 02/2005;
[show abstract][hide abstract] ABSTRACT: The identification of regulatory elements as over-represented motifs in the promoters of potentially co-regulated genes is an important and challenging problem in computational biology. Although many motif detection programs
have been developed so far, they still seem to be immature practically. In particular the choice of tunable parameters is
often critical to success. Thus knowledge regarding which parameter settings are most appropriate for various types of target
motifs is invaluable, but unfortunately has been scarce. In this paper, we report our parameter landscape analysis of two
widely-used programs (the Gibbs Sampler (GS) and MEME). Our results show that GS is relatively sensitive to the changes of
some parameter values while MEME is more stable. We present recommended parameter settings for GS optimized for four different
motif lengths. Thus, running GS four times with these settings should significantly decrease the risk of overlooking subtle