The word landscape of the non-coding segments of the Arabidopsis thaliana genome

Bioinformatics Laboratory, School of Electrical Engineering and Computer Science, Ohio University, Athens, OH, USA.
BMC Genomics (Impact Factor: 3.99). 10/2009; 10(1):463. DOI: 10.1186/1471-2164-10-463
Source: PubMed

ABSTRACT Genome sequences can be conceptualized as arrangements of motifs or words. The frequencies and positional distributions of these words within particular non-coding genomic segments provide important insights into how the words function in processes such as mRNA stability and regulation of gene expression.
Using an enumerative word discovery approach, we investigated the frequencies and positional distributions of all 65,536 different 8-letter words in the genome of Arabidopsis thaliana. Focusing on promoter regions, introns, and 3' and 5' untranslated regions (3'UTRs and 5'UTRs), we compared word frequencies in these segments to genome-wide frequencies. The statistically interesting words in each segment were clustered with similar words to generate motif logos. We investigated whether words were clustered at particular locations or were distributed randomly within each genomic segment, and we classified the words using gene expression information from public repositories. Finally, we investigated whether particular sets of words appeared together more frequently than others.
Our studies provide a detailed view of the word composition of several segments of the non-coding portion of the Arabidopsis genome. Each segment contains a unique word-based signature. The respective signatures consist of the sets of enriched words, 'unwords', and word pairs within a segment, as well as the preferential locations and functional classifications for the signature words. Additionally, the positional distributions of enriched words within the segments highlight possible functional elements, and the co-associations of words in promoter regions likely represent the formation of higher order regulatory modules. This work is an important step toward fully cataloguing the functional elements of the Arabidopsis genome.

Download full-text


Available from: Matt John Brian Geisler, Sep 27, 2015
27 Reads
  • Source
    • "However, even in the case of the longer fragments the promoters may be incomplete because recent reports showed that cis acting elements 58–69 kb upstream of the start codon can influence gene expression in maize [49]. Beyond the insufficient length of the promoter/5 -UTR fragments used, the lack of regulatory elements located in an intron [50] [51] or the 3 -UTR [52] [53] or post-transcriptional mechanisms related to ␤glucuronidase accumulation may also explain differences between in situ and GUS patterns. In any case all four isolated upstream fragments provide useful tools for the manipulation of industrially important traits of the maize kernel. "
    Dataset: Paper 6
  • Source
    • "When the sequence CACGTGTC is submitted using the ‘in silico expression analysis’ web tool with default settings, the genes harbouring this sequence within their promoter were found to be most strongly upregulated in the microarray expression data set abscisic acid (Figure 3A). This sequence has previously been associated with abscisic acid-responsive genes (2, 9). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Using bioinformatics, putative cis-regulatory sequences can be easily identified using pattern recognition programs on promoters of specific gene sets. The abundance of predicted cis-sequences is a major challenge to associate these sequences with a possible function in gene expression regulation. To identify a possible function of the predicted cis-sequences, a novel web tool designated ‘in silico expression analysis’ was developed that correlates submitted cis-sequences with gene expression data from Arabidopsis thaliana. The web tool identifies the A. thaliana genes harbouring the sequence in a defined promoter region and compares the expression of these genes with microarray data. The result is a hierarchy of abiotic and biotic stress conditions to which these genes are most likely responsive. When testing the performance of the web tool, known cis-regulatory sequences were submitted to the ‘in silico expression analysis’ resulting in the correct identification of the associated stress conditions. When using a recently identified novel elicitor-responsive sequence, a WT-box (CGACTTTT), the ‘in silico expression analysis’ predicts that genes harbouring this sequence in their promoter are most likely Botrytis cinerea induced. Consistent with this prediction, the strongest induction of a reporter gene harbouring this sequence in the promoter is observed with B. cinerea in transgenic A. thaliana.Database URL:
    Database The Journal of Biological Databases and Curation 01/2014; 2014:bau030. DOI:10.1093/database/bau030 · 3.37 Impact Factor
  • Source
    • "CpG island predictors cannot be used for plants, since a suitable prediction criterion is unavailable (Rombauts et al., 2003) and they are purported to be absent in plant genomes (Yamamoto et al., 2007b). Sequence-based PPPs for plants are either repositories of TFBSs and cis-regulatory elements reported in individual studies, such as PLACE (Higo et al., 1999), Osiris (Morris et al., 2008), and AGRIS (Davuluri et al., 2003), or in silico analysis of overrepresented k-mers at promoters (Molina and Grotewold, 2005; Yamamoto et al., 2007a; Lichtenberg et al., 2009). EP3 (Abeel et al., 2008a) is the only PPP available currently that predicts extended promoter regions in plant genomes. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The cis-regulatory regions on DNA serve as binding sites for proteins such as transcription factors and RNA polymerase. The combinatorial interaction of these proteins plays a crucial role in transcription initiation, which is an important point of control in the regulation of gene expression. We present here an analysis of the performance of an in silico method for predicting cis-regulatory regions in the plant genomes of Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa) on the basis of free energy of DNA melting. For protein-coding genes, we achieve recall and precision of 96% and 42% for Arabidopsis and 97% and 31% for rice, respectively. For noncoding RNA genes, the program gives recall and precision of 94% and 75% for Arabidopsis and 95% and 90% for rice, respectively. Moreover, 96% of the false-positive predictions were located in noncoding regions of primary transcripts, out of which 20% were found in the first intron alone, indicating possible regulatory roles. The predictions for orthologous genes from the two genomes showed a good correlation with respect to prediction scores and promoter organization. Comparison of our results with an existing program for promoter prediction in plant genomes indicates that our method shows improved prediction capability.
    Plant physiology 05/2011; 156(3):1300-15. DOI:10.1104/pp.110.167809 · 6.84 Impact Factor
Show more