The word landscape of the non-coding segments of the Arabidopsis thaliana genome

Bioinformatics Laboratory, School of Electrical Engineering and Computer Science, Ohio University, Athens, OH, USA.
BMC Genomics (Impact Factor: 3.99). 10/2009; 10(1):463. DOI: 10.1186/1471-2164-10-463
Source: PubMed


Genome sequences can be conceptualized as arrangements of motifs or words. The frequencies and positional distributions of these words within particular non-coding genomic segments provide important insights into how the words function in processes such as mRNA stability and regulation of gene expression.
Using an enumerative word discovery approach, we investigated the frequencies and positional distributions of all 65,536 different 8-letter words in the genome of Arabidopsis thaliana. Focusing on promoter regions, introns, and 3' and 5' untranslated regions (3'UTRs and 5'UTRs), we compared word frequencies in these segments to genome-wide frequencies. The statistically interesting words in each segment were clustered with similar words to generate motif logos. We investigated whether words were clustered at particular locations or were distributed randomly within each genomic segment, and we classified the words using gene expression information from public repositories. Finally, we investigated whether particular sets of words appeared together more frequently than others.
Our studies provide a detailed view of the word composition of several segments of the non-coding portion of the Arabidopsis genome. Each segment contains a unique word-based signature. The respective signatures consist of the sets of enriched words, 'unwords', and word pairs within a segment, as well as the preferential locations and functional classifications for the signature words. Additionally, the positional distributions of enriched words within the segments highlight possible functional elements, and the co-associations of words in promoter regions likely represent the formation of higher order regulatory modules. This work is an important step toward fully cataloguing the functional elements of the Arabidopsis genome.


Available from: Matt John Brian Geisler
  • Source
    • "However, even in the case of the longer fragments the promoters may be incomplete because recent reports showed that cis acting elements 58–69 kb upstream of the start codon can influence gene expression in maize [49]. Beyond the insufficient length of the promoter/5 -UTR fragments used, the lack of regulatory elements located in an intron [50] [51] or the 3 -UTR [52] [53] or post-transcriptional mechanisms related to ␤glucuronidase accumulation may also explain differences between in situ and GUS patterns. In any case all four isolated upstream fragments provide useful tools for the manipulation of industrially important traits of the maize kernel. "
    Dataset: Paper 6

  • Source
    • "When the sequence CACGTGTC is submitted using the ‘in silico expression analysis’ web tool with default settings, the genes harbouring this sequence within their promoter were found to be most strongly upregulated in the microarray expression data set abscisic acid (Figure 3A). This sequence has previously been associated with abscisic acid-responsive genes (2, 9). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Using bioinformatics, putative cis-regulatory sequences can be easily identified using pattern recognition programs on promoters of specific gene sets. The abundance of predicted cis-sequences is a major challenge to associate these sequences with a possible function in gene expression regulation. To identify a possible function of the predicted cis-sequences, a novel web tool designated ‘in silico expression analysis’ was developed that correlates submitted cis-sequences with gene expression data from Arabidopsis thaliana. The web tool identifies the A. thaliana genes harbouring the sequence in a defined promoter region and compares the expression of these genes with microarray data. The result is a hierarchy of abiotic and biotic stress conditions to which these genes are most likely responsive. When testing the performance of the web tool, known cis-regulatory sequences were submitted to the ‘in silico expression analysis’ resulting in the correct identification of the associated stress conditions. When using a recently identified novel elicitor-responsive sequence, a WT-box (CGACTTTT), the ‘in silico expression analysis’ predicts that genes harbouring this sequence in their promoter are most likely Botrytis cinerea induced. Consistent with this prediction, the strongest induction of a reporter gene harbouring this sequence in the promoter is observed with B. cinerea in transgenic A. thaliana.Database URL:
    Database The Journal of Biological Databases and Curation 01/2014; 2014:bau030. DOI:10.1093/database/bau030 · 3.37 Impact Factor
  • Source
    • "A number of gene prediction and motif discovery algorithms (Sandve and Drablos, 2006) exist that have been fashioned to this end. Motifs have been analyzed only for a single organism, namely, Arabidopsis thaliana (Lichtenberg at al., 2009a). The goal of this study is to enumerate and catalog all possible octamer motifs in different parts of the japonica rice genome as well as in the genome itself. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Among the different areas of molecular biology concerning the detailed study of different parts of the cell, such as genomics, proteomics, and metabolomics, different new areas of study are emerging which entail the analysis of different parts of the genome, such as the prediction of genes or different kinds of transcription factor binding sites (TFBSs). The goal of this study was to construct and analyze a catalogue of all statistically relevant putative functional octamer words or motifs (which we have termed the "motifome" of a given organism) found within first introns, promoters, the 5' and 3' untranslated regions (UTRs), and the entire genome of japonica rice, and compare them to results attained from a previous analysis performed on the Arabidopsis genome. We found a number of novel motifs in different sets of non-coding rice sequence sets. The diversity of motifs in rice was higher in Arabidopsis, implicating a higher mutation turnover. While common motifs were found between the two species, motif pairs were missing, showing the difference between the regulatory machinery between rice and Arabidopsis.
    Omics: a journal of integrative biology 06/2012; 16(6):334-42. DOI:10.1089/omi.2011.0056 · 2.36 Impact Factor
Show more