Leila Taher

National Center for Biotechnology Information, Bethesda, MD, USA

Are you Leila Taher?

Claim your profile

Publications (13)120.39 Total impact

  • Article: Systematic elucidation and in vivo validation of sequences enriched in hindbrain transcriptional control.
    [show abstract] [hide abstract]
    ABSTRACT: Illuminating the primary sequence encryption of enhancers is central to understanding the regulatory architecture of genomes. We have developed a machine learning approach to decipher motif patterns of hindbrain enhancers and identify 40,000 sequences in the human genome that we predict display regulatory control that includes the hindbrain. Consistent with their roles in hindbrain patterning, MEIS1, NKX6-1, as well as HOX and POU family binding motifs contributed strongly to this enhancer model. Predicted hindbrain enhancers are overrepresented at genes expressed in hindbrain and associated with nervous system development, and primarily reside in the areas of open chromatin. In addition, 77 (0.2%) of these predictions are identified as hindbrain enhancers on the VISTA Enhancer Browser, and 26,000 (60%) overlap enhancer marks (H3K4me1 or H3K27ac). To validate these putative hindbrain enhancers, we selected 55 elements distributed throughout our predictions and six low scoring controls for evaluation in a zebrafish transgenic assay. When assayed in mosaic transgenic embryos, 51/55 elements directed expression in the central nervous system. Furthermore, 30/34 (88%) predicted enhancers analyzed in stable zebrafish transgenic lines directed expression in the larval zebrafish hindbrain. Subsequent analysis of sequence fragments selected based upon motif clustering further confirmed the critical role of the motifs contributing to the classifier. Our results demonstrate the existence of a primary sequence code characteristic to hindbrain enhancers. This code can be accurately extracted using machine-learning approaches and applied successfully for de novo identification of hindbrain enhancers. This study represents a critical step toward the dissection of regulatory control in specific neuronal subtypes.
    Genome Research 07/2012; · 13.61 Impact Factor
  • Article: CLARE: Cracking the LAnguage of Regulatory Elements.
    [show abstract] [hide abstract]
    ABSTRACT: CLARE is a computational method designed to reveal sequence encryption of tissue-specific regulatory elements. Starting with a set of regulatory elements known to be active in a particular tissue/process, it learns the sequence code of the input set and builds a predictive model from features specific to those elements. The resulting model can then be applied to user-supplied genomic regions to identify novel candidate regulatory elements. CLARE's model also provides a detailed analysis of transcription factors that most likely bind to the elements, making it an invaluable tool for understanding mechanisms of tissue-specific gene regulation. AVAILABILITY: CLARE is freely accessible at http://clare.dcode.org/.
    Bioinformatics 12/2011; 28(4):581-3. · 5.47 Impact Factor
  • Article: Genome-wide identification of conserved regulatory function in diverged sequences.
    [show abstract] [hide abstract]
    ABSTRACT: Plasticity of gene regulatory encryption can permit DNA sequence divergence without loss of function. Functional information is preserved through conservation of the composition of transcription factor binding sites (TFBS) in a regulatory element. We have developed a method that can accurately identify pairs of functional noncoding orthologs at evolutionarily diverged loci by searching for conserved TFBS arrangements. With an estimated 5% false-positive rate (FPR) in approximately 3000 human and zebrafish syntenic loci, we detected approximately 300 pairs of diverged elements that are likely to share common ancestry and have similar regulatory activity. By analyzing a pool of experimentally validated human enhancers, we demonstrated that 7/8 (88%) of their predicted functional orthologs retained in vivo regulatory control. Moreover, in 5/7 (71%) of assayed enhancer pairs, we observed concordant expression patterns. We argue that TFBS composition is often necessary to retain and sufficient to predict regulatory function in the absence of overt sequence conservation, revealing an entire class of functionally conserved, evolutionarily diverged regulatory elements that we term "covert."
    Genome Research 06/2011; 21(7):1139-49. · 13.61 Impact Factor
  • Article: Genome-wide CTCF distribution in vertebrates defines equivalent sites that aid the identification of disease-associated genes.
    [show abstract] [hide abstract]
    ABSTRACT: Many genomic alterations associated with human diseases localize in noncoding regulatory elements located far from the promoters they regulate, making it challenging to link noncoding mutations or risk-associated variants with target genes. The range of action of a given set of enhancers is thought to be defined by insulator elements bound by the 11 zinc-finger nuclear factor CCCTC-binding protein (CTCF). Here we analyzed the genomic distribution of CTCF in various human, mouse and chicken cell types, demonstrating the existence of evolutionarily conserved CTCF-bound sites beyond mammals. These sites preferentially flank transcription factor-encoding genes, often associated with human diseases, and function as enhancer blockers in vivo, suggesting that they act as evolutionarily invariant gene boundaries. We then applied this concept to predict and functionally demonstrate that the polymorphic variants associated with multiple sclerosis located within the EVI5 gene impinge on the adjacent gene GFI1.
    Nature Structural &#38 Molecular Biology 06/2011; 18(6):708-14. · 12.71 Impact Factor
  • Source
    Article: Effects of HMGN variants on the cellular transcription profile.
    [show abstract] [hide abstract]
    ABSTRACT: High mobility group N (HMGN) is a family of intrinsically disordered nuclear proteins that bind to nucleosomes, alters the structure of chromatin and affects transcription. A major unresolved question is the extent of functional specificity, or redundancy, between the various members of the HMGN protein family. Here, we analyze the transcriptional profile of cells in which the expression of various HMGN proteins has been either deleted or doubled. We find that both up- and downregulation of HMGN expression altered the cellular transcription profile. Most, but not all of the changes were variant specific, suggesting limited redundancy in transcriptional regulation. Analysis of point and swap HMGN mutants revealed that the transcriptional specificity is determined by a unique combination of a functional nucleosome-binding domain and C-terminal domain. Doubling the amount of HMGN had a significantly larger effect on the transcription profile than total deletion, suggesting that the intrinsically disordered structure of HMGN proteins plays an important role in their function. The results reveal an HMGN-variant-specific effect on the fidelity of the cellular transcription profile, indicating that functionally the various HMGN subtypes are not fully redundant.
    Nucleic Acids Research 02/2011; 39(10):4076-87. · 8.03 Impact Factor
  • Article: Genome-wide CTCF distribution in vertebrates defines equivalent sites that aid the identification of disease-associated genes.
    Nature Structural &#38 Molecular Biology 01/2011; 18(9):1084. · 12.71 Impact Factor
  • Source
    Article: Global gene expression analysis of murine limb development.
    [show abstract] [hide abstract]
    ABSTRACT: Detailed information about stage-specific changes in gene expression is crucial for understanding the gene regulatory networks underlying development and the various signal transduction pathways contributing to morphogenesis. Here we describe the global gene expression dynamics during early murine limb development, when cartilage, tendons, muscle, joints, vasculature and nerves are specified and the musculoskeletal system of limbs is established. We used whole-genome microarrays to identify genes with differential expression at 5 stages of limb development (E9.5 to 13.5), during fore- and hind-limb patterning. We found that the onset of limb formation is characterized by an up-regulation of transcription factors, which is followed by a massive activation of genes during E10.5 and E11.5 which levels off at later time points. Among the 3520 genes identified as significantly up-regulated in the limb, we find ~30% to be novel, dramatically expanding the repertoire of candidate genes likely to function in the limb. Hierarchical and stage-specific clustering identified expression profiles that are likely to correlate with functional programs during limb development and further characterization of these transcripts will provide new insights into specific tissue patterning processes. Here, we provide for the first time a comprehensive analysis of developmentally regulated genes during murine limb development, and provide some novel insights into the expression dynamics governing limb morphogenesis.
    PLoS ONE 01/2011; 6(12):e28358. · 4.09 Impact Factor
  • Source
    Article: The genome of the Western clawed frog Xenopus tropicalis.
    [show abstract] [hide abstract]
    ABSTRACT: The western clawed frog Xenopus tropicalis is an important model for vertebrate development that combines experimental advantages of the African clawed frog Xenopus laevis with more tractable genetics. Here we present a draft genome sequence assembly of X. tropicalis. This genome encodes more than 20,000 protein-coding genes, including orthologs of at least 1700 human disease genes. Over 1 million expressed sequence tags validated the annotation. More than one-third of the genome consists of transposable elements, with unusually prevalent DNA transposons. Like that of other tetrapods, the genome of X. tropicalis contains gene deserts enriched for conserved noncoding elements. The genome exhibits substantial shared synteny with human and chicken over major parts of large chromosomes, broken by lineage-specific chromosome fusions and fissions, mainly in the mammalian lineage.
    Science 04/2010; 328(5978):633-6. · 31.20 Impact Factor
  • Source
    Article: Variable locus length in the human genome leads to ascertainment bias in functional inference for non-coding elements.
    Leila Taher, Ivan Ovcharenko
    [show abstract] [hide abstract]
    ABSTRACT: MOTIVATION: Several functional gene annotation databases have been developed in the recent years, and are widely used to infer the biological function of gene sets, by scrutinizing the attributes that appear over- and underrepresented. However, this strategy is not directly applicable to the study of non-coding DNA, as the non-coding sequence span varies greatly among different gene loci in the human genome and longer loci have a higher likelihood of being selected purely by chance. Therefore, conclusions involving the function of non-coding elements that are drawn based on the annotation of neighboring genes are often biased. We assessed the systematic bias in several particular Gene Ontology (GO) categories using the standard hypergeometric test, by randomly sampling non-coding elements from the human genome and inferring their function based on the functional annotation of the closest genes. While no category is expected to occur significantly over- or underrepresented for a random selection of elements, categories such as 'cell adhesion', 'nervous system development' and 'transcription factor activities' appeared to be systematically overrepresented, while others such as 'olfactory receptor activity'-underrepresented. RESULTS: Our results suggest that functional inference for non-coding elements using gene annotation databases requires a special correction. We introduce a set of correction coefficients for the probabilities of the GO categories that accounts for the variability in the length of the non-coding DNA across different loci and effectively eliminates the ascertainment bias from the functional characterization of non-coding elements. Our approach can be easily generalized to any other gene annotation database.
    Bioinformatics 02/2009; 25(5):578-84. · 5.47 Impact Factor
  • Article: On splice site prediction using weight array models: a comparison of smoothing techniques
    [show abstract] [hide abstract]
    ABSTRACT: In most eukaryotic genes, protein-coding exons are separated by non-coding introns which are removed from the primary transcript by a process called "splicing". The positions where introns are cut and exons are spliced together are called "splice sites". Thus, computational prediction of splice sites is crucial for gene finding in eukaryotes. Weight array models are a powerful probabilistic approach to splice site detection. Parameters for these models are usually derived from m-tuple frequencies in trusted training data and subsequently smoothed to avoid zero probabilities. In this study we compare three different ways of parameter estimation for m-tuple frequencies, namely (a) non-smoothed probability estimation, (b) standard pseudo counts and (c) a Gaussian smoothing procedure that we recently developed.
    Journal of Physics Conference Series 12/2007; 90(1):012004.
  • Source
    Article: AGenDA: gene prediction by cross-species sequence comparison.
    [show abstract] [hide abstract]
    ABSTRACT: Automatic gene prediction is one of the major challenges in computational sequence analysis. Traditional approaches to gene finding rely on statistical models derived from previously known genes. By contrast, a new class of comparative methods relies on comparing genomic sequences from evolutionary related organisms to each other. These methods are based on the concept of phylogenetic footprinting: they exploit the fact that functionally important regions in genomic sequences are usually more conserved than non-functional regions. We created a WWW-based software program for homology-based gene prediction at BiBiServ (Bielefeld Bioinformatics Server). Our tool takes pairs of evolutionary related genomic sequences as input data, e.g. from human and mouse. The server runs CHAOS and DIALIGN to create an alignment of the input sequences and subsequently searches for conserved splicing signals and start/stop codons near regions of local sequence conservation. Genes are predicted based on local homology information and splice signals. The server returns predicted genes together with a graphical representation of the underlying alignment. The program is available at http://bibiserv.TechFak.Uni-Bielefeld.DE/agenda/.
    Nucleic Acids Research 08/2004; 32(Web Server issue):W305-8. · 8.03 Impact Factor
  • Source
    Article: AGenDA: homology-based gene prediction.
    [show abstract] [hide abstract]
    ABSTRACT: We present a www server for homology-based gene prediction. The user enters a pair of evolutionary related genomic sequences, for example from human and mouse. Our software system uses CHAOS and DIALIGN to calculate an alignment of the input sequences and then searches for conserved splicing signals and start/stop codons around regions of local sequence similarity. This way, candidate exons are identified that are used, in turn, to calculate optimal gene models. The server returns the constructed gene model by email, together with a graphical representation of the underlying genomic alignment.
    Bioinformatics 09/2003; 19(12):1575-7. · 5.47 Impact Factor
  • Article: The Importance of Window Length in
    [show abstract] [hide abstract]
    ABSTRACT: Introduction The performance of gene prediction programs strongly depends on the methods that they use to locate splice sites. Different pattern recognition techniques are available to assess the quality of candidate splice sites, see [1] for an overview and further references. All of these techniques proceed by computing a score derived from the distribution of the nucleotides in the neighbourhood of a splice site consensus sequence. These scores are normally obtained with splice sites models that have been estimated from large training sets of exemplary neighbourhoods. The training sets may also include negative examples, i.e. sequences that contain the consensus sequence, but that are actually no splice sites. Unfortunately, the concept of `neighbourhood' is rather ambiguous, and there is no general recommendation about the positions of the nucleotides that should be included in the calculation, i.e. the analysis window that should be employed. In principle, the window length is an
    07/2003;