Margulies, E. H. et al. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. 17, 760-774

Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Genome Research (Impact Factor: 14.63). 07/2007; 17(6):760-74. DOI: 10.1101/gr.6034307
Source: PubMed


A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.

Download full-text


Available from: Galt Prager Barber,
  • Source
    • "There have been relatively few independent or community organized assessments of WGA pipelines. Notably, as part of the ENCODE Pilot Project (Margulies et al. 2007), four pipelines were assessed across a substantial number of regions, and Chen and Tompa later compared those alignments using the StatSigMA-w tool (Chen and Tompa 2010). The Alignathon is an attempt to perform a larger and more comprehensive evaluation. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Multiple sequence alignments (MSAs) are a prerequisite for a wide variety of evolutionary analyses. Published assessments and benchmark datasets for protein and, to a lesser extent, global nucleotide MSAs are available, but less effort has been made to establish benchmarks in the more general problem of whole genome alignment (WGA). Using the same model as the successful Assemblathon competitions we organized a competitive evaluation in which teams submitted their alignments and then assessments were performed collectively after all the submissions were received. Three datasets were used; two were simulated and based on primate and mammalian phylogenies and one was comprised of 20 real fly genomes. In total 35 submissions were assessed, submitted by ten teams using 12 different alignment pipelines. We found agreement between independent simulation-based and statistical assessments indicating that there are substantial accuracy differences between contemporary alignment tools. We saw considerable difference in the alignment quality of differently annotated regions and found few tools aligned the duplications analysed. We found many tools worked well at shorter evolutionary distances, but fewer performed competitively at longer distances. We provide all datasets, submissions and assessment programs for further study and provide, as a resource for future benchmarking, a convenient repository of code and data for reproducing the simulation assessments.
    Genome Research 10/2014; 24(12). DOI:10.1101/gr.174920.114 · 14.63 Impact Factor
  • Source
    • "This totaled 149 BACs. In addition, we picked 186 BACs to cover the dog homologs to the 45 human ENCODE pilot regions [20]. In total we integrated 85 Mb (an ∼0.4% increase) of novel sequence from BAC and primer walks into the canFam2.0 "
    [Show abstract] [Hide abstract]
    ABSTRACT: The domestic dog, Canis familiaris, is a well-established model system for mapping trait and disease loci. While the original draft sequence was of good quality, gaps were abundant particularly in promoter regions of the genome, negatively impacting the annotation and study of candidate genes. Here, we present an improved genome build, canFam3.1, which includes 85 MB of novel sequence and now covers 99.8% of the euchromatic portion of the genome. We also present multiple RNA-Sequencing data sets from 10 different canine tissues to catalog ∼175,000 expressed loci. While about 90% of the coding genes previously annotated by EnsEMBL have measurable expression in at least one sample, the number of transcript isoforms detected by our data expands the EnsEMBL annotations by a factor of four. Syntenic comparison with the human genome revealed an additional ∼3,000 loci that are characterized as protein coding in human and were also expressed in the dog, suggesting that those were previously not annotated in the EnsEMBL canine gene set. In addition to ∼20,700 high-confidence protein coding loci, we found ∼4,600 antisense transcripts overlapping exons of protein coding genes, ∼7,200 intergenic multi-exon transcripts without coding potential, likely candidates for long intergenic non-coding RNAs (lincRNAs) and ∼11,000 transcripts were reported by two different library construction methods but did not fit any of the above categories. Of the lincRNAs, about 6,000 have no annotated orthologs in human or mouse. Functional analysis of two novel transcripts with shRNA in a mouse kidney cell line altered cell morphology and motility. All in all, we provide a much-improved annotation of the canine genome and suggest regulatory functions for several of the novel non-coding transcripts.
    PLoS ONE 03/2014; 9(3):e91172. DOI:10.1371/journal.pone.0091172 · 3.23 Impact Factor
  • Source
    • "A study in plants has also reported an event of promoter shuffling generated by inter-chromosome and subsequent intra-chromosome recombination (31). Kent et al. (32) noticed an unexpected number of small fragments conserved between non-syntenic regions analyzing mammalian genomes, and similarly, in the ENCODE pilot project, the presence of small non-syntenic conserved regions were reported (33). Therefore, non-syntenic rearrangements of conserved (hence potentially functional) sequences did happen during evolution, and they are unlikely to be the mere result of assembly errors, but no further elucidation of their evolution and function has been undertaken so far. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Co-option of cis-regulatory modules has been suggested as a mechanism for the evolution of expression sites during development. However, the extent and mechanisms involved in mobilization of cis-regulatory modules remains elusive. To trace the history of non-coding elements, which may represent candidate ancestral cis-regulatory modules affirmed during chordate evolution, we have searched for conserved elements in tunicate and vertebrate (Olfactores) genomes. We identified, for the first time, 183 non-coding sequences that are highly conserved between the two groups. Our results show that all but one element are conserved in non-syntenic regions between vertebrate and tunicate genomes, while being syntenic among vertebrates. Nevertheless, in all the groups, they are significantly associated with transcription factors showing specific functions fundamental to animal development, such as multicellular organism development and sequence-specific DNA binding. The majority of these regions map onto ultraconserved elements and we demonstrate that they can act as functional enhancers within the organism of origin, as well as in cross-transgenesis experiments, and that they are transcribed in extant species of Olfactores. We refer to the elements as 'Olfactores conserved non-coding elements'.
    Nucleic Acids Research 02/2013; 41(6). DOI:10.1093/nar/gkt030 · 9.11 Impact Factor
Show more