Efficient comparison of sets of intervals with NC-lists

URGI, INRA Versailles, Route de Saint-Cyr, 78026 Versailles Cedex, France.
Bioinformatics (Impact Factor: 4.62). 02/2013; 29(7). DOI: 10.1093/bioinformatics/btt070
Source: PubMed

ABSTRACT MOTIVATION: High-throughput sequencing produces in a small amount of time a large amount of data, which are usually difficult to analyze. Mapping the reads to the transcripts they originate from, in order to quantify the expression of the genes, is a simple, yet time demanding, example of analysis. Fast genomic comparison algorithms are thus crucial for the analysis of the ever-expanding number of reads sequenced. RESULTS: We used NC-lists to implement an algorithm that compares a set of query intervals with a set of reference intervals in two steps. The first step, a pre-processing done once for all, requires time O(#Rlog(#R) + #Qlog(#Q)), where Q and R are the sets of query and reference intervals. The search phase requires constant space, and time O(#R + #Q + #M), where M is the set of overlaps. We showed that our algorithm compares favorably to five other algorithms, especially when several comparisons are performed. AVAILABILITY: The algorithm has been included to S-MART, a versatile tool box for RNA-Seq analysis, freely available at The algorithm can be used for many kinds of data (sequencing reads, annotations, etc.) in many formats (GFF3, BED, SAM, etc.), on any operating system. It is thus readily useable for the analysis of next-generation sequencing data. CONTACT: SUPPLEMENTARY INFORMATION: Complete algorithms, executables used in the benchmark, and the C++ implementation can be found at Bioinformatics online.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As vertebrate genome sequences near completion and research refocuses to their analysis, the issue of effective genome annotation display becomes critical. A mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with several dozen aligned annotation tracks, is provided at This browser displays assembly contigs and gaps, mRNA and expressed sequence tag alignments, multiple gene predictions, cross-species homologies, single nucleotide polymorphisms, sequence-tagged sites, radiation hybrid data, transposon repeats, and more as a stack of coregistered tracks. Text and sequence-based searches provide quick and precise access to any region of specific interest. Secondary links from individual features lead to sequence details and supplementary off-site databases. One-half of the annotation tracks are computed at the University of California, Santa Cruz from publicly available sequence data; collaborators worldwide provide the rest. Users can stably add their own custom tracks to the browser for educational or research purposes. The conceptual and technical framework of the browser, its underlying MYSQL database, and overall use are described. The web site currently serves over 50,000 pages per day to over 3000 different users.
    Genome Research 07/2002; 12(6):996-1006. DOI:10.1101/gr.229102. · 13.85 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Increased reliance on computational approaches in the life sciences has revealed grave concerns about how accessible and reproducible computation-reliant results truly are. Galaxy, an open web-based platform for genomic research, addresses these problems. Galaxy automatically tracks and manages data provenance and provides support for capturing the context and intent of computational methods. Galaxy Pages are interactive, web-based documents that provide users with a medium to communicate a complete computational analysis.
    Genome biology 08/2010; 11(8):R86. DOI:10.1186/gb-2010-11-8-r86 · 10.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The biochemistry of RNA-Seq library preparation results in cDNA fragments that are not uniformly distributed within the transcripts they represent. This non-uniformity must be accounted for when estimating expression levels, and we show how to perform the needed corrections using a likelihood based approach. We find improvements in expression estimates as measured by correlation with independently performed qRT-PCR and show that correction of bias leads to improved replicability of results across libraries and sequencing technologies.
    Genome biology 03/2011; 12(3):R22. DOI:10.1186/gb-2011-12-3-r22 · 10.47 Impact Factor