Efficient comparison of sets of intervals with NC-lists

URGI, INRA Versailles, Route de Saint-Cyr, 78026 Versailles Cedex, France.
Bioinformatics (Impact Factor: 4.98). 02/2013; 29(7). DOI: 10.1093/bioinformatics/btt070
Source: PubMed


High-throughput sequencing produces in a small amount of time a large amount of data, which are usually difficult to analyze. Mapping the reads to the transcripts they originate from, to quantify the expression of the genes, is a simple, yet time demanding, example of analysis. Fast genomic comparison algorithms are thus crucial for the analysis of the ever-expanding number of reads sequenced.

We used NC-lists to implement an algorithm that compares a set of query intervals with a set of reference intervals in two steps. The first step, a pre-processing done once for all, requires time O[#R log(#R) + #Q log(#Q)], where Q and R are the sets of query and reference intervals. The search phase requires constant space, and time O[#R + #Q + #M), where M is the set of overlaps. We showed that our algorithm compares favorably with five other algorithms, especially when several comparisons are performed.

The algorithm has been included to S-MART, a versatile tool box for RNA-Seq analysis, freely available at The algorithm can be used for many kinds of data (sequencing reads, annotations, etc.) in many formats (GFF3, BED, SAM, etc.), on any operating system. It is thus readily useable for the analysis of next-generation sequencing data.

Supplementary information:
Supplementary data are available at Bioinformatics online.

12 Reads
  • [Show abstract] [Hide abstract]
    ABSTRACT: Efficient search algorithms for finding genomic-range overlaps are essential for various bioinformatics applications. A majority of fast algorithms for searching the overlaps between a query range (e.g., a genomic variant) and a set of N reference ranges (e.g., exons) has time complexity of O(k + logN), where kdenotes a term related to the length and location of the reference ranges. Here, we present a simple but efficient algorithm that reduces k, based on the maximum reference range length. Specifically, for a given query range and the maximum reference range length, the proposed method divides the reference range set into three subsets: always, potentially, and never overlapping. Therefore, search effort can be reduced by excluding never overlapping subset. We demonstrate that the running time of the proposed algorithm is proportional to potentially overlapping subset size, that is proportional to the maximum reference range length if all the other conditions are the same. Moreover, an implementation of our algorithm was 13.8 to 30.0 percent faster than one of the fastest range search methods available when tested on various genomic-range data sets. The proposed algorithm has been incorporated into a disease-linked variant prioritization pipeline for WGS ( and its implementation is available at
    IEEE/ACM Transactions on Computational Biology and Bioinformatics 01/2014; 12(4):1-1. DOI:10.1109/TCBB.2014.2369042 · 1.44 Impact Factor