2013, pages 1–7
Data and text mining
Advance Access publication February 14, 2013
Efficient comparison of sets of intervals with NC-lists
Matthias Zytnicki*, YuFei Luo and Hadi Quesneville
URGI, INRA Versailles, Plant Biology and Breeding Division, 78026 Versailles Cedex, France
Associate Editor: Michael Brudno
Motivation: High-throughput sequencing produces in a small
amount of time a large amount of data, which are usually difficult to
analyze. Mapping the reads to the transcripts they originate from, to
quantify the expression of the genes, is a simple, yet time demanding,
example of analysis. Fast genomic comparison algorithms are thus
crucial for the analysis of the ever-expanding number of reads
Results: We used NC-lists to implement an algorithm that compares
a set of query intervals with a set of reference intervals in two steps.
The first step, a pre-processing done once for all, requires
time O½#Rlogð#RÞ þ #Qlogð#QÞ?, where Q and R are the sets of
query and reference intervals. The search phase requires con-
stant space, and time Oð#R þ #Q þ #MÞ, where M is the set of
overlaps. We showed that our algorithm compares favorably with
five other algorithms, especially when several comparisons are
Availability: The algorithm has been included to S–MART, a versatile
tool box for RNA-Seq analysis, freely available at http://urgi.versailles.
inra.fr/Tools/S-Mart. The algorithm can be used for many kinds of data
(sequencing reads, annotations, etc.) in many formats (GFF3, BED,
SAM, etc.), on any operating system. It is thus readily useable for
the analysis of next-generation sequencing data.
Supplementary information: Supplementary data are available at
Received on March 27, 2012; revised on December 20, 2012;
accepted on February 7, 2013
With the advent of high-throughput sequencing, bioinformatics
must analyze a large amount of data every day. Modern sequen-
cers can generate several hundred millions of sequences in a week
for a price that is affordable to more and more labs. When a
reference genome is available, the first task is to map the reads on
the genome. Many mapping tools are now available and research
is active on this topic [see Langmead et al. (2009) for instance].
For RNA-Seq, the second step may be the assignment of the
mapped read to the transcripts they originate from, to estimate
the expression of the genes (Anders, 2011). In general, the gen-
omic comparison of the mapped reads with a reference annota-
tion is the basis of many analyses: comparison of putative
transcription factor binding sites with up-regulated genes
(Blankenberg et al., 2010; Giardine et al., 2005; Goecks et al.,
2010); detection of the single-nucleotide polymorphisms that are
located in coding regions (Renaud et al., 2011); processing
de novo transcript sequences to determine if they represent
known or novel genes (Roberts et al., 2011). These three
examples involve a comparison of two annotations, and the
problem has been addressed often. However, high-throughput
sequencing, for the amount a data it produces, requires opti-
mized algorithms for its analysis.
Most tools model the reads or annotation as intervals, or lists
of intervals when different elements are modeled (exons, UTRs,
etc.). These intervals are considered along a reference, which
usually is a chromosome or a scaffold. Thus, comparing RNA-
Seq reads with known transcripts reduces to comparing a set of
query intervals (the reads) with a set of reference intervals (the
exons of the transcripts).
Every efficient algorithm requires a dedicated data structure,
such as an indexed database, an indexed flat file [such as a BAM
file (Li et al., 2009)], an R-tree or NC-lists (nested containment
lists) (Alekseyenko and Lee, 2007). These structures are usually
built once during the pre-processing step, and can be reused
for other analyses. Although these structures may take consider-
able amount of time to build, the balance is usually favorable
to pre-processed structures when several comparisons are
performed, as the time spent for the comparison itself is consid-
erably reduced. This observation leads to the conception of
the BAM format, now widely used in the bioinformatics
With the notable exception of the fjoin algorithm (Richardson,
2006), almost all the algorithms previously described only get all
the reference intervals that overlap with one given query interval:
most algorithms have been designed to retrieve all the intervals a
user can see when he selects a given window in a genome browser
(Kent et al., 2002). Whereas these algorithms can be used to
compare two sets by comparing each query interval, one after
the other, with the reference intervals, we will show here how
comparing the whole query set with the reference set can be more
Among the possible data structures presented to compare
intervals, NC-lists (Alekseyenko and Lee, 2007) are one of the
most promising. NC-lists have been first described to retrieve all
the reference intervals that overlap with a single interval. Their
structure is compact (a simple set of two arrays, L and H), the
algorithm is fast in practice and the search phase requires only
constant space, which is compulsory when handling several hun-
dreds of millions of reads. The key idea of NC-lists is to perform
binary dichotomic search on the list of reference intervals. But
dichotomic search cannot be performed when some intervals are
contained (or nested) inside other intervals, so NC-lists arrange
intervals into lists—the L array—where no two intervals are
nested. If some intervals are nested inside an ancestor interval,
*To whom correspondence should be addressed.
? The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: email@example.com
Bioinformatics Advance Access published March 11, 2013
by guest on August 20, 2015
they are stored in a separate sublist using the H array (see Fig. 1).
NC-lists can be built in linearithmic time [i.e. of the form
OðnlognÞ], using linear space (actually, only five integers are
stored per interval). In their article, the authors presented a
recursive dichotomic algorithm, equivalent to Alg. 1, which
uses NC-lists. It is claimed that getting all the reference intervals
that overlap with a query interval could be done in time
O½logð#RÞ þ #M?, where R is the reference set and M the pairs
query/reference that overlap, but this is not accurate for some
cases (see section 3.1).
In this article, we will present an algorithm, which relies on
NC-lists, and provides all the pairs query intervals/reference
intervals that overlap. In a pre-processing step, the algorithm
sorts the query and the reference intervals. It then builds a
NC-list for the reference intervals. In the search phase, the
algorithm compares every query interval with the reference
intervals in time Oð#R þ #Q þ #MÞ. All together, the algo-
rithm takes O½#Rlogð#RÞ þ #Qlogð#QÞ þ #M?. Although the
complexity of the whole algorithm is not better than already
known algorithms, the runtime complexity is significantly
lower than other constant-space algorithms. As such, our
algorithm is especially useful when performing multiple com-
parisons on large sets of data, such as in an RNA-Seq data
Algorithm 1 Original algorithm
Algorithms 2 Simplified algorithm
To compare two sets of intervals, we also used a NC-list for the reference
set, and query intervals are simply sorted by their start position. Our aim
is to find all the query intervals that overlap with at least one reference
interval. The main idea of the algorithm is that knowledge from the
comparison between a query interval and a reference interval will be
used for the comparison of the next query interval. A sketch of the algo-
rithm, which provides all the pairs of query/reference intervals that over-
lap, is presented in Alg. 2. The actual algorithm is slightly more complex,
and is described in section 3.2. It uses a special variable, nfo (for next first
overlap), which stores the first reference query that may overlap with the
next query interval.
the comparison of genomic intervals. We will first formally
define our data and the NC-list structure.
We will describe and analyze here the problem of
DEFINITION 1. An interval i ¼ ða,bÞ is an element of N2such
that a ? b. By convention, we set i:start ¼ a and i:end ¼ b.
For two intervals i and j, we define:
i ? j
The NC-list construction algorithm supposes that the intervals
have been previously sorted following a total order:
, ðði:start ? j:endÞ ^ ðj:start ? i:endÞÞ
, ððj:start ? i:startÞ ^ ði:end ? j:endÞÞ
ði is before jÞ
ði and j overlapÞ
ði is contained in jÞ
(sublist of 2)
(sublist of 3)
(sublist of 6)
Fig. 1. Transforming a set of ordered intervals an NC-list. All the inter-
vals have been previously sorted according to their increasing start pos-
ition and, in case of tie, decreasing end position. Because intervals 3, 4
and 5 are nested inside interval 2, they are removed from the top list
(which consists in intervals 1, 2 and 6) and inserted into another sublist.
Intervals 3 and 5 are moved to the sublist of 2. Similarly, interval 4 is
nested inside interval 3, and thus moved to another sublist. When an
interval is nested into two intervals (as it is the case for the interval 7,
which is nested in 2 and 6), the right-most interval that contains it is
chosen. Here, it is interval 6. An NC-list is a set of two arrays, L and
H. Each line of L stores the start and end positions of an interval, as well
as an index to the H array. The L data are stored so that the intervals that
are in the same list appear contiguously. For each sublist, a correspond-
ing line of the H array stores the index of its least interval and the size
of the list. As highlighted by the arrows, the sublist of interval 2 (line 1 of
the L array, which is a zero-based structure) is the line 1 of the H array.
The sublist starts at index 3 of the L array and contains 2 intervals (the
intervals 3 and 5)
M.Zytnicki et al.
by guest on August 20, 2015
DEFINITION 2. [OL, (Alekseyenko and Lee, 2007)]. A total
order ? is defined, such that
r:start ? r0:start
ðr:start ¼ r0:startÞ ^ ðr:end ? r0:endÞ
The associated asymmetric relation is defined by
8ðr,r0Þ 2 R2,r?r0,
8ðr,r0Þ 2 R2,ðr ? r0Þ , :ðr0?rÞ
If two different intervals, r and r0, have the same coordinates
[ðr:start ¼ r0:startÞ ^ ðr:end ¼ r0:endÞ], we define r ? r0or r0? r
To avoid ambiguity, the ? relation is subscripted by the set it
relates to (namely ?Qfor the query set and ?Rfor the reference
The successor of an element r 2 R with respect to the order ?R
will be noted succðrÞ, when it exists.
The construction phase of the NC-list groups the sorted
intervals into lists, such that an interval that is contained in
another interval is moved into the sublist of the container
DEFINITION 3. We define the subelement of an interval by
8r 2 R,r:sub ¼
fs 2 R : r ?Rs ^ s ?Rmin
?Rfr02 R : ðr ?Rr0Þ ^ ðr06? rÞgg
The children of an interval are as follows:
8r 2 R,r:children ¼ fc 2 r:sub : 8c02 r:subnfcg,c 6? c0g
r:children is also called the sublist of r.
The parent of an interval is defined as follows:
8ðc,pÞ 2 R2,ðc:parent ¼ pÞ , ðp 2 r:childrenÞ
Finally, r:ancestors is the list of ancestors of r 2 R, i.e. the list
ðr1,r2,r3, ...,rnÞ such that r1 has no parent, rk¼ rkþ1.parent
and rn¼ r:parent.
The previous definition provides a way to build the nested
containment structure from a sorted list of intervals: given an
interval r, all its successors that are nested into r should be found
in a list under r. They are the subelements of r. The children of r
are the subelements that are right under r (i.e. there is no other
interval nested in r that contains a child of r).
Note that an interval r may have no parent. In this case, we set
r:parent ¼ ; and all the intervals that have no parent form the
DEFINITION 4. The NC-list of a set of intervals is a tree-like data
structure such that
? each node contains sorted intervals,
? the root node is the list of intervals that has no parent,
? there is an edge between every interval and the list of its
Notice that a NC-list is not a tree because an edge connects a
node (the parent interval) to a list of nodes (the children
1, considers a query interval q and a set of reference intervals R.
It gives the elements of R that overlap with q. We will show here
that the algorithm presented by Alekseyenko and Lee (2007)
does not have the complexity claimed in the article. In the exam-
ple in Figure 2, the announced complexity does not hold.
The example has a nested structure, where each reference inter-
val has the same number of siblings. For each list of siblings,
none but the last one has children. The query overlaps every last
sibling of each list. The number of layers is equal to the number
of siblings, n. Here, #M ¼ n and #R ¼ n2. Executing the algo-
rithm yields a time complexity of O½nlogðnÞ?, as n binary searches
are performed (one for each layer). However, the expected com-
plexity is O½logðn2Þ þ n? ¼ OðnÞ5O½nlogðnÞ?.
The original algorithm, equivalent to Alg.
DEFINITION 5. The problem of the comparison of sets of inter-
vals considers two sets of intervals, Q and R (hereafter named
the query set and the reference set) and finds all the pairs
ðq,rÞ 2 Q ? R such that q and r overlap.
Notice that there is no assumption on the two sets: elements from
the query or reference sets may be nested or not, have different
DEFINITION 6. Consider a query read q. Let
? B½q? ¼ fr 2 R : r5qg be the set of reference intervals that are
? M½q? ¼ fr 2 R : r54qg be the set of reference intervals that
overlap with q.
? A½q? ¼ fr 2 R : r4qg be the set of reference intervals that are
Because B[q], M[q] and A[q] are disjoint, and cover the entirety
of R, fB½q?,M½q?,A½q?g is a partition of R. Moreover, any
optimized algorithm would of course try to compute M[q] as
fast as possible, while avoiding scanning B[q] and A[q]. The
two following lemma (their proof are omitted for they are
straightforward) will help us skipping reading these sets.
LEMMA 1. If an interval is in B[q], then all its subelements
A consequence, if a reference element is in B[q], then its
children intervals will not be compared with q.
8a 2 A½q?,8r 2 R,ða ?RrÞ ) ðr 2 A½q?Þ
The previous lemma implies that if we scan the reference inter-
val using the ordering ?R, the search can stop when the least
Fig. 2. Pathological case concerning the algorithm between one query
interval (in black) and several reference reads (white)
Efficient comparison of sets of intervals
by guest on August 20, 2015
element of A[q] is found. In other words, the greatest element of
M[q], when this set is not empty, is the predecessor of the least
element of A[q]. There is no similar rule concerning the least
element of M[q] and B[q], and characterizing the ‘left frontier’
of M[q] is slightly more complex.
To do so, we will define here nfo. Informally, this variable is
the least (using the ordering ?R) lowest (meaning that none of its
children does) interval that overlaps with q. Because nfo overlaps
with q, all its ancestors also do. Because it is the least variable
that overlaps with q, the successors of nfo either overlap with q
or are after q. In the algorithm, this variable is set when we
compare a query interval q with the set of reference intervals,
and it is the first interval that will be compared with the successor
of q. We will prove the previous claims here.
nfo½q? ¼ min
?Rffr 2 M½q? : r:children ? B½q?g [ A½q?g
nfo[q] may be undefined if B[q] and A[q] are empty. In this case,
we define nfo[q]¼None.
To help the reader, different configurations of the nfo[q] are
described in Figure 3. We will prove that the predecessors of nfo,
except for its ancestors, are all in B[q].
LEMMA 3. Let m½q? ¼ min?RfM½q? [ A½q?g.
If m[q] is undefined, we set m½q? ¼ None.
If m[q] is None, the nfo[q] also is. Otherwise,
m½q? 2 fnfo½q?g [ nfo½q?:ancestors
PROOF. Let us suppose that m[q] is not None (the proof is clear
otherwise). If m½q? 2 A½q?, then m½q? ¼ nfo½q? and the lemma is
proved. Otherwise, let r be a reference element such that
ðm½q? 2 r:ancestorsÞ ^ ðr:children ¼ ;Þ. Such an element exists,
otherwise the number of sublists would be infinite. Clearly,
r 2 fðr02 M½q? : r0:children ? B½q?Þg, so nfo½q??Rr. We have
thus m½q??Rnfo½q??Rr and r 2 m½q?:sub, which implies, by defini-
tion of the subelements, that nfo½q? 2 m½q?:sub, or nfo½q? ¼ m½q?.
This proves the lemma.
8r 2 R,ðr ?Rnfo½q?Þ ) ðr 2 B½q? [ nfo½q?:ancestorsÞ
As a result, suppose that we have found nfo[q] and that we are
looking for nfo½q0?, with q0¼ succðqÞ. Because B½q? ? B½q0?, the
previous corollary implies that nfo½q0? is either a parent of nfo[q]
or one of its successors.
8m 2M½q?,8r 2 R,
ððm ?RrÞ ^ ðm 62 r:ancestorsÞÞ ) ðr 2 M½q? [ A½q?Þ
PROOF. Let r be a reference interval such that m ?Rr ^ m 62
r:ancestors. The following assertions hold:
(1) m:start ? q:end ^ q:start?m:end
(2) m:start ? r:start
(3) m:start4r:start ^ m:end5r:end
From 2 and 3, we deduce that m:end5r:end. Comparing with 1,
we deduce that q:start5r:end, and so r 2 M½q? [ A½q?. This
proves the lemma.
(q and m overlap),
(m 62 r:ancestors).
8r 2 R,r ?Rnfo½q? )
r 2 B½q? if ðnfo½q? 2 M½q?Þ_
ðnfo½q? 2 r:ancestorsÞ
r 2 ðM½q? [ A½q?Þ otherwise
PROOF. If nfo½q? 2 A½q?, then Lemma 2 proves that r will be in
A[q]. Otherwise, nfo½q? 2 M½q?. In this case, by definition of
nfo[q], all its children are in B[q], and by application of
Lemma 1, all the intervals that are under nfo[q] are in B[q].
Finally, by application of Lemma 5, all the reference intervals
that are after nfo[q], but not under it, are in M½q? [ A½q?. This
proves the proposition.
This last proposition implies that in general, all the elements
greater than nfo[q] could overlap with the successor of q. The
only exception is when nfo[q] overlaps with q. In this case, chil-
dren intervals must be skipped. This is why we use a variable
skip, which stores this configuration.
infer an algorithm, which is completely presented in supplemen-
tary materials. A loop iterating over the query elements is
described in findOverlap. The algorithm that compares a
query interval with the reference intervals is described in
findOverlapIter. A last algorithm, getNext, shows how
to get the successor of a reference interval (considering the order-
Informally, the main algorithm directly jumps to the nfo refer-
ence element that had been computed by the previous query
interval. It then checks the ancestors. Then, it scans forward. If
the current reference is in B[q], it jumps to the next interval. If the
current reference is in M[q], it goes down to the sublists, except if
the variable skip is true. In such case, it directly jumps to the next
interval. If the current reference is in A[q], it stops. The variable
nfo is updated when necessary.
From the previous propositions, we can directly
PROPOSITION 7. nfo½q? is the nfo computed in the algorithm.
PROOF. Let us consider a query interval q0and its successor q.
ðM½q? \ A½q?Þ ? ðM½q0? \ A½q0?Þ.By corollary 4,
Fig. 3. Different configurations of the interval comparison problem.
In every case, the query interval (q) is in black, and the other colors
refer to the reference intervals. nfo[q] is indicated by the arrow. To help
the reader, reference intervals in B[q] are white; the intervals in M[q]
are light gray; dark gray intervals are in A[q]. Case (A) is the simple
case, the other cases are less intuitive. In case (B), we can observe that
the first overlapping interval is not nfo[q]: it is the bottom-most over-
lapping element. In case (C), all the children of nfo[q] are in B[q]. In case
(D), nfo[q] is in A[q]
M.Zytnicki et al.
by guest on August 20, 2015
m½q? ¼ min?RfM½q? [ A½q?g is either a parent of nfo½q0? or one of
Besides, we have previously proved that m½q? 2 fnfo½q?g[
nfo½q?:ancestors. Thus, starting from the previous nfo, checking
its ancestors, then possibly going right until m[q] is found, and
then finally going down is enough to find nfo[q]. This is what
the algorithm does.
PROPOSITION 8. The time complexity of the algorithm is
Oð#Q þ #R þ #MÞ.
PROOF. Let us consider the reference intervals that will be
compared with q. Let q0be its predecessor. The reference inter-
vals that are scanned are B½q?nB½q0?, M[q], and the least element
of A[q]. Because the sets fB½q?nB½q0? : ðq,q0Þ 2 Q2,succðq0Þ ¼ qg
are alldisjoint, thetotal
Oð#Q þ #R þ #MÞ.
Notice that the algorithm findOverlap sometimes needs to
go from the child to the parent, and thus be able to visit the tree
from bottom to top, whereas the original algorithm described in
Alg. 1 is a typical top-down algorithm. To be able to go up, we
added in the L table a new cell, which contains the address of the
parent element in the L table.
interval, but a succession of several intervals, which are the
exons. Similarly, the reads can also be splitted in several parts
if they overlap the exon/exon junction. In our implementation,
we modeled the query and the reference element as a single
interval (the smallest interval that contains all the exons), and
store these intervals into the NC-list. To avoid reporting the
reads that are the introns, we also store, for each interval, a
pointer to the memory address where the transcript or read is
completely described. To do so, we simply added a new column
in the L table, which stores the address. When an overlap is
found, the full structure is retrieved and the query and reference
intervals are compared in detail to report only true matches.
Transcripts usually are not simple
Comparison to other implementations.
of our algorithm when compared with several other published
methods. The first is a simple NC-list algorithm, as presented by
Alekseyenko and Lee (2007), which does not use any information
between two consecutive query intervals, hereafter called ‘nc’.
The second method implements binning (Kent et al., 2002)
using an indexed SQLite table, hereafter called ‘bin’. We also
implemented another flavor of this algorithm, called ‘has’,
where the database has been replaced by a hash structure, such
that the keys are the bins, and the values are lists of intervals.
A forth algorithm is a binning table with segment tree, as
described in Segtor (Renaud et al., 2011), called ‘seg’. We also
added FJoin (Richardson, 2006) (‘fj’), which scans the previously
sorted query intervals and reference intervals simultaneously to
find overlaps. Our algorithm will simply be called ‘new’.
Among the presented algorithms, only ‘bin’, ‘nc’ and ‘new’
have constant space complexities. The other algorithms, ‘has’,
‘seg’ (where the trees are stored in memory) and ‘fj’ (which has
a linear space complexity), are thus not likely to work on the
We show here the results
large amount of data modern sequencers generate, with a stan-
dard computer. For instance, in our implementation, the ‘has’
algorithm fills our RAM (4GB) when the reference dataset
contains 30M intervals. Still, as they rely on in-memory data,
they usually run faster on the sets they can handle.
For a fair comparison of all the algorithms, and to exclude any
bias that would originate from the choice of the programming
language used by the different methods, we re-implemented all
the algorithms carefully as described by the articles. All the algo-
rithms have exactly the same input, output and functionalities,
which reflect a usual mapped reads/annotation comparison
study. First, strand is ignored (as many RNA-Seq data have
no strand information, and most algorithms, when described in
their original articles, do not deal with this case). Second, each
feature (hereafter a read or a transcript) is stored as a single
interval. If an overlap is detected, the transcript is extracted
from the input file (each method keeps track of the memory
address of the features) and a second comparison is performed
to check if the overlap is not located in the introns of the tran-
script, in which case the overlap is not reported. Last, the output
file is a GFF3 file, which contains the query intervals that over-
lap with at least one reference element, and the list of the over-
lapping elements are added in the tags of the ninth field. These
implementations, as well as the benchmark itself, are available in
the S–MART toolbox. See supplementary materials for more
information about these implementations.
Example on a real dataset.
publicly available RNA-Seq datasets: on yeast, fly and cress
(available as SRR014335, SRR030228 and SRR346552 datasets
in GEO). We mapped the reads with Bowtie (Langmead
et al., 2009) on the reference genome and we compared the
mapped reads with the annotation (the genome sequence and
the annotations are both available from the Bowtie website).
For each dataset, we reported the number of annotated tran-
scripts (which are the reference intervals) as well as the
number of reads (the query intervals). We used the six different
algorithms previously mentioned. Run-time results are shown
in Table 1. The first columns give the characteristics of the
datasets: number of reads, number of transcripts and number
of overlaps. The following columns give the run-time spent by
the algorithms when the genes are the reference and the reads
are the query.
As expected, ‘has’ and the ‘fj’ algorithms usually perform
well on this dataset because the intervals are stored in memory.
We downloaded three different
Table 1. Characteristics of three real datasets, and run-time results (in
thousands of seconds) for the six algorithms
Dataset No. of
aThe program aborted for it needed too much memory (44GB).
No. of transc., number of annotated transcripts, used as reference; No. of ov.,
number of overlaps.
Efficient comparison of sets of intervals
by guest on August 20, 2015