Representing a static set of integers S, \(|S| = n\) from a finite universe \(U = [1{..}u]\) is a fundamental task in computer science. Our concern is to represent S in small space while supporting the operations of \(\mathsf {rank}\) and \(\mathsf {select}\) on S; if S is viewed as its characteristic vector, the problem becomes that of representing a bit-vector, which is arguably the most fundamental building block of succinct data structures. Although there is an information-theoretic lower bound of \({\mathcal {B}}(n, u)= \lg {u\atopwithdelims ()n}\) bits on the space needed to represent S, this applies to worst-case (random) sets S, and sets found in practical applications are compressible. We focus on the case where elements of S contain runs of| \(\ell >1\) consecutive elements, one that occurs in many practical situations. Let \({\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}\) denote the class of \({u\atopwithdelims ()n}\) distinct sets of \(n\) elements over the universe \([1{..}u]\). Let also \({\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g}\subset {\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}\) contain the sets whose \(n\) elements are arranged in \(g \le n\) runs of \(\ell _i \ge 1\) consecutive elements from U for \(i=1,\ldots , g\), and let \({\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g,r}\subset {\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g}\) contain all sets that consist of g runs, such that \(r \le g\) of them have at least 2 elements. This paper yields the following insights and contributions related to \(\mathsf {rank}\)/\(\mathsf {select}\) succinct data structures:
We introduce new compressibility measures for sets, including:
\({\mathcal {B}}_1(g,n,u)= \lg {|{\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g}|} = \lg {{u-n+1 \atopwithdelims ()g}} + \lg {{n-1 \atopwithdelims ()g-1}}\), and
\({\mathcal {B}}_2(r, g, n,u)= \lg {|{\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g,r}|} =\lg {{u-n+1 \atopwithdelims ()g}} + \lg {{n-g-1 \atopwithdelims ()r-1}} + \lg {{g\atopwithdelims ()r}}\),
such that \({\mathcal {B}}_2(r, g, n,u)\le {\mathcal {B}}_1(g,n,u)\le {\mathcal {B}}(n, u)\).
We give data structures that use space close to bounds \({\mathcal {B}}_1(g,n,u)\) and \({\mathcal {B}}_2(r, g, n,u)\) and support \(\mathsf {rank}\) and \(\mathsf {select}\) in \(\mathrm {O}(1)\) time.
We provide additional measures involving entropy-coding run lengths and gaps between items, and data structures to support \(\mathsf {rank}\) and \(\mathsf {select}\) using space close to these measures.