## About

63

Publications

8,536

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

806

Citations

Citations since 2017

## Publications

Publications (63)

The increase in the size of data repositories has forced the design of new computing paradigms to be able to process large volumes of data in a reasonable amount of time. One of them is in‐memory computing, which advocates storing all the data in main memory to avoid the disk I/O bottleneck. Compression is one of the key technologies for this appro...

The present work describes a statistical model to account for sequencing information of SARS-CoV-2 variants in wastewater samples. The model expresses the joint probability distribution of the number of genomic reads corresponding to mutations and non-mutations in every locus in terms of the variant proportions and the joint mutation distribution w...

MONI (Rossi et al., 2022) can store a pangenomic dataset T in small space and later, given a pattern P, quickly find the maximal exact matches (MEMs) of P with respect to T. In this paper we consider its one-pass version (Boucher et al., 2021), whose query times are dominated in our experiments by longest common extension (LCE) queries. We show how...

LiDAR devices are capable of acquiring clouds of 3D points reflecting any object around them, and adding additional attributes to each point such as color, position, time, etc. LiDAR datasets are usually large, and compressed data formats (e.g. LAZ) have been proposed over the years. These formats are capable of transparently decompressing portions...

Real-world point sets tend to be clustered, so using a machine word for each point is wasteful. In this paper we first show how a compact representation of quadtrees using O(1) bits per node can break this bound on clustered point sets, while offering efficient range searches. We then describe a new compact quadtree representation based on heavy-pa...

During viral infection, intrahost mutation and recombination can lead to significant evolution, resulting in a population of viruses that harbor multiple haplotypes. The task of reconstructing these haplotypes from short-read sequencing data is called viral quasispecies assembly, and it can be categorized as a multiassembly problem. We consider the...

Computing the product of the (binary) adjacency matrix of a large graph with a real-valued vector is an important operation that lies at the heart of various graph analysis tasks, such as computing PageRank. In this paper, we show that some well-known webgraph and social graph compression formats are computation-friendly, in the sense that they all...

During viral infection, intrahost mutation and recombination can lead to significant evolution, resulting in a population of viruses that harbor multiple haplotypes. The task of reconstructing these haplotypes from short-read sequencing data is called viral quasispecies assembly, and it can be categorized as a multiassembly problem. We consider the...

We address the problem of representing dynamic graphs using k2-trees. The k2-tree data structure is one of the succinct data structures proposed for representing static graphs, and binary relations in general. It relies on compact representations of bit vectors. Hence, by relying on compact representations of dynamic bit vectors, we can also repres...

In this article, we present a system that combines indoor positioning with a compression algorithm for trajectories in the context of a nursing home. Our aim is to gather and effectively represent the location of the residents and caregivers along time, while allowing for efficient access to those data. We briefly show the system architecture that...

In the past decade, next-generation sequencing (NGS) enabled the generation of genomic data in a cost-effective, high-throughput manner. The most recent third-generation sequencing technologies produce longer reads; however, their error rates are much higher, which complicates the assembly process. This generates time- and space- demanding long-rea...

Compressing real-world graphs has many benefits such as improving or enabling the visualization in small memory devices, graph query processing, community search, and mining algorithms. This work proposes a novel compact representation for real sparse and clustered undirected graphs. The approach lists all the maximal cliques by using a fast algori...

Motivation:
RNA viruses exhibit a high mutation rate and thus they exist in infected cells as a population of closely related strains called viral quasispecies. The viral quasispecies assembly problem asks to characterise the quasispecies present in a sample from high-throughput sequencing data. We study the de novo version of the problem, where r...

We address the problem of representing dynamic graphs using k²-trees.
The k²-tree data structure is one of the succinct data structures proposed for representing static graphs, and binary relations in general.
It relies on compact representations of bit vectors.
Hence, by relying on compact representations of dynamic bit vectors, we can also repres...

Binary relations are commonly used in Computer Science for modeling data. In addition to classical representations using matrices or lists, some compressed data structures have recently been proposed to represent binary relations in compact space, such as the $k^2$-tree and the Binary Relation Wavelet Tree (BRWT). Knowing their storage needs, suppo...

Binary relations are commonly used in Computer Science for modeling data. In addition to classical representations using matrices or lists, some compressed data structures have recently been proposed to represent binary relations in compact space, such as the k2-tree and the Binary RelationWavelet Tree (BRWT). Knowing their storage needs, supported...

In this work, we propose a framework to store and manage spatial data, which includes new efficient algorithms to perform operations accepting as input a raster dataset and a vector dataset. More concretely, we present algorithms for solving a spatial join between a raster and a vector dataset imposing a restriction on the values of the cells of th...

In this work, we propose a framework to store and manage spatial data, which includes new efficient algorithms to perform operations accepting as input a raster dataset and a vector dataset. More concretely, we present algorithms for solving a spatial join between a raster and a vector dataset imposing a restriction on the values of the cells of th...

LiDAR devices obtain a 3D representation of a space. Due to the large size of the resulting datasets, there already exist storage methods that use compression and present some properties that resemble those of compact data structures. Specifically, LAZ format allows accesses to a given datum or portion of the data without having to decompress the w...

In the field of algorithms and data structures analysis and design, most of the researchers focus only on the space/time trade-off, and little attention has been paid to energy consumption. Moreover, most of the efforts in the field of Green Computing have been devoted to hardware-related issues, being green software in its infancy. Optimizing the...

We address the problem of representing dynamic graphs using $k^2$-trees. The $k^2$-tree data structure is one of the succinct data structures proposed for representing static graphs, and binary relations in general. It relies on compact representations of bit vectors. Hence, by relying on compact representations of dynamic bit vectors, we can also...

LiDAR devices obtain a 3D representation of a space. Due to the large size of the resulting datasets, there already exist storage methods that use compression and present some properties that resemble those of compact data structures. Specifically, LAZ format allows accesses to a given datum or portion of the data without having to decompress the w...

In this paper, we propose a compact data structure to store labeled attributed graphs based on the \(k^2\)-tree, which is a very compact data structure designed to represent a simple directed graph. The idea we propose can be seen as an extension of the \(k^2\)-tree to support property graphs. In addition to the static approach, we also propose a d...

Binary relations are commonly used to represent relationships between real-world objects. Classical representations for binary relations can be very space-consuming when the set of elements is large. In these cases, compressed representations, such as the k2-tree, have proven to be a competitive solution, as they are efficient in time while consumi...

BIRDS stands for "Bioinformatics and Information Retrieval Data Structures analysis and design" and is a 4-year project (2016--2019) that has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 690941.
The overall goal of BIRDS is to establish a long term inte...

Author profiling consists in determining some demographic attributes — such as gender, age, nationality, language, religion, and others — of an author for a given document. This task, which has applications in fields such as forensics, security, or marketing, has been approached from different areas, especially from linguistics and natural language...

Compact data structures are storage structures that combine a compressed representation of the data and the access mechanisms for retrieving individual data without the need of decompressing from the beginning. The target is to be able to keep the data always compressed, even in main memory, given that the data can be processed directly in that for...

Computing the product of the adjacency (binary) matrix of a large graph with a real-valued vector is an important operation that lies at the heart of various graph analysis tasks, such as computing PageRank. In this paper we show that some well-known Web and social graph compression formats are {\em computation-friendly}, in the sense that they all...

In this paper, we present a methodology for off-line handwritten character recognition. The proposed methodology relies on a new feature extraction technique based on structural characteristics, histograms and profiles. As novelty, we propose the extraction of new eight histograms and four profiles from the $32\times 32$ matrices that represent the...

Compact data structures combine in a unique data structure a compressed representation of the data and the structures to access such data. The target is to be able to manage data directly in compressed form, and in this way, to keep data always compressed, even in main memory. With this, we obtain two benefits: we can manage larger datasets in main...

For a given basis of a vector space L over a field K and a multiplication table which is defined by a multilinear map [-,...,-]:L×n→L, we present an algorithm and develop a computer program on Mathematica in order to test if the given multiplication table corresponds to a Lie n-algebra or a non-Lie Leibniz n-algebra or neither. The algorithm is bas...

k2-trees have been proved successful to represent in avery compact way different kinds of binary relations, such as web graphs, RDFs or raster data. In order to be a fully functional succinct representation for these domains, the k2-tree must support all the required operations for binary relations. In their original description, the authors includ...

Real-world point sets tend to be clustered, so using a machine word for each
point is wasteful. In this paper we first bound the number of nodes in the
quadtree for a point set in terms of the points' clustering. We then describe a
quadtree data structure that uses $\mathcal{O} (1)$ bits per node and supports
faster queries than previous structures...

The representation of large subsets of the World Wide Web in the form of a directed graph has been extensively used to analyze structure, behavior, and evolution of those so-called Web graphs. However, interesting Web graphs are very large and their classical representations do not fit into the main memory of typical computers, whereas the required...

The List-Update Problem is a well studied online problem with direct applications in data compression. Although the model proposed by Sleator & Tarjan has become the standard in the field for the problem, its applicability in some domains, and in particular for compression purposes, has been questioned. In this paper, we focus on two alternative mo...

We present a new variable-length encoding scheme for sequences of integers, Directly Addressable Codes (DACs), which enables direct access to any element of the encoded sequence without the need of any sampling method. Our proposal is a kind of implicit data structure that introduces synchronism in the encoded sequence without using asymptotically...

Word-based byte-oriented compression has succeeded on large natural language text databases, by providing competitive compression ratios, fast random access, and direct sequential searching. We show that by just rearranging the target symbols of the compressed text into a tree-shaped structure, and using negligible additional space, we obtain a new...

Current processors include instruction set extensions especially designed for improving the performance of media, imaging, and 3D workloads. These instructions are rarely considered when implementing practical solutions for algorithms and compressed data structures, mostly because they are not directly generated by the compiler. In this paper, we p...

We propose an algorithm using Gröbner bases that decides in terms of the existence of a non singular matrix P if two Leibniz algebra structures over a finite dimensional CC-vector space are representative of the same isomorphism class.We apply this algorithm in order to obtain a reviewed classification of the 3-dimensional Leibniz algebras given by...

In this paper we focus on representing Web and social graphs. Our work is motivated by the need of mining information out of these graphs, thus our representations do not only aim at compressing the graphs, but also at supporting efficient navigation. This allows us to process bigger graphs in main memory, avoiding the slowdown brought by resorting...

Graph databases have emerged as an alternative data model with applications in many complex domains. Typically, the problems to be solved in such domains involve managing and mining huge graphs. The need for efficient processing in such applications has motivated the development of meth-ods for graph compression and indexing. However, most methods...

In this paper, we focus on the problem of preserving the data confidentiality when sharing the data for clustering. This problem poses new challenges for novel uses of privacy preserving data mining (PPDM) techniques. Specifically, this paper considers the synthetic data generation as a way to preserve the data privacy. One of the state of the art...

Finding approximate overlaps is the first phase of many sequence assembly methods. Given a set of r strings of total length n and an error-rate ε, the goal is to find, for all-pairs of strings, their suffix/prefix matches (overlaps) that are within edit distance k = ⌈εℓ⌉, where ℓ is the length of the overlap. We propose new solutions for this probl...

Synthetic data generators are one of the methods used in privacy preserving data mining for ensuring the privacy of the individuals when their data are published. Synthetic data generators construct artificial data from some models obtained from the original data. Such models are mainly based on statistics and, typically, do not take into account o...

We introduce a symbol reordering technique that implicitly synchronizes variable-length codes, such that it is possible to directly access the i-th codeword without need of any sampling method. The technique is practical and has many applications to the representation of ordered sets, sparse bitmaps, partial sums, and compressed data struc- tures f...

This paper presents a Web graph representation based on a compact tree
structure that takes advantage of large empty areas of the adjacency
matrix of the graph. Our results show that our method is competitive
with the best alternatives in the literature, offering a very good
compression ratio (3.3-5.3 bits per link) while permitting fast
navigation...

In this paper, we study different approaches for rank and select on sequences of bytes and propose new implementation strategies. Extensive experimental evaluation comparing the efficiency of the different alternatives are provided.Given a sequence of bits, a rank query counts the number of occurrences of the bit 1 up to a given position, and a sel...

Given a basis of a vector space V over a field
\mathbbK\mathbb{K} and a multiplication table which defines a bilinear map on V, we develop a computer program on Mathematica which checks if the bilinear map satisfies the Leibniz identity, that is, if
the multiplication table endows V with a Leibniz algebra structure. In case of a positive answer,...

Resumen Reducir el espacio de almacenamiento y el tiempo de trans- ferencia se ha vuelto un aspecto fundamental en las Bases de Datos Textuales. En este trabajo se presenta un nuevo compresor, denominado PPM orientado a palabras (SWPPM), en el que se aplican los modelos estad¶‡sticos propios de PPM utilizando como s¶‡mbolos de entrada las pa- labra...

In the research field of Geographic Information Systems (GIS), a cooperative effort has been undertaken by several international
organizations to define standards and specifications for interoperable systems. The Web Processing Service (WPS) is one of
the most recent specifications of the Open Geospatial Consortium (OGC). It is designed to standard...

Recent research has demonstrated beyond doubts the benefits of compressing natural language texts using word-based statistical semistatic compression. Not only it achieves extremely competitive compression rates, but also direct search on the compressed text can be carried out faster than on the original text; indexing based on inverted lists benef...

Data protection mechanisms need to find a trade-off between information loss and disclosure risk. To this end, information loss and disclosure risk measures have been developed. Due to the fact that when data is published it is usual to ignore which kind of analyses a user will pursue with the data, generic information loss measures are used to ana...

Masking methods are to protect data bases prior to their public release. They mask an original data file so that the new file ensures the privacy of data respondents. Information loss measures have been developed to evaluate in which extent the masked file diverges from the corresponding original file, and in what extent the same analyses on both f...

The development of applications that manage large text collections needs indexing methods which allow efficient retrieval over text. Several indexes have been proposed which try to reach a good trade-off between the space needed to store both the text and the index, and its search efficiency. Self-indexes are becoming more and more popular. Not onl...

Resumen Tradicionalmente, las comunidades de investigación en Re-cuperación de Información (RI) y en Ingeniería Lingüística (IL) han de-sarrollado sus investigaciones utilizando habitualmente como base docu-mentos no comprimidos indexados, en su caso, co ındices clásicos como lo ındices invertidos. Por otro lado, la comunidad de investigación en es...

## Projects

Project (1)

The overall goal of BIRDS is to establish a long term international network involving leading researchers in bioinformatics and information retrieval from four different continents, to strengthen the partnership though the exchange of knowledge and expertise, and to develop integrated approaches to improve current approaches in both fields. It will be implemented through staff exchanges, in addition to summer schools, workshops and conferences to facilitate knowledge sharing between members of partnership. We will also bring research results to market, thanks to cooperation with an innovative SME software development company based in Europe.