About
55
Publications
10,472
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,473
Citations
Introduction
Additional affiliations
September 2011 - present
January 2007 - present
Education
September 2004 - November 2008
September 2001 - June 2004
September 1994 - June 1999
Publications
Publications (55)
Experimentation is an intrinsic part of research since it allows for collecting quantitative observations, validating hypotheses, and providing evidence for their reformulation. For that reason, experimentation must be coherent with the purposes of the research, properly addressing the relevant questions in each case. Unfortunately, the literature...
In the field of optimization and machine learning, the statistical assessment of results has played a key role in conducting algorithmic performance comparisons. Classically, null hypothesis statistical tests have been used. However, recently, alternatives based on Bayesian statistics have shown great potential in complex scenarios, especially when...
Probabilistic label learning is a challenging task that arises from recent real-world problems within the weakly supervised classification framework. In this task algorithms have to deal with datasets where each instance has associated a set of probabilities belonging to different class labels.
In this paper, we propose a supervised univariate non-...
Statistical tests are a powerful set of tools when applied correctly, but unfortunately the extended misuse of them has caused great concern. Among many other applications, they are used in the detection of biomarkers so as to use the resulting p -values as a reference with which the candidate biomarkers are ranked. Although statistical tests can b...
Increasingly, treatment decisions for cancer patients are being made from next-generation sequencing results generated from formalin-fixed and paraffin-embedded (FFPE) biopsies. However, this material is prone to sequence artefacts that cannot be easily identified. In order to address this issue, we designed a machine learning-based algorithm to id...
The stability of feature subset selection algorithms has become crucial in real-world problems due to the need for consistent experimental results across different replicates. Specifically, in this paper, we analyze the reproducibility of ranking-based feature subset selection algorithms. When applied to data, this family of algorithms builds an or...
Nowadays, machine learning algorithms can be found in many applications where the classifiers play a key role. In this context, discretizing continuous attributes is a common step previous to classification tasks, the main goal being to retain as much discriminative information as possible. In this paper, we propose a supervised univariate non-para...
The most commonly used statistics in Evolutionary Computation (EC) are of the Wilcoxon-Mann-Whitney-test type, in its either paired or non-paired version. However, using such statistics for drawing performance comparisons has several known drawbacks. At the same time, Bayesian inference for performance analysis is an emerging statistical tool, whic...
The Mallows and Generalized Mallows Models are two of the most popular probability models for distributions on permutations. In this paper, we consider both models under the Hamming distance. This models can be seen as models for matchings instead of models for rankings. These models cannot be factorized, which contrasts with the popular MM and GMM...
Nowadays, data analysis applied to high dimension has arisen. The edification of high-dimensional data can be achieved by the gathering of different independent data. However, each independent set can introduce its own bias. We can cope with this bias introducing the observation set structure into our model. The goal of this article is to build the...
In many current problems, the actual class of the instances, the ground truth, is unavailable. Instead, with the intention of learning a model, the labels can be crowdsourced by harvesting them from different annotators. In this work, among those problems we focus on those that are binary classification problems. Specifically, our main objective is...
The statistical assessment of the empirical comparison of algorithms is an essential step in heuristic optimization. Classically, researchers have relied on the use of statistical tests. However, recently, concerns about their use have arisen and, in many fields, other (Bayesian) alternatives are being considered. For a proper analysis, different a...
Konputazio ebolutiboan, algoritmoek optimizazio-problemen gainean duten errendimendua ebaluatzeko ohikoa izaten da problema horien hainbat instantzia erabiltzea. Batzuetan, problema errealen instantziak eskuragarri daude, eta beraz, esperimentaziorako instantzien multzoa hortik osatzen da. Tamalez, orokorrean, ez da hori gertatzen. Instantziak esku...
The Mallows and Generalized Mallows models are compact yet powerful and natural ways of representing a probability distribution over the space of permutations. In this paper, we deal with the problems of sampling and learning such distributions when the metric on permutations is the Cayley distance. We propose new methods for both operations, and t...
In the last decade, many works in combinatorial optimisation have shown that, due to the advances in multi-objective optimisation, the algorithms from this field could be used for solving single-objective problems as well. In this sense, a number of papers have proposed multi-objectivising single-objective problems in order to use multiobjective al...
Luminal B breast tumors have aggressive clinical and biological features, and constitute the most heterogeneous molecular subtype, both clinically and molecularly. Unfortunately, the immunohistochemistry correlate of the luminal B subtype remains still imprecise, and it has now become of paramount importance to define a classification scheme capabl...
In this paper we present the R package PerMallows, which is a complete toolbox to work with permutations, distances and some of the most popular probability models for permutations: Mallows and the Generalized Mallows models. The Mallows model is an exponential location model, considered as analogous to the Gaussian distribution. It is based on the...
Comparing the results obtained by two or more algorithms in a set of problems is a central task in areas such as machine learning or optimization. Drawing conclusions from these comparisons may require the use of statistical tools such as hypothesis testing. There are some interesting papers that cover this topic. In this manuscript we present scma...
This is a poster presenting the FrogMIS algorithm which was initially published in the proceedings of the ANTS 2014 conference, and later in the journal Swarm Intelligence in 2015.
In the last decade, many works in combinatorial optimisation have shown that, due to the advances in multi-objective optimisation, the algorithms in this field could be used for solving single-objective problems. In this sense, a number of papers have proposed multi-objectivising single-objective problems in order to apply multi-objectivisation sch...
Finding large (and generally maximal) independent sets of vertices in a given graph is a fundamental problem in distributed computing. Applications include, for example, facility location and backbone formation in wireless ad hoc networks. In this paper, we study a decentralized (or distributed) algorithm inspired by the calling behavior of male Ja...
The combinatorial optimization problem tackled in this work is from the family of minimum weight rooted arborescence problems. The problem is NP-hard and has applications, for example, in computer vision and in multistage production planning. We describe an algorithm which makes use of a mathematical programming solver in order to find near-optimal...
The problem of identifying a maximal independent (node) set in a given graph is a fundamental problem in distributed computing. It has numerous applications, for example, in wireless networks in the context of facility location and backbone formation. In this paper we study the ability of a bio-inspired, distributed algorithm, initially proposed fo...
We propose a set of novel methodologies which enable valid statistical hypothesis testing when we have only positive and unlabelled (PU) examples. This type of problem, a special case of semi-supervised data, is common in text mining, bioinformatics, and computer vision. Focusing on a generalised likelihood ratio test, we have 3 key contributions:...
Objectives
Accumulating evidence indicates that aberrant DNA methylation is closely related to oral carcinogenesis, and it has been shown that methylation changes might be used as prognostic biomarker in oral squamous cell carcinoma. Oral lichenoid disease (OLD) is the most common oral potentially malignant disorder in our region. The aim of this s...
In this paper we propose a Beam-ACO approach for a combinatorial optimization problem known as the repetition-free longest common subsequence problem. Given two input sequences \(x\) and \(y\) over a finite alphabet \(\varSigma \), this problem concerns to find a longest common subsequence of \(x\) and \(y\) in which no letter is repeated. Beam-ACO...
One of the emerging techniques for performing the analysis of the DNA microarray data known as biclustering is the search of subsets of genes and conditions which are coherently expressed. These subgroups provide clues about the main biological processes. Until now, different approaches to this problem have been proposed. Most of them use the mean...
In the information retrieval framework, there are problems where the goal is to recover objects of a particular class from big sets of unlabelled objects. In some of these problems, only examples from the class we want to recover are available. For such problems, the machine learning community has developed algorithms that are able to learn binary...
An increasing number of data mining domains consider data that can be represented as permutations. Therefore, it is important
to devise new methods to learn predictive models over datasets of permutations. However, maintaining probability distributions
over the space of permutations is a hard task since there are n! permutations of n elements. The...
Haplotype data are especially important in the study of complex diseases since it contains more information than genotype data. However, obtaining haplotype data is technically difficult and costly. Computational methods have proved to be an effective way of inferring haplotype data from genotype data. One of these methods, the haplotype inference...
The increase in the number and complexity of biological databases has raised the need for modern and powerful data analysis tools and techniques. In order to fulfill these requirements, the machine learning discipline has become an everyday tool in bio-laboratories. The use of machine learning techniques has been extended to a wide spectrum of bioi...
Abstract The feature subset selection (FSS) problem has a growing importance in many,machine learning applications where the amount,of variables is very high. There is a great number of algorithms that can approach this problem in supervised databases, but when examples from one or more classes are not available, supervised FSS algorithms cannot be...
Taqman probes distribution in the Taqman Low density array (www.appliedbiosystem.com)
(0.05 MB XLS)
DCT data from the TLDA analysis. The data comes from the different comparisons: MS (relapse and remitting) vs Controls; Relapse (Relap) vs controls; remitting(Remitt) vs controls and relapse vs remitting
(0.32 MB DOC)
Target genes studied with their gene ID, the miRNA that binds to the gene, the group in which these genes are expected to be down-regulated and the Geneglobe Assay code.
(0.03 MB DOC)
Resume of the panther software methods
(0.03 MB DOC)
Clinical description of the patients. Tev: Time of evolution (years). EDSS: Expanded Disability Status Score. Te: Time from the relapse onset and the blood extraction (in days)
(0.03 MB DOC)
Complete data from the non-parametrical statistical analysis
(0.15 MB XLS)
Complete list of the miRNA predicted targets
(0.05 MB XLS)
Data from the pathway analysis conducted by panther with the predicted gene target lists from each miRNA. Two different groups of miRNA were studied; coming from the experiment and coming from the chance group
(0.05 MB DOC)
Microarray-based global gene expression profiling, with the use of sophisticated statistical algorithms is providing new insights into the pathogenesis of autoimmune diseases. We have applied a novel statistical technique for gene selection based on machine learning approaches to analyze microarray expression data gathered from patients with system...
Differences in gene expression patterns have been documented not only in Multiple Sclerosis patients versus healthy controls but also in the relapse of the disease. Recently a new gene expression modulator has been identified: the microRNA or miRNA. The aim of this work is to analyze the possible role of miRNAs in multiple sclerosis, focusing on th...
The development of techniques for oncogenomic analyses such as array comparative genomic hybridization, messenger RNA expression
arrays and mutational screens have come to the fore in modern cancer research. Studies utilizing these techniques are able
to highlight panels of genes that are altered in cancer. However, these candidate cancer genes mus...
Within the wide field of classification on the Machine Learning discipline, Bayesian classifiers are very well established
paradigms. They allow the user to work with probabilistic processes, as well as, with graphical representations of the relationships
among the variables of a problem.
The positive unlabeled learning term refers to the binary classification problem in the absence of negative examples. When only positive and unlabeled instances are available, semi-supervised classification algorithms cannot be directly applied, and thus new algorithms are required. One of these positive unlabeled learning algorithms is the positiv...
The discovery of the genes involved in genetic diseases is a very important step towards the understanding of the nature of these diseases. In-lab identification is a difficult, time-consuming task, where computational methods can be very useful. In silico identification algorithms can be used as a guide in future studies. Previous works in this to...
This article reviews machine learning methods for bioinformatics. It presents modelling methods, such as supervised classification, clustering and probabilistic graphical models for knowledge discovery, as well as deterministic and stochastic heuristics for optimization. Applications in genomics, proteomics, systems biology, evolution and text mini...
Questions
Question (1)
We have recently been involved in the analysis of sequencing data coming from Ion Torrent technology. We had previous experience in biological data analysis such as that coming from microarrays or mass spectrometry, but no experience with sequencing data.
After searching for information and "playing" a little bit with the data, what we have seen is that the pre-processing of the reads generated is a critical step before any downstream analysis. Although there are many tools and approaches, I still don't have clear what pre-processing steps are the most appropriate. I guess the 'strictness' of the trimming and filtering should depend on the downstream analysis but, what are the criteria to trim/filter the data? What other steps (such as error correction, maybe) are interesting? To what extent these steps are technology-independent (or, more interestingly, which depend on the technology and how?.
Issues that I know/guess that are relevant (in the particular case of Ion Torrent data, though most likely they appear in other technologies) are:
Quality of the reads - Which (obviously) decreases with the length of the read. Trimming the 3' tail is a good approach to increase the average quality of the read, but I'm not quite sure it is enough. Are there other approaches to do this beyond the simplistic one of cutting at the first position that surpasses a given quality threshold?. I'm also concerned with the distribution of the quality values through the reads, which seem to have a high variability (high quality values followed by low quality ones). This (may) have to do with the next issue
Influence of homopolymers - I assume that all the technologies have problems with this issue, but to what extent? Which are good criteria to filter according to this issue (total number of homopolymers?, maximum length?)
Contamination with adapters/barcodes - In my particular data (although I have observed it also in public data) I would say that in the first 10-15 bases of some reads there are evidence of contamination with incorrectly removed adapters and/or barcodes. I have seen tools that perform this 'cleansing' step for data coming from other technologies but, what about Ion Torrent data?. If there is no tool, any suggestion about how could we use the sequence of the adapters to identify and remove this contamination?
To sum up, I need feedback about pre-processing steps (trimming, filtering and any other method) for sequencing data, particularly for data coming from Ion Torrent technology.