Borja Calvo

Borja Calvo
University of the Basque Country | UPV/EHU · Computer Sciences and Artificial Intelligence

PhD

About

55
Publications
10,472
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,473
Citations
Additional affiliations
September 2011 - present
University of the Basque Country
Position
  • Profesor Adjunto (Lecturer)
January 2007 - present
University of the Basque Country
Position
  • Universidad del País Vasco / Euskal Herriko Unibertsitatea
Education
September 2004 - November 2008
University of the Basque Country
Field of study
  • Computer Science
September 2001 - June 2004
University of the Basque Country
Field of study
  • Computer Science
September 1994 - June 1999
University of the Basque Country
Field of study
  • Biochemistry

Publications

Publications (55)
Article
Full-text available
Experimentation is an intrinsic part of research since it allows for collecting quantitative observations, validating hypotheses, and providing evidence for their reformulation. For that reason, experimentation must be coherent with the purposes of the research, properly addressing the relevant questions in each case. Unfortunately, the literature...
Article
In the field of optimization and machine learning, the statistical assessment of results has played a key role in conducting algorithmic performance comparisons. Classically, null hypothesis statistical tests have been used. However, recently, alternatives based on Bayesian statistics have shown great potential in complex scenarios, especially when...
Article
Probabilistic label learning is a challenging task that arises from recent real-world problems within the weakly supervised classification framework. In this task algorithms have to deal with datasets where each instance has associated a set of probabilities belonging to different class labels. In this paper, we propose a supervised univariate non-...
Article
Full-text available
Statistical tests are a powerful set of tools when applied correctly, but unfortunately the extended misuse of them has caused great concern. Among many other applications, they are used in the detection of biomarkers so as to use the resulting p -values as a reference with which the candidate biomarkers are ranked. Although statistical tests can b...
Article
Full-text available
Increasingly, treatment decisions for cancer patients are being made from next-generation sequencing results generated from formalin-fixed and paraffin-embedded (FFPE) biopsies. However, this material is prone to sequence artefacts that cannot be easily identified. In order to address this issue, we designed a machine learning-based algorithm to id...
Article
Full-text available
The stability of feature subset selection algorithms has become crucial in real-world problems due to the need for consistent experimental results across different replicates. Specifically, in this paper, we analyze the reproducibility of ranking-based feature subset selection algorithms. When applied to data, this family of algorithms builds an or...
Article
Nowadays, machine learning algorithms can be found in many applications where the classifiers play a key role. In this context, discretizing continuous attributes is a common step previous to classification tasks, the main goal being to retain as much discriminative information as possible. In this paper, we propose a supervised univariate non-para...
Conference Paper
The most commonly used statistics in Evolutionary Computation (EC) are of the Wilcoxon-Mann-Whitney-test type, in its either paired or non-paired version. However, using such statistics for drawing performance comparisons has several known drawbacks. At the same time, Bayesian inference for performance analysis is an emerging statistical tool, whic...
Article
The Mallows and Generalized Mallows Models are two of the most popular probability models for distributions on permutations. In this paper, we consider both models under the Hamming distance. This models can be seen as models for matchings instead of models for rankings. These models cannot be factorized, which contrasts with the popular MM and GMM...
Article
Full-text available
Nowadays, data analysis applied to high dimension has arisen. The edification of high-dimensional data can be achieved by the gathering of different independent data. However, each independent set can introduce its own bias. We can cope with this bias introducing the observation set structure into our model. The goal of this article is to build the...
Article
In many current problems, the actual class of the instances, the ground truth, is unavailable. Instead, with the intention of learning a model, the labels can be crowdsourced by harvesting them from different annotators. In this work, among those problems we focus on those that are binary classification problems. Specifically, our main objective is...
Conference Paper
The statistical assessment of the empirical comparison of algorithms is an essential step in heuristic optimization. Classically, researchers have relied on the use of statistical tests. However, recently, concerns about their use have arisen and, in many fields, other (Bayesian) alternatives are being considered. For a proper analysis, different a...
Article
Full-text available
Konputazio ebolutiboan, algoritmoek optimizazio-problemen gainean duten errendimendua ebaluatzeko ohikoa izaten da problema horien hainbat instantzia erabiltzea. Batzuetan, problema errealen instantziak eskuragarri daude, eta beraz, esperimentaziorako instantzien multzoa hortik osatzen da. Tamalez, orokorrean, ez da hori gertatzen. Instantziak esku...
Article
Full-text available
The Mallows and Generalized Mallows models are compact yet powerful and natural ways of representing a probability distribution over the space of permutations. In this paper, we deal with the problems of sampling and learning such distributions when the metric on permutations is the Cayley distance. We propose new methods for both operations, and t...
Article
In the last decade, many works in combinatorial optimisation have shown that, due to the advances in multi-objective optimisation, the algorithms from this field could be used for solving single-objective problems as well. In this sense, a number of papers have proposed multi-objectivising single-objective problems in order to use multiobjective al...
Article
Full-text available
Luminal B breast tumors have aggressive clinical and biological features, and constitute the most heterogeneous molecular subtype, both clinically and molecularly. Unfortunately, the immunohistochemistry correlate of the luminal B subtype remains still imprecise, and it has now become of paramount importance to define a classification scheme capabl...
Article
Full-text available
In this paper we present the R package PerMallows, which is a complete toolbox to work with permutations, distances and some of the most popular probability models for permutations: Mallows and the Generalized Mallows models. The Mallows model is an exponential location model, considered as analogous to the Gaussian distribution. It is based on the...
Article
Full-text available
Comparing the results obtained by two or more algorithms in a set of problems is a central task in areas such as machine learning or optimization. Drawing conclusions from these comparisons may require the use of statistical tools such as hypothesis testing. There are some interesting papers that cover this topic. In this manuscript we present scma...
Poster
Full-text available
This is a poster presenting the FrogMIS algorithm which was initially published in the proceedings of the ANTS 2014 conference, and later in the journal Swarm Intelligence in 2015.
Conference Paper
In the last decade, many works in combinatorial optimisation have shown that, due to the advances in multi-objective optimisation, the algorithms in this field could be used for solving single-objective problems. In this sense, a number of papers have proposed multi-objectivising single-objective problems in order to apply multi-objectivisation sch...
Article
Finding large (and generally maximal) independent sets of vertices in a given graph is a fundamental problem in distributed computing. Applications include, for example, facility location and backbone formation in wireless ad hoc networks. In this paper, we study a decentralized (or distributed) algorithm inspired by the calling behavior of male Ja...
Article
The combinatorial optimization problem tackled in this work is from the family of minimum weight rooted arborescence problems. The problem is NP-hard and has applications, for example, in computer vision and in multistage production planning. We describe an algorithm which makes use of a mathematical programming solver in order to find near-optimal...
Conference Paper
The problem of identifying a maximal independent (node) set in a given graph is a fundamental problem in distributed computing. It has numerous applications, for example, in wireless networks in the context of facility location and backbone formation. In this paper we study the ability of a bio-inspired, distributed algorithm, initially proposed fo...
Conference Paper
Full-text available
We propose a set of novel methodologies which enable valid statistical hypothesis testing when we have only positive and unlabelled (PU) examples. This type of problem, a special case of semi-supervised data, is common in text mining, bioinformatics, and computer vision. Focusing on a generalised likelihood ratio test, we have 3 key contributions:...
Article
Objectives Accumulating evidence indicates that aberrant DNA methylation is closely related to oral carcinogenesis, and it has been shown that methylation changes might be used as prognostic biomarker in oral squamous cell carcinoma. Oral lichenoid disease (OLD) is the most common oral potentially malignant disorder in our region. The aim of this s...
Conference Paper
In this paper we propose a Beam-ACO approach for a combinatorial optimization problem known as the repetition-free longest common subsequence problem. Given two input sequences \(x\) and \(y\) over a finite alphabet \(\varSigma \), this problem concerns to find a longest common subsequence of \(x\) and \(y\) in which no letter is repeated. Beam-ACO...
Article
One of the emerging techniques for performing the analysis of the DNA microarray data known as biclustering is the search of subsets of genes and conditions which are coherently expressed. These subgroups provide clues about the main biological processes. Until now, different approaches to this problem have been proposed. Most of them use the mean...
Article
In the information retrieval framework, there are problems where the goal is to recover objects of a particular class from big sets of unlabelled objects. In some of these problems, only examples from the class we want to recover are available. For such problems, the machine learning community has developed algorithms that are able to learn binary...
Conference Paper
An increasing number of data mining domains consider data that can be represented as permutations. Therefore, it is important to devise new methods to learn predictive models over datasets of permutations. However, maintaining probability distributions over the space of permutations is a hard task since there are n! permutations of n elements. The...
Article
Haplotype data are especially important in the study of complex diseases since it contains more information than genotype data. However, obtaining haplotype data is technically difficult and costly. Computational methods have proved to be an effective way of inferring haplotype data from genotype data. One of these methods, the haplotype inference...
Article
The increase in the number and complexity of biological databases has raised the need for modern and powerful data analysis tools and techniques. In order to fulfill these requirements, the machine learning discipline has become an everyday tool in bio-laboratories. The use of machine learning techniques has been extended to a wide spectrum of bioi...
Article
Abstract The feature subset selection (FSS) problem has a growing importance in many,machine learning applications where the amount,of variables is very high. There is a great number of algorithms that can approach this problem in supervised databases, but when examples from one or more classes are not available, supervised FSS algorithms cannot be...
Data
Taqman probes distribution in the Taqman Low density array (www.appliedbiosystem.com) (0.05 MB XLS)
Data
DCT data from the TLDA analysis. The data comes from the different comparisons: MS (relapse and remitting) vs Controls; Relapse (Relap) vs controls; remitting(Remitt) vs controls and relapse vs remitting (0.32 MB DOC)
Data
Target genes studied with their gene ID, the miRNA that binds to the gene, the group in which these genes are expected to be down-regulated and the Geneglobe Assay code. (0.03 MB DOC)
Data
Resume of the panther software methods (0.03 MB DOC)
Data
Clinical description of the patients. Tev: Time of evolution (years). EDSS: Expanded Disability Status Score. Te: Time from the relapse onset and the blood extraction (in days) (0.03 MB DOC)
Data
Complete data from the non-parametrical statistical analysis (0.15 MB XLS)
Data
Complete list of the miRNA predicted targets (0.05 MB XLS)
Data
Data from the pathway analysis conducted by panther with the predicted gene target lists from each miRNA. Two different groups of miRNA were studied; coming from the experiment and coming from the chance group (0.05 MB DOC)
Article
Full-text available
Microarray-based global gene expression profiling, with the use of sophisticated statistical algorithms is providing new insights into the pathogenesis of autoimmune diseases. We have applied a novel statistical technique for gene selection based on machine learning approaches to analyze microarray expression data gathered from patients with system...
Article
Full-text available
Differences in gene expression patterns have been documented not only in Multiple Sclerosis patients versus healthy controls but also in the relapse of the disease. Recently a new gene expression modulator has been identified: the microRNA or miRNA. The aim of this work is to analyze the possible role of miRNAs in multiple sclerosis, focusing on th...
Article
Full-text available
The development of techniques for oncogenomic analyses such as array comparative genomic hybridization, messenger RNA expression arrays and mutational screens have come to the fore in modern cancer research. Studies utilizing these techniques are able to highlight panels of genes that are altered in cancer. However, these candidate cancer genes mus...
Chapter
Within the wide field of classification on the Machine Learning discipline, Bayesian classifiers are very well established paradigms. They allow the user to work with probabilistic processes, as well as, with graphical representations of the relationships among the variables of a problem.
Article
The positive unlabeled learning term refers to the binary classification problem in the absence of negative examples. When only positive and unlabeled instances are available, semi-supervised classification algorithms cannot be directly applied, and thus new algorithms are required. One of these positive unlabeled learning algorithms is the positiv...
Article
The discovery of the genes involved in genetic diseases is a very important step towards the understanding of the nature of these diseases. In-lab identification is a difficult, time-consuming task, where computational methods can be very useful. In silico identification algorithms can be used as a guide in future studies. Previous works in this to...
Article
This article reviews machine learning methods for bioinformatics. It presents modelling methods, such as supervised classification, clustering and probabilistic graphical models for knowledge discovery, as well as deterministic and stochastic heuristics for optimization. Applications in genomics, proteomics, systems biology, evolution and text mini...

Questions

Question (1)
Question
We have recently been involved in the analysis of sequencing data coming from Ion Torrent technology. We had previous experience in biological data analysis such as that coming from microarrays or mass spectrometry, but no experience with sequencing data.
After searching for information and "playing" a little bit with the data, what we have seen is that the pre-processing of the reads generated is a critical step before any downstream analysis. Although there are many tools and approaches, I still don't have clear what pre-processing steps are the most appropriate. I guess the 'strictness' of the trimming and filtering should depend on the downstream analysis but, what are the criteria to trim/filter the data? What other steps (such as error correction, maybe) are interesting? To what extent these steps are technology-independent (or, more interestingly, which depend on the technology and how?.
Issues that I know/guess that are relevant (in the particular case of Ion Torrent data, though most likely they appear in other technologies) are:
Quality of the reads - Which (obviously) decreases with the length of the read. Trimming the 3' tail is a good approach to increase the average quality of the read, but I'm not quite sure it is enough. Are there other approaches to do this beyond the simplistic one of cutting at the first position that surpasses a given quality threshold?. I'm also concerned with the distribution of the quality values through the reads, which seem to have a high variability (high quality values followed by low quality ones). This (may) have to do with the next issue
Influence of homopolymers - I assume that all the technologies have problems with this issue, but to what extent? Which are good criteria to filter according to this issue (total number of homopolymers?, maximum length?)
Contamination with adapters/barcodes - In my particular data (although I have observed it also in public data) I would say that in the first 10-15 bases of some reads there are evidence of contamination with incorrectly removed adapters and/or barcodes. I have seen tools that perform this 'cleansing' step for data coming from other technologies but, what about Ion Torrent data?. If there is no tool, any suggestion about how could we use the sequence of the adapters to identify and remove this contamination?
To sum up, I need feedback about pre-processing steps (trimming, filtering and any other method) for sequencing data, particularly for data coming from Ion Torrent technology.

Network

Cited By