About
128
Publications
105,084
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
10,576
Citations
Introduction
Publications
Publications (128)
This report addresses, from a machine learning perspective, a multi-class classification problem to predict the first deterioration level of a COVID-19 positive patient at the time of hospital admission. Socio-demographic features, laboratory tests and other measures are taken into account to learn the models. Our output is divided into 4 categorie...
Vibration analysis (VA) techniques have aroused great interest in the industrial sector during the last decades. In particular, VA is widely used for rotatory components failure detection, such as rolling bearings, gears, etc. In the present work, we propose a novel data-driven methodology to process vibration-related data, in order to detect rotat...
The COVID-19 pandemic is continuously evolving with drastically changing epidemiological situations which are approached with different decisions: from the reduction of fatalities to even the selection of patients with the highest probability of survival in critical clinical situations. Motivated by this, a battery of mortality prediction models wi...
Background
Feature selection is a relevant step in the analysis of single-cell RNA sequencing datasets. Most of the current feature selection methods are based on general univariate descriptors of the data such as the dispersion or the percentage of zeros. Despite the use of correction methods, the generality of these feature selection methods bias...
Feature selection is a relevant step in the analysis of single-cell RNA sequencing datasets. Triku is a feature selection method that favours genes defining the main cell populations. It does so by selecting genes expressed by groups of cells that are close in the nearest neighbor graph. Triku efficiently recovers cell populations present in artifi...
In the last years considerable research effort has been put on modelling the ever growing data in streaming environments. Some of these efforts are related with streaming novelty detection. In this scenario, new classes may emerge, disappear, or drift within time while others are normally classified. Recent works model these events with non-paramet...
In recent years, a variety of research areas have contributed to a set of related problems with rare event, anomaly, novelty and outlier detection terms as the main actors. These multiple research areas have created a mix-up between terminology and problems. In some research, similar problems have been named differently; while in some other works,...
Plain Language Summary
Deep neural networks have recently demonstrated great versatility and an unprecedented capacity to model complex problems. In weather modeling, these algorithms have been applied to solve different problems. This is a promising area of research, given the availability of large volumes of weather data and increasingly powerful...
In regression, a predictive model which is able to anticipate the output of a new case is learnt from a set of previous examples. The output or response value of these examples used for model training is known. When learning with aggregated outputs, the examples available for model training are individually unlabeled. Collectively, the aggregated o...
Numerical Weather Prediction (NWP) models represent sub-grid processes using parameterizations, which are often complex and a major source of uncertainty in weather forecasting. In this work, we devise a simple machine learning (ML) methodology to learn parameterizations from basic NWP fields. Specifically, we demonstrate how encoder-decoder Convol...
Majority voting is a popular and robust strategy to aggregate different opinions in learning from crowds, where each worker labels examples according to their own criteria. Although it has been extensively studied in the binary case, its behavior with multiple classes is not completely clear, specifically when annotations are biased. This paper att...
Classifying software defects according to any defined taxonomy is not straightforward. In order to be used for automatizing the classification of software defects, two sets of defect reports were collected from public issue tracking systems from two different real domains. Due to the lack of a domain expert, the collected defects were categorized b...
This paper describes a suite of tools and a model for improving the accuracy of airport weather forecasts produced by numerical weather prediction (NWP) products, by learning from the relationships between previously modelled and observed data. This is based on a new machine learning methodology that allows circular variables to be naturally incorp...
In software engineering, associating each reported defect with a category allows, among many other things, for the appropriate allocation of resources. Although this classification task can be automated using standard machine learning techniques, the categorization of defects for model training requires expert knowledge, which is not always availab...
Since many important real-world classification problems involve learning from unbalanced data, the challenging class-imbalance problem has lately received considerable attention in the community. Most of the methodological contributions proposed in the literature carry out a set of experiments over a battery of specific datasets. In these cases, in...
Lan eredu honen helburua atunaren arrantzaren errentagarritasuna hobetzea da arrainaren behaketa eta ibilbidearen optimizazioan oinarrituta, erregaien kontsumoa murriztuz eta harrapaketak mantenduz. Munduko lanpo sabelmarradun eta hegahoriaren arrantzaren kontribuzio maila altuenak erakusten dituen itsas-azaleko arrantza mota inguratze arrantzan da...
Although a great methodological effort has been invested in proposing competitive solutions to the class-imbalance problem, little effort has been made in pursuing a theoretical understanding of this matter. In order to shed some light on this topic, we perform, through a novel framework, an exhaustive analysis of the adequateness of the most commo...
Probabilistic Graphical model (PGMs) types Data format and pre-processing Bayesian networks (BNs): structure and parameters Bayesian network classifiers Applications of Bayesian networks in environmental sciences Sentimental analysis in social sciences using BNs Multi-dimensional Bayesian network classifiers Flexible classifiers Inference diagrams...
Weakly supervised classification tries to learn from data sets which are not certainly labeled. Many problems, with different natures of partial labeling, fit this description. In this paper, the novel problem of learning from positive-unlabeled proportions is presented. The provided examples are unlabeled, and the only class information available...
Machine learning techniques have been previously used to assist clinicians to select embryos for human-assisted reproduction. This work aims to show how an appropriate modeling of the problem can contribute to improve machine learning techniques for embryo selection. In this study, a dataset of 330 consecutive cycles (and associated embryos) carrie...
During the last decades several learning algorithms have been proposed to learn probability distributions based on decomposable models. Some of these algorithms can be used to search for a maximum likelihood decomposable model with a given maximum clique size, k. Unfortunately, the problem of learning a maximum likelihood decomposable model given a...
In recent years, the performance of semisupervised learning (SSL) has been theoretically investigated. However, most of this theoretical development has focused on binary classification problems. In this paper, we take it a step further by extending the work of Castelli and Cover to the multiclass paradigm. In particular, we consider the key proble...
Standard supervised classification learns a classifier from a set of labeled examples. Alternatively, in the field of weakly supervised classification different frameworks have been presented where the training data cannot be certainly labeled. In this paper, the novel problem of learning from positive-unlabeled proportions is presented. The provid...
Performance assessment of a learning method related to its prediction ability on independent data is extremely important in supervised classification. This process provides the information to evaluate the quality of a classification model and to choose the most appropriate technique to solve the specific supervised classification problem at hand. T...
Learning from crowds is a classification problem where the provided training instances are labeled by multiple (usually conflicting) annotators. In different scenarios of this problem, straightforward strategies show an astonishing performance. In this paper, we characterize the crowd scenarios where these basic strategies show a good behavior. As...
Wind is one of the parameters best predicted by numerical weather models, as it can be directly calculated from the physical equations of pressure that govern its movement. However, local winds are considerably affected by topography, which global numerical weather models, due to their limited resolution, are not able to reproduce. To improve the s...
The effect of different factors (spawning biomass, environmental conditions) on recruitment is a subject of great importance in the management of fisheries, recovery plans and scenarios exploration. In this study, recently proposed supervised classification techniques, tested by the machine-learning community, are applied to forecast the recruitmen...
A fundamental question in the field of approximation algorithms, for a given problem instance, is the selection of the best (or a suitable) algorithm with regard to some performance criteria. A practical strategy for facing this problem is the application of machine learning techniques. However, limited support has been given in the literature to t...
This paper deals with a classification problem known as learning from label proportions. The provided dataset is composed of unlabeled instances and is divided into disjoint groups. General class information is given within the groups: the proportion of instances of the group that belong to each class. We have developed a method based on the Struct...
A multi-species approach to fisheries management requires taking into account the interactions between species in order to improve recruitment forecasting. Recent advances in Bayesian networks direct the learning of models with several interrelated variables to be forecasted simultaneously. These are known as multi-dimensional Bayesian network clas...
This work presents a multidimensional classifier described in terms of interaction factors called multidimensional k-interaction classifier. The classifier is based on a probabilistic model composed of the product of all the interaction factors of order lower or equal to k and it takes advantage of all the information contained in them. The propose...
Learning from crowds is a recently fashioned supervised classification framework where the true/real labels of the training instances are not available. However, each instance is provided with a set of noisy class labels, each indicating the class-membership of the instance according to the subjective opinion of an annotator. The additional challen...
One of the emerging techniques for performing the analysis of the DNA microarray data known as biclustering is the search of subsets of genes and conditions which are coherently expressed. These subgroups provide clues about the main biological processes. Until now, different approaches to this problem have been proposed. Most of them use the mean...
In the information retrieval framework, there are problems where the goal is to recover objects of a particular class from big sets of unlabelled objects. In some of these problems, only examples from the class we want to recover are available. For such problems, the machine learning community has developed algorithms that are able to learn binary...
Sentiment Analysis is defined as the computational study of opinions, sentiments and emotions expressed in text. Within this broad field, most of the work has been focused on either Sentiment Polarity classification, where a text is classified as having positive or negative sentiment, or Subjectivity classification, in which a text is classified as...
Malignancies arising in the large bowel cause the second largest number of deaths from cancer in the Western World. Despite progresses made during the last decades, colorectal cancer remains one of the most frequent and deadly neoplasias in the western countries.
A genomic study of human colorectal cancer has been carried out on a total of 31 tumor...
Clinical sampled data. M = male; F = female.
New cohort of samples (clinical data).
This paper deals with the problem of multi-instance learning when label proportions are provided. In this classification problem, the instances of the dataset are divided into disjoint groups, where there is no certainty about the labels associated with individual samples. However, in each group the number of instances that belong to each class is...
Progress is continuously being made in the quest for stable biomarkers linked to complex diseases. Mass spectrometers are one of the devices for tackling this problem. The data profiles they produce are noisy and unstable. In these profiles, biomarkers are detected as signal regions (peaks), where control and disease samples behave differently. Mas...
Improving our ability to predict recruitment is a key element in fisheries management. However, the
interactions between population dynamics and different environmental factors are complex and often
non-linear, making it difficult to produce robust predictions. ‘Machine-learning’ techniques (in particular,
supervised classification methods) have be...
The increase in the number and complexity of biological databases has raised the need for modern and powerful data analysis tools and techniques. In order to fulfill these requirements, the machine learning discipline has become an everyday tool in bio-laboratories. The use of machine learning techniques has been extended to a wide spectrum of bioi...
A pipeline of supervised classification methods proposed is applied to seven fish species of commercial interest in the Bay of Biscay.
‘Machine-learning’ techniques have been proposed as a useful tool to produce robust predictions. In this WD we apply to anchovy the methodology proposed in Fernandes et al. (2009) to build a robust classifier of recruitments and to make early predictions using climatic indices. The methodology consists of a ‘pipeline’ of state-of-the-art machine-le...
Taqman probes distribution in the Taqman Low density array (www.appliedbiosystem.com)
(0.05 MB XLS)
DCT data from the TLDA analysis. The data comes from the different comparisons: MS (relapse and remitting) vs Controls; Relapse (Relap) vs controls; remitting(Remitt) vs controls and relapse vs remitting
(0.32 MB DOC)
Target genes studied with their gene ID, the miRNA that binds to the gene, the group in which these genes are expected to be down-regulated and the Geneglobe Assay code.
(0.03 MB DOC)
Resume of the panther software methods
(0.03 MB DOC)
Clinical description of the patients. Tev: Time of evolution (years). EDSS: Expanded Disability Status Score. Te: Time from the relapse onset and the blood extraction (in days)
(0.03 MB DOC)
Complete data from the non-parametrical statistical analysis
(0.15 MB XLS)
Complete list of the miRNA predicted targets
(0.05 MB XLS)
Data from the pathway analysis conducted by panther with the predicted gene target lists from each miRNA. Two different groups of miRNA were studied; coming from the experiment and coming from the chance group
(0.05 MB DOC)
Microarray-based global gene expression profiling, with the use of sophisticated statistical algorithms is providing new insights into the pathogenesis of autoimmune diseases. We have applied a novel statistical technique for gene selection based on machine learning approaches to analyze microarray expression data gathered from patients with system...
When learning Bayesian network based classifiers continuous variables are usually handled by discretization, or assumed that they follow a Gaussian distribution. This work introduces the kernel based Bayesian network paradigm for supervised classification. This paradigm is a Bayesian network which estimates the true density of the continuous variab...
Differences in gene expression patterns have been documented not only in Multiple Sclerosis patients versus healthy controls but also in the relapse of the disease. Recently a new gene expression modulator has been identified: the microRNA or miRNA. The aim of this work is to analyze the possible role of miRNAs in multiple sclerosis, focusing on th...
Evolutionary search algorithms have become an essential asset in the algorithmic toolbox for solving high-dimensional optimization problems in across a broad range of bioinformatics problems. Genetic algorithms, the most well-known and representative evolutionary search technique, have been the subject of the major part of such applications. Estima...
Zooplankton biomass and abundance estimation, based on surveys or time-series, is carried out routinely. Automated or semi-automated
image analysis processes, combined with machine-learning techniques for the identification of plankton, have been proposed
to assist in sample analysis. A difficulty in automated plankton recognition and classificatio...
The main purpose of a gene interaction network is to map the relationships of the genes that are out of sight when a genomic study is tackled. DNA microarrays allow the measure of gene expression of thousands of genes at the same time. These data constitute the numeric seed for the induction of the gene networks. In this paper, we propose a new app...
10.1 Introducción En este tema se va a presentar el paradigma conocido comó arbol de clasificación. En el mismo, basándose en un particionamiento recursivo del dominio de definición de las variables predictoras, se va a poder representar el conocimiento sobre el problema por medio de una estructura dé arbol. El paradigma que se presenta en este tem...
Limb-girdle muscular dystrophy type 2A (LGMD2A) is a recessive genetic disorder caused by mutations in calpain 3 (CAPN3). Calpain 3 plays different roles in muscular cells, but little is known about its functions or in vivo substrates. The aim of this study was to identify the genes showing an altered expression in LGMD2A patients and the possible...
Within the wide field of classification on the Machine Learning discipline, Bayesian classifiers are very well established
paradigms. They allow the user to work with probabilistic processes, as well as, with graphical representations of the relationships
among the variables of a problem.
Feature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques.
In this article, we make the interested...
We present a supervised wrapper approach to discretization. In contrast to many classical approaches, the discretization process is multivariate: all variables are discretized simultaneously, and the proposed discretization is evaluated with the Naive-Bayes classifier. The search for the optimal discretization is carried out as an optimization proc...
This work shows, using bivariate continuous artificial domains, the relation that seems to exist between some measures based
on the information theory and the expected classification error.
The relations that seem to be found in this work could be applied to the improvement of the classifiers which assign a posteriori probabilities to each class v...
Most of the Bayesian network-based classifiers are usually only able to handle discrete variables. However, most real-world domains involve continuous variables. A common practice to deal with continuous variables is to discretize them, with a subsequent loss of information. This work shows how discrete classifier induction algorithms can be adapte...
This article reviews machine learning methods for bioinformatics. It presents modelling methods, such as supervised classification, clustering and probabilistic graphical models for knowledge discovery, as well as deterministic and stochastic heuristics for optimization. Applications in genomics, proteomics, systems biology, evolution and text mini...
This is a nicely edited volume on Estimation of Distribution Algorithms (EDAs) by leading researchers on this important topic.
It covers a wide range of topics in EDAs, from theoretical analysis to experimental studies, from single objective to multi-objective optimisation, and from parallel EDAs to hybrid EDAs. It is a very useful book for everyon...
This is a nicely edited volume on Estimation of Distribution Algorithms (EDAs) by leading researchers on this important topic.
It covers a wide range of topics in EDAs, from theoretical analysis to experimental studies, from single objective to multi-objective optimisation, and from parallel EDAs to hybrid EDAs. It is a very useful book for everyon...
The transjugular intrahepatic portosystemic shunt (TIPS) is a treatment for cirrhotic patients with portal hypertension. A subgroup of patients dies in the first 6 months and another subgroup lives a long period of time. Nowadays, no risk factors have been identified in order to determine how long a patient will survive. An empirical study for pred...
IntroductionGenetic NetworksProbabilistic Graphical ModelsInferring Genetic Networks by Means of Probabilistic Graphical ModelsConclusions
AcknowledgementsReferences