Vincent LemaireOrange Labs · Orange Labs Research
Vincent Lemaire
PhD - HDR
About
219
Publications
96,418
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,399
Citations
Introduction
http://www.vincentlemaire-labs.fr/
Main topics : Machine Learning, Data Science, Neural Networks, Time Series Classification, Model Interpretation, ...
Additional affiliations
January 2004 - December 2018
Education
December 2008 - December 2008
October 1996 - September 1999
Publications
Publications (219)
The field of Weakly Supervised Learning (WSL) has recently seen a surge of popularity, with numerous papers addressing different types of “supervision deficiencies”. In WSL use cases, a variety of situations exists where the collected “information” is imperfect. The paradigm of WSL attempts to list and cover these problems with associated solutions...
In Novel Class Discovery (NCD), the goal is to find new classes in an unlabeled set given a labeled set of known but different classes. While NCD has recently gained attention from the community, no framework has yet been proposed for heterogeneous tabular data, despite being a very common representation of data. In this paper, we propose TabularNC...
Novel Class Discovery (NCD) is a growing field where we are given during training a labeled set of known classes and an unlabeled set of different classes that must be discovered. In recent years, many methods have been proposed to address this problem, and the field has begun to mature. In this paper, we provide a comprehensive survey of the state...
More and more applications require early decisions, i.e. taken as soon as possible from partially observed data. However, the later a decision is made, the more its accuracy tends to improve, since the description of the problem to hand is enriched over time. Such a compromise between the earliness and the accuracy of decisions has been particularl...
(This paper is now published in TMLR 2024). Mislabeled examples are ubiquitous in real-world machine learning datasets, advocating the development of techniques for automatic detection. We show that most mislabeled detection methods can be viewed as probing trained machine learning models using a few core principles. We formalize a modular framewor...
Recent research in machine learning has given rise to a flourishing literature on the quantification and decomposition of model uncertainty. This information can be very useful during interactions with the learner, such as in active learning or adaptive learning, and especially in uncertainty sampling. To allow a simple representation of these tota...
\texttt{ml\_edm} is a Python 3 library, designed for early decision making of any learning tasks involving temporal/sequential data. The package is also modular, providing researchers an easy way to implement their own triggering strategy for classification, regression or any machine learning task. As of now, many Early Classification of Time Serie...
Quantitative systems pharmacology (QSP) models of cancer immunity offer a mechanistic understanding of cellular dynamics and drug effects that are often challenging to investigate clinically. Despite their success, these models are limited by their inability to mechanistically represent patient survival as an output, which restricts their utility i...
Recent studies in active learning, particularly in uncertainty sampling, have focused on the decomposition of model uncertainty into reducible and irreducible uncertainties. In this paper, the aim is to simplify the computational process while eliminating the dependence on observations. Crucially, the inherent uncertainty in the labels is considere...
In many situations, the measurements of a studied phenomenon are provided sequentially, and the prediction of its class needs to be made as early as possible so as not to incur too high a time penalty, but not too early and risk paying the cost of misclassification. This problem has been particularly studied in the case of time series, and is known...
The problem of novel class discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number o...
(Paper Accepted at IJCNN 2024) - There are now many comprehension algorithms for understanding the decisions of a machine learning algorithm. Among these are those based on the generation of counterfactual examples. This article proposes to view this generation process as a source of creating a certain amount of knowledge that can be stored to be u...
Time series segmentation (TSS) is a research problem that focuses on dividing long multivariate sensor data into smaller, homogeneous subsequences. This task is critical for various real-world data analysis applications, such as energy consumption monitoring, climate change assessment, and human activity recognition (HAR). Despite its importance, e...
The problem of Novel Class Discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number o...
Recent research in active learning, and more precisely in uncertainty sampling, has focused on the decomposition of model uncertainty into reducible and irreducible uncertainties. In this paper, we propose to simplify the computational phase and remove the dependence on observations, but more importantly to take into account the uncertainty already...
Training machine learning models from data with weak supervision and dataset shifts is still challenging. Designing algorithms when these two situations arise has not been explored much, and existing algorithms cannot always handle the most complex distributional shifts. We think the biquality data setup is a suitable framework for designing such a...
(Preprint version) - Variable selection or importance measurement of input variables to a machine learning model has become the focus of much research. It is no longer enough to have a good model, one also must explain its decisions. This is why there are so many intelligibility algorithms available today. Among them, Shapley value estimation algor...
Novel Class Discovery (NCD) is the problem of trying to discover novel classes in an unlabeled set, given a labeled set of different but related classes. The majority of NCD methods proposed so far only deal with image data, despite tabular data being among the most widely used type of data in practical applications. To interpret the results of clu...
Training machine learning models from data with weak supervision and dataset shifts is still challenging. Designing algorithms when these two situations arise has not been explored much, and existing algorithms cannot always handle the most complex distributional shifts. We think the biquality data setup is a suitable framework for designing such a...
The democratization of Data Mining has been widely successful thanks in part to powerful and easy-to-use Machine Learning libraries. These libraries have been particularly tailored to tackle Supervised Learning. However, strong supervision signals are scarce in practice, and practitioners must resort to weak supervision. In addition to weaknesses o...
This paper has been accepted at the workshop AIMLAI of ECML-PKDD 2023 - "Variable selection or importance measurement of input variables to a machine learning model has become the focus of much research. It is no longer enough to have a good model, one also must explain its decisions. This is why there are so many intelligibility algorithms availab...
Novel Class Discovery (NCD) is the problem of trying to discover novel classes in an unlabeled set, given a labeled set of different but related classes. The majority of NCD methods proposed so far only deal with image data, despite tabular data being among the most widely used type of data in practical applications. To interpret the results of clu...
In this paper we show that the combination of a Contrastive representation with a label noise-robust classification head requires fine-tuning the representation in order to achieve state-of-the-art performances. Since fine-tuned representations are shown to outperform frozen ones, one can conclude that noise-robust classification heads are indeed a...
This paper has been accepted at IJCNN 2023 - Time Series Classification (TSC) has received much attention in the past two decades and is still a crucial and challenging problem in data science and knowledge engineering. Indeed, along with the increasing availability of time series data, many TSC algorithms have been suggested by the research commun...
Proceedings de l'atelier TextMine 2023. Le but de cet atelier est de réunir des chercheurs sur la thématique large de la fouille de textes. Cet atelier vise à offrir une occasion de rencontres pour les universitaires et les industriels, appartenant aux différentes communautés de l'intelligence artificielle, l'apprentissage automatique, le traitemen...
Dans le domaine du Novel Class Discovery (NCD), le but est de trouver de nouvelles classes dans un ensemble non étiqueté lorsqu'un ensemble étiqueté de classes connues mais différentes est disponible. Bien que le NCD ait récemment attiré l'attention de la communauté scientifique, aucune solution n'a encore été proposée pour les données tabulaires,...
Hospitals face high occupation rates resulting in a longer boarding time and more complex bed management. This task could be facilitated by anticipating the unscheduled admissions. We study the capability of information from French electronic health records of an emergency department (ED) to predict patient disposition decisions.
We compare the per...
In Novel Class Discovery (NCD), the goal is to find new classes in an unlabeled set given a labeled set of known but different classes. While NCD has recently gained attention from the community, no framework has yet been proposed for heterogeneous tabular data, despite being a very common representation of data. In this paper, we propose TabularNC...
Presentation : brief introduction to "weakly supervised learning" and then a focus on "active learning"
For more details see : https://www.researchgate.net/publication/354719650_From_Weakly_Supervised_Learning_to_Biquality_Learning_an_Introduction
In Novel Class Discovery (NCD), the goal is to find new classes in an unlabeled set given a labeled set of known but different classes. While NCD has recently gained attention from the community, no framework has yet been proposed for heterogeneous tabular data, despite being a very common representation of data. In this paper, we propose TabularNC...
In this article, we propose a framework for seasonal time series probabilistic forecasting. It aims at forecasting (in a probabilistic way) the whole next season of a time series, rather than only the next value. Probabilistic forecasting consists in forecasting a probability distribution function for each future position. The proposed framework is...
This paper has been published in SIGKDD Newsletter exploration (december 2022) . ..... More and more applications require early decisions, i.e. taken as soon as possible from partially observed data. However, the later a decision is made, the more its accuracy tends to improve, since the description of the problem to hand is enriched over time. Suc...
Learning to predict ahead of time events in open time series is challenging. While Early Classification of Time Series (ECTS) tackles the problem of balancing online the accuracy of the prediction with the cost of delaying the decision when the individuals are time series of finite length with a unique label for the whole time series. Surprisingly,...
Cet article propose une vision originale et globale de l'Apprentissage Faiblement Supervisé, menant à la conception d'approches génériques capable de traiter tout type de faiblesses en supervision. Un nouveau cadre appelé "Données Biqualité" est introduit, qui suppose qu'un petit jeu de données fiable d'exemples correctement étiquetés est disponibl...
Cet article propose une méthode de création automatique de variables (pour la régression) qui viennent compléter les informations contenues dans le vecteur initial des variables explicatives. Notre méthode fonctionne comme une étape de prétraitement dans laquelle les valeurs continues de la variable a régresser sont discrétisées en un ensemble d'in...
Due to an ever-increasing demand for analyzing the large volumes of information issuing from high-speed data streams, multi-label stream classification is replacing the traditional offline multi-label classification system and has thus become a focal point in recent years. In this paper, we propose a new algorithm for multi-label stream classificat...
This paper proposes a method for the automatic creation of variables (in the case of regression) that complement the information contained in the initial input vector. The method works as a pre-processing step in which the continuous values of the variable to be regressed are discretized into a set of intervals which are then used to define value t...
En apprentissage automatique, la performance d’un modèle supervisé dépend souvent du volume de données étiquetées. Entraîner un modèle sur un grand nombre de données nécessite donc l’étiquetage de beaucoup d’observations et requiert souvent une expertise coûteuse en temps et en argent. Une des solutions consiste alors à externaliser le travail d’ét...
Many approaches have been proposed for early classification of time series in light of itssignificance in a wide range of applications including healthcare, transportation and fi-nance. Until now, the early classification problem has been dealt with by considering onlyirrevocable decisions. This paper introduces a new problem calledearly and revoca...
Active learning is a subfield of machine learning which allows to reduce the amount of data necessary to train a classifier. The training set is built in an iterative way such that only the most significant and informative data are used and labeled by an external person called oracle. It is furthermore possible to use active learning with the theor...
Science, technology, and commerce increasingly recognise the importance of ma-
chine learning approaches for data-intensive, evidence-based decision making.
This is accompanied by increasing numbers of machine learning applications
and volumes of data. Nevertheless, the capacities of processing systems or hu-
man supervisors or domain experts remai...
This paper has been accepted at the IAL@ECML Workshop 2021 (https://www.activeml.net/ial2021/index.html) -------- "In this paper we show that the combination of a Contrastive representation with a label noise-robust classification head requires fine-tuning the representation in order to achieve state-of-the-art performances. Since fine-tuned repres...
In this article, we propose a framework for seasonal time series probabilistic forecasting. It aims at forecasting (in a probabilistic way) the whole next season of a time series, rather than only the next value. Probabilistic forecasting consists in forecasting a probability distribution function for each future position. The proposed framework is...
https://arxiv.org/abs/2012.09632
(this paper has been accepted at IJCNN 2021)
The field of Weakly Supervised Learning (WSL) has recently seen a surge of popularity, with numerous papers addressing different types of "supervision deficiencies". In WSL use cases, a variety of situations exists where the collected "information" is imperfect. The parad...
Supervised learning of time series data has been extensively studied for the case of a categorical target variable. In some application domains, e.g., energy, environment and health monitoring, it occurs that the target variable is numerical and the problem is known as \textit{time series extrinsic regression} (TSER). In the literature, some well-k...
Supervised learning of time series data has been extensively studied for the case of a categorical target variable. In some application domains, e.g., energy, environment and health monitoring, it occurs that the target variable is numerical and the problem is known as time series extrinsic regression (TSER). In the literature, some well-known time...
Many approaches have been proposed for early classification of time series in light of its significance in a wide range of applications including healthcare, transportation and finance. However, recently a preprint saved on Arxiv claim that all research done for almost 20 years now on the Early Classification of Time Series is useless, or, at the v...
This talk gives a 'brief overview' of Weakly supervised learning. The choice was made to present things in a hierarchical way for simplicity because it is more 'didactic'. But the view via the cube on the last slide is more appropriate, more general.
For more details see : https://www.researchgate.net/publication/354719650_From_Weakly_Supervised_L...
This a talk about some insights of technical aspects of Khiops Interpretation for the Inria Team 'Lacodam' - March 2021
you may find other details about this tool on http://vincentlemaire-labs.fr/iki.html
Supervised learning of time series data has been extensively studied for the case of a categorical target variable. In some application domains, e.g., energy, environment and health monitoring, it occurs that the target variable is numerical and the problem is known as time series extrinsic regression (TSER). In the literature, some well-known time...
C'est une évidence que de dire que nous sommes entrés dans une ère où la donnée textuelle sous toute ses formes submerge chacun de nous que ce soit dans son
environnement personnel ou professionnel : l'augmentation croissante de documents
nécessaires aux entreprises ou aux administrations, la profusion de données textuelles
disponibles via Internet...
This book constitutes the refereed proceedings of the 6th ECML PKDD Workshop on Advanced Analytics and Learning on Temporal Data, AALTD 2021, held during September 13-17, 2021. The workshop was planned to take place in Bilbao, Spain, but was held virtually due to the COVID-19 pandemic.
The 12 full papers presented in this book were carefully review...
Supervised classification can be effective for prediction but sometimes weak on interpretability or explainability (XAI). Clustering, on the other hand, tends to isolate categories or profiles that can be meaningful but there is no guarantee that they are useful for labels prediction. Predictive clustering seeks to obtain the best of the two worlds...
This paper has been published at the Workshop IAL@ECML (http://ceur-ws.org/Vol-2660/ialatecml_paper3.pdf) -- Active learning aims to reduce annotation cost by predicting which samples are useful for a human expert to label. Although this field is quite old, several important challenges to using active learning in real-world settings still remain un...
Paper accepted at the Workshop « Data Quality Assessment for Machine Learning (DQAML)” SIGKDD 2021 ---
In some industrial application as fraud detection common supervision techniques may not be efficient because they rely on the quality of labels. In concrete cases, these labels may be weak in quantity, quality or trustworthiness. We propose a ben...
https://arxiv.org/abs/2010.09621 (this paper has been accepted at IJCNN 2021). The field of Weakly Supervised Learning (WSL) has recently seen a surge of popularity, with numerous papers addressing different types of ``supervision deficiencies'', namely: poor quality, non adaptability, and insufficient quantity of labels. Regarding quality, label n...
Supervised classification can be effective for prediction but sometimes weak on interpretability or explainability (XAI). Clustering, on the other hand, tends to isolate categories or profiles that can be meaningful but there is no guarantee that they are useful for labels prediction. Predictive clustering seeks to obtain the best of the two worlds...
Paper available here : https://link.springer.com/chapter/10.1007/978-3-030-59065-9_25 ---
Multivariate Time Series Classification (MTSC) has attracted increasing research attention in the past years due to the wide range applications in e.g., action/activity recognition, EEG/ECG classification, etc. In this paper, we open a novel path to tackle wi...
Active learning aims to reduce annotation cost by predicting which samples are useful for a human expert to label. Although this field is quite old, several important challenges to using active learning in real-world settings still remain unsolved. In particular, most selection strategies are hand-designed, and it has become clear that there is no...
Multivariate Time Series Classification (MTSC) has attracted increasing research attention in the past years due to the wide range applications in e.g., action/activity recognition, EEG/ECG classification, etc. In this paper, we open a novel path to tackle with MTSC: a relational way. The multiple dimensions of MTS are represented in a relational d...
In some application areas, the ability to understand (describe) the results given by a classifier is as an important condition as its predictive performance is. In this case, the classifier is considered as important if it can produce comprehensible results with a good predictive performance. This is referred as a trade-off “interpretation vs. perf...
Supervised classification can be effective for prediction but sometimes weak on interpretability or explainability (XAI). Clustering, on the other hand, tends to isolate categories or profiles that can be meaningful but there is no guarantee that they are useful for labels prediction. Predictive clustering seeks to obtain the best of the two worlds...
Cet article présente une méthode de classification de séries temporelle qui sélectionne des représentations alternatives (telles que les dérivées, les intégrales cumulatives, le spectre de puissance) et en extrait des descripteurs informatifs. L'approche proposée est décomposée en trois étapes : i) les séries temporelles originales sont transformée...
We address the problem of event classification for proactive fiber break detection in high-speed optical communication systems. The proposed approach is based on monitoring the State of Polarization (SOP) via digital signal processing in a coherent receiver. We describe in details the design of a classifier providing interpretable decision rules an...
This book constitutes the refereed proceedings of the 4th ECML PKDD Workshop on Advanced Analytics and Learning on Temporal Data, AALTD 2019, held in Würzburg, Germany, in September 2019.
The 7 full papers presented together with 9 poster papers were carefully reviewed and selected from 31 submissions. The papers cover topics such as temporal data...
This book constitutes the refereed proceedings of the 4th ECML PKDD Workshop on Advanced Analytics and Learning on Temporal Data, AALTD 2019, held in Ghent, Belgium, in September 2020.
The 15 full papers presented in this book were carefully reviewed and selected from 29 submissions. The selected papers are devoted to topics such as Temporal Data C...
Intitulé du Stage Comparaison de méthodes d'apprentissage faiblement supervisées dans le cas de la fraude - Mission: Le contexte général du stage est la classification faiblement supervisée [1] dans le cas de la fraude (classes très déséquilibrées et bruit d'étiquetage). De nombreux services distribués par Orange peuvent faire l'objet de tentatives...