Fabrice Clerot

Fabrice Clerot
Orange Labs · Innovation, Marketing and Technologies

.Engineer

About

171
Publications
443,651
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,160
Citations
Introduction
There is no way, however, in which the individual can avoid the burden of responsibility for his own evaluations. The key cannot be found that will unlock the enchanted garden, wherein, among the fairy-rings and the shrubs of magic wands, beneath the trees laden with monads and noumena, blossom forth the flowers of Probabilitas realis. in Bruno de Finetti, Theory of Probability, Wiley (1974)
Additional affiliations
January 1988 - present
CNET --> France Télécom R&D --> Orange Labs
Position
  • Orange Labs
Description
  • the partition function is the shortest path from semiconductor physics to data-mining

Publications

Publications (171)
Preprint
Full-text available
More and more applications require early decisions, i.e. taken as soon as possible from partially observed data. However, the later a decision is made, the more its accuracy tends to improve, since the description of the problem to hand is enriched over time. Such a compromise between the earliness and the accuracy of decisions has been particularl...
Conference Paper
Paper available here : https://link.springer.com/chapter/10.1007/978-3-030-59065-9_25 --- Multivariate Time Series Classification (MTSC) has attracted increasing research attention in the past years due to the wide range applications in e.g., action/activity recognition, EEG/ECG classification, etc. In this paper, we open a novel path to tackle wi...
Chapter
Multivariate Time Series Classification (MTSC) has attracted increasing research attention in the past years due to the wide range applications in e.g., action/activity recognition, EEG/ECG classification, etc. In this paper, we open a novel path to tackle with MTSC: a relational way. The multiple dimensions of MTS are represented in a relational d...
Chapter
Full-text available
We address the problem of event classification for proactive fiber break detection in high-speed optical communication systems. The proposed approach is based on monitoring the State of Polarization (SOP) via digital signal processing in a coherent receiver. We describe in details the design of a classifier providing interpretable decision rules an...
Conference Paper
Full-text available
We address the problem of event classification for pro\-active fiber break detection in high-speed optical communication systems. The proposed approach is based on monitoring the State of Polarization (SOP) via digital signal processing in a coherent receiver. We describe in details the design of a classifier providing interpretable decision rules...
Article
Full-text available
Addressing Answer Selection (AS) tasks with complex neural networks typically requires a large amount of annotated data to increase the accuracy of the models. In this work, we are interested in simple models that can potentially give good performance on datasets with no or few annotations. First, we propose new unsupervised baselines that leverage...
Conference Paper
Full-text available
Addressing Question Answering (QA) tasks with complex neural networks typically requires a large amount of annotated data to achieve a satisfactory accuracy of the models. In this work, we are interested in simple models that can potentially give good performance on datasets with no or few annotations. First, we propose new unsupervised baselines t...
Conference Paper
Full-text available
Dans le cadre de cet article nous détaillons une méthodologie d’anonymisation des trajectoires d’appels mobiles, dans un objectif de publication de données individuelles. Le risque contre lequel on souhaite se protéger est le risque de ré-identification. Pour cela on va générer automatiquement des trajectoires synthétiques à partir d’un ensemble ré...
Chapter
Full-text available
Co-clustering is a data mining technique used to extract the underlying block structure between the rows and columns of a data matrix. Many approaches have been studied and have shown their capacity to extract such structures in continuous, binary or contingency tables. However, very little work has been done to perform co-clustering on mixed type...
Chapter
Full-text available
Co-clustering is a class of unsupervised data analysis techniques that extract the existing underlying dependency structure between the instances and variables of a data table as homogeneous blocks. Most of those techniques are limited to variables of the same type. In this paper, we propose a mixed data co-clustering method based on a two-step met...
Preprint
Full-text available
We propose a MAP Bayesian approach to perform and evaluate a co-clustering of mixed-type data tables. The proposed model infers an optimal segmentation of all variables then performs a co-clustering by minimizing a Bayesian model selection cost function. One advantage of this approach is that it is user parameter-free. Another main advantage is the...
Conference Paper
We propose a methodology to anonymize microdata (i.e. a table of n individuals described by d attributes). The goal is to be able to release an anonymized data table built from the original data while meeting the differential privacy requirements. The proposed solution combines co-clustering with synthetic data generation to produce anonymized data...
Conference Paper
Full-text available
We propose a methodology to anonymize microdata (i.e. a table of n individuals described by d attributes). The goal is to be able to release an anonymized data table built from the original data that protects against the re-identification risk. The proposed solution combines co-clustering with synthetic data generation to produce anonymized data. C...
Conference Paper
Full-text available
Sequence classification has become a fundamental problem in data mining and machine learning. Feature based classification is one of the techniques that has been used widely for sequence classification. Mining sequential classification rules plays an important role in feature based classification. Despite the abundant literature in this area, minin...
Article
Full-text available
Sequential data are generated in many domains of science and technology. Although many studies have been carried out for sequence classification in the past decade, the problem is still a challenge, particularly for pattern-based methods. We identify two important issues related to pattern-based sequence classification, which motivate the present w...
Presentation
Full-text available
This presentation is an brief introduction (a brief course) to data mining which is given in house in Orange Labs.
Conference Paper
Full-text available
Dans cet article, nous proposons une méthodologie pour anonymiser une table de données multidimensionnelles contenant des données individuelles (soit n individus décrits par m variables). L'objectif est de publier une table ano-nyme construite à partir d'une table initiale qui protège contre le risque de ré-identification. En d'autres termes, on ne...
Conference Paper
Full-text available
La classification croisée est une technique d'analyse non supervisée qui permet d'extraire la structure sous-jacente existante entre les individus et les variables d'une table de données sous forme de blocs homogènes. Cette technique se limitant aux variables de même nature, soit numériques soit catégo-rielles, nous proposons de l'étendre en propos...
Conference Paper
Full-text available
We suggest a novel way for exploratory topic segmentation based on data grid models. In this context, a text can be represented as a data set of two-dimensional points; each point is defined by two variables: a word (categorical value) and the placement of the word in the text (numerical value). Instantiating data grid models to the 2D-points turns...
Poster
Full-text available
We formalize the asynchronous multi-armed bandits with known trend problem (AMABKT) and propose a few empirical solutions, the most efficient one being based on finite-horizon Gittins indices.
Conference Paper
Full-text available
Cet article s'intéresse à la construction d'un classifieur bayésien naïf dans un contexte de partitionnement vertical des données. Les variables expli-catives sont détenues par une partie, la classe par une autre, et aucune des parties ne souhaite laisser l'accès à ses données individuelles pour la construction du classifieur. On propose une approc...
Article
Full-text available
We exploit the Minimum Description Length (MDL) principle as a model selection technique for Bernoulli distributions and compare several types of MDL codes. We first present a simplistic crude two-part MDL code and a Normalized Maximum Likelihood (NML) code. We then focus on the enumerative two-part crude MDL code, suggest a Bayesian interpretation...
Conference Paper
Full-text available
To address the contextual bandit problem, we propose an online random forest algorithm. The analysis of the proposed algorithm is based on the sample complexity needed to find the optimal decision stump. Then, the decision stumps are recursively stacked in a random collection of decision trees, BANDIT FOREST. We show that the proposed algorithm is...
Conference Paper
Full-text available
En analyse exploratoire, l’identification et la visualisation des interactions entre variables dans les grandes bases de données est un défi (Dhillon et al., 2003; Kolda et Sun, 2008). Nous présentons Khiops CoViz, un outil qui permet d’explorer par visualisation les relations importantes entre deux (ou plusieurs) variables, qu’elles soient catégor...
Presentation
Full-text available
Bandit Forest and use cases
Conference Paper
Full-text available
We study the K-armed dueling bandit problem which is a variation of the classical Multi-Armed Bandit (MAB) problem in which the learner receives only relative feedback about the selected pairs of arms. We propose a new algorithm called Relative Exponential-weight algorithm for Exploration and Exploitation (REX3) to handle the adversarial utility-ba...
Conference Paper
Full-text available
Cet article s’intéresse à la construction d’un classifieur dans un contexte de partitionnement vertical des données. Les variables explicatives sont détenues par une partie, la classe par une autre, et aucune des parties ne souhaite laisser l’accès à ses données individuelles pour la construction du classifieur. On propose une approche dans laquell...
Article
Full-text available
We suggest a novel method of clustering and exploratory analysis of temporal event sequences data (also known as categorical time series) based on three-dimensional data grid models. A data set of temporal event sequences can be represented as a data set of three-dimensional points, each point is defined by three variables: a sequence identifier, a...
Article
To address the contextual bandit problem, we propose an online decision tree algorithm. We show that the proposed algorithm, KMD-Tree, incurs an expected cumulated regret in the order of O(log T) against the greedy decision tree built knowing the joint distribution of contexts and rewards. We show that this problem dependent regret bound is optimal...
Conference Paper
Full-text available
Call Detail Records (CDRs) are data recorded by telecommunications companies, consisting of basic informations related to several dimensions of the calls made through the network: the source, destination, date and time of calls. CDRs data analysis has received much attention in the recent years since it might reveal valuable information about human...
Conference Paper
Full-text available
En analyse exploratoire, l’identification et la visualisation des interactions entre variables dans les grandes bases de données est un défi (Dhillon et al., 2003; Kolda et Sun, 2008). Nous présentons Khiops CoViz, un outil qui permet d’explorer par visualisation les relations importantes entre deux (ou plusieurs) variables, qu’elles soient catégor...
Patent
Full-text available
A method of assisting with the construction of a tree of clusters of electronic documents, the documents being defined by predetermined attributes. The method includes, for a given cluster of documents and a given level of the tree, the following steps: a) obtaining (E300) constraints defined between at least two documents of said cluster and stori...
Conference Paper
Identifying and visually analyzing interesting interactions between variables in large-scale data sets through k-coclustering is of high importance. We present Khiops CoViz1, a tool for visual analysis of interesting relationships between two or more variables (categorical and/or numerical). The visualization of k variables coclustering takes the f...
Conference Paper
Full-text available
We describe our submission to the AAIA’14 Data Mining Competition, where the objective was to reach good predictive performance on text mining classification problems while using a small number of variables. Our submission was ranked 6th, less than 1% behind the winner. We also present an empirical study on the trade-off between parsimony of the re...
Conference Paper
Full-text available
Nous proposons une nouvelle méthode de clustering et d’analyse de séquences temporelles basée sur les modèles en grille à trois dimensions. Les séquences sont partitionnées en clusters, la dimension temporelle est discrétisée en intervalles et la dimension évènement est partitionnée en groupes. La grille de cellules 3D forme ainsi un estimateur non...
Article
Full-text available
Measurements from an Internet backbone link carrying TCP traffic towards different ADSL areas are analyzed in this paper. For traffic analysis, we adopt a flow based approach and the popular mice/elephants dichotomy. The originality of the experimental data reported in this paper, when compared with previous measurements from very high speed backbo...
Conference Paper
Full-text available
We suggest a simple yet effective and parameter-free feature construction process for time series classification. Our process is decomposed in three steps: (i) we transform original data into several simple representations; (ii) on each representation, we apply a coclustering method; (iii) we use coclustering results to build new features for time...
Conference Paper
Full-text available
We study a stochastic online learning scheme with partial feedback where the utility of decisions is only observable through an estimation of the environment parameters. We propose a generic pure-exploration algorithm, able to cope with various utility functions from multi-armed bandits settings to dueling bandits. The primary application of this s...
Article
Full-text available
The Orange "Data for Development" (D4D) challenge is an open data challenge on anonymous call patterns of Orange's mobile phone users in Ivory Coast. The goal of the challenge is to help address society development questions in novel ways by contributing to the socio-economic development and well-being of the Ivory Coast population. Participants to...
Conference Paper
Full-text available
In this paper, we propose an approach to analyze the performance and the added value of automatic recommender systems in an industrial context. We show that recommender systems are multifaceted and can be organized around 4 structuring functions: help users to decide, help users to compare, help users to discover, help users to explore. A global of...
Article
Full-text available
When the marketing service has to contact customers to propose them a product, the probability that these customers will buy this product is calculated beforehand. This probability is calculated using a predictive model. The mar- keting service contacts the clients having the highest probability of buying the product. In parallel and before the com...
Conference Paper
Full-text available
One of the most critical operators for a Data Stream Management System is the join operator. Unfortunately, the join operator between the stream A and B is a blocking operator: for each current tuple of the stream A, the entire stream B have to be scanned. The usual technique used for unblocking stream operators consists to restrict the processing...
Conference Paper
Full-text available
1 Introduction Data stream mining is becoming ubiquitous because of the increase of the information volumes and the need for real-time access to information and on-line decision-making. Summarizing infinite data streams for fast access to the past information is becoming an active area of research. Such summarization under finite or slowly growing...
Conference Paper
Full-text available
The K Nearest Neighbors (KNN) is strongly dependent on the quality of the distance metric used. For supervised classification problems, the aim of metric learning is to learn a distance metric for the input data space from a given collection of pair of similar/dissimilar points. A crucial point is the distance metric used to measure the closeness o...
Conference Paper
Full-text available
In itself, the continuous exponential increase of the data-warehouses size does not necessarily lead to a richer and finer-grained information since the processing capabilities do not increase at the same rate. Current state-of-the-art technologies require the user to strike a delicate balance between the processing cost and the information quality...
Conference Paper
Full-text available
Data streams constitute the core of many traditional (e.g. financial) and emerging (e.g. environmental) applications. The sources of streams are ubiquitous in daily life (e.g. web clicks). One feature of these data is the high speed of their arrival. Thus, their processing entails a special constraint. Despite the ex- ponential growth in the capaci...
Presentation
Full-text available
Information-based data stream summary
Conference Paper
Full-text available
L’objecif de cet article est de faire de la carte auto-organisatrice hiérarchique (GHSOM) un outil utilisable dans le cadre d’une démarche d’analyse exploratoire de données. La visualisation globale est un outil indispensable pour rendre les résultats d’une segmentation intelligibles pour un utilisateur. Nous proposons donc différents outils de vis...
Conference Paper
Full-text available
Computer systems generate a large amount of data that, in terms of space and time, is very expensive - even impossible - to store. Besides this, many applications need to keep an historical view of such data in order to provide historical aggregated information, perform data mining tasks or detect anomalous behavior in computer systems. One solutio...
Presentation
Full-text available
data stream "generic summary" as constrained density estimation