Eric Moulines

Eric Moulines
MINES ParisTech | ParisTech · Department of Signal and Image Processing

PhD

About

483
Publications
51,336
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
23,708
Citations
Citations since 2016
136 Research Items
8649 Citations
201620172018201920202021202202004006008001,0001,2001,400
201620172018201920202021202202004006008001,0001,2001,400
201620172018201920202021202202004006008001,0001,2001,400
201620172018201920202021202202004006008001,0001,2001,400
Education
September 1984 - September 1986
Ecole Nationale Supérieure des Télécommunications (ENST, Télécom ParisTech)
Field of study
  • Signal Processing
September 1981 - August 1984
Ecole Polytechnique, France, Paris
Field of study
  • Mathematics, Physics

Publications

Publications (483)
Preprint
Full-text available
In this paper, we develop a new algorithm, Annealed Skewed SGD - AskewSGD - for training deep neural networks (DNNs) with quantized weights. First, we formulate the training of quantized neural networks (QNNs) as a smoothed sequence of interval-constrained optimization problems. Then, we propose a new first-order stochastic method, AskewSGD, to sol...
Preprint
This paper focuses on Bayesian inference in a federated learning context (FL). While several distributed MCMC algorithms have been proposed, few consider the specific limitations of FL such as communication bottlenecks and statistical heterogeneity. Recently, Federated Averaging Langevin Dynamics (FALD) was introduced, which extends the Federated A...
Preprint
We consider reinforcement learning in an environment modeled by an episodic, finite, stage-dependent Markov decision process of horizon $H$ with $S$ states, and $A$ actions. The performance of an agent is measured by the regret after interacting with the environment for $T$ episodes. We propose an optimistic posterior sampling algorithm for reinfor...
Preprint
Full-text available
The particle-based, rapid incremental smoother (PARIS) is a sequential Monte Carlo technique allowing for efficient online approximation of expectations of additive functionals under Feynman--Kac path distributions. Under weak assumptions, the algorithm has linear computational complexity and limited memory requirements. It also comes with a number...
Article
In a recent paper, Muehlebach and Jordan (2021a) proposed a novel algorithm for constrained optimization that uses original ideals from nonsmooth dynamical systems. In this work, we extend Muehlebach and Jordan (2021a) in several important directions: (i) we provide existence and convergence results for continuous-time trajectories under general co...
Preprint
Full-text available
Importance Sampling (IS) is a method for approximating expectations under a target distribution using independent samples from a proposal distribution and the associated importance weights. In many applications, the target distribution is known only up to a normalization constant, in which case self-normalized IS (SNIS) can be used. While the use o...
Preprint
This paper provides a finite-time analysis of linear stochastic approximation (LSA) algorithms with fixed step size, a core method in statistics and machine learning. LSA is used to compute approximate solutions of a $d$-dimensional linear system $\bar{\mathbf{A}} \theta = \bar{\mathbf{b}}$, for which $(\bar{\mathbf{A}}, \bar{\mathbf{b}})$ can only...
Preprint
Full-text available
This paper studies the Variational Inference (VI) used for training Bayesian Neural Networks (BNN) in the overparameterized regime, i.e., when the number of neurons tends to infinity. More specifically, we consider overparameterized two-layer BNN and point out a critical issue in the mean-field VI training. This problem arises from the decompositio...
Preprint
Full-text available
Personalised federated learning (FL) aims at collaboratively learning a machine learning model taylored for each client. Albeit promising advances have been made in this direction, most of existing approaches works do not allow for uncertainty quantification which is crucial in many applications. In addition, personalisation in the cross-device set...
Preprint
Full-text available
We propose the Bayes-UCBVI algorithm for reinforcement learning in tabular, stage-dependent, episodic Markov decision process: a natural extension of the Bayes-UCB algorithm by Kaufmann et al. (2012) for multi-armed bandits. Our method uses the quantile of a Q-value function posterior as upper confidence bound on the optimal Q-value function. For B...
Article
Full-text available
In this paper, we propose an efficient variance reduction approach for additive functionals of Markov chains relying on a novel discrete-time martingale representation. Our approach is fully non-asymptotic and does not require the knowledge of the stationary distribution (and even any type of ergodicity) or specific structure of the underlying dens...
Preprint
Full-text available
Vector Quantised-Variational AutoEncoders (VQ-VAE) are generative models based on discrete latent representations of the data, where inputs are mapped to a finite set of learned embeddings.To generate new samples, an autoregressive prior distribution over the discrete states must be trained separately. This prior is generally very complex and leads...
Preprint
Full-text available
While the Metropolis Adjusted Langevin Algorithm (MALA) is a popular and widely used Markov chain Monte Carlo method, very few papers derive conditions that ensure its convergence. In particular, to the authors' knowledge, assumptions that are both easy to verify and guarantee geometric convergence, are still missing. In this work, we establish $V$...
Conference Paper
Real-Time Detection of the glottal closure from the speech signal, based on convexity disruptions of accumulated signal.
Preprint
Full-text available
Speech signal carefully processed for extracting glottal closure time
Conference Paper
Full-text available
Performing reliable Bayesian inference on a big data scale is becoming a keystone in the modern era of machine learning. A workhorse class of methods to achieve this task are Markov chain Monte Carlo (MCMC) algorithms and their design to handle distributed datasets has been the subject of many works. However, existing methods are not completely eit...
Preprint
Full-text available
We develop an Explore-Exploit Markov chain Monte Carlo algorithm ($\operatorname{Ex^2MCMC}$) that combines multiple global proposals and local moves. The proposed method is massively parallelizable and extremely computationally efficient. We prove $V$-uniform geometric ergodicity of $\operatorname{Ex^2MCMC}$ under realistic conditions and compute e...
Preprint
The Expectation Maximization (EM) algorithm is the default algorithm for inference in latent variable models. As in any other field of machine learning, applications of latent variable models to very large datasets make the use of advanced parallel and distributed architectures mandatory. This paper introduces FedEM, which is the first extension of...
Preprint
In this paper, we establish moment and Bernstein-type inequalities for additive functionals of geometrically ergodic Markov chains. These inequalities extend the corresponding inequalities for independent random variables. Our conditions cover Markov chains converging geometrically to the stationary distribution either in $V$-norms or in weighted W...
Preprint
We study the convergence in total variation and $V$-norm of discretization schemes of the underdamped Langevin dynamics. Such algorithms are very popular and commonly used in molecular dynamics and computational statistics to approximatively sample from a target distribution of interest. We show first that, for a very large class of schemes, a mino...
Conference Paper
Full-text available
Variational auto-encoders (VAE) are popular deep latent variable models which are trained by maximizing an Evidence Lower Bound (ELBO). To obtain tighter ELBO and hence better variational approximations, it has been proposed to use importance sampling to get a lower variance estimate of the evidence. However, importance sampling is known to perform...
Article
Full-text available
Fast incremental expectation maximization (FIEM) is a version of the EM framework for large datasets. In this paper, we first recast FIEM and other incremental EM type algorithms in the Stochastic Approximation within EM framework. Then, we provide nonasymptotic bounds for the convergence in expectation as a function of the number of examples n and...
Preprint
Full-text available
Variational auto-encoders (VAE) are popular deep latent variable models which are trained by maximizing an Evidence Lower Bound (ELBO). To obtain tighter ELBO and hence better variational approximations, it has been proposed to use importance sampling to get a lower variance estimate of the evidence. However, importance sampling is known to perform...
Preprint
Full-text available
Performing reliable Bayesian inference on a big data scale is becoming a keystone in the modern era of machine learning. A workhorse class of methods to achieve this task are Markov chain Monte Carlo (MCMC) algorithms and their design to handle distributed datasets has been the subject of many works. However, existing methods are not completely eit...
Preprint
Full-text available
This paper provides a non-asymptotic analysis of linear stochastic approximation (LSA) algorithms with fixed stepsize. This family of methods arises in many machine learning tasks and is used to obtain approximate solutions of a linear system $\bar{A}\theta = \bar{b}$ for which $\bar{A}$ and $\bar{b}$ can only be accessed through random estimates $...
Preprint
Full-text available
Federated learning aims at conducting inference when data are decentralised and locally stored on several clients, under two main constraints: data ownership and communication overhead. In this paper, we address these issues under the Bayesian paradigm. To this end, we propose a novel Markov chain Monte Carlo algorithm coined \texttt{QLSD} built up...
Preprint
Incremental Expectation Maximization (EM) algorithms were introduced to design EM for the large scale learning framework by avoiding the full data set to be processed at each iteration. Nevertheless, these algorithms all assume that the conditional expectations of the sufficient statistics are explicit. In this paper, we propose a novel algorithm n...
Preprint
A novel algorithm named Perturbed Prox-Preconditioned SPIDER (3P-SPIDER) is introduced. It is a stochastic variancereduced proximal-gradient type algorithm built on Stochastic Path Integral Differential EstimatoR (SPIDER), an algorithm known to achieve near-optimal first-order oracle inequality for nonconvex and nonsmooth optimization. Compared to...
Preprint
Full-text available
In this work we present an approach for building tight model-free confidence intervals for the optimal value function $V^\star$ in general infinite horizon MDPs via the upper solutions. We suggest a novel upper value iterative procedure (UVIP) to construct upper solutions for a given agent's policy. UVIP leads to a model free method of policy evalu...
Article
In this paper we propose a novel and practical variance reduction approach for additive functionals of dependent sequences. Our approach combines the use of control variates with the minimization of an empirical variance estimate. We analyze finite sample properties of the proposed method and derive finite-Time bounds of the excess asymptotic varia...
Preprint
Full-text available
Simultaneously sampling from a complex distribution with intractable normalizing constant and approximating expectations under this distribution is a notoriously challenging problem. We introduce a novel scheme, Invertible Flow Non Equilibrium Sampling (InFine), which departs from classical Sequential Monte Carlo (SMC) and Markov chain Monte Carlo...
Preprint
Full-text available
This paper studies fixed step-size stochastic approximation (SA) schemes, including stochastic gradient schemes, in a Riemannian framework. It is motivated by several applications, where geodesics can be computed explicitly, and their use accelerates crude Euclidean methods. A fixed step-size scheme defines a family of time-homogeneous Markov chain...
Article
We study the problem of sampling from a probability distribution π on Rd which has a density w.r.t. the Lebesgue measure known up to a normalization factor x→ e−U(x)/f Rd e−U(y) dy. We analyze a sampling method based on the Euler discretization of the Langevin stochastic differential equations under the assumptions that the potential U is continuou...
Preprint
This paper studies the exponential stability of random matrix products driven by a general (possibly unbounded) state space Markov chain. It is a cornerstone in the analysis of stochastic algorithms in machine learning (e.g. for parameter tracking in online learning or reinforcement learning). The existing results impose strong conditions such as u...
Preprint
Full-text available
We undertake a precise study of the non-asymptotic properties of vanilla generative adversarial networks (GANs) and derive theoretical guarantees in the problem of estimating an unknown $d$-dimensional density $p^*$ under a proper choice of the class of generators and discriminators. We prove that the resulting density estimate converges to $p^*$ i...
Preprint
Full-text available
Markov Chain Monte Carlo (MCMC) is a class of algorithms to sample complex and high-dimensional probability distributions. The Metropolis-Hastings (MH) algorithm, the workhorse of MCMC, provides a simple recipe to construct reversible Markov kernels. Reversibility is a tractable property which implies a less tractable but essential property here, i...
Preprint
Fast Incremental Expectation Maximization (FIEM) is a version of the EM framework for large datasets. In this paper, we first recast FIEM and other incremental EM type algorithms in the {\em Stochastic Approximation within EM} framework. Then, we provide nonasymptotic bounds for the convergence in expectation as a function of the number of examples...
Preprint
The Expectation Maximization (EM) algorithm is of key importance for inference in latent variable models including mixture of regressors and experts, missing observations. This paper introduces a novel EM algorithm, called \texttt{SPIDER-EM}, for inference from a training set of size $n$, $n \gg 1$. At the core of our algorithm is an estimator of t...
Preprint
The Expectation Maximization (EM) algorithm is a key reference for inference in latent variable models; unfortunately, its computational cost is prohibitive in the large scale learning setting. In this paper, we propose an extension of the Stochastic Path-Integrated Differential EstimatoR EM (SPIDER-EM) and derive complexity bounds for this novel a...
Preprint
In this paper we propose a novel and practical variance reduction approach for additive functionals of dependent sequences. Our approach combines the use of control variates with the minimisation of an empirical variance estimate. We analyse finite sample properties of the proposed method and derive finite-time bounds of the excess asymptotic varia...
Article
Full-text available
In this paper, we propose a novel variance reduction approach for additive functionals of Markov chains based on minimization of an estimate for the asymptotic variance of these functionals over suitable classes of control variates. A distinctive feature of the proposed approach is its ability to significantly reduce the overall finite sample varia...
Preprint
Full-text available
This paper analyzes the convergence for a large class of Riemannian stochastic approximation (SA) schemes, which aim at tackling stochastic optimization problems. In particular, the recursions we study use either the exponential map of the considered manifold (geodesic schemes) or more general retraction functions (retraction schemes) used as a pro...
Preprint
Full-text available
In this contribution, we propose a new computationally efficient method to combine Variational Inference (VI) with Markov Chain Monte Carlo (MCMC). This approach can be used with generic MCMC kernels, but is especially well suited to \textit{MetFlow}, a novel family of MCMC algorithms we introduce, in which proposals are obtained using Normalizing...
Preprint
Linear two-timescale stochastic approximation (SA) scheme is an important class of algorithms which has become popular in reinforcement learning (RL), particularly for the policy evaluation problem. Recently, a number of works have been devoted to establishing the finite time analysis of the scheme, especially under the Markovian (non-i.i.d.) noise...
Preprint
Full-text available
Uncertainty quantification for deep learning is a challenging open problem. Bayesian statistics offer a mathematically grounded framework to reason about uncertainties; however, approximate posteriors for modern neural networks still require prohibitive computational costs. We propose a family of algorithms which split the classification task into...
Article
We consider in this paper the problem of sampling a high-dimensional probability distribution π having a density w.r.t. the Lebesgue measure on Rd, known up to a normalization constant x ∣→ π(x) = e−U(x)/( Rd e−U(y)dy. Such problem naturally occurs for example in Bayesian inference and machine learning. Under the assumption that U is continuously d...
Preprint
Full-text available
The EM algorithm is one of the most popular algorithm for inference in latent data models. The original formulation of the EM algorithm does not scale to large data set, because the whole data set is required at each iteration of the algorithm. To alleviate this problem, Neal and Hinton have proposed an incremental version of the EM (iEM) in which...
Preprint
The ability to generate samples of the random effects from their conditional distributions is fundamental for inference in mixed effects models. Random walk Metropolis is widely used to perform such sampling, but this method is known to converge slowly for medium dimensional problems, or when the joint structure of the distributions to sample is sp...
Preprint
In this paper we propose a novel variance reduction approach for additive functionals of Markov chains based on minimization of an estimate for the asymptotic variance of these functionals over suitable classes of control variates. A distinctive feature of the proposed approach is its ability to significantly reduce the overall finite sample varian...
Article
Full-text available
The ability to generate samples of the random effects from their conditional distributions is fundamental for inference in mixed effects models. Random walk Metropolis is widely used to perform such sampling, but this method is known to converge slowly for medium dimensional problems, or when the joint structure of the distributions to sample is sp...
Preprint
We state and prove a quantitative version of the bounded difference inequality for geometrically ergodic Markov chains. Our proof uses the same martingale decomposition as \cite{MR3407208} but, compared to this paper, the exact coupling argument is modified to fill a gap between the strongly aperiodic case and the general aperiodic case.
Preprint
Full-text available
We consider the problem of sampling from a target distribution which is \emph{not necessarily logconcave}. Non-asymptotic analysis results are established in a suitable Wasserstein-type distance of the Stochastic Gradient Langevin Dynamics (SGLD) algorithm, when the gradient is driven by even \emph{dependent} data streams. Our estimates are sharper...
Article
A mixed data frame (MDF) is a table collecting categorical, numerical and count observations. The use of MDF is widespread in statistics and the applications are numerous from abundance data in ecology to recommender systems. In many cases, an MDF exhibits simultaneously main effects, such as row, column or group effects and interactions, for which...
Article
A complete methodology called LORI (Low-Rank Interaction), including a Poisson model, an algorithm, and an automatic selection of the regularization parameter, is proposed for the analysis of frequency tables with covariates, including an upper bound on the estimation error. A simulation study with synthetic data suggests that LORI improves empiric...
Preprint
In this paper we propose an efficient variance reduction approach for MCMC algorithms relying on a novel discrete time martingale representation for Markov chains. Our approach is fully non-asymptotic and does not require any type of ergodicity or special product structure of the underlying density. By rigorously analyzing the convergence of the pr...
Preprint
Full-text available
Stochastic approximation (SA) is a key method used in statistical learning. Recently, its non-asymptotic convergence analysis has been considered in many papers. However, most of the prior analyses are made under restrictive assumptions such as unbiased gradient estimates and convex objective function, which significantly limit their applications t...
Article
We consider the problem of nonparametric density estimation of a random environment from the observation of a single trajectory of a random walk in this environment. We build several density estimators using the beta-moments of this distribution. Then we apply the Goldenschluger-Lepski method to select an estimator satisfying an oracle type inequal...
Preprint
Full-text available
Many applications of machine learning involve the analysis of large data frames-matrices collecting heterogeneous measurements (binary, numerical, counts, etc.) across samples-with missing values. Low-rank models, as studied by Udell et al. [30], are popular in this framework for tasks such as visualization, clustering and missing value imputation....
Preprint
Full-text available
Stochastic Gradient Langevin Dynamics (SGLD) is a combination of a Robbins-Monro type algorithm with Langevin dynamics in order to perform data-driven stochastic optimization. In this paper, the SGLD method with fixed step size $\lambda$ is considered in order to sample from a logconcave target distribution $\pi$, known up to a normalisation factor...
Preprint
Full-text available
Stochastic Gradient Langevin Dynamics (SGLD) has emerged as a key MCMC algorithm for Bayesian learning from large scale datasets. While SGLD with decreasing step sizes converges weakly to the posterior distribution, the algorithm is often used with a constant step size in practice and has demonstrated successes in machine learning tasks. The curren...
Preprint
Full-text available
A new methodology is presented for the construction of control variates to reduce the variance of additive functionals of Markov Chain Monte Carlo (MCMC) samplers. Our control variates are defined as linear combinations of functions whose coefficients are obtained by minimizing a proxy for the asymptotic variance. The construction is theoretically...
Preprint
Full-text available
A mixed data frame (MDF) is a table collecting categorical, numerical and count observations. The use of MDF is widespread in statistics and the applications are numerous from abundance data in ecology to recommender systems. In many cases, an MDF exhibits simultaneously main effects, such as row, column or group effects and interactions, for which...
Preprint
Full-text available
We consider the problem of non-parametric density estimation of a random environment from the observation of a single trajectory of a random walk in this environment. We first construct a density estimator using the beta-moments. We then show that the Goldenshluger-Lepski method can be used to select the beta-moment. We prove non-asymptotic bounds...
Conference Paper
Full-text available
This paper proposes a new methodology to reduce energy consumptions in large buildings while simultaneously optimizing thermal comfort. The model designed with an energy simulation program is calibrated by the Covariance Matrix Adaptation Evolutionary Strategy using observations including consumptions, inside temperatures and comfort measurements s...
Preprint
Full-text available
In this paper we address the convergence of stochastic approximation when the functions to be minimized are not convex and nonsmooth. We show that the "mean-limit" approach to the convergence which leads, for smooth problems, to the ODE approach can be adapted to the non-smooth case. The limiting dynamical system may be shown to be, under appropria...
Article
Full-text available
Mean embeddings provide an extremely flexible and powerful tool in machine learning and statistics to represent probability distributions and define a semi-metric (MMD, maximum mean discrepancy; also called N-distance or energy distance), with numerous successful applications. The representation is constructed as the expectation of the feature map...
Chapter
In this chapter we will discuss the case in which the state space \(\mathsf {X}\) is discrete, which means either finite or countably infinite. In this case, it will always be assumed that \(\mathscr {X}= \mathscr {P}(\mathsf {X})\), the set of all subsets of \(\mathsf {X}\). Since every state is an atom, we will first apply the results of Chapter...