Michael Gastpar

Michael Gastpar
Swiss Federal Institute of Technology in Lausanne | EPFL · School of Computer and Communication Sciences

Dr. ès sc. (EPFL, 2002)

About

353
Publications
13,347
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
13,665
Citations
Additional affiliations
July 2009 - March 2015
Delft University of Technology
Position
  • Professor (Full)
January 2003 - July 2015
University of California, Berkeley
Position
  • Professor (Associate)

Publications

Publications (353)
Preprint
\emph{Batch normalization} is a successful building block of neural network architectures. Yet, it is not well understood. A neural network layer with batch normalization comprises three components that affect the representation induced by the network: \emph{recentering} the mean of the representation to zero, \emph{rescaling} the variance of the r...
Preprint
High-order phenomena play crucial roles in many systems of interest, but their analysis is often highly nontrivial. There is a rich literature providing a number of alternative information-theoretic quantities capturing high-order phenomena, but their interpretation and relationship with each other is not well understood. The lack of principles uni...
Preprint
Full-text available
We study which machine learning algorithms have tight generalization bounds. First, we present conditions that preclude the existence of tight generalization bounds. Specifically, we show that algorithms that have certain inductive biases that cause them to be unstable do not admit tight generalization bounds. Next, we show that algorithms that are...
Preprint
Full-text available
The Shannon lower bound has been the subject of several important contributions by Berger. This paper surveys Shannon bounds on rate-distortion problems under mean-squared error distortion with a particular emphasis on Berger's techniques. Moreover, as a new result, the Gray-Wyner network is added to the canon of settings for which such bounds are...
Preprint
Full-text available
Attention-based transformers have been remarkably successful at modeling generative processes across various domains and modalities. In this paper, we study the behavior of transformers on data drawn from \kth Markov processes, where the conditional distribution of the next symbol in a sequence depends on the previous $k$ symbols observed. We obser...
Preprint
Full-text available
We formalize the problem of prompt compression for large language models (LLMs) and present a framework to unify token-level prompt compression methods which create hard prompts for black-box models. We derive the distortion-rate function for this setup as a linear program, and provide an efficient algorithm to compute this fundamental limit via th...
Preprint
Full-text available
In recent years, transformer-based models have revolutionized deep learning, particularly in sequence modeling. To better understand this phenomenon, there is a growing interest in using Markov input processes to study transformers. However, our current understanding in this regard remains limited with many fundamental questions about how transform...
Article
The Shannon lower bound has been the subject of several important contributions by Berger. This paper surveys Shannon bounds on rate-distortion problems under mean-squared error distortion with a particular emphasis on Berger’s techniques. Moreover, as a new result, the Gray-Wyner network is added to the canon of settings for which such bounds are...
Article
Inspired by the connection between classical regret measures employed in universal prediction and Rényi divergence, we introduce a new class of universal predictors that depend on a real parameter α ≥ 1. This class interpolates two well-known predictors, the mixture estimators, that include the Laplace and the Krichevsky-Trofimov predictors, and th...
Preprint
Full-text available
This paper focuses on parameter estimation and introduces a new method for lower bounding the Bayesian risk. The method allows for the use of virtually \emph{any} information measure, including R\'enyi's $\alpha$, $\varphi$-Divergences, and Sibson's $\alpha$-Mutual Information. The approach considers divergences as functionals of measures and explo...
Preprint
Full-text available
We adopt an information-theoretic framework to analyze the generalization behavior of the class of iterative, noisy learning algorithms. This class is particularly suitable for study under information-theoretic metrics as the algorithms are inherently randomized, and it includes commonly used algorithms such as Stochastic Gradient Langevin Dynamics...
Preprint
Full-text available
We prove that every online learnable class of functions of Littlestone dimension $d$ admits a learning algorithm with finite information complexity. Towards this end, we use the notion of a globally stable algorithm. Generally, the information complexity of such a globally stable algorithm is large yet finite, roughly exponential in $d$. We also sh...
Article
Most of our lives are conducted in the cyberspace. The human notion of privacy translates into a cyber notion of privacy on many functions that take place in the cyberspace. This article focuses on three such functions: how to privately retrieve information from cyberspace (privacy in information retrieval), how to privately leverage large-scale di...
Article
The increasing prevalence of massive datasets makes the outsourcing of storage and computation tasks to distributed servers a necessity. This raises a number of concerns regarding the security and integrity of stored information, the privacy of accessing desired information, the communication overhead of distributed systems, the latency, reliabilit...
Preprint
Full-text available
Inspired by Sibson's alpha-mutual information, we introduce a new class of universal predictors that depend on a real parameter greater than one. This class interpolates two well-known predictors, the mixture estimator, that includes the Laplace and the Krichevsky-Trofimov predictors, and the Normalized Maximum Likelihood (NML) estimator. We point...
Preprint
Full-text available
In this work, we connect the problem of bounding the expected generalisation error with transportation-cost inequalities. Exposing the underlying pattern behind both approaches we are able to generalise them and go beyond Kullback-Leibler Divergences/Mutual Information and sub-Gaussian measures. In particular, we are able to provide a result showin...
Preprint
Full-text available
We explore a family of information measures that stems from R\'enyi's $\alpha$-Divergences with $\alpha<0$. In particular, we extend the definition of Sibson's $\alpha$-Mutual Information to negative values of $\alpha$ and show several properties of these objects. Moreover, we highlight how this family of information measures is related to function...
Preprint
Full-text available
We consider the problem of parameter estimation in a Bayesian setting and propose a general lower-bound that includes part of the family of $f$-Divergences. The results are then applied to specific settings of interest and compared to other notable results in the literature. In particular, we show that the known bounds using Mutual Information can...
Preprint
Full-text available
In the distributed remote (CEO) source coding problem, many separate encoders observe independently noisy copies of an underlying source. The rate loss is the difference between the rate required in this distributed setting and the rate that would be required in a setting where the encoders can fully cooperate. In this sense, the rate loss characte...
Preprint
Full-text available
The Gray-Wyner network subject to a fidelity criterion is studied. Upper and lower bounds for the trade-offs between the private sum-rate and the common rate are obtained for arbitrary sources subject to mean-squared error distortion. The bounds meet exactly, leading to the computation of the rate region, when the source is jointly Gaussian. They m...
Article
The conditional mean is a fundamental and important quantity whose applications include the theories of estimation and rate-distortion. It is also notoriously difficult to work with. This paper establishes novel bounds on the differential entropy of the conditional mean in the case of finite-variance input signals and additive Gaussian noise. The m...
Article
This paper presents explicit solutions for two related non-convex information extremization problems due to Gray and Wyner in the Gaussian case. The first problem is the Gray-Wyner network subject to a sum-rate constraint on the two private links. Here, our argument establishes the optimality of Gaussian codebooks and hence, a closed-form formula f...
Preprint
Full-text available
Compute-forward is a coding technique that enables receiver(s) in a network to directly decode one or more linear combinations of the transmitted codewords. Initial efforts focused on Gaussian channels and derived achievable rate regions via nested lattice codes and single-user (lattice) decoding as well as sequential (lattice) decoding. Recently,...
Preprint
Full-text available
Most of our lives are conducted in the cyberspace. The human notion of privacy translates into a cyber notion of privacy on many functions that take place in the cyberspace. This article focuses on three such functions: how to privately retrieve information from cyberspace (privacy in information retrieval), how to privately leverage large-scale di...
Article
In this work, the probability of an event under some joint distribution is bounded by measuring it with the product of the marginals instead (which is typically easier to analyze) together with a measure of the dependence between the two random variables. These results find applications in adaptive data analysis, where multiple dependencies are int...
Preprint
Full-text available
An important notion of common information between two random variables is due to Wyner. In this paper, we derive a lower bound on Wyner's common information for continuous random variables. The new bound improves on the only other general lower bound on Wyner's common information, which is the mutual information. We also show that the new lower bou...
Article
Full-text available
Wyner’s common information is a measure that quantifies and assesses the commonality between two random variables. Based on this, we introduce a novel two-step procedure to construct features from data, referred to as Common Information Components Analysis (CICA). The first step can be interpreted as an extraction of Wyner’s common information. The...
Preprint
Full-text available
Learning and compression are driven by the common aim of identifying and exploiting statistical regularities in data, which opens the door for fertile collaboration between these areas. A promising group of compression techniques for learning scenarios is normalised maximum likelihood (NML) coding, which provides strong guarantees for compression o...
Article
Algebraic network information theory is an emerging facet of network information theory, studying the achievable rates of random code ensembles that have algebraic structure, such as random linear codes. A distinguishing feature is that linear combinations of codewords can sometimes be decoded more efficiently than codewords themselves. The present...
Preprint
We give an information-theoretic interpretation of Canonical Correlation Analysis (CCA) via (relaxed) Wyner's common information. CCA permits to extract from two high-dimensional data sets low-dimensional descriptions (features) that capture the commonalities between the data sets, using a framework of correlations and linear transforms. Our interp...
Preprint
We consider the problem of source coding subject to a fidelity criterion for the Gray-Wyner network that connects a single source with two receivers via a common channel and two private channels. General lower bounds are derived for jointly Gaussian sources subject to the mean-squared error criterion, leveraging convex duality and an argument invol...
Preprint
Full-text available
The aim of this work is to provide bounds connecting two probability measures of the same event using R\'enyi $\alpha$-Divergences and Sibson's $\alpha$-Mutual Information, a generalization of respectively the Kullback-Leibler Divergence and Shannon's Mutual Information. A particular case of interest can be found when the two probability measures c...
Preprint
A natural relaxation of Wyner's Common Information is studied. Specifically, the constraint of conditional independence is replaced by an upper bound on the conditional mutual information. While of interest in its own right, this relaxation has operational significance in a source coding problem that models coded caching. For the special case of jo...
Article
The feedback sum-rate capacity is established for the symmetric J-user Gaussian multiple-access channel (GMAC). The main contribution is a converse bound that combines the dependence-balance argument of Hekstra and Willems (1989) with a variant of the factorization of a convex envelope of Geng and Nair (2014). The converse bound matches the achieva...
Preprint
Full-text available
In this work, the probability of an event under some joint distribution is bounded by measuring it with the product of the marginals instead (which is typically easier to analyze) together with a measure of the dependence between the two random variables. These results find applications in adaptive data analysis, where multiple dependencies are int...
Article
A new scheme for the problem of centralized coded caching with non-uniform demands is proposed. The distinguishing feature of the proposed placement strategy is that it admits equal sub-packetization for all files while allowing the users to allocate more cache to the files which are more popular. This creates natural broadcasting opportunities in...
Preprint
Full-text available
The following problem is considered: given a joint distribution $P_{XY}$ and an event $E$, bound $P_{XY}(E)$ in terms of $P_XP_Y(E)$ (where $P_XP_Y$ is the product of the marginals of $P_{XY}$) and a measure of dependence of $X$ and $Y$. Such bounds have direct applications in the analysis of the generalization error of learning algorithms, where $...
Preprint
Full-text available
There is an increasing concern that most current published research findings are false. The main cause seems to lie in the fundamental disconnection between theory and practice in data analysis. While the former typically relies on statistical independence, the latter is an inherently adaptive process: new hypotheses are formulated based on the out...
Article
The distributed remote source coding (so-called CEO) problem is studied in the case where the underlying source, not necessarily Gaussian, has finite differential entropy and the observation noise is Gaussian. The main result is a new lower bound for the sum-rate-distortion function under arbitrary distortion measures. When specialized to the case...
Preprint
Full-text available
Consider a receiver in a multi-user network that wishes to decode several messages. Simultaneous joint typicality decoding is one of the most powerful techniques for determining the fundamental limits at which reliable decoding is possible. This technique has historically been used in conjunction with random i.i.d. codebooks to establish achievable...
Preprint
We consider a setup in which confidential i.i.d. samples $X_1,\dotsc,X_n$ from an unknown finite-support distribution $\boldsymbol{p}$ are passed through $n$ copies of a discrete privatization channel (a.k.a. mechanism) producing outputs $Y_1,\dotsc,Y_n$. The channel law guarantees a local differential privacy of $\epsilon$. Subject to a prescribed...
Preprint
The feedback sum-rate capacity is established for the symmetric $J$-user Gaussian multiple-access channel (GMAC). The main contribution is a converse bound that combines the dependence-balance argument of Hekstra and Willems (1989) with a variant of the factorization of a convex envelope of Geng and Nair (2014). The converse bound matches the achie...
Article
We present a practical strategy that aims to attain rate points on the dominant face of the multiple access channel capacity using a standard low complexity decoder. This technique is built upon recent theoretical developments of Zhu and Gastpar on compute-forward multiple access (CFMA) which achieves the capacity of the multiple access channel usi...
Preprint
Multi-server single-message private information retrieval is studied in the presence of side information. In this problem, $K$ independent messages are replicatively stored at $N$ non-colluding servers. The user wants to privately download one message from the servers without revealing the index of the message to any of the servers, leveraging its...
Article
Full-text available
This paper presents a joint typicality framework for encoding and decoding nested linear codes in multi-user networks. This framework provides a new perspective on compute– forward within the context of discrete memoryless networks. In particular, it establishes an achievable rate region for computing a linear combination over a discrete memoryless...
Preprint
Full-text available
We propose a novel caching strategy for the problem of centralized coded caching with non-uniform demands. Our placement strategy can be applied to an arbitrary number of users and files, and can be easily adapted to the scenario where file popularities are user-specific. The distinguishing feature of the proposed placement strategy is that it allo...
Preprint
We study the problem of single-server multi-message private information retrieval with side information. One user wants to recover $N$ out of $K$ independent messages which are stored at a single server. The user initially possesses a subset of $M$ messages as side information. The goal of the user is to download the $N$ demand messages while not l...
Preprint
Full-text available
The distributed remote source coding (so-called CEO) problem is studied in the case where the underlying source has finite differential entropy and the observation noise is Gaussian. The main result is a new lower bound for the sum-rate-distortion function under arbitrary distortion measures. When specialized to the case of mean-squared error, it i...
Article
Full-text available
Despite significant progress in the caching literature concerning the worst case and uniform average case regimes, the algorithms for caching with nonuniform demands are still at a basic stage and mostly rely on simple grouping and memory-sharing techniques. In this work we introduce a novel centralized caching strategy for caching with nonuniform...
Article
Full-text available
We present a practical strategy that aims to attain rate points on the dominant face of the multiple access channel capacity using a standard low complexity decoder. This technique is built upon recent theoretical developments of Zhu and Gastpar on compute-forward multiple access (CFMA) which achieves the capacity of the multiple access channel usi...
Article
The cooperative data exchange problem is studied for the fully connected network. In this problem, each node initially only possesses a subset of the $K$ packets making up the file. Nodes make broadcast transmissions that are received by all other nodes. The goal is for each node to recover the full file. In this paper, we present a polynomial-time...
Article
Full-text available
We introduce the Fixed Cluster Repair System (FCRS) as a novel architecture for Distributed Storage Systems (DSS) that achieves a small repair bandwidth while guaranteeing a high availability. Specifically we partition the set of servers in a DSS into $s$ clusters and allow a failed server to choose any cluster other than its own as its repair grou...
Article
Full-text available
In this paper, we consider a cache aided network in which each user is assumed to have individual caches, while upon users’ requests, an update message is sent through a common link to all users. First, we formulate a general information theoretic setting that represents the database as a discrete memoryless source, and the users’ requests as side...
Article
Full-text available
Computation codes in network information theory are designed for the scenarios where the decoder is not interested in recovering the information sources themselves, but only a function thereof. K\"orner and Marton showed for distributed source coding that such function decoding can be achieved more efficiently than decoding the full information sou...
Preprint
Computation codes in network information theory are designed for the scenarios where the decoder is not interested in recovering the information sources themselves, but only a function thereof. K\"orner and Marton showed for distributed source coding that such function decoding can be achieved more efficiently than decoding the full information sou...
Article
Full-text available
The classical distributed storage problem can be modeled by a k-uniform {\it complete} hyper-graph where vertices represent servers and hyper-edges represent users. Hence each hyper-edge should be able to recover the full file using only the memories of the vertices associated with it. This paper considers the generalization of this problem to {\it...
Conference Paper
From a subset of the n-dimensional integer lattice, we independently pick two points uniformly at random. A sumset is formed by adding these two points component-wise and a sumset is called typical, if the sum falls inside this set with high probability. In this note we characterize the asymptotic size of the typical sumsets for large n, and show t...
Conference Paper
Full-text available
We study the problem of coded caching when the server has access to several libraries and each user makes independent requests from every library. The single-library scenario has been well studied and it has been proved that coded caching can significantly improve the delivery rate compared to uncoded caching. In this work we show that when all the...
Conference Paper
We study a generalization of Wyner's Common Information toWatanabe's Total Correlation. The first minimizes the description size required for a variable that can make two other random variables conditionally independent. If independence is unattainable, Watanabe's total (conditional) correlation is measure to check just how independent they have be...
Conference Paper
Given two identical linear codes C with rate R over F q of length n, we independently pick one codeword from each codebook uniformly at random. A sumset is formed by adding these two codewords entry-wise as integer vectors and a sumset is called typical, if the sum falls inside this set with high probability. In this paper we show that the asymptot...

Network

Cited By