About
74
Publications
7,826
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
791
Citations
Citations since 2017
Introduction
I am a professor for Mathematics of Machine Learning at KU Eichstätt-Ingolstadt.
My research focuses on mathematically understanding the potential and limitations of methods from machine learning and in particular deep learning. This includes:
- Approximation properties of deep neural networks
- Complexity of sets of neural networks (VC dimension, entropy numbers, sampling numbers)
- Robustness of deep neural networks
I am also interested in harmonic analysis.
Additional affiliations
June 2021 - October 2021
June 2020 - May 2021
February 2018 - present
Publications
Publications (74)
In this paper we show that the Fourier transform induces an isomorphism
between the coorbit spaces defined by Feichtinger and Gr\"ochenig of the mixed,
weighted Lebesgue spaces $L_{v}^{p,q}$ with respect to the quasi-regular
representation of a semi-direct product $\mathbb{R}^{d}\rtimes H$ with suitably
chosen dilation group $H$, and certain decomp...
Many smoothness spaces in harmonic analysis are decomposition spaces. Iin this paper we ask: Given two decomposition spaces, is there an embedding between the two? A decomposition space $\mathcal{D}(\mathcal{Q}, L^p, Y)$ can be described using : a covering $\mathcal{Q}=(Q_{i})_{i\in I}$ of the frequency domain, an exponent $p$ and a sequence space...
\newcommand{mc}[1]{\mathcal{#1}}$ $\newcommand{D}{\mc{D}(\mc{Q},L^p,\ell_w^q)}$ We present a framework for the construction of structured, possibly compactly supported Banach frames and atomic decompositions for decomposition spaces. Such a space $\D$ is defined using a frequency covering $\mc{Q}=(Q_i)_{i\in I}$: If $(\varphi_i)_{i}$ is a suitable...
We generalize the classical universal approximation theorem for neural networks to the case of complex-valued neural networks. Precisely, we consider feedforward networks with a complex activation function $\sigma : \mathbb{C} \to \mathbb{C}$ in which each neuron performs the operation $\mathbb{C}^N \to \mathbb{C}, z \mapsto \sigma(b + w^T z)$ with...
In this paper, we consider Barron functions $f : [0,1]^d \to \mathbb{R}$ of smoothness $\sigma > 0$, which are functions that can be written as \[ f(x) = \int_{\mathbb{R}^d} F(\xi) \, e^{2 \pi i \langle x, \xi \rangle} \, d \xi \quad \text{with} \quad \int_{\mathbb{R}^d} |F(\xi)| \cdot (1 + |\xi|)^{\sigma} \, d \xi < \infty. \] For $\sigma = 1$, th...
Motivated by results of Dyatlov on Fourier uncertainty principles for Cantor sets and of Knutsen, for joint time-frequency representations (STFT with Gaussian, equivalent to Fock spaces), we suggest a general setting relating localization and uncertainty and prove, within this context, an uncertainty principle for Cantor sets in Bergman spaces of t...
This paper provides maximal function characterizations of anisotropic Triebel–Lizorkin spaces associated to general expansive matrices for the full range of parameters p∈(0,∞)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgree...
Continuing previous work, this paper provides maximal characterizations of anisotropic Triebel-Lizorkin spaces F˙p,qα\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\do...
Gabor systems are used in fields ranging from audio processing to digital communication. Such a Gabor system (g,Λ) consists of all time-frequency shifts π(λ)g of a window function g∈L2(R) along a lattice Λ⊂R2. We focus on Gabor systems that are also Riesz sequences, meaning that one can stably reconstruct the coefficients c=(cλ)λ∈Λ from the functio...
Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications --- f...
We consider Gabor frames generated by a general lattice and a window function that belongs to one of the following spaces: the Sobolev space $$V_1 = H^1(\mathbb {R}^d)$$ V 1 = H 1 ( R d ) , the weighted $$L^2$$ L 2 -space $$V_2 = L_{1 + |x|}^2(\mathbb {R}^d)$$ V 2 = L 1 + | x | 2 ( R d ) , and the space $$V_3 = \mathbb {H}^1(\mathbb {R}^d) = V_1 \c...
Using techniques developed recently in the field of compressed sensing we prove new upper bounds for general (non-linear) sampling numbers of (quasi-)Banach smoothness spaces in $L^2$. In relevant cases such as mixed and isotropic weighted Wiener classes or Sobolev spaces with mixed smoothness, sampling numbers in $L^2$ can be upper bounded by best...
This paper provides a classification theorem for expansive matrices $A \in \mathrm{GL}(d, \mathbb{R})$ generating the same anisotropic homogeneous Triebel-Lizorkin space $\dot{\mathbf{F}}^{\alpha}_{p, q}(A)$ for $\alpha \in \mathbb{R}$ and $p,q \in (0,\infty]$. It is shown that $\dot{\mathbf{F}}^{\alpha}_{p, q}(A) = \dot{\mathbf{F}}^{\alpha}_{p, q}...
Warped time-frequency systems have recently been introduced as a class of structured continuous frames for functions on the real line. Herein, we generalize this framework to the setting of functions of arbitrary dimensionality. After showing that the basic properties of warped time-frequency representations carry over to higher dimensions, we dete...
Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications --- f...
We derive an extension of the Walnut–Daubechies criterion for the invertibility of frame operators. The criterion concerns general reproducing systems and Besov-type spaces. As an application, we conclude that $$L^2$$ L 2 frame expansions associated with smooth and fast-decaying reproducing systems on sufficiently fine lattices extend to Besov-type...
This paper is a continuation of [arXiv:2104.14361]. It concerns maximal characterizations of anisotropic Triebel-Lizorkin spaces $\dot{\mathbf{F}}^{\alpha}_{p,q}$ for the endpoint case of $p = \infty$ and the full scale of parameters $\alpha \in \mathbb{R}$ and $q \in (0,\infty]$. In particular, a Peetre-type characterization of the anisotropic Bes...
This paper provides a self-contained exposition of coorbit spaces associated with integrable group representations and quasi-Banach function spaces. It extends the theory in [Studia Math., 180(3):237-253, 2007] to locally compact groups that do not necessarily possess a compact, conjugation-invariant unit neighborhood. Furthermore, the present pape...
Motivated by results of Dyatlov on Fourier uncertainty principles for Cantor sets and of Knutsen, for joint time-frequency representations (STFT with Gaussian, equivalent to Fock spaces), we suggest a general setting relating localization and uncertainty and prove, within this context, an uncertainty principle for Cantor sets in Bergman spaces of t...
We study the problem of learning classification functions from noiseless training samples, under the assumption that the decision boundary is of a certain regularity. We establish universal lower bounds for this estimation problem, for general classes of continuous decision boundaries. For the class of locally Barron-regular decision boundaries, we...
Rate distortion theory is concerned with optimally encoding signals from a given signal class S using a budget of R bits, as R→∞. We say that Scan be compressed at rates if we can achieve an error of at most O(R-s) for encoding the given signal class; the supremal compression rate is denoted by s∗(S). Given a fixed coding scheme, there usually are...
We consider neural network approximation spaces that classify functions according to the rate at which they can be approximated (with error measured in $L^p$) by ReLU neural networks with an increasing number of coefficients, subject to bounds on the magnitude of the coefficients and the number of hidden layers. We prove embedding theorems between...
We consider neural network approximation spaces that classify functions according to the rate at which they can be approximated (with error measured in $L^p$) by ReLU neural networks with an increasing number of coefficients, subject to bounds on the coefficients and the number of hidden layers. We prove embedding theorems between these spaces for...
Assume that \(X_{\Sigma } \in \mathbb {R}^{n}\) is a centered random vector following a multivariate normal distribution with positive definite covariance matrix \(\Sigma \). Let \(g : \mathbb {R}^{n} \rightarrow \mathbb {C}\) be measurable and of moderate growth, say \(|g(x)| \lesssim (1 + |x|)^{N}\). We show that the map \(\Sigma \mapsto \mathbb...
Schur's test for integral operators states that if a kernel K:X×Y→C satisfies ∫Y|K(x,y)|dν(y)≤C and ∫X|K(x,y)|dμ(x)≤C, then the associated integral operator is bounded from Lp(ν) into Lp(μ), simultaneously for all p∈[1,∞]. We derive a variant of this result which ensures that the integral operator acts boundedly on the (weighted) mixed-norm Lebesgu...
We consider Gabor frames generated by a general lattice and a window function that belongs to one of the following spaces: the Sobolev space $V_1 = H^1(\mathbb R^d)$, the weighted $L^2$-space $V_2 = L_{1 + |x|}^2(\mathbb R^d)$, and the space $V_3 = \mathbb H^1(\mathbb R^d) = V_1 \cap V_2$ consisting of all functions with finite uncertainty product;...
Let G⊂L2(R) be the subspace spanned by a Gabor Riesz sequence (g,Λ) with g∈L2(R) and a lattice Λ⊂R2 of rational density. It was shown recently that if g is well-localized both in time and frequency, then G cannot contain any time-frequency shift π(z)g of g with z∈R2∖Λ. In this paper, we improve the result to the quantitative statement that the L2-d...
We study the expressivity of deep neural networks. Measuring a network’s complexity by its number of connections or by its number of neurons, we consider the class of functions for which the error of best approximation with networks of a given complexity decays at a certain rate when increasing the complexity budget. Using results from classical ap...
This paper provides maximal function characterizations of anisotropic Triebel-Lizorkin spaces associated to general expansive matrices for the full range of parameters $p \in (0,\infty)$, $q \in (0,\infty]$ and $\alpha \in \mathbb{R}$. The equivalent norm is defined in terms of the decay of wavelet coefficients, quantified by a Peetre-type space ov...
We study the computational complexity of (deterministic or randomized) algorithms based on point samples for approximating or integrating functions that can be well approximated by neural networks. Such algorithms (most prominently stochastic gradient descent and its variants) are used extensively in the field of deep learning. One of the most impo...
We show that complex-valued neural networks with the modReLU activation function $\sigma(z) = \mathrm{ReLU}(|z| - 1) \cdot z / |z|$ can uniformly approximate complex-valued functions of regularity $C^n$ on compact subsets of $\mathbb{C}^d$, giving explicit bounds on the approximation rate.
We prove a negative result for the approximation of functions defined on compact subsets of Rd (where d≥2) using feedforward neural networks with one hidden layer and arbitrary continuous activation function. In a nutshell, this result claims the existence of target functions that are as difficult to approximate using these neural networks as one m...
We show that sampling or interpolation formulas in reproducing kernel Hilbert spaces can be obtained by reproducing kernels whose dual systems form molecules, ensuring that the size profile of a function is fully reflected by the size profile of its sampled values. The main tool is a local holomorphic calculus for convolution-dominated operators, v...
We prove bounds for the approximation and estimation of certain classification functions using ReLU neural networks. Our estimation bounds provide a priori performance guarantees for empirical risk minimization using networks of a suitable size, depending on the number of training samples available. The obtained approximation and estimation rates a...
Rate distortion theory is concerned with optimally encoding a given signal class $\mathcal{S}$ using a budget of $R$ bits, as $R\to\infty$. We say that $\mathcal{S}$ can be compressed at rate $s$ if we can achieve an error of $\mathcal{O}(R^{-s})$ for encoding $\mathcal{S}$; the supremal compression rate is denoted $s^\ast(\mathcal{S})$. Given a fi...
Schur's test states that if $K:X\times Y\to\mathbb{C}$ satisfies $\int_Y |K(x,y)|d\nu(y)\leq C$ and $\int_X |K(x,y)|d\mu(x)\leq C$, then the associated integral operator acts boundedly on $L^p$ for all $p\in [1,\infty]$. We derive a variant of this result ensuring boundedness on the (weighted) mixed-norm Lebesgue spaces $L_w^{p,q}$ for all $p,q\in...
We analyze the topological properties of the set of functions that can be implemented by neural networks of a fixed size. Surprisingly, this set has many undesirable properties. It is highly non-convex, except possibly for a few exotic activation functions. Moreover, the set is not closed with respect to \(L^p\)-norms, \(0< p < \infty \), for all p...
We show that sampling or interpolation formulas in reproducing kernel Hilbert spaces can be obtained by reproducing kernels whose dual systems form molecules, ensuring that the size profile of a function is fully reflected by the size profile of its sampled values. The main tool is a local holomorphic calculus for convolution-dominated operators, v...
We derive an extension of the Walnut-Daubechies criterion for the invertibility of frame operators. The criterion concerns general reproducing systems and Besov-type spaces. As an application, we conclude that $L^2$ frame expansions associated with smooth and fast-decaying reproducing systems on sufficiently fine lattices extend to Besov-type space...
We study the expressivity of deep neural networks. Measuring a network's complexity by its number of connections or by its number of neurons, we consider the class of functions for which the error of best approximation with networks of a given complexity decays at a certain rate when increasing the complexity budget. Using results from classical ap...
We introduce a family of quasi-Banach spaces — which we call wave packet smoothness spaces — that includes those function spaces which can be characterised by the sparsity of their expansions in Gabor frames, wave atoms, and many other frame constructions. We construct Banach frames for and atomic decompositions of the wave packet smoothness spaces...
We consider non-complete Gabor frame sequences generated by an $S_0$-function and a lattice $\Lambda$ and prove that there is $m \in \mathbb{N}$ such that all time-frequency shifts leaving the corresponding Gabor space invariant have their parameters in $\tfrac{1}{m}\Lambda$. We also investigate time-frequency shift invariance under duality aspects...
Let $\mathcal G \subset L^2(\mathbb R)$ be the subspace spanned by a Gabor Riesz sequence $(g,\Lambda)$ with $g \in L^2(\mathbb R)$ and a lattice $\Lambda \subset \mathbb R^2$ of rational density. It was shown recently that if $g$ is well-localized both in time and frequency, then $\mathcal G$ cannot contain any time-frequency shift $\pi(z) g$ of $...
We introduce a family of quasi-Banach spaces - which we call wave packet smoothness spaces - that includes those function spaces which can be characterised by the sparsity of their expansions in Gabor frames, wave atoms, and many other frame constructions. We construct Banach frames for and atomic decompositions of the wave packet smoothness spaces...
We discuss the expressive power of neural networks which use the non-smooth ReLU activation function $\varrho(x) = \max\{0,x\}$ by analyzing the approximation theoretic properties of such networks. The existing results mainly fall into two categories: approximation using ReLU networks with a fixed depth, or using ReLU networks whose depth increases...
This chapter is concerned with recent progress in the context of coorbit space theory. Based on a square-integrable group representation, the coorbit theory provides new families of associated smoothness spaces, where the smoothness of a function is measured by the decay of the associated voice transform. Moreover, by discretizing the representatio...
Many representation systems on the sphere have been proposed in the past, such as spherical harmonics, wavelets, or curvelets. Each of these data representations is designed to extract a specific set of features, and choosing the best fixed representation system for a given scientific application is challenging. In this paper, we show that we can l...
Convolutional neural networks are the most widely used type of neural networks in applications. In mathematical analysis, however, mostly fully-connected networks are studied. In this paper, we establish a connection between both network architectures. Using this connection, we show that all upper and lower bounds concerning approximation rates of...
Many representation systems on the sphere have been proposed in the past, such as spherical harmonics, wavelets, or curvelets. Each of these data representations is designed to extract a specific set of features, and choosing the best fixed representation system for a given scientific application is challenging. In this paper, we show that we can l...
This paper ist concerned with recent progress in the context of coorbit space theory. Based on a square integrable group representation, the coorbit theory provides new families of associated smoothness spaces, where the smoothness of a function is measured by the decay of the associated voice transform. Moreover, by discretizing the representation...
We analyze the topological properties of the set of functions that can be implemented by neural networks of a fixed size. Surprisingly, this set has many undesirable properties: It is highly non-convex, except possibly for a few exotic activation functions. Moreover, the set is not closed with respect to $L^p$-norms, $0<p<\infty$, for all practical...
We analyze the topological properties of the set of functions that can be implemented by neural networks of a fixed size. Surprisingly, this set has many undesirable properties: It is highly non-convex, except possibly for a few exotic activation functions. Moreover, the set is not closed with respect to $L^p$-norms, $0<p<\infty$, for all practical...
Assume that $X_{\Sigma}\in\mathbb{R}^{n}$ is a random vector following a multivariate normal distribution with zero mean and positive definite covariance matrix $\Sigma$. Let $g:\mathbb{R}^{n}\to\mathbb{C}$ be measurable and of moderate growth, e.g., $|g(x)| \lesssim (1+|x|)^{N}$. We show that the map $\Sigma\mapsto\mathbb{E}\left[g(X_{\Sigma})\rig...
We study the necessary and sufficient complexity of ReLU neural networks---in terms of depth and number of weights---which is required for approximating classifier functions in an $L^2$-sense. As a model class, we consider the set $\mathcal{E}^\beta (\mathbb R^d)$ of possibly discontinuous piecewise $C^\beta$ functions $f : [-1/2, 1/2]^d \to \mathb...
There are two notions of sparsity associated to a frame $\Psi=(\psi_i)_{i\in I}$: Analysis sparsity of $f$ means that the analysis coefficients $(\langle f,\psi_i\rangle)_i$ are sparse, while synthesis sparsity means that $f=\sum_i c_i\psi_i$ with sparse coefficients $(c_i)_i$. Here, sparsity of $c=(c_i)_i$ means $c\in\ell^p(I)$ for a given $p<2$....
This article describes how the ideas promoted by the fundamental papers published by M. Frazier and B. Jawerth in the eighties have influenced subsequent developments related to the theory of atomic decompositions and Banach frames for function spaces such as the modulation spaces and Besov-Triebel-Lizorkin spaces. Both of these classes of spaces a...
We consider the problem of characterizing the wavefront set of a tempered distribution u∈S′(Rd)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$u\in \mathcal {S}'(\mathb...
Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents many codes from taking advantage of the available parallelism, as delays of single processes may...
This article describes how the ideas promoted by the fundamental papers published by M. Frazier and B. Jawerth in the eighties have influenced subsequent developments related to the theory of atomic decompositions and Banach frames for function spaces such as the modulation spaces and Besov-Triebel-Lizorkin spaces. Both of these classes of spaces a...
In the present paper, we investigate whether an embedding of a decomposition
space $\mathcal{D}\left(\mathcal{Q},L^{p},Y\right)$ into a given Sobolev space
$W^{k,q}(\mathbb{R}^{d})$ exists. As special cases, this includes embeddings
into Sobolev spaces of (homogeneous and inhomogeneous) Besov spaces,
($\alpha$)-modulation spaces, shearlet smoothnes...
We consider the problem of characterizing the wavefront set of a tempered
distribution $u\in\mathcal{S}'(\mathbb{R}^{d})$ in terms of its continuous
wavelet transform, where the latter is defined with respect to a suitably
chosen dilation group $H\subset{\rm GL}(\mathbb{R}^{d})$. In this paper we
develop a comprehensive and unified approach that al...
Performance analysis is very important to understand the applications’ behavior and to identify bottlenecks. Performance analysis
tools should facilitate the exploration of the data collected and help to identify where the analyst has to look. While this
functionality can promote the tools usage on small and medium size environments, it becomes ma...
Programming and optimising large parallel applications for multi-core systems is an ambitious and time consuming challenge. Therefore, a number of software tools have been developed in the past to assist the programmer in optimising their codes. Scalasca and Vampir are two of these performance-analysis tools that are already well established and re...
The performance of parallel applications is often aected by wait states occurring when processes fail to reach synchronization points simultaneously. In the kojak project, we have shown that these wait states and other performance properties can be diagnosed by searching event traces for characteristic patterns and quantifying their severity, i.e.,...