## No full-text available

To read the full-text of this research,

you can request a copy directly from the author.

We show that standard multilayer feedforward networks with as few as a single hidden layer and arbitrary bounded and nonconstant activation function are universal approximators with respect to Lp(μ) performance criteria, for arbitrary finite input environment measures μ, provided only that sufficiently many hidden units are available. If the activation function is continuous, bounded and nonconstant, then continuous mappings can be learned uniformly over compact input sets. We also give very general conditions ensuring that networks with sufficiently smooth activation functions are capable of arbitrarily accurate approximation to a function and its derivatives.

To read the full-text of this research,

you can request a copy directly from the author.

... These methods have shown numerous successes in solving a large variety of PDEs empirically. Their successes are partly due to the provable universal approximation power of DNNs [34,49,84]. On the other hand, these methods aim at solving specific instances of PDEs, and as a consequence, they need to start from scratch for the same PDE whenever the initial and/or boundary value changes. ...

... , W 1 , b 1 ) ∈ R m , and training the network u θ refers to finding the minimizer θ of some properly designed loss function. DNNs are shown to be very powerful in approximating high-dimensional functions in a vast amount of studies in recent years, see, e.g., [28,29,34,35,46,49,64,84]. For example, it is shown in [28] that for any M, ε > 0, k ∈ N, p ∈ [1, ∞], and Ω = (0, 1) d ⊂ R d , denote F := {f ∈ W k,p (Ω; R) | f W k,p (Ω) ≤ M }, then there exists a DNN structure u θ of form (3.2) with sufficiently large m and L (which depend on M , ε, d and p only), such that for any f ∈ F , there is u θ − f W k,p (Ω) ≤ ε for some θ ∈ R m . ...

... To see this, we first repeat the procedure above but withε replaced byε/2. Then the universal approximation theorem [34,84] and the continuity of DNNs in its parameters imply that there exist DNNs {φ j : 1 ≤ j ≤ m}, whose network parameters are collectively denoted by η ∈ R m , satisfy φ j − ϕ j ∞ ≤ε/(2 mC 0 |Ω|) and ...

We develop a novel computational framework to approximate solution operators of evolution partial differential equations (PDEs). By employing a general nonlinear reduced-order model, such as a deep neural network, to approximate the solution of a given PDE, we realize that the evolution of the model parameter is a control problem in the parameter space. Based on this observation, we propose to approximate the solution operator of the PDE by learning the control vector field in the parameter space. From any initial value, this control field can steer the parameter to generate a trajectory such that the corresponding reduced-order model solves the PDE. This allows for substantially reduced computational cost to solve the evolution PDE with arbitrary initial conditions. We also develop comprehensive error analysis for the proposed method when solving a large class of semilinear parabolic PDEs. Numerical experiments on different high-dimensional evolution PDEs with various initial conditions demonstrate the promising results of the proposed method.

... The representation problem has been studied in theory for a long time for classical function classes such as Sobolev spaces [15,29,42,53,54]. Error bounds are obtained for ReLU deep neural networks. ...

... In the computational setting, we can only represent finite symbols, i.e., there are only finite terms in the sum Eq. (29). Hence the discrete version of data is a sampling of Eq. (29), which is in the form of Here Ns is the number of symbols in the data, which is also referred to the length of data. ...

... In the computational setting, we can only represent finite symbols, i.e., there are only finite terms in the sum Eq. (29). Hence the discrete version of data is a sampling of Eq. (29), which is in the form of Here Ns is the number of symbols in the data, which is also referred to the length of data. It is convenient to set τ such that Ts/τ is an integer. ...

In this work, we use the Bradbury's fiber Equation and an explainable convolutional deep learning network (NLS-Net) to solve an inverse problem of the nonlinear Schrödinger equation, which is widely used in fiber-optic communications. The landscape and minimizers of the non-convex loss function of the learning problem are studied empirically. It provides a guidance for choosing hyper-parameters of the method. The estimation error of the optimal solution is discussed in terms of expressive power of the NLS-Net and data. Besides, we compare the performance of several training algorithms that are popular in deep learning. It is shown that one can obtain a relatively accurate estimate of the considered parameters using the proposed method. The study provides a natural framework of solving inverse problems of nonlinear partial differential equations with deep learning.

... In recent years, developments in artificial neural networks have prompted new research into their capacity to be excellent differential equation solvers [1][2][3][4][5]. They are universal approximators [6]; they can circumvent the curse of dimensionality [7]; and they are continuous. However, practically, their construction and optimisation costs are enough to deter the discerning user. ...

... These state that, under certain conditions, neural networks are able to approximate any function to arbitrary closeness. We recall one of these theorems by Hornik [6]. ...

... 2. Initiate and train new neural networks {N k } K k=1 in sequence to satisfy differential equations given by (6), via loss functions (7). Once the loss has converged, stop training, freeze the parameters of N k , and proceed with N k+1 . ...

We motivate the use of neural networks for the construction of numerical solutions to differential equations. We prove that there exists a feed-forward neural network that can arbitrarily minimise an objective function that is zero at the solution of Poisson's equation, allowing us to guarantee that neural network solution estimates can get arbitrarily close to the exact solutions. We also show how these estimates can be appreciably enhanced through various strategies, in particular through the construction of error correction networks, for which we propose a general method. We conclude by providing numerical experiments that attest to the validity of all such strategies for variants of Poisson's equation.

... Here, (l) can be a fixed or a learnable parameter. have demonstrated that the expressiveness follows from the sum aggregator and the injectiveness of the transformation function, for which they proposed a multilayer perceptron with at least one hidden layer, motivated by the universal approximator theorem Hornik et al. [1989], Hornik [1991]. ...

... Remark B.1. from an empirical perspective, proposed a multilayer perceptron with at least one hidden layer for the function f (l) , motivated by the universal approximator theorem Hornik et al. [1989], Hornik [1991]. Anagolously, it has been established that SVMs using the RBF-kernel are universal approximators Burges [1998], Hammer and Gersmann [2003]. ...

We present a deep Graph Convolutional Kernel Machine (GCKM) for semi-supervised node classification in graphs. First, we introduce an unsupervised kernel machine propagating the node features in a one-hop neighbourhood. Then, we specify a semi-supervised classification kernel machine through the lens of the Fenchel-Young inequality. The deep graph convolutional kernel machine is obtained by stacking multiple shallow kernel machines. After showing that unsupervised and semi-supervised layer corresponds to an eigenvalue problem and a linear system on the aggregated node features, respectively, we derive an efficient end-to-end training algorithm in the dual variables. Numerical experiments demonstrate that our approach is competitive with state-of-the-art graph neural networks for homophilious and heterophilious benchmark datasets. Notably, GCKM achieves superior performance when very few labels are available.

... When advocating NNs, a feature that is often emphasised is that they are universal approximators [195] 8 . It is not quite so well known that BDTs are also universal approximators, for more or less the same reasons as NNs are [196], so in that respect they are on equal footing. ...

... This is a step like function, which meets the requirements for a squashing function, as set out in [195]. Therefore, a linear superposition of these neurons is a universal approximator. ...

This thesis investigates possible parameter values of, and optimum jet reconstruction for the signals from, the two Higgs doublet model (2HDM). Possible parameter values are investigated by way of recasting a parameter scan published by ATLAS. The original analysis, performed with 36.1 fb−1 of run 2 data, investigated the possibility of observing cascade decays from the 2HDM. The study considered the process A → ZH → l +l −b ¯b (where l = e, µ) in the context of the standard four Yukawa types. A parameter space in the physical basis of the 2HDM is explored, seeking parameter combinations that are not forbidden by theoretical constraints or existing observations, and to which a detector would be sensitive. The existing study is recast in two directions, firstly the possibility of exchanging A and H is investigated. Secondly, the extrapolation to run 3 is calculated. Under exchange of H and A all detectable parameter combinations are forbidden. More promisingly, however, it is seen that run 3 will offer sensitivity to considerable areas of permissible parameter space. It is clear that these decay channels already offer potential for finding the 2HDM at the LHC. Another line of investigation that might compliment this, is the potential to improve sensitivity by better signal reconstruction techniques. In particular, jet reconstruction techniques that might expand sensitivity to cascade decays from the 2HDM ending in a four b-quark final state are sort. Firstly, the challenges of reconstructing these states with existing algorithms is evaluated, and the limitations posed by cuts in the trigger illustrated. A comparison is made between the prevalent anti-kT algorithm and a somewhat unusual algorithm termed variable-R. This finds that variable-R performs this task best, both in terms of mass peak reconstruction, and jet multiplicity. The second investigation into optimum jet construction aims to apply a novel method, spectral clustering, to the jet formation problem. Again, it is driven by an interest in reconstructing cascade decays from the 2HDM. This method proves to be insensitive to infra-red singularities in a practical sense. It is also shown to be very flexible, capable of clustering a range of signal types, without requiring alterations to its parameter settings.

... Then, the parameters in the feedforward neural network are optimized via the minimization of the cost function. The largest advantage of using the feedforward neural network in the path optimization method is in the universal approximation theorem; the neural network even with a mono-hidden layer can approximate any kind of continuous function on the compact subset as long as we prepare a sufficient number of units in the hidden layer [42,43]. ...

The path optimization method is applied to a QCD effective model with a Polyakov loop and a repulsive vector-type interaction at finite temperature and density to circumvent the model sign problem. We show how the path optimization method can increase the average phase factor and control the model sign problem. This is the first study which correctly treats the repulsive vector-type interaction in the QCD effective model with a Polyakov loop via the Markov-chain Monte Carlo approach. It is shown that we can evade the model sign problem within the standard path-integral formulation by complexifying the temporal component of the gluon field and the vector-type auxiliary field.

... В основі більшості ШНМ лежить, так званий, елементарний персептрон Розенблатта [6], при цьому як функція активації використовується традиційна сигмоїда або гіперболічний тангенс. Трьохшаровий персептрон забезпечує високу якість апроксимації достатньо складних функцій, заданих в обмеженій області визначення [8]. Спроби покращити якість отриманого рішення шляхом збільшення кількості прихованих шарів ШНМ натикаються на проблему, пов'язану з, так званим, зникаючим та вибуховим градієнтом [9], поява котрого пов'язана з формою сигмоїдальних активаційних функцій. ...

Optimizing the learning speedof deep neural networks is an extremely important issue. Modern approaches focus on the use of neural networksbased on the Rosenblatt perceptron. But the results obtained are not satisfactory for industrial and scientific needs inthe context of the speed of learning neural networks. Also, this approach stumbles upon the problems of a vanishingand exploding gradient. To solve the problem, the paper proposed using a neo-fuzzy neuron, whose properties arebased on the F-transform. The article discusses the use of neo-fuzzy neuron as the main component of the neuralnetwork. The architecture of a deep neo-fuzzy neural network is shown, as well as a backpropagation algorithmfor this architecture with a triangular membership function for neo-fuzzy neuron. The main advantages of usingneo-fuzzy neuron as the main component of the neural network are given. The article describes the properties of aneo-fuzzy neuron that addresses the issues of improving speed and vanishing or exploding gradient. The proposedneo-fuzzy deep neural network architecture is compared with standard deep networks based on the Rosenblattperceptron.

... If the network is seen as the implementation, then a fundamental part of the work of 'integration' with grammatical formalisms consists in solving the mapping problem. In principle, feed forward networks can implement any grammar, or any formal representation, or approximate it with arbitrary precisionwith the basic units of representation implemented as vectors in some numerical parameter space, and with operations manipulating these representations corresponding to information flowing through the network, changing its state (Hornik 1991, Leshno et al. 1993, Lin et al. 2017). This makes them attractive for simulating grammatical learning. ...

For six decades generative concept dominated syntactic theory. The work on generative rules cannot simplify the concepts of the normative principles of the rules, but rather that the subject matter of the rules be considered normative. Grammar is a way to express phrases in their correct form. Grammar rules are accurate by the way they are formulated in a specific type that does not include generative grammar. The term "generative" is directly related to Noam Chomsky's tradition of grammatical research. This term and its formalities and terminology have been studied extensively within the Chomsky tradition.

... BNNs, capitalising on this success, randomise the NN architecture to yield Bayesian priors for functions. BNNs are popular as they empirically show good results, scale well in the dimension of the function's domain, and more ground is being made on the supporting theory, e.g., on their approximation quality, infinite-width behaviour, etc. (Hornik, 1991;Matthews et al., 2018). A drawback of standard BNNs is currently the limited interpretability of the posterior distributions on the parameter space, as the distribution on each weight degenerates due to the scaling of the variance proportional to the number of nodes. ...

This paper introduces a new neural network based prior for real valued functions. Each weight and bias of the neural network has an independent Gaussian prior, with the key novelty that the variances decrease in the width of the network in such a way that the resulting function is well defined in the limit of an infinite width network. We show that the induced posterior over functions is amenable to Monte Carlo sampling using Hilbert space Markov chain Monte Carlo (MCMC) methods. This type of MCMC is stable under mesh refinement, i.e. the acceptance probability does not degenerate as more parameters of the function's prior are introduced, even ad infinitum. We demonstrate these advantages over other function space priors, for example in Bayesian Reinforcement Learning.

... E, Han, and Jentzen (2017) or Huré, Pham, and Warin (2020) (to quote only two). There are good reasons for this evolution, starting with universal approximation theorems, or density results justifying the use of neural network as a versatile parameterization: see (Kidger and Lyons, 2020) for the fixed-width case and (Hornik, 1991;Cybenko, 1989) for the fixed-depth case. Other incentives for using neural networks are the ability to automatically infer a linear regression basis (provided by the trained hidden layers 3 ), or the ease of transfer learning (Pan and Yang, 2009;Bozinovski, 2020). ...

... The neural networks can learn and process data in parallel to achieve output values. Furthermore, the universal NN approximation theorem states that NN can accurately approximate any nonlinear behavior with single or several hidden layers [30]. In this respect, an RNN is a deep learning network topology presented to improve the network performance on current and future inputs by using knowledge from the past. ...

Over the last few years, improving power extraction from the wind energy conversion system (WECS) under varying wind speeds has become a complex task. The current study presents the optimum maximum power point tracking (MPPT) control approach integrated with neural network (NN)-based rotor speed control and pitch angle control to extract the maximum power from the WECS. So, this study presents a reference model adaptive control (RMAC) for a direct-drive (DD) permanent magnet vernier generator (PMVG)-based WECS under real wind speed conditions. Initially, the RMAC-based rotor speed tracking control is presented with adaptive terms, which tracks a reference model that guarantees the expected exponential decay of rotor speed error trajectory. Then, to reduce the wind speed measurement errors, a recurrent neural network (RNN)-based training model is presented. Moreover, the asymptotic stability of the proposed control method is mathematically proven by Lyapunov theory. In addition, the pitch angle control is presented to efficiently operate the rotor speed within the allowable operating range. Eventually, the proposed control system demonstrates its effectiveness through simulation and experimentation using a prototype of 5 kW DD PMVG-based WECS. After that, the comparative results affirm the superiority of the proposed control method over existing control methods.

... Therefore, deciding on the number of hidden layers and the number of neurons in each is very effective in increasing the overall efficiency of the model and should be chosen carefully. Selecting a small number of neurons in the hidden layers may lead to under fitting, and selecting a large number of neurons will lead to overfitting and increase the model training time [14,24]. For each model, values between 64 and 4096 were set as the acceptable range for the number of neurons. ...

The most dangerous type of skin cancer in the world is Melanoma. Early diagnosis of this cancer in primary stages can increase the chance of surviving death. In recent years, automatic skin cancer detection systems have played a significant role in increasing the rate of cancer diagnosis. Although deep convolutional neural networks presented advantages over traditional methods and brought tremendous breakthroughs in many image classification tasks, accurate classification of skin lesions remains challenging due to the complexity of choosing appropriate architecture for deep neural networks and hyper-parameter tuning. The aim of this paper is to increase the performance of skin lesion classification system through optimizing hyper-parameters and architecture of deep neural network using metaheuristic optimization algorithms. For this purpose, three optimization algorithms are employed to find an optimal configuration for the convolutional neural network either in pre-trained models or model that are trained from scratch. Then the deep features extracted from the optimized models were fused together in pairs and used to train a KNN classifier. The effect of applying hyper-parameter optimization is evaluated on ISIC 2017 and ISIC 2018 datasets. The accuracy of the deep neural network produced by our method reaches to 81.6% and F1-score of 80.9% on ISIC 2017 dataset and accuracy of 90.1% and F1-score of 89.8% on ISIC 2018. The results of the present study indicate that the proposed method outperforms similar methods in classifying seven and three classes of images, without requiring heavy preprocessing and segmentation steps.

... There is a number of appealing reasons to choose ψ θ (s) in the form of an artificial neural network (ANN) to render the ansatz a NQS. Most importantly, rigorous representation theorems guarantee that any possible wave function can be approximated by an ANN in the limit of large network sizes [48][49][50][51]. This means that the approach is numerically exact in the sense that the accuracy of results can be certified selfconsistently by convergence checks. ...

Programmable quantum devices are now able to probe wave functions at unprecedented levels. This is based on the ability to project the many-body state of atom and qubit arrays onto a measurement basis which produces snapshots of the system wave function. Extracting and processing information from such observations remains, however, an open quest. One often resorts to analyzing low-order correlation functions - i.e., discarding most of the available information content. Here, we introduce wave function networks - a mathematical framework to describe wave function snapshots based on network theory. For many-body systems, these networks can become scale free - a mathematical structure that has found tremendous success in a broad set of fields, ranging from biology to epidemics to internet science. We demonstrate the potential of applying these techniques to quantum science by introducing protocols to extract the Kolmogorov complexity corresponding to the output of a quantum simulator, and implementing tools for fully scalable cross-platform certification based on similarity tests between networks. We demonstrate the emergence of scale-free networks analyzing data from Rydberg quantum simulators manipulating up to 100 atoms. We illustrate how, upon crossing a phase transition, the system complexity decreases while correlation length increases - a direct signature of build up of universal behavior in data space. Comparing experiments with numerical simulations, we achieve cross-certification at the wave-function level up to timescales of 4 $\mu$ s with a confidence level of 90%, and determine experimental calibration intervals with unprecedented accuracy. Our framework is generically applicable to the output of quantum computers and simulators with in situ access to the system wave function, and requires probing accuracy and repetition rates accessible to most currently available platforms.

... CNN-based models have been very successfully used for many supervised learning tasks, particularly for image classification (e.g., Krizhevsky et al. 2017;He et al. 2015;Simonyan & Zisserman 2014;Goodfellow et al. 2014;Ronneberger et al. 2015). They are built from a hierarchy of artificial neural networks, known as "universal function approximators" (Hornik et al. 1990;Hornik 1991), which learn increasingly abstract representations of the input data, X , by nonlinearly transforming the data through a series of hidden layers that relate X to an output prediction Y . CNNs are a special class of neural network architecture, that differ from fully connected neural networks, by their inclusion of only partially connected, or so-called convolutional layers, which detect the topological structure of the input data, capturing how neighboring image pixels are related spatially, or how adjacent time-series measurements are related temporally. ...

Stellar variability is driven by a multitude of internal physical processes that depend on fundamental stellar properties. These properties are our bridge to reconciling stellar observations with stellar physics and to understand the distribution of stellar populations within the context of galaxy formation. Numerous ongoing and upcoming missions are charting brightness fluctuations of stars over time, which encode information about physical processes such as the rotation period, evolutionary state (such as effective temperature and surface gravity), and mass (via asteroseismic parameters). Here, we explore how well we can predict these stellar properties, across different evolutionary states, using only photometric time-series data. To do this, we implement a convolutional neural network, and with data-driven modeling we predict stellar properties from light curves of various baselines and cadences. Based on a single quarter of Kepler data, we recover the stellar properties, including the surface gravity for red giant stars (with an uncertainty of ≲0.06 dex) and rotation period for main-sequence stars (with an uncertainty of ≲5.2 days, and unbiased from ≈5 to 40 days). Shortening the Kepler data to a 27 days Transiting Exoplanet Survey Satellite–like baseline, we recover the stellar properties with a small decrease in precision, ∼0.07 for log g and ∼5.5 days for P rot , unbiased from ≈5 to 35 days. Our flexible data-driven approach leverages the full information content of the data, requires minimal or no feature engineering, and can be generalized to other surveys and data sets. This has the potential to provide stellar property estimates for many millions of stars in current and future surveys.

... While fully-connected networks, including shal-arXiv:2301.12033v1 [cs.LG] 28 Jan 2023 low networks, are universal approximators (Cybenko, 1989;Hornik, 1991) of continuous functions, they are largely limited in theory and in practice. Classic results (Mhaskar, 1996;Maiorov & Pinkus, 1999;Maiorov, 1999;Hanin & Sellke, 2017) show that, in the worst-case, approximating r-continuously differentiable target functions (with bounded derivatives) using fullyconnected networks requires Θ( −d/r ) parameters, where d is the input dimension and is the approximation rate. ...

In this paper, we investigate the Rademacher complexity of deep sparse neural networks, where each neuron receives a small number of inputs. We prove generalization bounds for multilayered sparse ReLU neural networks, including convolutional neural networks. These bounds differ from previous ones, as they consider the norms of the convolutional filters instead of the norms of the associated Toeplitz matrices, independently of weight sharing between neurons. As we show theoretically, these bounds may be orders of magnitude better than standard norm-based generalization bounds and empirically, they are almost non-vacuous in estimating generalization in various simple classification problems. Taken together, these results suggest that compositional sparsity of the underlying target function is critical to the success of deep neural networks.

... Neural networks approximate functions by learning from sample points during training [9] with arbitrary precision [10,14,16]. A feedforward neural network is a directed acyclic graph with the edges having weights and the vertices (neurons) having biases and being structured in layers. ...

Verification of neural networks relies on activation functions being piecewise affine (pwa) -- enabling an encoding of the verification problem for theorem provers. In this paper, we present the first formalization of pwa activation functions for an interactive theorem prover tailored to verifying neural networks within Coq using the library Coquelicot for real analysis. As a proof-of-concept, we construct the popular pwa activation function ReLU. We integrate our formalization into a Coq model of neural networks, and devise a verified transformation from a neural network N to a pwa function representing N by composing pwa functions that we construct for each layer. This representation enables encodings for proof automation, e.g. Coq's tactic lra -- a decision procedure for linear real arithmetic. Further, our formalization paves the way for integrating Coq in frameworks of neural network verification as a fallback prover when automated proving fails.

... Despite the empirical success of deep operator learning in many applications, the statistical learning theory is very limited, especially when the ambient space is infinite-dimensional. Generally speaking, the learning theory consists of three parts: the approximation theory that quantifies the expressibility of various DNNs as a surrogate for a class of operators, the optimization theory that analyzes various optimization schemes and non-convexity nature of the optimization task, and the generalization theory that assesses the discrepancy when only finite many training data is available. The approximation theory of DNNs originates from the universal approximation theory for a certain class of functions [18,19], and then was extended to other classes of functions such as continuous function [20,21,22], certain smooth functions [23,24,25], and functions with integral representations [26,27]. In comparison to many theoretic works in approximation theory for high dimensional functions, the approximation theory for operators, especially between infinite dimensional spaces, are however very limited. ...

Deep neural networks (DNNs) have seen tremendous success in many fields and their developments in PDE-related problems are rapidly growing. This paper provides an estimate for the generalization error of learning Lipschitz operators over Banach spaces using DNNs with applications to various PDE solution operators. The goal is to specify DNN width, depth, and the number of training samples needed to guarantee a certain testing error. Under mild assumptions on data distributions or operator structures, our analysis shows that deep operator learning can have a relaxed dependence on the discretization resolution of PDEs and, hence, lessen the curse of dimensionality in many PDE-related problems. We apply our results to various PDEs, including elliptic equations, parabolic equations, and Burgers equations.

... Artificial neural networks (ANN) are nonlinear mathematical models inspired from the function of the biological nervous system and reported to be a universal approximator (Hornik et al., 1989;Hornik, 1991). From a computational point of view, an ANN model is a network of highly interconnected neurons structured in a suite of parallel layers, each one possess a fixed number of neurons having a specific rule (Amiri et al., 2013). ...

... It is often the case that in addition to multiplication by weights v ij or w jk there is a translation by adding a ij or b jk called a 'bias'. MLPs are practically seen as universal function approximators [20] given a non-linear activation function and a large number of hidden units. They are used in experiments in Chapter 4 and 5. ...

Robustness of predictive deep models is a challenging problem with many implications. It is of particular importance when models are used in safety-critical applications, such as healthcare. However, there is yet to be agreement on a comprehensive definition on what it means for a model to be robust, and a theory on why these issues arise. Given the general nature of the problem, existing work related to robustness is spread across different areas of research. Existing research has considered a range of robustness aspects, for instance robustness to small input perturbations, which arise from the study of adversarial examples, but there is also robustness to different domains for the same task, and robustness issues which arise from object placement, transplanting, lighting, weather conditions, or object style, as some examples. This thesis explores a formulation of robustness in terms of the assumed structural causal model (SCM) which generates the observed data.The SCM allows these different types of robustness issues to be viewed in a unifying way. Using this view, this work furthers the connection between prediction robustness and the assumed structural causal model by suggesting that optimising for prediction performance across a diverse set of distributions from the same SCM will move the model closer to the causal predictor of the target variable, providing a theoretical foundation to optimise purely for prediction in the setting where training and testing data are not independently and identically distributed. Formulating robustness in this way suggests that large deep models should, in general, be more susceptible to robustness issues; while some of these issues have been observed in applications such as computer vision, it has been less discussed in others. We investigate the robustness of state-of-the-art deep (SotA) classifiers in human activity recognition using a new proposed benchmark informed by the causal formulation, and show that a simpler model is at least as robust as SotA deep models whilst being at least ten times faster to train. The causal view of robustness additionally hints at the idea that less data can be beneficial for robustness, contrary to popular belief that more data is always better. To test this idea, a data selection algorithm is proposed based on inverting the idea of a popular causal inference procedure for tabular data. The robustness of a model trained on the selected subset of data is evaluated through synthetic and semi-synthetic data experiments. Under certain conditions the data subset improves robustness and subsequently data efficiency.

... It is the case of models such as SVM [44], ARIMA [35], as well as probabilistic ones involving hidden Markov models [54] or fuzzy logic [7]. Finally, Artificial Neural Networks (ANNs) have been recently massively considered to provide fast and reliable approximations of PDE solutions, thanks to the universal approximation theorem [27] that led to different proposals on the topic [31][1] [8]. Specifically, ANNs provided with internal recurrence mechanisms have gradually become the standard for time series prediction when dealing with large amount of data available for the training [15][43] [46]. ...

Parametric time-dependent systems are of a crucial importance in modeling real phenomena, often characterized by non-linear behaviors too. Those solutions are typically difficult to generalize in a sufficiently wide parameter space while counting on limited computational resources available. As such, we present a general two-stages deep learning framework able to perform that generalization with low computational effort in time. It consists in a separated training of two pipe-lined predictive models. At first, a certain number of independent neural networks are trained with data-sets taken from different subsets of the parameter space. Successively, a second predictive model is specialized to properly combine the first-stage guesses and compute the right predictions. Promising results are obtained applying the framework to incompressible Navier-Stokes equations in a cavity (Rayleigh-Bernard cavity), obtaining a 97% reduction in the computational time comparing with its numerical resolution for a new value of the Grashof number.

... The proof of Theorem 1 relies on the fact that a piecewise-linear function cannot be equal to a function exhibiting smooth curves. However, it is known that neural networks, provided with enough capacity, can approximate any function with arbitrary precision (Hornik, 1991). We address this point in Section 3, where we discuss the implications of Theorem 1 regarding the capacity requirements of tightly certifiable networks. ...

Certified defenses against small-norm adversarial examples have received growing attention in recent years; though certified accuracies of state-of-the-art methods remain far below their non-robust counterparts, despite the fact that benchmark datasets have been shown to be well-separated at far larger radii than the literature generally attempts to certify. In this work, we offer insights that identify potential factors in this performance gap. Specifically, our analysis reveals that piecewise linearity imposes fundamental limitations on the tightness of leading certification techniques. These limitations are felt in practical terms as a greater need for capacity in models hoped to be certified efficiently. Moreover, this is in addition to the capacity necessary to learn a robust boundary, studied in prior work. However, we argue that addressing the limitations of piecewise linearity through scaling up model capacity may give rise to potential difficulties -- particularly regarding robust generalization -- therefore, we conclude by suggesting that developing smooth activation functions may be the way forward for advancing the performance of certified neural networks.

... Recently, with the rapid development of machine learning (ML) and deep learning (DL) [22][23][24][25], it has been gradually used to explore the forward problems of ordinary differential equations (ODEs) and partial differential equations (PDEs) in the field of scientific computing [26]. It is the existence of the universal approximation theorem [27][28][29][30][31] that makes it a natural idea to utilize neural networks as generalizers of approximation functions. The neural network solvers for ODEs/PDEs can be broadly divided into two parts mainly. ...

... The flexibility of FermiNet hinges on the universal approximation property of neural networks [17,18], which makes them a versatile tool for approximating highdimensional functions and has led to radical advances in many computational fields [19][20][21][22]. This success has motivated the application of neural networks to solving problems across the physical sciences, including quantum mechanics [12,[23][24][25]. ...

Deep neural networks have been very successful as highly accurate wave function Ansätze for variational Monte Carlo calculations of molecular ground states. We present an extension of one such Ansatz, FermiNet, to calculations of the ground states of periodic Hamiltonians, and study the homogeneous electron gas. FermiNet calculations of the ground-state energies of small electron gas systems are in excellent agreement with previous initiator full configuration interaction quantum Monte Carlo and diffusion Monte Carlo calculations. We investigate the spin-polarized homogeneous electron gas and demonstrate that the same neural network architecture is capable of accurately representing both the delocalized Fermi liquid state and the localized Wigner crystal state. The network converges on the translationally invariant ground state at high density and spontaneously breaks the symmetry to produce the crystalline ground state at low density, despite being given no a priori knowledge that a phase transition exists.

... Deep learning is a subfield of machine learning in which algorithms based on neural networks are trained to act as powerful function approximators; it can be shown that under certain conditions neural networks are universal approximators [Hornik, 1991], making them powerful tools in machine learning. ...

Latent variable models (LVMs) provide an elegant, efficient, and interpretable approach to learning the generation process of observed data. Latent variables can capture salient features within often highly-correlated data, forming powerful tools in machine learning. For high-dimensional data, LVMs are typically parameterised by deep neural networks, and trained by maximising a variational lower bound on the data log likelihood. These models often suffer from poor use of their latent variable, with ad-hoc annealing factors used to encourage retention of information in the latent variable. In this work, we first introduce a novel approach to latent variable modelling, based on an objective that encourages both data reconstruction and generation. This ensures by design that the latent representations capture information about the data. Second, we consider a novel approach to inducing local differential privacy (LDP) in high dimensions with a specifically-designed LVM. LDP offers a rigorous approach to preserving one’s privacy against both adversaries and the database administrator. Existing LDP mechanisms struggle to retain data utility in high dimensions owing to prohibitive noise requirements. We circumvent this by inducing LDP on the low- dimensional manifold underlying the data. Further, we introduce a novel approach for downstream model learning using LDP training data, enabling the training of performant machine learning models. We achieve significant performance gains over current state-of-the-art LDP mechanisms, demonstrating far-reaching implications for the widespread practice of data collection and sharing. Finally, we scale up this approach, adapting current state-of-the-art representation learning models to induce LDP in even higher-dimensions, further widening the scope of LDP mechanisms for high-dimensional data collection.

... However, several approaches have been investigated to guarantee stability for closed-loop systems using neural networks. The first approach is to exploit the theorems by [38] and [64] about the universal approximation ability of multi-layer feed-forward neural networks. They have shown that any continuous mapping over a compact domain can be approximated as accurately as necessary by a feed-forward neural network with one hidden layer. ...

The control of manufacturing processes must satisfy high quality and efficiency requirements while meeting safety requirements. A broad spectrum of monitoring and control strategies, such as model- and optimization-based controllers, are utilized to address these issues. Driven by rising demand for flexible yet energy and resource-efficient operations existing approaches are challenged due to high uncertainties and changes. Machine learning algorithms are becoming increasingly important in tackling these challenges, especially due to the growing amount of available data. The ability for automatic adaptation and learning from human operators offer new opportunities to increase efficiency yet provide flexible operation. Combining machine learning algorithms with safe or robust controls offers novel reliable operation methods. This chapter highlights ways to fuse machine learning and control for the safe and improved operation of chemical and biochemical processes. We outline and summarize both - learning models for control and learning the control components. We offer a general overview, including a literature review, to provide a guideline for utilizing machine learning techniques in control structures.

... Fully-connected neural networks (FCNNs) have been very successful for high-dimensional complex functions, due to their universal approximation property [Cybenko 1989;Hassoun et al. 1995;Hornik 1991;Kubat 1999]. Despite their success in approximation, their complicated deep layers with massively connected neurons prevent us from interpreting the result. ...

Algorithm fairness in the application of artificial intelligence (AI) is essential for a better society. As the foundational axiom of social mechanisms, fairness consists of multiple facets. Although the machine learning (ML) community has focused on intersectionality as a matter of statistical parity, especially in discrimination issues, an emerging body of literature addresses another facet -- monotonicity. Based on domain expertise, monotonicity plays a vital role in numerous fairness-related areas, where violations could misguide human decisions and lead to disastrous consequences. In this paper, we first systematically evaluate the significance of applying monotonic neural additive models (MNAMs), which use a fairness-aware ML algorithm to enforce both individual and pairwise monotonicity principles, for the fairness of AI ethics and society. We have found, through a hybrid method of theoretical reasoning, simulation, and extensive empirical analysis, that considering monotonicity axioms is essential in all areas of fairness, including criminology, education, health care, and finance. Our research contributes to the interdisciplinary research at the interface of AI ethics, explainable AI (XAI), and human-computer interactions (HCIs). By evidencing the catastrophic consequences if monotonicity is not met, we address the significance of monotonicity requirements in AI applications. Furthermore, we demonstrate that MNAMs are an effective fairness-aware ML approach by imposing monotonicity restrictions integrating human intelligence.

... For all our investigated colour samples, which mainly include natural colours, their reflectances are smooth continuous functions. It was demonstrated by Hornik [25] that multilayer feedforward networks are, under very general conditions on the hidden unit activation function (continuous, bounded, and nonconstant), universal approximators provided that sufficiently many hidden units are available. The single-layer ANN RS reconstruction performance has also been explored in our previous experiment, whereas the question of possible efficiency improvement by adding HLs of neurons remained open, where an upgrade could be expected already with one additional HL. ...

Knowledge of surface reflection of an object is essential in many technological fields, including graphics and cultural heritage. Compared to direct multi- or hyper-spectral capturing approaches, commercial RGB cameras allow for a high resolution and fast acquisition, so the idea of mapping this information into a reflectance spectrum (RS) is promising. This study compared two modelling approaches based on a training set of RGB-reflectance pairs, one implementing artificial neural networks (ANN) and the other one using multivariate polynomial approximation (PA). The effect of various parameters was investigated: the ANN learning algorithm—standard backpropagation (BP) or Levenberg-Marquardt (LM), the number of hidden layers (HLs) and neurons, the degree of multivariate polynomials in PA, the number of inputs, and the training set size on both models. In the two-layer ANN with significantly fewer inputs than outputs, a better MSE performance was found where the number of neurons in the first HL was smaller than in the second one. For ANNs with one and two HLs with the same number of neurons in the first layer, the RS reconstruction performance depends on the choice of BP or LM learning algorithm. RS reconstruction methods based on ANN and PA are comparable, but the ANN models’ better fine-tuning capabilities enable, under realistic constraints, finding ANNs that outperform PA models. A profiling approach was proposed to determine the initial number of neurons in HLs—the search centre of ANN models for different training set sizes.

... The theoretical basis of an artificial neural network is the universal approximation theorem. 42,43 Simply, a carefully designed feedforward neural network can accurately approximate the mapping relationship from input to output, which is consistent with the original intent. The PointNet neural network is used to construct a specific mapping relationship to replace the CFD solver. ...

A PointNet-based data-driven neural network model is proposed, which takes the film hole geometry variables and flow conditions as inputs to reconstruct the adiabatic cooling effectiveness distribution. The model aims to realize rapid reconstruction of the film cooling effectiveness field under complex and variable working conditions with a more flexible data organizational form. The dataset is derived from numerical simulations of the jet under crossflow. Select unstructured grid nodes are used to form point clouds for network training. The PointNet architecture includes two modules to extract the global features of the input point cloud and calculate the adiabatic efficiency. The responsiveness of the model to different variables is evaluated from the effectiveness contours, centerline, and laterally-averaged effectiveness plots. Further, correlation analysis is used to evaluate the accuracy of model predictions. Over the entire dataset, the mean correlation coefficient is 0.99, indicating that the model has a satisfactory ability to reconstruct and predict the effectiveness field. The main contribution from the area around the film holes to the cooling effectiveness distribution is further confirmed via critical point analysis.

... Now that deep learning applications at TotalEnergies are introduced, the following chapter will describe in more detail the deep learning workload. The next one (chapter 4 Deep Learning I n the mathematical theory of artificial neural networks, the universal approximation theorem [65] states that "a neural network of a single hidden layer containing a finite number of neurons can approximate continuous functions over compact subsets of Rn". In other words, a neural network can learn any continuous mapping from data samples x and their corresponding labels y by fitting its thousands of internal parameters. ...

Ensemble of Deep Neural Networks (DNNs) combines the predictions from multiple DNNs to improve the predictions to improve predictions over one single DNNs by reducing generalization error.They have the potential to develop the most complex and strategic applications such as developing ML applications for the energy industry. However, ensembles of DNNs raise multiple open questions such as: to run and scale efficiently the steps of their life cycle (training and inference) to obtain global performance leveraging HPC. How to train multiple independent DNNs, select a performant ensemble, and serving the built ensemble’s predictions to the client application.In this work, we propose procedures to build an accurate ensemble of DNNs with multiple accelerations of its life cycle. Our procedures ensure to automatically find good ensembles by searching complementary DNNs, individual accuracy, and computing cost. In regard to those two objectives: accuracy and inference speed, we discovered that our procedure to build Ensembles of DNNs bring high advantages: computing speed, accuracy, and higher reproducibility, save resource consumption.Those new technics have been proposed on practical applications on both supervised image classification and optimal facility control by deep reinforcement learning. Experimental proofs have been performed on multiple energy applications towards a low-carbon future.

... Deep learning is capable of optimizing a non-linear function from a large amount of trainable data. Theoretically, deep neural networks can approximate non-linear mappings of any complexity (Cybenko, 1989;Hornik, 1991), and deep learning has been widely used in oceanography, e.g., automatic detection and prediction of mesoscale eddies (Zeng et al., 2015;Xu et al., 2019), prediction of El Niño-Southern Oscillation, studies of climate model parameter sensitivity, and parameterization of unresolved atmospheric processes (Ham et al., 2019;Esteves et al., 2019;Anderson and Lucas, 2018). Bolton and Zanna (2019) demonstrated the powerful potential of deep learning for estimating ocean currents using satellite observations. ...

The Indonesian Throughflow (ITF) connects the tropical Pacific and Indian Oceans and is critical to the regional to global climate system. Previous research indicates that the Indo-Pacific pressure gradient is a major driver of the ITF, implying the possibility of forecasting ITF transport with sea surface height (SSH) of the Indo-Pacific Ocean. Here we use a deep learning approach with the Convolutional Neural Network (CNN) model to reproduce the ITF transport. The CNN model is trained with a random selection of the Coupled Model Intercomparison Project Phase 6 (CMIP6) simulations and verified with residual components of the CMIP6 simulations. A test of the training results shows that the CNN model with SSH is able to reproduce about 90% of the total variance of ITF transport. The CNN model with CMIP6 is then transformed to the Simple Ocean Data Assimilation (SODA) dataset and we find that the transformed model reproduces about 80% of the total variance of ITF transport in SODA. A time series of ITF transport, verified by the Monitoring the ITF (MITF) and International Nusantara Stratification and Transport (INSTANT) measurements of ITF, is then produced by the model using satellite observations from 1993 to 2021. We discover that the CNN model can make a valid prediction with a leading time of seven months, implying that the ITF transport can be predicted using the deep learning approach with SSH data.

... For a neural network, is a nonlinear function composed of simple, interconnected computational elements, or nodes, defined by learned weight parameters and an activation function. A key advantage for neural networks over physical approaches is that they are fast and accurate, and, as universal function approximators(Hornik, 1991), they can empirically learn complex, often indirect and nonlinear dependencies embedded in the data that may be difficult to physically model(W.J. Blackwell & Chen, 2009). Neural networks have attracted increasing wide use from the sounding and remote sensing community in recent years (F. ...

In recent decades, spaceborne microwave and hyperspectral infrared sounding instruments have significantly benefited weather forecasting and climate science. However, existing retrievals of lower troposphere temperature and humidity profiles have limitations in vertical resolution, and often cannot accurately represent key features such as the mixed layer thermodynamic structure and the inversion at the planetary boundary layer (PBL) top. Because of the existing limitations in PBL remote sensing from space, there is a compelling need to improve routine, global observations of the PBL and enable advances in scientific understanding and weather and climate prediction. To address this, we have developed a new 3D deep neural network (DNN) which enhances detail and reduces noise in Level 2 granules of temperature and humidity profiles from the Atmospheric Infrared Sounder (AIRS)/Advanced Microwave Sounding Unit (AMSU) sounder instruments aboard NASA’s Aqua spacecraft. We show that the enhancement improves accuracy and detail including key features such as capping inversions at the top of the PBL over land, resulting in improved accuracy in estimations of PBL height.

We investigate using deep learning, a type of machine-learning algorithm employing multiple layers of artificial neurons, for the mathematical representation of multigroup cross sections for use in the Griffin reactor multiphysics code for two-step deterministic neutronics calculations. A three-dimensional fuel element typical of a high-temperature gas reactor as well as a two-dimensional sodium-cooled fast reactor lattice are modeled using the Serpent Monte Carlo code, and multigroup macroscopic cross sections are generated for various state parameters to produce a training data set and a separate validation data set. A fully connected, feedforward neural network is trained using the open-source PyTorch machine-learning framework, and its accuracy is compared against the standard piecewise linear interpolation model.
Additionally, we provide in this work a generic technique for propagating the cross-section model errors up to the keff using sensitivity coefficients with the first-order uncertainty propagation rule. Quantifying the eigenvalue error due to the cross-section regression errors is especially practical for appropriately selecting the mathematical representation of the cross sections. We demonstrate that the artificial neural network model produces lower errors and therefore enables better accuracy relative to the piecewise linear model when the cross sections exhibit nonlinear dependencies; especially when a coarse grid is employed, where the errors can be halved by the artificial neural network. However, for linearly dependent multigroup cross sections as found for the sodium-cooled fast reactor case, a simpler linear regression outperforms deeper networks.

Multi-carrier code-division multiple access system support multiple users at the same time over same frequency band. It is a multiple access scheme used in OFDM-based telecommunication systems. Though it is a promising wireless communication technology with high spectral efficiency and system performance, it is prone to multiple access interference (MAI). So, this paper mainly aims to design a MC-CDMA receiver to mitigate MAI. The classical receivers like maximal ratio combining (MRC), equal gain combining (EGC), and minimum mean square error (MMSE) fails to cancel MAI when the MC-CDMA is subjected to non-linearistic degradations. In this case, the neural network (NN) receivers could be a better alternative. The efficiency and effectiveness of the proposed ELM (Extreme Learning Machine) Algorithm based receiver is studied thoroughly and explained in detailed for the MC-CDMA with non-linearistic degradations.

Since their resurgence in 2012, Deep Neural Networks have become ubiquitous in most disciplines of Artificial Intelligence, such as image recognition, speech processing, and Natural Language Processing. However, over the last few years, neural networks have grown exponentially deeper, involving more and more parameters. Nowadays, it is not unusual to encounter architectures involving several billions of parameters, while they mostly contained thousands less than ten years ago.This generalized increase in the number of parameters makes such large models compute-intensive and essentially energy inefficient. This makes deployed models costly to maintain but also their use in resource-constrained environments very challenging.For these reasons, much research has been conducted to provide techniques reducing the amount of storage and computing required by neural networks. Among those techniques, neural network pruning, consisting in creating sparsely connected models, has been recently at the forefront of research. However, although pruning is a prevalent compression technique, there is currently no standard way of implementing or evaluating novel pruning techniques, making the comparison with previous research challenging.Our first contribution thus concerns a novel description of pruning techniques, developed according to four axes, and allowing us to unequivocally and completely define currently existing pruning techniques. Those components are: the granularity, the context, the criteria, and the schedule. Defining the pruning problem according to those components allows us to subdivide the problem into four mostly independent subproblems and also to better determine potential research lines.Moreover, pruning methods are still in an early development stage, and primarily designed for the research community. Indeed, most pruning works are usually implemented in a self-contained and sophisticated way, making it troublesome for non-researchers to apply such techniques without having to learn all the intricacies of the field. To fill this gap, we proposed FasterAI toolbox, intended to be helpful to researchers, eager to create and experiment with different compression techniques, but also to newcomers, that desire to compress their neural network for concrete applications. In particular, the sparsification capabilities of FasterAI have been built according to the previously defined pruning components, allowing for a seamless mapping between research ideas and their implementation.We then propose four theoretical contributions, each one aiming at providing new insights and improving on state-of-the-art methods in each of the four identified description axes. Also, those contributions have been realized by using the previously developed toolbox, thus validating its scientific utility.Finally, to validate the applicative character of the pruning technique, we have selected a use case: the detection of facial manipulation, also called DeepFakes Detection. The goal is to demonstrate that the developed tool, as well as the different proposed scientific contributions, can be applicable to a complex and actual problem. This last contribution is accompanied by a proof-of-concept application, providing DeepFake detection capabilities in a web-based environment, thus allowing anyone to perform detection on an image or video of their choice.This Deep Learning era has emerged thanks to the considerable improvements in high-performance hardware and access to a large amount of data. However, since the decline of Moore's Law, experts are suggesting that we might observe a shift in how we conceptualize the hardware, by going from task-agnostic to domain-specialized computations, thus leading to a new era of collaboration between software, hardware, and machine learning communities. This new quest for more efficiency will thus undeniably go through neural network compression techniques, and particularly sparse computations.

Five algorithms of Gaussian process regression, artificial neural network (ANN), support vector machine, boosted trees, and genetic algorithm artificial neural networks (GAANN) are used to model high manganese steel's processing parameters, chemical composition, and mechanical properties. The results show that the ANN model optimized by applying the GAANN with topology [25, 25] has the highest prediction accuracy. Based on the network calculated using the GAANN, the price optimization of the target performance is achieved by introducing the price factor and the target performance in the fitness function. The NSGA-II algorithm is applied to design ultra-high manganese steel's processes and chemical composition. The predicted performance is much higher than the highest value in the original data, and the calculation results all have an accuracy of about 94%. The developed material design model is applicable to high manganese steel and can be used to design other alloys, which provides a good direction for machine learning to design multi-component alloy materials.

Perceptual Confidence is the view that our conscious perceptual experiences assign confidence. In previous papers, I motivated it using first‐personal evidence (Morrison, 2016), and Jessie Munton motivated it using normative evidence (Munton, 2016). In this paper, I will consider the extent to which it is motivated by third‐personal evidence. I will argue that the current evidence is supportive but not decisive. I will then describe experiments that might provide stronger evidence. I hope to thereby provide a roadmap for future research.

This textbook shows how to bring theoretical concepts from finance and econometrics to the data. Focusing on coding and data analysis with R, we show how to conduct research in empirical finance from scratch. We start by introducing the concepts of tidy data and coding principles using the tidyverse family of R packages. We then provide the code to prepare common open source and proprietary financial data sources (CRSP, Compustat, Mergent FISD, TRACE) and organize them in a database. We reuse these data in all the subsequent chapters, which we keep as self-contained as possible. The empirical applications range from key concepts of empirical asset pricing (beta estimation, portfolio sorts, performance analysis, Fama-French factors) to modeling and machine learning applications (fixed effects estimation, clustering standard errors, difference-in-difference estimators, ridge regression, Lasso, Elastic net, random forests, neural networks) and portfolio optimization techniques.

Eine wichtige Informationsgrundlage für strategische Entscheidungen im Sportmarketing bildet das Markenimage, da es die Perspektive der Anspruchsgruppen auf die Marke widerspiegelt. Die Analyse des Markenimages ist jedoch methodisch komplex, weshalb dafür der Einsatz Künstlicher Neuronaler Netze eingehender untersucht wird. Denn dieses Verfahren der Künstlichen Intelligenz ermöglicht die Modellierung vielschichtiger und nichtlinearer Wirkungsbeziehungen. Der konzeptionelle Ansatz wird am empirischen Praxisbeispiel des Sportartikelherstellers adidas veranschaulicht, indem ein mehrschichtiges Künstliches Neuronales Netz zwischen den Bewertungen spezifischer Markenattribute und der Gesamtmarke modelliert wird. Mithilfe einer Analyse der Verbindungsgewichte des Netzes wird der Variableneinfluss verschiedener Markenattribute gemessen, woraus sich konkrete Implikationen für die Sportmarketingpraxis ergeben.

Activation functions play critical roles in neural networks, yet current off-the-shelf neural networks pay little attention to the specific choice of activation functions used. Here we show that data-aware customization of activation functions can result in striking reductions in neural network error. We first give a simple linear algebraic explanation of the role of activation functions in neural networks; then, through connection with the Diaconis-Shahshahani Approximation Theorem, we propose a set of criteria for good activation functions. As a case study, we consider regression tasks with a partially exchangeable target function, \emph{i.e.} $f(u,v,w)=f(v,u,w)$ for $u,v\in \mathbb{R}^d$ and $w\in \mathbb{R}^k$, and prove that for such a target function, using an even activation function in at least one of the layers guarantees that the prediction preserves partial exchangeability for best performance. Since even activation functions are seldom used in practice, we designed the ``seagull'' even activation function $\log(1+x^2)$ according to our criteria. Empirical testing on over two dozen 9-25 dimensional examples with different local smoothness, curvature, and degree of exchangeability revealed that a simple substitution with the ``seagull'' activation function in an already-refined neural network can lead to an order-of-magnitude reduction in error. This improvement was most pronounced when the activation function substitution was applied to the layer in which the exchangeable variables are connected for the first time. While the improvement is greatest for low-dimensional data, experiments on the CIFAR10 image classification dataset showed that use of ``seagull'' can reduce error even for high-dimensional cases. These results collectively highlight the potential of customizing activation functions as a general approach to improve neural network performance.

Machine learning techniques such as neural networks bear the potential to improve the performance and applicability of model predictive control to real-world systems.
However, they also bear the danger of erratic-unpredictable behavior and malfunctioning of machine learning approaches.
Neural networks might fail to predict system behavior as it is often impossible to provide strict performance or uncertainty bounds. While this challenge can be tackled using robust model predictive control approaches that span a safety net around the machine learning supported predictions, this can lead to significant performance degradation and infeasibility.
To tackle this challenge, a safe neural network-supported learning tube model predictive control scheme is proposed, which allows bounding the worst-case performance in case of a malfunctioning machine learning component, yet enables decreasing the conservatism. The basic idea is to enforce the neural network to stay in the vicinity of a presumed given nominal model with the error dynamics directly incorporated into the neural network output function. Therefore, the error dynamics do not require an additional control input, resulting in the omission of input constraint tightening. Constraint fulfillment is guaranteed, robust set stability for a particular learning function class is established, and an upper bound for the performance of a malfunctioning neural network is given. The method is evaluated in simulations considering a rover operating in an uncertain environment.

We give conditions ensuring that multilayer feedforward networks with as few as a single hidden layer and an appropriately smooth hidden layer activation function are capable of arbitrarily accurate approximation to an arbitrary function and its derivatives. In fact, these networks can approximate functions that are not differentiable in the classical sense, but possess only a generalized derivative, as is the case for certain piecewise differentiable functions. The conditions imposed on the hidden layer activation function are relatively mild; the conditions imposed on the domain of the function to be approximated have practical implications. Our approximation results provide a previously missing theoretical justification for the use of multilayer feedforward networks in applications requiring simultaneous approximation of a function and its derivatives.

Haar Measure and Convolution The Dual Group and the Fourier Transform Fourier-Stieltjes Transforms Positive-Definite Functions The Inversion Theorem The Plancherel Theorem The Pontryagin Duality Theorem The Bohr Compactification A Characterization of B(Γ)

In this paper we demonstrate that finite linear combinations of compositions of a fixed, univariate function and a set of
affine functionals can uniformly approximate any continuous function ofn real variables with support in the unit hypercube; only mild conditions are imposed on the univariate function. Our results
settle an open question about representability in the class of single hidden layer neural networks. In particular, we show
that arbitrary decision regions can be arbitrarily well approximated by continuous feedforward neural networks with only a
single internal, hidden layer and any continuous sigmoidal nonlinearity. The paper discusses approximation properties of other
possible types of nonlinearities that might be implemented by artificial neural networks.

It is shown that feedforward networks having bounded weights are
not undesirable restricted, but are in fact universal approximators,
provided that the hidden-layer activation function belongs to one of
several suitable broad classes of functions: polygonal functions,
certain piecewise polynomial functions, or a class of functions analytic
on some open interval. These results are obtained by trading bounds on
network weights for possible increments to network complexity, as
indexed by the number of hidden nodes. The hidden-layer activation
functions used include functions not admitted by previous universal
approximation results, so the present results also extend the already
broad class of activation functions for which universal approximation
results are available. A theorem which establishes the approximate
ability of these arbitrary mappings to learn when examples are generated
by a stationary ergodic process is given

K.M. Hornik, M. Stinchcombe, and H. White (Univ. of California at San Diego, Dept. of Economics Discussion Paper, June 1988; to appear in Neural Networks) showed that multilayer feedforward networks with as few as one hidden layer, no squashing at the output layer, and arbitrary sigmoid activation function at the hidden layer are universal approximators: they are capable of arbitrarily accurate approximation to arbitrary mappings, provided sufficiently many hidden units are available. The present authors obtain identical conclusions but do not require the hidden-unit activation to be sigmoid. Instead, it can be a rather general nonlinear function. Thus, multilayer feedforward networks possess universal approximation capabilities by virtue of the presence of intermediate layers with sufficiently many parallel processors; the properties of the intermediate-layer activation function are not so crucial. In particular, sigmoid activation functions are not necessary for universal approximation.< >

The authors present a method for constructing a feedforward neural net implementing an arbitrarily good approximation to any L/sub 2/ function over (-1, 1)/sup n/. The net uses n input nodes, a single hidden layer whose width is determined by the function to be implemented and the allowable mean square error, and a linear output neuron. Error bounds and an example are given for the method.< >

Recently, multiple input, single output, single hidden-layer feedforward neural networks have been shown to be capable of approximating a nonlinear map and its partial derivatives. Specifically, neural nets have been shown to be dense in various Sobolev spaces. Building upon this result, we show that a net can be trained so that the map and its derivatives are learned. Specifically, we use a result of Gallant's to show that least squares and similar estimates are strongly consistent in Sobolev norm provided the number of hidden units and the size of the training set increase together. We illustrate these results by an application to the inverse problem of chaotic dynamics: recovery of a nonlinear map from a time series of iterates. These results extend automatically to nets that embed the single hidden layer, feedforward network as a special case.

In this paper, we prove that any continuous mapping can be approximately realized by Rumelhart-Hinton-Williams' multilayer neural networks with at least one hidden layer whose output functions are sigmoid functions. The starting point of the proof for the one hidden layer case is an integral formula recently proposed by Irie-Miyake and from this, the general case (for any number of hidden layers) can be proved by induction. The two hidden layers case is proved also by using the Kolmogorov-Arnold-Sprecher theorem and this proof also gives non-trivial realizations.

A theorem is proved to the effect that three-layered perceptrons with an infinite number of computing units can represent arbitrary mapping if the desired mapping and the input-output characteristics of the computing units satisfy some constraints. The proof is constructive, and each coefficient is explicitly presented. The theorem theoretically guarantees a kind of universality for three-layered perceptrons. Although two-layered perceptrons (simple perceptrons) cannot represent arbitrary functions, three-layers prove necessary and sufficient. The relationship between the model used in the proof and the distributed storage and processing of information is also discussed.< >

The authors show that a multiple-input, single-output,
single-hidden-layer feedforward network with (known) hardwired
connections from input to hidden layer, monotone squashing at the hidden
layer and no squashing at the output embeds as a special case a
so-called Fourier network, which yields a Fourier series approximation
properties of Fourier series representations. In particular,
approximation to any desired accuracy of any square integrable function
can be achieved by such a network, using sufficiently many hidden units.
In this sense, such networks do not make avoidable mistakes

How neural networks work Technical Report LA-UR-88-418. Los Alamos, NM: Los Ala-mos National Laboratory Sobolev spaces

- A Lapedes
- R Farber
- V G Ja

Lapedes, A., & Farber, R. (1988). How neural networks work. Technical Report LA-UR-88-418. Los Alamos, NM: Los Ala-mos National Laboratory. Maz'ja, V. G. (1985). Sobolev spaces. New York: Springer Verlag.