Article

The Local Minima of the Error Surface of the 2-2-1 XOR Network.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

All local minima of the error surface of the 2-2-1 XOR network are described. A local minimum is defined as a point such that all points in a neighbourhood have an error value greater than or equal to the error value in that point. It is proved that the error surface of the two-layer XOR network with two hidden units has a number of regions with local minima. These regions of local minima occur for combinations of the weights from the inputs to the hidden nodes such that one or both hidden nodes are saturated for at least two patterns. However, boundary points of these regions of local minima are saddle points. It will be concluded that from each finite point in weight space a strictly decreasing path exists to a point with error zero. This also explains why experiments using higher numerical precision find less “local minima”.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Theoretical analysis, however, often relies on unrealistic assumptions, sometimes causing erroneous conclusions. For example, papers were published claiming that XOR has no local minima [9] , to be subsequently followed by other publications that explicitly listed all local minima of the XOR problem [10] . Sprinkhuizen et al. [10] have also stated that the listed local minima are in fact saddle points [10] . ...
... For example, papers were published claiming that XOR has no local minima [9] , to be subsequently followed by other publications that explicitly listed all local minima of the XOR problem [10] . Sprinkhuizen et al. [10] have also stated that the listed local minima are in fact saddle points [10] . More recent studies confirm that local optima are indeed present in the NN error landscapes [11] , although saddle points are likely to become more prevalent as the dimensionality of the problem increases [6,12] . ...
... Hamey [9] claimed that the NN error surface associated with XOR has no local minima. A year later, Sprinkhuizen-Kuyper et al. [10,17] showed that stationary points are present in the XOR NN search space, but that the stationary points are in fact saddle points. A more recent study of the XOR error surface was published by Mehta et al. [18] , where techniques developed for potential energy landscapes were used to quantify local minima of the XOR problem under a varied number of hidden neurons and regularisation coefficient values. ...
Article
Quantification of the stationary points and the associated basins of attraction of neural network loss surfaces is an important step towards a better understanding of neural network loss surfaces at large. This work proposes a novel method to visualise basins of attraction together with the associated stationary points via gradient-based stochastic sampling. The proposed technique is used to perform an empirical study of the loss surfaces generated by two different error metrics: quadratic loss and entropic loss. The empirical observations confirm the theoretical hypothesis regarding the nature of neural network attraction basins. Entropic loss is shown to exhibit stronger gradients and fewer stationary points than quadratic loss, indicating that entropic loss has a more searchable landscape. Quadratic loss is shown to be more resilient to overfitting than entropic loss. Both losses are shown to exhibit local minima, but the number of local minima is shown to decrease with an increase in dimensionality. Thus, the proposed visualisation technique successfully captures the local minima properties exhibited by the neural network loss surfaces, and can be used for the purpose of fitness landscape analysis of neural networks.
... Hamey [9] has claimed that the NN error surface associated with XOR has no local minima. A year later, Sprinkhuizen-Kuyper et al. [10,17] have shown that stationary points are present in the XOR NN search space, but that the stationary points are in fact saddle points. A more recent study of the XOR error surface was published by Mehta et al. [18], where techniques developed for potential energy landscapes were used to quantify local minima of the XOR problem under a varied number of hidden neurons and regularisation coefficient values. ...
... indicates that SSE exhibited a much weaker correlation between the gradient and the error when sampled by gradient walks initialised in the [−10, 10] interval. For CE, the positive correlation was still clearly manifested.Thus, CE exhibited a more searchable landscape when sampled by the [−10,10] ...
... indicates that SSE exhibited a much weaker correlation between the gradient and the error when sampled by gradient walks initialised in the [−10, 10] interval. For CE, the positive correlation was still clearly manifested.Thus, CE exhibited a more searchable landscape when sampled by the [−10,10] ...
Preprint
Full-text available
Quantification of the stationary points and the associated basins of attraction of neural network loss surfaces is an important step towards a better understanding of neural network loss surfaces at large. This work proposes a novel method to visualise basins of attraction together with the associated stationary points via gradient-based random sampling. The proposed technique is used to perform an empirical study of the loss surfaces generated by two different error metrics: quadratic loss and entropic loss. The empirical observations confirm the theoretical hypothesis regarding the nature of neural network attraction basins. Entropic loss is shown to exhibit stronger gradients and fewer stationary points than quadratic loss, indicating that entropic loss has a more searchable landscape. Quadratic loss is shown to be more resilient to overfitting than entropic loss. Both losses are shown to exhibit local minima, but the number of local minima is shown to decrease with an increase in dimensionality. Thus, the proposed visualisation technique successfully captures the local minima properties exhibited by the neural network loss surfaces, and can be used for the purpose of fitness landscape analysis of neural networks.
... The XOR problem: Let us consider a 2-2-1 network used for the XOR problem. The purpose of this experiment is to allow us to verify the results of the proposed approach against previous literature (see Hamey, 1995Hamey, , 1998Sprinkhuizen-Kuyper and Boers, 1999), which has investigated analytically the local and global minima in the XOR problem. All network nodes use the logistic sigmoid activation function, inputs to all nodes are in the interval [0, 1] and the desired outputs are the bounds of the interval [0, 1]. ...
... Finally, if we make the assumption that the outputs of the sigmoids are in the interval [0.1, 0.9] the initial box is considered to be [−4.4, 4.4] Overall, the outcomes of the three experiments are in line with previous research (Hamey, 1995(Hamey, , 1998Sprinkhuizen-Kuyper and Boers, 1999). Actually, Hamey (1998) applied to the XOR problem a new methodology for the analysis of the error surface and showed that starting from any point with finite weight values, there exists a finite non-ascending trajectory to a point with error equal to zero. ...
... Actually, Hamey (1998) applied to the XOR problem a new methodology for the analysis of the error surface and showed that starting from any point with finite weight values, there exists a finite non-ascending trajectory to a point with error equal to zero. Moreover, Sprinkhuizen-Kuyper and Boers (1999) proved that "the error surface of the two-layer XOR network with two hidden units has a number of regions with local minima" and concluded that "from each finite point in weight space, a strictly decreasing path exists to a point with error zero". The conclusions of these papers resulted following an exhaustive analysis of the error surface of the network output. ...
Article
Full-text available
Training a multilayer perceptron (MLP) with algorithms employing global search strategies has been an important research direction in the field of neural networks. Despite a number of significant results, an important matter concerning the bounds of the search region---typically defined as a box---where a global optimization method has to search for a potential global minimizer seems to be unresolved. The approach presented in this paper builds on interval analysis and attempts to define guaranteed bounds in the search space prior to applying a global search algorithm for training an MLP. These bounds depend on the machine precision and the term guaranteed denotes that the region defined surely encloses weight sets that are global minimizers of the neural network's error function. Although the solution set to the bounding problem for an MLP is in general non-convex, the paper presents the theoretical results that help deriving a box which is a convex set. This box is an outer approximation of the algebraic solutions to the interval equations resulting from the function implemented by the network nodes. An experimental study using well known benchmarks is presented in accordance with the theoretical results.
... The entire dataset can be seen in Table 1. Despite being a seemingly trivial problem, the XOR is not linearly separable, and generates a complex error landscape that is still not fully understood [11,28]. ...
... Given the classic XOR problem, a corresponding fully-connected feed-forward NN architecture was chosen. The NN comprised of two input units, two hidden units, and one output unit [28]. Bias weights were associated with the hidden and the output units. ...
Conference Paper
Understanding the properties of neural network error landscapes is an important problem faced by the neural network research community. A few attempts have been made in the past to gather insight about neural network error landscapes using fitness landscape analysis techniques. However, most fitness landscape metrics rely on the analysis of random samples, which may not represent the high-dimensional neural network search spaces well. If the random samples do not include areas of good fitness, then the presence of local optima and/or saddle points cannot be quantified. This paper proposes a progressive gradient walk as an alternative sampling algorithm for neural network error landscape analysis. Experiments show that the proposed walk captures areas of good fitness significantly better than the random walks.
... The exclusive OR (XOR) between two boolean variables is a logical operation whose output is true only when the two inputs have different values. This function has been the subject of several studies aimed at gaining insight into properties of loss functions for small neural networks, because of the nonlinear separability of the data [46][47][48][49][50][51]. ...
... It was initially shown that one particular formulation of the loss function for the simplest network required to solve the XOR problem has only one minimum [46][47][48]. Ref. [50] demonstrated the absence of higher local minima in more complex networks (networks with two hidden layers and two units in each layer) as long as the activation units are not saturated [49]; in contrast, when the activation units saturate due to some weights having effectively infinite values, local minima start to appear. The existence of suboptimal local minima in landscapes of more complicated networks (networks with two hidden layers, two units in the first layer and three units in the second layer) was demonstrated in [51], di-rectly contradicting earlier assertions that two-layer neural networks with sigmoid activation functions and N h − 1 hidden nodes do not have suboptimal local minima when the learning is performed with N h training samples [55]. ...
Article
Full-text available
Training an artificial neural network involves an optimization process over the landscape defined by the cost (loss) as a function of the network parameters. We explore these landscapes using optimisation tools developed for potential energy landscapes in molecular science. The number of local minima and transition states (saddle points of index one), as well as the ratio of transition states to minima, grow rapidly with the number of nodes in the network. There is also a strong dependence on the regularisation parameter, with the landscape becoming more convex (fewer minima) as the regularisation term increases. We demonstrate that in our formulation, stationary points for networks with $N_h$ hidden nodes, including the minimal network required to fit the XOR data, are also stationary points for networks with $N_{h} +1$ hidden nodes when all the weights involving the additional nodes are zero. Hence, smaller networks optimized to train the XOR data are embedded in the landscapes of larger networks. Our results clarify certain aspects of the classification and sensitivity (to perturbations in the input data) of minima and saddle points for this system, and may provide insight into dropout and network compression.
... One major instrument in such a study is via an error/loss surface analysis, cf. [11,12,13]. However, even for a very simple problem, such as the classic XOR problem, the error surface analysis is very complicated and the results are still hard to conclude. ...
... However, even for a very simple problem, such as the classic XOR problem, the error surface analysis is very complicated and the results are still hard to conclude. Specifically, works in [14,15,12,16] demonstrate such a difficulty of characterising the error surface of an FNN. On the other hand, although the BP algorithm is suspected to be sensitive to initialisations, e.g. ...
Article
Despite the recent great success of deep neural networks in various applications, designing and training a deep neural network is still among the greatest challenges in the field. In this work, we present a smooth optimisation perspective on designing and training multilayer Feedforward Neural Networks (FNNs) in the supervised learning setting. By characterising the critical point conditions of an FNN based optimisation problem, we identify the conditions to eliminate local optima of the corresponding cost function. Moreover, by studying the Hessian structure of the cost function at the global minima, we develop an approximate Newton FNN algorithm, which is capable of alleviating the vanishing gradient problem. Finally, our results are numerically verified on two classic benchmarks, i.e., the XOR problem and the four region classification problem.
... The main problem of the machine learning is to minimize the cost function E with a suitable choice of weights W . A gradient method, described above and called backpropagation in the context of neural network training, can get stuck in local minima or take very long time to run in order to optimize E. This is due to the fact that general properties of the cost surface are usually unknown and only the trial and error numerical methods are available (see [4], [12], [16], [9], [17], [18], [5]). No theoretical approach is known to provide the exact initial weights in backpropagation with guaranteed convergence to the global minima of E. One of most powerful techniques used in backpropagation is the adaptive learning rate selection [8] where the step size of iterations is gradually raised in order to escape a local minimum. ...
Article
Full-text available
In this paper, we investigate the supervised backpropagation training of multilayer neural networks from a dynamical systems point of view. We discuss some links with the qualitative theory of differential equations and introduce the overfly algorithm to tackle the local minima problem. Our approach is based on the existence of first integrals of the generalised gradient system with build-in dissipation.
... Later on, it has been investigated that the feedforward multilayer neural network with the enhanced and extended versions of backpropagation learning algorithm is more suitable for handling the complex pattern classification or recognition tasks in spite of its inherited problem of local minimum, slow rate of convergence and no guarantee of convergence, e.g. Sprinkhuizen-Kuyer and Boers (1999), Abarbanel, Talathi, Gibb, and Rabinovich (2005), and Shrivastava and Singh (2011). It has been found that to overcome the problems of a descent gradient searching in a large search space as in the case of complex pattern recognition tasks through multilayer feedforward neural network with the evolutionary search algorithm, i.e. the genetic algorithm (GA) is a better alternative, e.g. ...
Article
Full-text available
In this work, the performance of feedforward neural network with a descent gradient of distributed error and the genetic algorithm GA is evaluated for the recognition of handwritten ‘SWARS’ of Hindi curve script. The performance index for the feedforward multilayer neural networks is considered here with distributed instantaneous unknown error i.e. different error for different layers. The objective of the GA is to make the search process more efficient to determine the optimal weight vectors from the population. The GA is applied with the distributed error. The fitness function of the GA is considered as the mean of square distributed error that is different for each layer. Hence the convergence is obtained only when the minimum of different errors is determined. It has been analysed that the proposed method of a descent gradient of distributed error with the GA known as hybrid distributed evolutionary technique for the multilayer feed forward neural performs better in terms of accuracy, epochs and the number of optimal solutions for the given training and test pattern sets of the pattern recognition problem.
... The major drawbacks of the two-term BP learning algorithm are the problems of local minima and slow convergence speeds, which limit the scope for real-time applications [17]. A local minimum is defined as a point such that all points in a neighborhood have an error value greater than or equal to the error value in that point [18]. However, GA is particularly good to perform efficient searching in large and complex space to find out the global optima and for the convergence. ...
Article
This paper describes the performance evaluation for the feed forward neural network with three different soft computing techniques to recognition of hand written English alphabets. Evolutionary algorithms for the hybrid neural network are showing the numerous potential in the field of pattern recognition. We have taken five trials and two networks of each of the algorithm: back propagation, Evolutionary algorithm, and Hybrid Evolutionary algorithm respectively. These algorithms have been taken the definite lead on the conventional approaches of neural network for pattern recognition.It has been analyzed that the feed forward neural network by two Evolutionary algorithms makes better generalization accuracy in character recognition problems. The problem of not convergence the weight in conventional backpropagation has also eliminated by using the soft computing techniques. It has been observed that, there are more than one converge weight matrix in character recognition for every training set. The results of the experiments show that the hybrid evolutionary algorithm can solve challenging problem most reliably and efficiently. These algorithms have also been compared on the basis of time and space complexity for the training set.
Book
Full-text available
The proceedings of the 4th International Conference on Management, Engineering, Science, Social Science and Humanities (iCon-MESSSH’19), held in Phuket, Thailand during 26-27 July 2019 contains refereed papers that were presented in the conference.
Conference Paper
The classical S-bit parity two-class pattern classification problem can be solved by a multilayer perceptron (MLP) with only H = [(B+1)/2] hidden nodes. By our "symmetric" arguments, we show the following: (1) H hidden nodes may consist of one "linear" and (H-1) binary-valued step-functions (i.e., McCulloch-Pitts neurons) for a facile solution (of integer-valued weights); (2) the posed model can be transformed to a well-known triangular-shaped network with (H-1) hidden nodes having direct connections from B inputs to the terminal node; (3) "weight-sharing" simplifies the structure and thus makes it easier to find a solution; (4) it is possible to find in O(H) an optimal set of weights for an MLP with H "tanh" hidden nodes; and (5) a new scheme is designed to supply desired outputs to at most (H-1) hidden nodes for developing insensitivity to initial weights. These findings concerning how sigmoid hidden nodes get saturated for solution are closely related to plateau phenomena that often occur in MLP-learning on parity. Since the phenomena are related to the saddle-point issue, we investigated the indefiniteness of the Hessian matrix H of the sum-squared-error measure: When H is indefinite, a learning algorithm that exploits negative curvature can efficiently find a separating hyperplane by moving away from the nearest saddle point.
Article
Multi-stage feed-forward neural network (NN) learning with sigmoidal-shaped hidden-node functions is implicitly constrained optimization featuring negative curvature. Our analyses on the Hessian matrix H of the sum-squared-error measure highlight the following intriguing findings: At an early stage of learning, H tends to be indefinite and much better-conditioned than the Gauss-Newton Hessian JTJ. The NN structure influences the indefiniteness and rank of H. Exploiting negative curvature leads to effective learning. All these can be numerically confirmed owing to our stagewise second-order backpropagation; the systematic procedure exploits NN's "layered symmetry" to compute H efficiently, making exact Hessian evaluation feasible for fairly large practical problems.
Article
It was assumed proven that two-layer feedforward neural networks with t-1 hidden nodes, when presented with t input patterns, can not have any suboptimal local minima on the error surface. In this paper, however, we shall give a counterexample to this assumption. This counterexample consists of a region of local minima with nonzero error on the error surface of a neural network with three hidden nodes when presented with four patterns (the XOR problem). We will also show that the original proof is valid only when an unusual definition of local minimum is used
Technical Report
Full-text available
The exclusive-or learning task in a feed-forward neural network with two hidden nodes is investigated. Constraint equations are derived which fully describe the finite stationary points of the error surface. It is shown that the stationary points occur in a single connected union of eighteen manifolds. A Taylor series expansion is applied to the network error surface and it is shown that all points within the enumerated manifolds are arbitrarily close to points of lower error. It follows that the finite stationary points of the exclusive-or task are not relative minima. This result is surprising in view of the commonly held belief that the exclusive-or task exhibits local minima. The present result complements a recent result of the author's which proves the absence of regional local minima in the exclusive-or task. 1 Introduction It is well known that back-propagation learning can become trapped when being trained on the exclusiveor task with two hidden nodes (figure 1). However, the...
Article
190 articles about neural network learning algorithms published in 1993 and 1994 are examined for the amount of experimental evaluation they contain. 29% of them employ not even a single realistic or real learning problem. Only 8% of the articles present results for more than one problem using real world data. Furthermore, one third of all articles do not present any quantitative comparison with a previously known algorithm. These results suggest that we should strive for better assessment practices in neural network learning algorithm research. For the long-term benefit of the field, the publication standards should be raised in this respect and easily accessible collections of benchmark problems should be built. Keywords: algorithm evaluation, science, experiment 1 Introduction A large body of research in artificial neural networks is concerned with finding good learning algorithms to solve practical application problems. Such work tries to improve for instance the quality of solu...
Article
A complete solution of the excitation values which may occur at the local minima of the XOR problem is obtained analytically for two-layered networks in the two most commonly quoted configurations, using the gradient backpropagation algorithm. The role of direct connections which bypass the two-layered system is discussed in connection to the XOR problem and other related training tasks.
Article
The artificial neural network with one hidden unit and the input units connected to the output unit is considered. It is proven that the error surface of this network for the patterns of the XOR problem has minimum values with zero error and that all other stationary points of the error surface are saddlepoints. Also, the volume of the regions in weight space with saddlepoints is zero, hence training this network on the four patterns of the XOR problem using, e.g., backpropagation with momentum, the correct solution with error zero will be reached in the limit with probability one.
Article
We investigate the error surface of the XOR problem for a 2-2-1 network with sigmoid transfer functions. It is proved that all stationary points with finite weights are saddle points with positive error or absolute minima with error zero. So, for finite weights no local minima occur. The proof results from a careful analysis of the Taylor series expansion around the stationary points. For some points coefficients of third or even fourth order in the Taylor series expansion are used to complete the proof. The proofs give a deeper insight into the complexity of the error surface in the neighbourhood of saddle points. These results can guide the research in finding learning algorithms that can handle these kinds of saddle points.
Article
This paper presents a case study of the analysis of local minima in feedforward neural networks. Firstly, a new methodology for analysis is presented, based upon consideration of trajectories through weight space by which a training algorithm might escape a hypothesized local minimum. This analysis method is then applied to the well known XOR (exclusive-or) problem, which has previously been considered to exhibit local minima. The analysis proves the absence of local minima, eliciting significant aspects of the structure of the error surface. The present work is important for the study of the existence of local minima in feedforward neural networks, and also for the development of training algorithms which avoid or escape entrapment in local minima.
Article
In this paper it is proved that the error surface of the two-layer XOR network with two hidden units has a number of regions with local minima. These regions of local minima occur for combinations of the weights from the inputs to the hidden nodes such that one or both hidden nodes are saturated (give output 0 or 1) for at least two patterns. However, boundary points of these regions of local minima are saddle points. From these results it can be concluded that from each finite point in weight space a strictly decreasing path exists to a point with error zero. Furthermore we give proofs that points with error zero exist, and that points with the output unit saturated are either saddle points or (local) maxima. In [10] it is proved that stationary points with finite weights are either saddle points or absolute minima. 1 Introduction To investigate the error surfaces of XOR networks thoroughly is important, since Prechelt [5] found in his investigation of articles on learning algorithms...
  • D E Rumelhart
  • J L Mcclelland
  • The Pdp Research
  • Group
D.E. Rumelhart, J.L. McClelland and the PDP Research Group, Parallel Distributed Processing, Vol. 1 (MIT Press, Cambridge, MA, 1986).