Preprint

Quantifying the effect of representations on task complexity

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

We examine the influence of input data representations on learning complexity. For learning, we posit that each model implicitly uses a candidate model distribution for unexplained variations in the data, its noise model. If the model distribution is not well aligned to the true distribution, then even relevant variations will be treated as noise. Crucially however, the alignment of model and true distribution can be changed, albeit implicitly, by changing data representations. "Better" representations can better align the model to the true distribution, making it easier to approximate the input-output relationship in the data without discarding useful data variations. To quantify this alignment effect of data representations on the difficulty of a learning task, we make use of an existing task complexity score and show its connection to the representation-dependent information coding length of the input. Empirically we extract the necessary statistics from a linear regression approximation and show that these are sufficient to predict relative learning performance outcomes of different data representations and neural network types obtained when utilizing an extensive neural network architecture search. We conclude that to ensure better learning outcomes, representations may need to be tailored to both task and model to align with the implicit distribution of model and task.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
In this paper we propose an approach to avoiding catastrophic forgetting in sequential task learning scenarios. Our technique is based on a network reparameterization that approximately diagonalizes the Fisher Information Matrix of the network parameters. This reparameterization takes the form of a factorized rotation of parameter space which, when used in conjunction with Elastic Weight Consolidation (which assumes a diagonal Fisher Information Matrix), leads to significantly better performance on lifelong learning of sequential tasks. Experimental results on the MNIST, CIFAR-100, CUB-200 and Stanford-40 datasets demonstrate that we significantly improve the results of standard elastic weight consolidation, and that we obtain competitive results when compared to other state-of-the-art in lifelong learning without forgetting.
Conference Paper
Full-text available
The Multipoint Expected Improvement criterion (or q-EI) has recently been stud-ied in batch-sequential Bayesian Optimization. This paper deals with the new way of computing q-EI, without using Monte-Carlo simulations, through a new closed-form formula. The latter allows a very fast computation of q-EI for reasonably low values of q (typically, less than 10). New parallel kriging-based optimization strategies, tested on a 6-dimensional toy example, show promising results.
Article
Full-text available
Some ancient Greek philosophers and thinkers questioned the geocentric system and proposed instead a heliocentric system. The main proponents of this view - which was seen as heretical at the time - are believed to have been the Pythagoreans Philolaos, Heraclides, Hicetas, and Ecphantos, but mainly Aristarchos of Samos, who placed the Sun in the position of the "central fire" of the Pythagoreans. The geocentric system, reworked by Claudius Ptolemaeus (Ptolemy), was the dominant one for centuries, and it was only during the sixteenth century that the Polish monk-astronomer, Copernicus, revisited the ancient Greek heliocentric views and became the new champion of the theory that we all accept today.
Article
Full-text available
Variable and feature selection have become the focus of much research in areas of application for which datasets with tells or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
Article
Full-text available
An accurate and robust face recognition system was developed and tested. This system exploits the feature extraction capabilities of the discrete cosine transform (DCT) and invokes certain normalization techniques that increase its robustness to variations in facial geometry and illumination. The method was tested on a variety of available face databases, including one collected at McGill University. The system was shown to perform very well when compared to other approaches.
Conference Paper
Full-text available
High information redundancy and strong correlations in face images result in inefficiencies when such images are used directly in recognition tasks. In this paper, discrete cosine transforms (DCT) are used to reduce image information redundancy because only a subset of the transform coefficients are necessary to preserve the most important facial features, such as hair outline, eyes and mouth. We demonstrate experimentally that when DCT coefficients are fed into a backpropagation neural network for classification, high recognition rates can be achieved using only a small proportion (0.19%) of available transform components. This makes DCT-based face recognition more than two orders of magnitude faster than other approaches
Article
Full-text available
Human face detection plays an important role in applications such as video surveillance, human computer interface, face recognition, and face image database management. We propose a face detection algorithm for color images in the presence of varying lighting conditions as well as complex backgrounds. Based on a novel lighting compensation technique and a nonlinear color transformation, our method detects skin regions over the entire image and then generates face candidates based on the spatial arrangement of these skin patches. The algorithm constructs eye, mouth, and boundary maps for verifying each face candidate. Experimental results demonstrate successful face detection over a wide range of facial variations in color, position, scale, orientation, 3D pose, and expression in images from several photo collections (both indoors and outdoors)
Article
Full-text available
Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day
Article
Full-text available
It is widely believed that to efficiently represent an otherwise smooth object with discontinuities along edges, one must use an adaptive representation that in some sense `tracks' the shape of the discontinuity set. This folk-belief -- some would say folk-theorem -- is incorrect. At the very least, the possible quantitative advantage of such adaptation is vastly smaller than commonly believed. We have recently constructed a tight frame of curvelets which provides stable, efficient, and near-optimal representation of otherwise smooth objects having discontinuities along smooth curves. By applying naive thresholding to the curvelet transform of such an object, one can form m-term approximations with rate of L 2 approximation rivaling the rate obtainable by complex adaptive schemes which attempt to `track' the discontinuity set. In this article we explain the basic issues of efficient m-term approximation, the construction of efficient adaptive representation, the construction ...
Article
Full-text available
We examine in some detail Mel Frequency Cepstral Coefficients (MFCCs) - the dominant features used for speech recognition - and investigate their applicability to modeling music. In particular, we examine two of the main assumptions of the process of forming MFCCs: the use of the Mel frequency scale to model the spectra; and the use of the Discrete Cosine Transform (DCT) to decorrelate the Mel-spectral vectors.
Conference Paper
Despite impressive results in visual-inertial state estimation in recent years, high speed trajectories with six degree of freedom motion remain challenging for existing estimation algorithms. Aggressive trajectories feature large accelerations and rapid rotational motions, and when they pass close to objects in the environment, this induces large apparent motions in the vision sensors, all of which increase the difficulty in estimation. Existing benchmark datasets do not address these types of trajectories, instead focusing on slow speed or constrained trajectories, targeting other tasks such as inspection or driving. We introduce the UZH-FPV Drone Racing dataset, consisting of over 27 sequences, with more than 10 km of flight distance, captured on a first-person-view (FPV) racing quadrotor flown by an expert pilot. The dataset features camera images, inertial measurements, event-camera data, and precise ground truth poses. These sequences are faster and more challenging, in terms of apparent scene motion, than any existing dataset. Our goal is to enable advancement of the state of the art in aggressive motion estimation by providing a dataset that is beyond the capabilities of existing state estimation algorithms.
Article
Using established principles from Statistics and Information Theory, we show that invariance to nuisance factors in a deep neural network is equivalent to information minimality of the learned representation, and that stacking layers and injecting noise during training naturally bias the network towards learning invariant representations. We then decompose the cross-entropy loss used during training and highlight the presence of an inherent overfitting term. We propose regularizing the loss by bounding such a term in two equivalent ways: One with a Kullbach-Leibler term, which relates to a PAC-Bayes perspective; the other using the information in the weights as a measure of complexity of a learned model, yielding a novel Information Bottleneck for the weights. Finally, we show that invariance and independence of the components of the representation learned by the network are bounded above and below by the information in the weights, and therefore are implicitly optimized during training. The theory enables us to quantify and predict sharp phase transitions between underfitting and overfitting of random labels when using our regularized loss, which we verify in experiments, and sheds light on the relation between the geometry of the loss function, invariance properties of the learned representation, and generalization error.
Book
Mallat's book is the undisputed reference in this field - it is the only one that covers the essential material in such breadth and depth. - Laurent Demanet, Stanford University The new edition of this classic book gives all the major concepts, techniques and applications of sparse representation, reflecting the key role the subject plays in today's signal processing. The book clearly presents the standard representations with Fourier, wavelet and time-frequency transforms, and the construction of orthogonal bases with fast algorithms. The central concept of sparsity is explained and applied to signal compression, noise reduction, and inverse problems, while coverage is given to sparse representations in redundant dictionaries, super-resolution and compressive sensing applications. Features: * Balances presentation of the mathematics with applications to signal processing * Algorithms and numerical examples are implemented in WaveLab, a MATLAB toolbox * Companion website for instructors and selected solutions and code available for students New in this edition * Sparse signal representations in dictionaries * Compressive sensing, super-resolution and source separation * Geometric image processing with curvelets and bandlets * Wavelets for computer graphics with lifting on surfaces * Time-frequency audio processing and denoising * Image compression with JPEG-2000 * New and updated exercises A Wavelet Tour of Signal Processing: The Sparse Way, third edition, is an invaluable resource for researchers and R&D engineers wishing to apply the theory in fields such as image processing, video processing and compression, bio-sensing, medical imaging, machine vision and communications engineering. Stephane Mallat is Professor in Applied Mathematics at École Polytechnique, Paris, France. From 1986 to 1996 he was a Professor at the Courant Institute of Mathematical Sciences at New York University, and between 2001 and 2007, he co-founded and became CEO of an image processing semiconductor company. Includes all the latest developments since the book was published in 1999, including its application to JPEG 2000 and MPEG-4 Algorithms and numerical examples are implemented in Wavelab, a MATLAB toolbox Balances presentation of the mathematics with applications to signal processing.
Book
This revised and expanded monograph presents the general theory for frames and Riesz bases in Hilbert spaces as well as its concrete realizations within Gabor analysis, wavelet analysis, and generalized shift-invariant systems. Compared with the first edition, more emphasis is put on explicit constructions with attractive properties. Based on the exiting development of frame theory over the last decade, this second edition now includes new sections on the rapidly growing fields of LCA groups, generalized shift-invariant systems, duality theory for as well Gabor frames as wavelet frames, and open problems in the field. Key features include: *Elementary introduction to frame theory in finite-dimensional spaces * Basic results presented in an accessible way for both pure and applied mathematicians * Extensive exercises make the work suitable as a textbook for use in graduate courses * Full proofs includ ed in introductory chapters; only basic knowledge of functional analysis required * Explicit constructions of frames and dual pairs of frames, with applications and connections to time-frequency analysis, wavelets, and generalized shift-invariant systems * Discussion of frames on LCA groups and the concrete realizations in terms of Gabor systems on the elementary groups; connections to sampling theory * Selected research topics presented with recommendations for more advanced topics and further readin g * Open problems to stimulate further research An Introduction to Frames and Riesz Bases will be of interest to graduate students and researchers working in pure and applied mathematics, mathematical physics, and engineering. Professionals working in digital signal processing who wish to understand the theory behind many modern signal processing tools may also find this book a useful self-study reference. Review of the first edition: "Ole Christensen’s An Introduction to Frames and Riesz Bases is a first-rate introduction to the field … . The book provides an excellent exposition of these topics. The material is broad enough to pique the interest of many readers, the included exercises supply some interesting challenges, and the coverage provides enough background for those new to the subject to begin conducting original research." — Eric S. Weber, American Mathematical Monthly, Vol. 112, February, 2005
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
The success of hand-crafted machine learning systems in many applications raises the question of making machine learning algorithms more autonomous, i.e., to reduce the requirement of expert input to a minimum. We discuss two strategies towards this goal: (1) automated optimization of hyperparameters (including mechanisms for feature selection, preprocessing, model selection, etc) and (2) the development of algorithms with reduced sets of hyperparameters. Since many research directions (e.g., deep learning), show a tendency towards increasingly complex algorithms with more and more hyperparamters, the demand for both of these strategies continuously increases. We review recent hyperparameter optimization methods and discuss data-driven approaches to avoid the introduction of hyperparameters using unsupervised learning. We end in discussing how these complementary strategies can work hand-in-hand, representing a very promising approach towards autonomous machine learning.
Article
In this paper we investigate the performance of different types of rectified activation functions in convolutional neural network: standard rectified linear unit (ReLU), leaky rectified linear unit (Leaky ReLU), parametric rectified linear unit (PReLU) and a new randomized leaky rectified linear units (RReLU). We evaluate these activation function on standard image classification task. Our experiments suggest that incorporating a non-zero slope for negative part in rectified activation units could consistently improve the results. Thus our findings are negative on the common belief that sparsity is the key of good performance in ReLU. Moreover, on small scale dataset, using deterministic negative slope or learning it are both prone to overfitting. They are not as effective as using their randomized counterpart.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Article
Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to configure neural networks and deep belief networks. Compared with neural networks configured by a pure grid search, we find that random search over the same domain is able to find models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search finds better models by effectively searching a larger, less promising configuration space. Compared with deep belief networks configured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional configuration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for configuring algorithms for new data sets. Our analysis casts some light on why recent "High Throughput" methods achieve surprising success--they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms.
Article
Existing complexity measures from contemporary learning theory cannot be conveniently applied to specific learning problems (e.g. training sets). Moreover, they are typically non-generic, i.e. they necessitate making assumptions about the way in which the learner will operate. The lack of a satisfactory, generic complexity measure for learning problems poses difficulties for researchers in various areas; the present paper puts forward an idea which may help to alleviate these. It shows that supervised learning problems fall into two generic complexity classes, only one of which is associated with computational tractability. By determining which class a particular problem belongs to, we can thus effectively evaluate its degree of generic difficulty.
Article
Estimation of structure, such as in variable selection, graphical modelling or cluster analysis is notoriously difficult, especially for high-dimensional data. We introduce stability selection. It is based on subsampling in combination with (high-dimensional) selection algorithms. As such, the method is extremely general and has a very wide range of applicability. Stability selection provides finite sample control for some error rates of false discoveries and hence a transparent principle to choose a proper amount of regularisation for structure estimation. Variable selection and structure estimation improve markedly for a range of selection methods if stability selection is applied. We prove for randomised Lasso that stability selection will be variable selection consistent even if the necessary conditions needed for consistency of the original Lasso method are violated. We demonstrate stability selection for variable selection and Gaussian graphical modelling, using real and simulated data. Comment: 30 pages, 7 figures
Article
We study the numerical performance of a limited memory quasi-Newton method for large scale optimization, which we call the L-BFGS method. We compare its performance with that of the method developed by Buckley and LeNir (1985), which combines cycles of BFGS steps and conjugate direction steps. Our numerical tests indicate that the L-BFGS method is faster than the method of Buckley and LeNir, and is better able to use additional storage to accelerate convergence. We show that the L-BFGS method can be greatly accelerated by means of a simple scaling. We then compare the L-BFGS method with the partitioned quasi-Newton method of Griewank and Toint (1982a). The results show that, for some problems, the partitioned quasi-Newton method is clearly superior to the L-BFGS method. However we find that for other problems the L-BFGS method is very competitive due to its low iteration cost. We also study the convergence properties of the L-BFGS method, and prove global convergence on uniformly convex problems.
Article
Machine learning algorithms frequently require careful tuning of model hyperparameters, regularization terms, and optimization parameters. Unfortunately, this tuning is often a "black art" that requires expert experience, unwritten rules of thumb, or sometimes brute-force search. Much more appealing is the idea of developing automatic approaches which can optimize the performance of a given learning algorithm to the task at hand. In this work, we consider the automatic tuning problem within the framework of Bayesian optimization, in which a learning algorithm's generalization performance is modeled as a sample from a Gaussian process (GP). The tractable posterior distribution induced by the GP leads to efficient use of the information gathered by previous experiments, enabling optimal choices about what parameters to try next. Here we show how the effects of the Gaussian process prior and the associated inference procedure can have a large impact on the success or failure of Bayesian optimization. We show that thoughtful choices can lead to results that exceed expert-level performance in tuning machine learning algorithms. We also describe new algorithms that take into account the variable cost (duration) of learning experiments and that can leverage the presence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization on a diverse set of contemporary algorithms including latent Dirichlet allocation, structured SVMs and convolutional neural networks.
Article
Variance estimation is a fundamental problem in statistical modelling. In ultrahigh dimensional linear regression where the dimensionality is much larger than the sample size, traditional variance estimation techniques are not applicable. Recent advances in variable selection in ultrahigh dimensional linear regression make this problem accessible. One of the major problems in ultrahigh dimensional regression is the high spurious correlation between the unobserved realized noise and some of the predictors. As a result, the realized noises are actually predicted when extra irrelevant variables are selected, leading to serious underestimate of the level of noise. We propose a two-stage refitted procedure via a data splitting technique, called refitted cross-validation, to attenuate the influence of irrelevant variables with high spurious correlations. Our asymptotic results show that the resulting procedure performs as well as the oracle estimator, which knows in advance the mean regression function. The simulation studies lend further support to our theoretical claims. The naive two-stage estimator and the plug-in one-stage estimators using the lasso and smoothly clipped absolute deviation are also studied and compared. Their performances can be improved by the reffitted cross-validation method proposed.
Article
We studied a number of measures that characterize the difficulty of a classification problem, focusing on the geometrical complexity of the class boundary. We compared a set of real-world problems to random labelings of points and found that real problems contain structures in this measurement space that are significantly different from the random sets. Distributions of problems in this space show that there exist at least two independent factors affecting a problem's difficulty. We suggest using this space to describe a classifier's domain of competence. This can guide static and dynamic selection of classifiers for specific problems as well as subproblems formed by confinement, projection, and transformations of the feature vectors.
Chapter
This chapter introduces the concept of differential entropy, which is the entropy of a continuous random variable. Differential entropy is also related to the shortest description length, and is similar in many ways to the entropy of a discrete random variable. But there are some important differences, and there is need for some care in using the concept.
Article
This thesis summarizes four of my research projects in machine learning. One of them is on a theoretical challenge of defining and exploring complexity measures for data sets; the others are about new and improved classification algorithms. We first investigate the role of data complexity in the context of binary classification problems. The universal data complexity is defined for a data set as the Kolmogorov complexity of the mapping enforced by that data set. It is closely related to several existing principles used in machine learning such as Occam's razor, the minimum description length, and the Bayesian approach. We demonstrate the application of the data complexity in two learning problems, data decomposition and data pruning. In data decomposition, we illustrate that a data set is best approximated by its principal subsets which are Pareto optimal with respect to the complexity and the set size. In data pruning, we show that outliers usually have high complexity contributions, and propose methods for estimating the complexity contribution. Experiments were carried out with a practical complexity measure on several toy problems. We then propose a family of novel learning algorithms to directly minimize the 0/1 loss for perceptrons. A perceptron is a linear threshold classifier that separates examples with a hyperplane. Unlike most perceptron learning algorithms, which require smooth cost functions, our algorithms directly minimize the 0/1 loss, and usually achieve the lowest training error compared with other algorithms. The algorithms are also computationally efficient. Such advantages make them favorable for both standalone use and ensemble learning, on problems that are not linearly separable. Experiments show that our algorithms work very well with AdaBoost. We also study ensemble methods that aggregate many base hypotheses in order to achieve better performance. AdaBoost is one such method for binary classification problems. The superior out-of-sample performance of AdaBoost has been attributed to the fact that it minimizes a cost function based on the margin, in that it can be viewed as a special case of AnyBoost, an abstract gradient descent algorithm. We provide a more sophisticated abstract boosting algorithm, CGBoost, based on conjugate gradient in function space. When the AdaBoost exponential cost function is optimized, CGBoost generally yields much lower cost and training error but higher test error, which implies that the exponential cost is vulnerable to overfitting. With the optimization power of CGBoost, we can adopt more "regularized" cost functions that have better out-of-sample performance but are difficult to optimize. Our experiments demonstrate that CGBoost generally outperforms AnyBoost in cost reduction. With suitable cost functions, CGBoost can have better out-of-sample performance. A multiclass classification problem can be reduced to a collection of binary problems with the aid of a coding matrix. The quality of the final solution, which is an ensemble of base classifiers learned on the binary problems, is affected by both the performance of the base learner and the error-correcting ability of the coding matrix. A coding matrix with strong error-correcting ability may not be overall optimal if the binary problems are too hard for the base learner. Thus a trade-off between error-correcting and base learning should be sought. In this paper, we propose a new multiclass boosting algorithm that modifies the coding matrix according to the learning ability of the base learner. We show experimentally that our algorithm is very efficient in optimizing the multiclass margin cost, and outperforms existing multiclass algorithms such as AdaBoost.ECC and one-vs-one. The improvement is especially significant when the base learner is not very powerful.
Article
In this paper, an efficient method for high-speed face recognition based on the discrete cosine transform (DCT), the Fisher's linear discriminant (FLD) and radial basis function (RBF) neural networks is presented. First, the dimensionality of the original face image is reduced by using the DCT and the large area illumination variations are alleviated by discarding the first few low-frequency DCT coefficients. Next, the truncated DCT coefficient vectors are clustered using the proposed clustering algorithm. This process makes the subsequent FLD more efficient. After implementing the FLD, the most discriminating and invariant facial features are maintained and the training samples are clustered well. As a consequence, further parameter estimation for the RBF neural networks is fulfilled easily which facilitates fast training in the RBF neural networks. Simulation results show that the proposed system achieves excellent performance with high training and recognition speed, high recognition rate as well as very good illumination robustness.
Article
PAC-Bayesian learning methods combine the informative priors of Bayesian methods with distribution-free PAC guarantees. Stochastic model selection predicts a class label by stochastically sampling a classifier according to a "posterior distribution" on classifiers. This paper gives a PAC-Bayesian performance guarantee for stochastic model selection that is superior to analogous guarantees for deterministic model selection. The guarantee is stated in terms of the training error of the stochastic classifier and the KL-divergence of the posterior from the prior. It is shown that the posterior optimizing the performance guarantee is a Gibbs distribution. Simpler posterior distributions are also derived that have nearly optimal performance guarantees.
Face recognition using dct-based feature vectors
  • Christine Irene Podilchuk
  • Xiaoyu Zhang
Christine Irene Podilchuk and Xiaoyu Zhang. Face recognition using dct-based feature vectors, September 1 1998. US Patent 5,802,208.
Faster neural networks straight from jpeg
  • Lionel Gueguen
  • Alex Sergeev
  • Ben Kadlec
  • Rosanne Liu
  • Jason Yosinski
Lionel Gueguen, Alex Sergeev, Ben Kadlec, Rosanne Liu, and Jason Yosinski. Faster neural networks straight from jpeg. In Advances in Neural Information Processing Systems, pages 3933-3944, 2018.
Where is the information in a deep neural network? CoRR, abs/1905.12213
  • Alessandro Achille
  • Stefano Soatto
Alessandro Achille and Stefano Soatto. Where is the information in a deep neural network? CoRR, abs/1905.12213, 2019. URL http://arxiv.org/abs/1905.12213.
Taking the human out of the loop: A review of bayesian optimization
  • Bobak Shahriari
  • Kevin Swersky
  • Ziyu Wang
  • P Ryan
  • Nando De Adams
  • Freitas
Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104 (1):148-175, 2015.
  • Adam Gaier
  • David Ha
Adam Gaier and David Ha. Weight agnostic neural networks. arXiv preprint arXiv:1906.04358, 2019.
  • J Ian
  • Goodfellow
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • Philipp Jund
  • Nichola Abdo
  • Andreas Eitel
  • Wolfram Burgard
Philipp Jund, Nichola Abdo, Andreas Eitel, and Wolfram Burgard. The freiburg groceries dataset. arXiv preprint arXiv:1611.05799, 2016.