Conference Paper

Statistical Inference Problems and Their Rigorous Solutions

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper presents direct settings and rigorous solutions of Statistical Inference problems. It shows that rigorous solutions require solving ill-posed Fredholm integral equations of the first kind in the situation where not only the right-hand side of the equation is an approximation, but the operator in the equation is also defined approximately. Using Stefanuyk-Vapnik theory for solving such operator equations, constructive methods of empirical inference are introduced. These methods are based on a new concept called V-matrix. This matrix captures geometric properties of the observation data that are ignored by classical statistical methods.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Related Work: In recent work, Liu et al. [13] proposed a direct change estimator for graphical models based on the ratio of the probability density of the two models [9,10,25,26,31]. They focused on the special case of L 1 norm, i.e., δθ * ∈ R p 2 is sparse, and provided non-asymptotic error bounds for the estimator along with a sample complexity of n 1 = O(s 2 log p) and n 2 = O(n 2 1 ) for an unbounded density ratio model, where s is the number of the changed edges with p being the number of variables. ...
... In this Section, we first give a brief background on Ising model selection. Then, we explain how to develop the loss function L(δθ; X n1 1 , X n2 2 ) based on the density ratio [9,10,25,31] to directly estimate δθ = θ 1 − θ 2 , and finally we describe how to solve the optimization problem (1) for any norm R(δθ). ...
Article
We consider the problem of estimating change in the dependency structure between two p-dimensional Ising models, based on respectively n1n_1 and n2n_2 samples drawn from the models. The change is assumed to be structured, e.g., sparse, block sparse, node-perturbed sparse, etc., such that it can be characterized by a suitable (atomic) norm. We present and analyze a norm-regularized estimator for directly estimating the change in structure, without having to estimate the structures of the individual Ising models. The estimator can work with any norm, and can be generalized to other graphical models under mild assumptions. We show that only one set of samples, say n2n_2, needs to satisfy the sample complexity requirement for the estimator to work, and the estimation error decreases as cmin(n1,n2)\frac{c}{\sqrt{\min(n_1,n_2)}}, where c depends on the Gaussian width of the unit norm ball. For example, for 1\ell_1 norm applied to s-sparse change, the change can be accurately estimated with min(n1,n2)=O(slogp)\min(n_1,n_2)=O(s \log p) which is sharper than an existing result n1=O(s2logp)n_1= O(s^2 \log p) and n2=O(n12)n_2 = O(n_1^2). Experimental results illustrating the effectiveness of the proposed estimator are presented.
Chapter
In this chapter, an overview of the theory of probability, statistical and machine learning is made covering the main ideas and the most popular and widely used methods in this area. As a starting point, the randomness and determinism as well as the nature of the real-world problems are discussed. Then, the basic and well-known topics of the traditional probability theory and statistics including the probability mass and distribution, probability density and moments, density estimation, Bayesian and other branches of the probability theory, are recalled followed by a analysis. The well-known data pre-processing techniques, unsupervised and supervised machine learning methods are covered. These include a brief introduction of the distance metrics, normalization and standardization, feature selection, orthogonalization as well as a review of the most representative clustering, classification, regression and prediction approaches of various types. In the end, the topic of image processing is also briefly covered including the popular image transformation techniques, and a number of image feature extraction techniques at three different levels.
Chapter
In this chapter, we will describe the fundamentals of the proposed new “empirical” approach as a systematic methodology with its nonparametric quantities derived entirely from the actual data with no subjective and/or problem-specific assumptions made. It has a potential to be a powerful extension of (and/or alternative to) the traditional probability theory, statistical learning and computational intelligence methods. The nonparametric quantities of the proposed new empirical approach include: (1) the cumulative proximity; (2) the eccentricity, and the standardized eccentricity; (3) the data density, and (4) the typicality. They can be recursively updated on a sample-by-sample basis, and they have unimodal and multimodal, discrete and continuous forms/versions. The nonparametric quantities are based on ensemble properties of the data and not limited by prior restrictive assumptions. The discrete version of the typicality resembles the unimodal probability density function, but is in a discrete form. The discrete multimodal typicality resembles the probability mass function.
Article
Full-text available
The article is dedicated to development of algorithms for recovery of the unknown function of the experimental data distribution density on the basis of nonparametric statistics, i.e. Parzen-Rosenblatt estimates. The results of the algorithms are illustrated by examples of gas pipeline pressures processing and the number of cycles prior to samples failure during their durability tests.
Conference Paper
The objective of this paper is to consider some properties of decisions produced by classifiers that are in consensus. Consensus allows strong classifiers to obtain very reliable classification on the objects on which consensus has been reached. For those ones where consensus is not reached the reclassification procedure should be applied based on other classification algorithms. Properties of different consensuses are described using algebraic approach and performance evaluation routine.
Article
Full-text available
In this paper, we propose an approach to data analysis, which is based entirely on the empirical observations of discrete data samples and the relative proximity of these points in the data space. At the core of the proposed new approach is the typicality—an empirically derived quantity that resembles probability. This nonparametric measure is a normalized form of the square centrality (centrality is a measure of closeness used in graph theory). It is also closely linked to the cumulative proximity and eccentricity (a measure of the tail of the distributions that is very useful for anomaly detection and analysis of extreme values). In this paper, we introduce and study two types of typicality, namely its local and global versions. The local typicality resembles the well-known probability density function (pdf), probability mass function, and fuzzy set membership but differs from all of them. The global typicality, on the other hand, resembles well-known histograms but also differs from them. A distinctive feature of the proposed new approach, empirical data analysis (EDA), is that it is not limited by restrictive impractical prior assumptions about the data generation model as the traditional probability theory and statistical learning approaches are. Moreover, it does not require an explicit and binary assumption of either randomness or determinism of the empirically observed data, their independence, or even their number (it can be as low as a couple of data samples). The typicality is considered as a fundamental quantity in the pattern analysis, which is derived directly from data and is stated in a discrete form in contrast to the traditional approach where a continuous pdf is assumed a priori and estimated from data afterward. The typicality introduced in this paper is free from the paradoxes of the pdf. Typicality is objectivist while the fuzzy sets and the belief-based branch of the probability theory are subjectivist. The local typicality is expressed in a closed analytical form and can be calculated recursively, thus, computationally very efficiently. The other nonparametric ensemble properties of the data introduced and studied in this paper, namely, the square centrality, cumulative proximity, and eccentricity, can also be updated recursively for various types of distance metrics. Finally, a new type of classifier called naïve typicality-based EDA class is introduced, which is based on the newly introduced global typicality. This is only one of the wide range of possible applications of EDA including but not limited for anomaly detection, clustering, classification, control, prediction, control, rare events analysis, etc., which will be the subject of further research.
Conference Paper
This paper introduces an advanced setting of machine learning problem in which an Intelligent Teacher is involved. During training stage, Intelligent Teacher provides Student with information that contains, along with classification of each example, additional privileged information (explanation) of this example. The paper describes two mechanisms that can be used for significantly accelerating the speed of Student’s training: (1) correction of Student’s concepts of similarity between examples, and (2) direct Teacher-Student knowledge transfer.
Article
Full-text available
We establish learning rates to the Bayes risk for support vector machines (SVM's) with hinge loss. Since a theorem of Devroye states that no learning algorithm can learn with a uniform rate to the Bayes risk for all probability distributions we have to restrict the class of considered distributions: in order to obtain fast rates we assume a noise condition recently proposed by Tsybakov and an approximation condition in terms of the distribution and the reproducing kernel Hilbert space used by the SVM. For Gaussian RBF kernels with varying widths we propose a geometric noise assumption on the distribution which ensures the approximation condition. This geometric assumption is not in terms of smoothness but describes the concentration of the marginal distribution near the decision boundary. In particular we are able to describe nontrivial classes of distributions for which SVM's using a Gaussian kernel can learn with almost linear rate.
Article
Full-text available
In this letter we discuss a least squares version for support vector machine (SVM) classifiers. Due to equality type constraints in the formulation, the solution follows from solving a set of linear equations, instead of quadratic programming for classical SVM''s. The approach is illustrated on a two-spiral benchmark classification problem.
Article
Full-text available
Let F^n\hat{F}_n denote the empirical distribution function for a sample of n i.i.d. random variables with distribution function F. In 1956 Dvoretzky, Kiefer and Wolfowitz proved that P(nsupx(F^n(x)F(x))>λ)Cexp(2λ2),P\big(\sqrt{n} \sup_x(\hat{F}_n(x) - F(x)) > \lambda\big) \leq C \exp(-2\lambda^2), where C is some unspecified constant. We show that C can be taken as 1 (as conjectured by Birnbaum and McCarty in 1958), provided that exp(2λ2)12\exp(-2\lambda^2) \leq \frac{1}{2}. In particular, the two-sided inequality P(nsupxF^n(x)F(x)>λ)2exp(2λ2)P\big(\sqrt{n} \sup_x|\hat{F}_n(x) - F(x)| > \lambda\big) \leq 2 \exp(-2\lambda^2) holds without any restriction on λ\lambda. In the one-sided as well as in the two-sided case, the constants cannot be further improved.
Article
Full-text available
In this paper we study a dual version of the Ridge Regression procedure. It allows us to perform non-linear regression by constructing a linear regression function in a high dimensional feature space. The feature space representation can result in a large increase in the number of parameters used by the algorithm. In order to combat this "curse of dimensionality", the algorithm allows the use of kernel functions, as used in Support Vector methods. We also discuss a powerful family of kernel functions which is constructed using the ANOVA decomposition method from the kernel corresponding to splines with an infinite number of nodes. This paper introduces a regression estimation algorithm which is a combination of these two elements: the dual version of Ridge Regression is applied to the ANOVA enhancement of the infinitenode splines. Experimental results are then presented (based on the Boston Housing data set) which indicate the performance of this algorithm relative to other algorithms....
Article
Full-text available
The Support Vector Machine (SVM) is a new and promising technique for classification and regression, developed by V. Vapnik and his group at AT&T Bell Labs [2, 9]. The technique can be seen as a new training algorithm for Polynomial, Radial Basis Function and Multi-Layer Perceptron networks. SVMs are currently considered slower at runtime than other techniques with similar generalization performance. In this paper we focus on SVM for classification and investigate the problem of reducing its runtime complexity. We present two relevant results: a) the use of SVM itself as a regression tool to approximate the decision surface with a user-specified accuracy; and b) a reformulation of the training problem that yields the exact same decision surface using a smaller number of basis functions. We believe that this reformulation offers great potential for future improvements of the technique. For most of the selected problems, both approaches give reductions of run-time in the 50-95% range, wi...
Chapter
The Support Vector Machine is a powerful new learning algorithm for solving a variety of learning and function estimation problems, such as pattern recognition, regression estimation, and operator inversion. The impetus for this collection was a workshop on Support Vector Machines held at the 1997 NIPS conference. The contributors, both university researchers and engineers developing applications for the corporate world, form a Who's Who of this exciting new area. Contributors Peter Bartlett, Kristin P. Bennett, Christopher J.C. Burges, Nello Cristianini, Alex Gammerman, Federico Girosi, Simon Haykin, Thorsten Joachims, Linda Kaufman, Jens Kohlmorgen, Ulrich Kreßel, Davide Mattera, Klaus-Robert Müller, Manfred Opper, Edgar E. Osuna, John C. Platt, Gunnar Rätsch, Bernhard Schölkopf, John Shawe-Taylor, Alexander J. Smola, Mark O. Stitson, Vladimir Vapnik, Volodya Vovk, Grace Wahba, Chris Watkins, Jason Weston, Robert C. Williamson
Conference Paper
This paper introduces an advanced setting of machine learning problem in which an Intelligent Teacher is involved. During training stage, Intelligent Teacher provides Student with information that contains, along with classification of each example, additional privileged information (explanation) of this example. The paper describes two mechanisms that can be used for significantly accelerating the speed of Student’s training: (1) correction of Student’s concepts of similarity between examples, and (2) direct Teacher-Student knowledge transfer.
Article
We introduce a constructive setting for the problem of density ratio estimation through the solution of a multidimensional integral equation. In this equation, not only its right hand side is approximately known, but also the integral operator is approximately defined. We show that this ill-posed problem has a rigorous solution and obtain the solution in a closed form. The key element of this solution is the novel V-matrix, which captures the geometry of the observed samples. We compare our method with previously proposed ones, using both synthetic and real data. Our experimental results demonstrate the good potential of the new approach.
Article
Machine learning is an interdisciplinary field of science and engineering that studies mathematical theories and practical applications of systems that learn. This book introduces theories, methods, and applications of density ratio estimation, which is a newly emerging paradigm in the machine learning community. Various machine learning problems such as non-stationarity adaptation, outlier detection, dimensionality reduction, independent component analysis, clustering, classification, and conditional density estimation can be systematically solved via the estimation of probability density ratios. The authors offer a comprehensive introduction of various density ratio estimators including methods via density estimation, moment matching, probabilistic classification, density fitting, and density ratio fitting as well as describing how these can be applied to machine learning. The book also provides mathematical theories for density ratio estimation including parametric and non-parametric convergence analysis and numerical stability analysis to complete the first and definitive treatment of the entire framework of density ratio estimation in machine learning.
Article
This report derives explicit solutions to problems involving Tchebycheffian spline functions. We use a reproducing kernel Hilbert space which depends on the smoothness criterion, but not on the form of the data, to solve explicitly Hermite-Birkhoff interpolation and smoothing problems. Sard's best approximation to linear functionals and smoothing with respect to linear inequality constraints are also discussed. Some of the results are used to show that spline interpolation and smoothing is equivalent to prediction and filtering on realizations of certain stochastic processes.
Book
Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing learning algorithms what is important in learning theory?.
Article
We introduce a general constructive setting of the density ratio estimation problem as a solution of a (multidimensional) integral equation. In this equation, not only its right hand side is known approximately, but also the integral operator is defined approximately. We show that this ill-posed problem has a rigorous solution and obtain the solution in a closed form. The key element of this solution is the novel V-matrix, which captures the geometry of the observed samples. We compare our method with three well-known previously proposed ones. Our experimental results demonstrate the good potential of the new approach.
Conference Paper
Wahba's classical representer theorem states that the solutions of certain risk minimization problems involving an empirical risk term and a quadratic regularizer can be written as expansions in terms of the training examples. We generalize the theorem to a larger class of regularizers and empirical risk terms, and give a self-contained proof utilizing the feature space associated with a kernel. The result shows that a wide range of problems have optimal solutions that live in the finite dimensional span of the training examples mapped into feature space, thus enabling us to carry out kernel algorithms independent of the (potentially infinite) dimensionality of the feature space.
Nonparametric Methods for Estimating Probability Densities. Automation and Remote Control
  • V Vapnik
  • A Stefanyuk
Estimation of the Likelihood Ratio Function in the “Disorder” Problem of Random Processes. Automation and Remote Control
  • A Stefanyuk