Online Learning of Noisy Data

Dipt. di Sci. dell'Inf., Univ. degli Studi di Milano, Milan, Italy
IEEE Transactions on Information Theory (Impact Factor: 2.65). 01/2012; 57(12):7907 - 7931. DOI: 10.1109/TIT.2011.2164053
Source: IEEE Xplore

ABSTRACT We study online learning of linear and kernel-based predictors, when individual examples are corrupted by random noise, and both examples and noise type can be chosen adversarially and change over time. We begin with the setting where some auxiliary information on the noise distribution is provided, and we wish to learn predictors with respect to the squared loss. Depending on the auxiliary information, we show how one can learn linear and kernel-based predictors, using just 1 or 2 noisy copies of each example. We then turn to discuss a general setting where virtually nothing is known about the noise distribution, and one wishes to learn with respect to general losses and using linear and kernel-based predictors. We show how this can be achieved using a random, essentially constant number of noisy copies of each example. Allowing multiple copies cannot be avoided: Indeed, we show that the setting becomes impossible when only one noisy copy of each instance can be accessed. To obtain our results we introduce several novel techniques, some of which might be of independent interest.

Download full-text


Available from: Nicolò Cesa-Bianchi, Jul 11, 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the most common variants of linear regression, including Ridge, Lasso and Support-vector regression, in a setting where the learner is allowed to observe only a fixed number of attributes of each example at training time. We present simple and efficient algorithms for these problems: for Lasso and Ridge regression they need the same total number of attributes (up to constants) as do full-information algorithms, for reaching a certain accuracy. For Support-vector regression, we require exponentially less attributes compared to the state of the art. By that, we resolve an open problem recently posed by Cesa-Bianchi et al. (2010). Experiments show the theoretical bounds to be justified by superior performance compared to the state of the art.
  • [Show abstract] [Hide abstract]
    ABSTRACT: We consider the problem of online estimation of an arbitrary real-valued signal corrupted by zero-mean noise using linear estimators. The estimator is required to iteratively predict the underlying signal based on the current and several last noisy observations, and its performance is measured by the mean-square-error. We design and analyze an algorithm for this task whose total square-error on any interval of the signal is equal to that of the best fixed filter in hindsight with respect to the interval plus an additional term whose dependence on the total signal length is only logarithmic. This bound is asymptotically tight, and resolves the question of Moon and Wiessman [“Universal FIR MMSE filtering,” IEEE Trans. Signal Process., vol. 57, no. 3, pp. 1068-1083, 2009]. Furthermore, the algorithm runs in linear time in terms of the number of filter coefficients. Previous constructions required at least quadratic time.
    IEEE Transactions on Signal Processing 04/2013; 61(7):1595-1604. DOI:10.1109/TSP.2012.2234742 · 3.20 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The discovery, extraction and analysis of knowledge from data rely generally upon the use of unsupervised learning methods, in particular clustering approaches. Much recent research in clustering and data engineering has focused on the consideration of finite mixture models which allow to reason in the face of uncertainty and to learn by example. The adoption of these models becomes a challenging task in the presence of outliers and in the case of high-dimensional data which necessitates the deployment of feature selection techniques. In this paper we tackle simultaneously the problems of cluster validation (i.e. model selection), feature selection and outliers rejection when clustering positive data. The proposed statistical framework is based on the generalized inverted Dirichlet distribution that offers a more practical and flexible alternative to the inverted Dirichlet which has a very restrictive covariance structure. The learning of the parameters of the resulting model is based on the minimization of a message length objective incorporating prior knowledge. We use synthetic data and real data generated from challenging applications, namely visual scenes and objects clustering, to demonstrate the feasibility and advantages of the proposed method.
    Knowledge-Based Systems 03/2014; 59. DOI:10.1016/j.knosys.2014.01.007 · 3.06 Impact Factor
Show more