Online Learning of Noisy Data

Dipt. di Sci. dell'Inf., Univ. degli Studi di Milano, Milan, Italy
IEEE Transactions on Information Theory (Impact Factor: 2.33). 01/2012; 57(12):7907 - 7931. DOI: 10.1109/TIT.2011.2164053
Source: IEEE Xplore


We study online learning of linear and kernel-based predictors, when individual examples are corrupted by random noise, and both examples and noise type can be chosen adversarially and change over time. We begin with the setting where some auxiliary information on the noise distribution is provided, and we wish to learn predictors with respect to the squared loss. Depending on the auxiliary information, we show how one can learn linear and kernel-based predictors, using just 1 or 2 noisy copies of each example. We then turn to discuss a general setting where virtually nothing is known about the noise distribution, and one wishes to learn with respect to general losses and using linear and kernel-based predictors. We show how this can be achieved using a random, essentially constant number of noisy copies of each example. Allowing multiple copies cannot be avoided: Indeed, we show that the setting becomes impossible when only one noisy copy of each instance can be accessed. To obtain our results we introduce several novel techniques, some of which might be of independent interest.


Available from: Nicolò Cesa-Bianchi
  • Source
    • "Further details about the MN model can be found in [22]. Cesa-Bianchi et al. [23] considered a more complicated model in which the features and labels are both added with zero-mean and variance-bounded noise. They used unbiased estimates of the gradient of the surrogate loss function to learn from the noisy sample in an online learning setting. "

  • Source
    • "This is the basic learning mechanism in the first two settings we consider, in Section 4. This technique already appears in [3] (as well as previous work in other settings, e.g., [2]), and our main contribution for these two settings is the observation that it can be done with one or two noisy copies of each example, under appropriate distributional assumptions. The third setting we consider, for kernel-based predictors (Section 5), is where the main technical novelty of this paper lies, as it requires a rather different approach than that of [3]. This approach is discussed below. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We study the framework of online learning, when individual examples are corrupted by random noise, and both examples and noise type can be chosen adversarially. Previous work has shown that without knowledge of the noise distribution, it is possible to learn using a random, potentially unbounded number of independent noisy copies of each example. Moreover, it is generally impossible to learn with just one noisy copy per example. In this paper, we explore the consequences of being given some side information on the noise distribution. We consider several settings, and show how one can learn linear and kernel-based predictors using just one or two noisy views of each example, depending on the side information provided. 1
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the most common variants of linear regression, including Ridge, Lasso and Support-vector regression, in a setting where the learner is allowed to observe only a fixed number of attributes of each example at training time. We present simple and efficient algorithms for these problems: for Lasso and Ridge regression they need the same total number of attributes (up to constants) as do full-information algorithms, for reaching a certain accuracy. For Support-vector regression, we require exponentially less attributes compared to the state of the art. By that, we resolve an open problem recently posed by Cesa-Bianchi et al. (2010). Experiments show the theoretical bounds to be justified by superior performance compared to the state of the art.
Show more