Online Learning of Noisy Data
ABSTRACT We study online learning of linear and kernelbased predictors, when individual examples are corrupted by random noise, and both examples and noise type can be chosen adversarially and change over time. We begin with the setting where some auxiliary information on the noise distribution is provided, and we wish to learn predictors with respect to the squared loss. Depending on the auxiliary information, we show how one can learn linear and kernelbased predictors, using just 1 or 2 noisy copies of each example. We then turn to discuss a general setting where virtually nothing is known about the noise distribution, and one wishes to learn with respect to general losses and using linear and kernelbased predictors. We show how this can be achieved using a random, essentially constant number of noisy copies of each example. Allowing multiple copies cannot be avoided: Indeed, we show that the setting becomes impossible when only one noisy copy of each instance can be accessed. To obtain our results we introduce several novel techniques, some of which might be of independent interest.

Article: Adaptive Universal Linear Filtering
[Show abstract] [Hide abstract]
ABSTRACT: We consider the problem of online estimation of an arbitrary realvalued signal corrupted by zeromean noise using linear estimators. The estimator is required to iteratively predict the underlying signal based on the current and several last noisy observations, and its performance is measured by the meansquareerror. We design and analyze an algorithm for this task whose total squareerror on any interval of the signal is equal to that of the best fixed filter in hindsight with respect to the interval plus an additional term whose dependence on the total signal length is only logarithmic. This bound is asymptotically tight, and resolves the question of Moon and Wiessman [“Universal FIR MMSE filtering,” IEEE Trans. Signal Process., vol. 57, no. 3, pp. 10681083, 2009]. Furthermore, the algorithm runs in linear time in terms of the number of filter coefficients. Previous constructions required at least quadratic time.IEEE Transactions on Signal Processing 04/2013; 61(7):15951604. · 2.81 Impact Factor  SourceAvailable from: arxiv.org[Show abstract] [Hide abstract]
ABSTRACT: We consider the most common variants of linear regression, including Ridge, Lasso and Supportvector regression, in a setting where the learner is allowed to observe only a fixed number of attributes of each example at training time. We present simple and efficient algorithms for these problems: for Lasso and Ridge regression they need the same total number of attributes (up to constants) as do fullinformation algorithms, for reaching a certain accuracy. For Supportvector regression, we require exponentially less attributes compared to the state of the art. By that, we resolve an open problem recently posed by CesaBianchi et al. (2010). Experiments show the theoretical bounds to be justified by superior performance compared to the state of the art.06/2012;
Page 1
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 12, DECEMBER 20117907
Online Learning of Noisy Data
Nicoló CesaBianchi, Shai ShalevShwartz, and Ohad Shamir
Abstract—We study online learning of linear and kernelbased
predictors, when individual examples are corrupted by random
noise, and both examples and noise type can be chosen adversari
ally and change over time. We begin with the setting where some
auxiliary information on the noise distribution is provided, and
we wish to learn predictors with respect to the squared loss. De
pending on the auxiliary information, we show how one can learn
linear and kernelbased predictors, using just 1 or 2 noisy copies
of each example. We then turn to discuss a general setting where
virtually nothing is known about the noise distribution, and one
wishes to learn with respect to general losses and using linear and
kernelbased predictors. We show how this can be achieved using
a random, essentially constant number of noisy copies of each ex
ample. Allowing multiple copies cannot be avoided: Indeed, we
show thatthesetting becomesimpossible whenonly onenoisy copy
ofeachinstancecanbeaccessed.Toobtainourresultsweintroduce
several novel techniques, some of which might be of independent
interest.
I. INTRODUCTION
I
ties. Examples include bioinformatics, medical tests, robotics,
and remote sensing. These measurements have errors that may
be due to several reasons: lowcost sensors, communication
and power constraints, or intrinsic physical limitations. In all
such cases, the learner trains on a distorted version of the actual
“target” data, which is where the learner’s predictive ability is
eventually evaluated. A concrete scenario matching this setting
is an automated diagnosis system based on computedtomog
raphy (CT) scans. In order to build a large dataset for training
the system, we might use lowdose CT scans: although the
images are noisier than those obtained through a standardradi
ation CT scan, lower exposure to radiation will persuade more
people to get a scan. On the other hand, at test time, a patient
suspected of having a serious disease will agree to undergo a
standard scan.
N many machine learning applications training data are
typically collected by measuring certain physical quanti
Manuscript received September 02, 2010; revised December 29, 2010; ac
cepted July 08, 2011. Date of publication September 08, 2011; date of current
version December 07, 2011. The material in this paper was presented at the
COLT 2010 conference. This work was supported in part by the Israeli Science
Foundation under Grant 59010 and in part by the PASCAL2 Network of Ex
cellence under EC Grant 216886.
N. CesaBianchi is with the Dipartimento di Scienze dell’Infor
mazione, Università degli Studi di Milano, Milano 20135, Italy (email:
nicolo.cesabianchi@unimi.it).
S. ShalevShwartz is with the Computer Science and Engineering
Department, The Hebrew University, Jerusalem 91904, Israel (email:
shais@cs.huji.ac.il).
O. Shamir is with Microsoft Research New England, Cambridge, MA 02142
USA (email: ohadsh@microsoft.com).
Communicated by T. Weissman, Associate Editor for Shannon Theory.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIT.2011.2164053
In this work, we investigate the extent to which a learning
algorithm for training linear and kernelbased predictors can
achieve a good performance when the features and/or target
values of the training data are corrupted by noise. Note that, al
though in the noisefree case learning with kernels is generally
not harder than linear learning, in the noisy case the situation is
different due to the potentially complex interaction between the
kernel and the noise distribution.
Weproveupperandlowerboundsonthelearner’scumulative
loss in the framework of online learning, where examples are
generated by an arbitrary and possibly adversarial source. We
model the measurement error via a random zeromean pertur
bation which affects each example observed by the learner. The
noise distribution may also be chosen adversarially, and change
for each example.
In the first part of the paper, we discuss the consequences
of being given some auxiliary information on the noise distri
bution. This is relevant in many applications, where the noise
can be explicitly modeled, or even intentionally introduced.
For example, in order to comply with privacy issues certain
datasets can be published only after being “sanitized”, which
corresponds to perturbing each data item with enough Gaussian
noise—see, e.g., [1]. In this work we show how to learn from
such sanitized data.
Focusing on the squared loss, we discuss three different set
tings, reflecting different levels of knowledge about the noise
distribution: known variance bound, known covariance struc
ture, and Gaussian noise with known covariance matrix. Our
results for these three settings can be summarized as follows:
Known Variance Bound: Linear predictors can be learnt
withtwoindependentnoisycopiesofeachinstance
is, two independent realizations of the example corrupted
by random noise), and one noisy copy of each target value
.
Known covariance structure: Linear predictors can be
learnt with only one noisy copy of
Gaussian distribution with known covariance matrix:
Kernelbased (and therefore linear) predictors can be
learnt using two independent noisy copies of each
one noisy copy of
. (Although we focus on Gaussian
kernels, we show how this result can be extended, in a
certain sense, to general radial kernels.)
Thus, the positive learning results get stronger the more we
can assume about the noise distribution. To obtain our results,
we use online gradient descent techniques of increasing sophis
tication. The first two settings are based on constructing unbi
ased gradient estimates, while the third setting involves a novel
technique based on constructing surrogate Hilbert spaces. The
surrogate space is built such that gradient descent on the noisy
examples in that space corresponds, in an appropriately defined
(that
and.
, and
00189448/$26.00 © 2011 IEEE
Page 2
7908IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 12, DECEMBER 2011
manner, to gradient descent on the noisefree examples in the
original space.
In the second part of the paper we consider linear and kernel
based learning with respect to general loss functions (and not
just the squared loss as before). Our positive results are quite
general:byassumingjustavarianceboundonthenoiseweshow
how it is possible to learn functions in any dotproduct (e.g.,
polynomial) or radial kernel Hilbert space, under any analytic
convex loss function. Our techniques, which are readily extend
able to other kernel types as well, require querying a random
number of independently perturbed copies of each example. We
show that this number is bounded by a constant with high prob
ability. This is in sharp contrast with standard averaging tech
niques, which attempts to directly estimate the noisy instance,
as these require a sample whose size depends on the scale of the
problem. Moreover, the number of queries is controlled by the
user, and can be reduced at the cost of receiving more examples
overall.
Finally, we formally show in this setting that learning is im
possible when only one perturbed copy of each example can be
accessed. This holds even without kernels, and for any reason
able loss function.
A. Related Work
In the machine learning literature, the problem of learning
from noisy examples, and, in particular, from noisy training in
stances,hastraditionallyreceivedalotofattention—see,forex
ample,therecentsurvey[2].Ontheotherhand,therearecompa
rably few theoreticallyprincipled studies on this topic. Two of
them focus on models quite different from the one studied here:
randomattributenoiseinPACbooleanlearning[3],[4],andma
licious noise [5], [6]. In the first case learning is restricted to
classes of boolean functions, and the noise must be independent
across each boolean coordinate. In the second case an adversary
is allowed to perturb a small fraction of the training examples
in an arbitrary way, making learning impossible in a strong in
formationtheoretic sense unless this perturbed fraction is very
small (of the order of the desired accuracy for the predictor).
The previous work perhaps closest to the one presented here
is [7],where binaryclassificationmistakebounds areprovenfor
the online Winnow algorithm in the presence of attribute errors.
Similarly to our setting, the sequence of instances observed by
the learner is chosen by an adversary. However, in [7] the noise
process is deterministic and also controlled by the adversary,
who may change the value of each attribute in an arbitrary way.
The final mistake bound, which only applies when the noiseless
data sequence is linearly separable without kernels, depends on
the sum of all adversarial perturbations.
II. FRAMEWORK AND NOTATION
Weconsiderasettingwherethegoalistopredictvalues
based on instances. We focus on predictors which are
either linear—i.e., of the form
or kernelbased—i.e., of the form
a feature mapping into some reproducing kernel Hilbert space
for some vector,
whereis
(RKHS)1 . In the latter case, we assume there exists a kernel
function
that efficiently implements inner
products in that space, i.e.,
in fact, linear predictors are just a special case of kernelbased
predictors: we can take
to be the identity mapping and let
.Otherchoicesofthekernelallowsustolearn
nonlinear predictors over
, while retaining much of the com
putational convenience and theoretical guarantees of learning
linear predictors (see [8] for more details). In the remainder of
this section, our discussion will use the notation of kernelbased
predictors, buteverythingwill apply to linearpredictors as well.
The standard online learning protocol is defined as the fol
lowing repeated game between the learner and an adversary: at
each round
, the learner picks a hypothesis
The adversary then picks an example
featurevector
andtargetvalue
The loss suffered by the learner is
is a known and fixed loss function. The goal of the learner is to
minimize regret with respect to a fixed convex set of hypotheses
, defined as
Note that
.
, composed of a
,andrevealsittothelearner.
, where
Typically, we wish to find a strategy for the learner, such that
no matter what is the adversary’s strategy of choosing the se
quence of examples, the expression above is sublinear in
this paper, we will focus for simplicity on a finitehorizon set
ting, where the number of online rounds
to the learner. All our results can easily be modified to deal with
the infinite horizon setting, where the learner needs to achieve
sublinear regret for all
simultaneously.
Wenowmakethefollowingmodification,whichlimitsthein
formation available to the learner: In each round, the adversary
also selects a vectorvalued random variable
variable
.Insteadofreceiving
cess to an oracle
, which can return independent realizations
of
and
forces the learner to see only a noisy version of the data, where
thenoisedistributioncanbechangedbytheadversaryaftereach
round. We will assume throughout the paper that
zeromean, independent, and there is some fixed known upper
bound on
and
are not zeromean, but the mean is known to the learner, we can
always deduct those means from
zeromean setting. The assumption that
can be relaxed to uncorrelation or even disposed of entirely in
some of the discussed settings, at the cost of some added tech
nical complexity in the algorithms and proofs.
The learner may call the oracle
as we discuss later on, being able to call
be necessary for the learner to have any hope to succeed, when
nothingmoreisknownaboutthenoisedistribution.Ontheother
hand, if the learner calls
an unlimited number of times,
can be reconstructed arbitrarily well by averaging, and we
. In
is fixed and known
and a random
,thelearnerisgivenac
. Inotherwords, theadversary
andare
for all . Note that ifor
and, thus reducing to the
is independent of
more than once. In fact,
more than once can
,
1Recall that a Hilbert space is a natural generalization of Euclidean space to
possibly infinite dimensions. More formally, it is an inner product space which
is complete with respect to the norm induced by the inner product.
Page 3
CESABIANCHI et al.: ONLINE LEARNING OF NOISY DATA 7909
are back to the standard learning setting. In this paper we focus
onlearningalgorithmsthatcall
stant numberoftimes,whichdependsonlyonourchoiceofloss
function and kernel (rather than the horizon
or the variance of
,, which happens with naïve averaging
techniques).
In this setting, we wish to minimize the regret in hindsight
for any sequence of unperturbed data, and in expectation with
respect to the noise introduced by the oracle, namely
onlyasmall,essentiallycon
, the norm of,
(1)
Note that the stochastic quantities in the above expression are
just
, where each
previous perturbed examples
When the noise distribution is bounded or has subGaussian
tails, our techniques can also be used to bound the actual re
gret with high probability, by relying on Azuma’s inequality or
variants thereof (see for example [9]). However, for simplicity
here we focus on the expected regret in (1).
The regret form in (1) is relevant where we actually wish
to learn from data, without the noise causing a hindrance.
In particular, consider the batch setting, where the examples
are actually sampled i.i.d. from some unknown
distribution, and we wish to find a predictor which minimizes
the expected loss with respect to new examples
standard onlinetobatch conversion techniques [9], if we can
find an online algorithm with a sublinear bound on (1), then it
is possible to construct learning algorithms for the batch setting
which are robust to noise. That is, algorithms generating a pre
dictor
with close to minimal expected loss
among all, despite getting only noisy access to the data.
In Appendix A, we briefly discuss alternative regret measures.
In the first part of our paper, we assume that the loss func
tion
is the squared loss
the second part of the paper, we deal with more general loss
functions, which are convex in
forafixed canbewrittenas
domain. This assumption holds for instance for the squared loss
, the exponential loss
and “smoothed” versions of loss functions such as the absolute
loss
and the hinge loss
(wediscussexamplesinmoredetailsinSectionVB.This
assumption can be relaxed under certain conditions, and this is
further discussed in Section IIIC.
Turning to the issue of kernels, we note that the general pre
sentationofourapproachissomewhathamperedbythefactthat
it needs to be tailored to the kernel we use. In this paper, we
focus on two important families of kernels:
Dot Product Kernels: the kernel
a function of
. Examples of such kernels
are linear kernels
kernels
; inhomogeneous polynomial kernels
; exponential kernels
kernels
, and more (see for instance [8],
[10]).
is a measurable function of the
for.
. Using
. In
and analytic, in the sense that
,forany inits
,
can be written as
; homogeneous polynomial
; binomial
Radial Kernels:
can be written as a function of
.Acentralandwidelyusedmemberofthisfamily
is the Gaussian kernel,
.
We emphasize that many of our techniques are extendable to
other kernel types as well.
for some
III. TECHNIQUES
We begin by presenting a highlevel and mostly informal
overview of the techniques we use to overcome the noise
present in the data. The first technique we discuss (“stochastic”
online gradient descent) is folklore, and forms a basis for our
learning algorithms. The rest of the techniques are designed to
overcome the noise in the data, and to the best of our knowl
edge, are novel to the machine learning community. Hence,
they might be of independent interest and applicable to other
learning problems with partial information on the examples.
A. “Stochastic” Online Gradient Descent
There exists a welldeveloped theory, as well as efficient al
gorithms, for dealing with the standard online learning setting,
where the example
is revealed after each round, and
for general convex loss functions. One of the simplest and most
well known ones is the online gradient descent algorithm due to
Zinkevich [11]. This algorithm, and its “stochastic” extension,
form a basis for our results, and we briefly survey it below.
At the heart of the standard online gradient descent algorithm
is the following observation: for any set of vectors
in some Hilbert space, suppose we define
, whereis a projection operator on a convex
set
, andis a suitably chosen step size. Then for any
, it holds that
and
(2)
where the
of
let
(we focus on linear predictors here for simplicity). Then by
convexity, the lefthand side (LHS) of (2) is lower bounded
by
are provided with
after each round, we can compute
, perform the update as above, and get an algorithm with
sublinear regret with respect to any predictor
norm.
In our setting of noisy data, the algorithm described above is
inapplicable, because
is unknown and we cannot com
pute
. However, suppose that instead of
vectors
with bounded variance, such that
and use them to update
. It turns out that based on (2), one
can still show that
notation hides dependencies on the norm
and the norms of. In particular, suppose that we
be the gradient ofwith respect to
. Thus, if we
of bounded
, we pick random
,
(3)
In our setting of noisy data, we cannot compute
suppose we can use the noisy data that we do have, in
order to construct a random boundedvariance vector
, but
,
Page 4
7910IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 12, DECEMBER 2011
such that
can be shown to equal
pectation here is again with respect to the noisy examples
(recall that
is a random vector that depends on the noisy
examples). Applying the same convexity argument as be
fore, we get an
upper bound on the expected regret
. In that case, the LHS of (3)
. The ex
. Thus, by
doing updates using
the regret which scales sublinearly with
The idea that one can work with random unbiased estimates
of
is not new, and has been used in previous work, such as
onlinebanditlearning(seeforinstance[12]–[14]).Here,weuse
this property in a new way, in order to devise algorithms which
are robust to noise.
For linear kernels and losses such as the squared loss, con
structing such unbiased estimates based on 1 or 2 noisy copies
ofeachexampleisnottoohard.However,whenwediscussnon
linearkernels,constructinganunbiasedestimatebecomesmuch
more tricky: rather than a finitedimensional vector,
existinahighorinfinitedimensionalHilbertspace.Evenworse,
due to the nonlinearity of virtually all feature mappings, the un
biasedperturbation
ofeachinstance
and complicated perturbation
the next technique.
, we get an algorithm with a bound on
.
might
is mappedtoabiased
. This leads us toof
B. “Parallel Worlds” Online Gradient Descent
Thetechniquedescribedhereisthecentraloneweusetolearn
with kernelbasedpredictors and squaredloss,in thecase where
thenoisedistributionis fixedandknowntobeaGaussian.Inthe
next subsections, we will describe our techniques for dealing
with unknown noise distribution and more general loss func
tions, at the cost of more noisy copies per example.
Unlike the “stochastic” online gradient descent approach dis
cussed in the previous subsection, the approach we discuss here
does not rely directly on constructing unbiased estimates of
In a nutshell, we construct a surrogate RKHS, with a surrogate
feature mapping
, such that for any noisy copy
any fixed instance , it holds that
.
of, and
(4)
where theexpectationis withrespect tothenoise.Thus, “noisy”
inner products in the surrogate RKHS correspond (in expecta
tion) to “noisefree” inner products in the original RKHS. This
allows us to use the noisy data in order to construct vectors
in the surrogate RKHS with the following interesting property:
if we apply online gradient descent on
nels), to get predictors
any
,
(using ker
, then for in the RKHS of
where
tain mapping to the RKHS of
respect to the unperturbed examples
online gradient descent in the surrogate RKHS, the LHS is
andare the images ofand according to a cer
are the gradients with
. Since we applied
, and
by (3). Thus, we get that
, which implies a sublinear regret bound for
. We emphasize that unlike the previous ap
proaches, the expectation of
live in different mathematical spaces!
A technical issue which needs addressing is that the norm
of
has to be related to the norm of the actual predictor
compareourselveswith.Whilethiscannotbealwaysdone,such
a relation does hold if
is reasonably “nice”, in a sense which
will be formalized later on.
Constructing a surrogate RKHS as in (4) can be done when
the original RKHS corresponds to a Gaussian kernel. Neverthe
less, we can extend our results, in a certain sense, to more gen
eral radial kernels. The basic tool we use is Schoenberg’s the
orem, which implies that any radial kernel can be written as an
integral of Gaussian kernels of different widths. Using this re
sult, we can show that one can still construct a surrogate RKHS,
whichhasthepropertyof(4)withrespecttoanapproximatever
sion of our original radial kernel.
is
is not equal to. Indeed, they
we
C. Unbiased Estimators for Nonlinear Functions
We now turn to discuss our techniques for dealing with the
mostgeneralsetting:learningkernelbasedpredictors,withgen
erallossfunctions,andwithonlyavarianceboundknownonthe
noisedistribution.At theheartofthesetechniques liesan appar
ently littleknown method from sequential estimation theory to
construct unbiased estimates of nonlinear and possibly complex
functions.
Suppose that we are given access to independent copies of a
real random variable
, with expectation
function
, and we wish to construct an unbiased estimate of
. Ifis a linear function, then this is easy: just sample
from , and return. By linearity,
and we are done. The problem becomes less trivial when
general, nonlinear function, since usually
In fact, when
takes finitely many values and
nomial function, one can prove that no unbiased estimator can
exist (see [15], Proposition 8 and its proof). Nevertheless, we
show how in many cases one can construct an unbiased esti
mator of
, including cases covered by the impossibility
result. There is no contradiction, because we do not construct a
“standard” estimator. Usually, an estimator is a function from a
given sample to the range of the parameter we wish to estimate.
An implicit assumption is that the size of the sample given to it
is fixed, and this is also a crucial ingredient in the impossibility
result. We circumvent this by constructing an estimator based
on a random number of samples.
Here is the key idea: suppose
continuous on a bounded interval. It is well known that one can
construct a sequence of polynomials
is a polynomial of degree , which converges uniformly to
the interval. If
.Now,consider theestimatorwhichdrawsa
positive integer
according to some distribution
, samplesfortimes to get
, and some real
is a
.
is not a poly
is any function
, where
on
, let
, and returns
Page 5
CESABIANCHI et al.: ONLINE LEARNING OF NOISY DATA7911
, where we as
sume
. The expected value of this estimator is equal to
Thus, we have an unbiased estimator of
This technique was introduced in a rather obscure early
1960’s paper [16] from sequential estimation theory, and ap
pears to be little known. However, we believe this technique is
interesting, and expect it to have useful applications for other
problems as well.
Whilethismayseematfirstlikeaverygeneralresult,thevari
anceofthisestimatormustbeboundedforittobeuseful.Unfor
tunately, this is not true for general continuous functions. More
precisely, let
be distributed according to
value returned by the estimator of
thatif
isaBernoullirandomvariable,andif
some integer
, then must be
ferentiable. Since
that functions
which yield an estimator with finite variance,
while using a number of queries with bounded variance, must
be continuously differentiable. Moreover, in case we desire the
number of queries to be essentially constant (e.g., choose a dis
tributionfor
withexponentiallydecayingtails),wemusthave
for all , which implies that
differentiable (in fact, in [17] it is conjectured that
analytic in such cases).
Thus, we focus in this paper on functions
alytic, i.e., they can be written as
propriate constants
. In that case,
the truncated Taylor expansion of
. Moreover, we can pick
So the estimator works as follows: we sample a nonnegative
integer
according to
independentlytimes to get
where we set
We have the following:
.
, and let
. In [17], it is shown
be the
for
times continuously dif
, this means
should be infinitely
must be
which are an
for ap
can simply be
, i.e.,
for any
to order
.
, sample
, and return
if.2
Lemma 1: For the above estimator, it holds that
. The expected number of samples used by the esti
mator is
, and the probability of it being at least
is. Moreover, if we assume that
exists for anyin the domain of interest, then
Proof: The fact that
cussion above. The results about the number of samples follow
follows from the dis
2Admittedly, the event ? ? ? should receive zero probability, as it amounts
to “skipping” the sampling altogether. However, setting
pears to improve the bound in this paper only in the smaller order terms, while
making the analysis in the paper more complicated.
?? ? ?? ? ? ap
directlyfrom propertiesof thegeometricdistribution.As for the
second moment,
equals
The parameter
the estimator and the number of samples needed: the larger is ,
the less samples we need, but the estimator has more variance.
In any case, the sample size distribution decays exponentially
fast.
It should be emphasized that the estimator associated with
Lemma 1 is tailored for generality, and is suboptimal in some
cases. For example, if
is a polynomial function, then
for sufficiently large , and there is no reason to sample
a distribution supported on all nonnegative integers: it just in
creases the variance. Nevertheless, in order to keep the presen
tationuniformandgeneral,wealwaysusethistypeofestimator.
If needed, the estimator can be optimized for specific cases.
We also note that this technique can be improved in various
directions, if more is known about the distribution of
stance,ifwehavesomeestimateoftheexpectationandvariance
of
, then we can perform a Taylor expansion around the esti
mated
rather than 0, and tune the probability distribution
of
to be different than the one we used above. These modi
fications can allow us to make the variance of the estimator ar
bitrarily small, if the variance of
onecan takepolynomialapproximationsto
better than truncated Taylor expansions. In this paper, for sim
plicity, we ignore these potential improvements.
Finally, we note that a related result in [17] implies that
it is impossible to estimate
when
is discontinuous, even if we allow a number of queries
and estimator values which are infinite in expectation. Since
the derivatives of some wellknown loss functions (such as
the hinge loss) are discontinuous, estimating their gradient in
an unbiased manner and arbitrary noise appears to be impos
sible. While our techniques allow us to work with “smoothed”
approximate versions of such losses, the regret guarantees
degrades with the quality of approximation, and this prevents
us from saying anything nontrivial about learning with respect
to the original losses. Thus, if online learning with noise and
such loss functions is at all feasible, a rather different approach
than ours needs to be taken.
provides a tradeoff between the variance of
from
. For in
is small enough. Moreover,
whichare perhaps
in an unbiased manner
Page 6
7912IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 12, DECEMBER 2011
D. Unbiasing Noise in the RKHS
Thesecondcomponentinourapproachtodealwithunknown
noise in the kernel setting involves the unbiased estimation of
, when we only have unbiased noisy copies of
again, we have a nontrivial problem, because the feature map
ping
is usually highly nonlinear, so
general. Moreover,
is not a scalar function, so the technique
of Section IIIC will not work asis.
To tackle this problem, we construct an explicit feature map
ping, which needs to be tailored to the kernel we want to use.
To give a very simple example, suppose we use the homoge
neous seconddegree polynomial kernel
It is not hard to verify that the function
fined via
ture mapping for this kernel. Now, if we query two indepen
dent noisy copies
,of , we have that the expectation of
therandomvector
. Thus, we can construct unbiased estimatesof
RKHS. Of course, this example pertains to a very simple RKHS
with a finite dimensional representation. By a randomization
technique somewhat similar to the one in Section IIIC, we can
adapt this approach to infinite dimensional RKHS as well. In a
nutshell, we represent
as an infinitedimensional vector,
and its noisy unbiased estimate is a vector which is nonzero on
only finitely many entries, using finitely many noisy queries.
Moreover, inner products between these estimates can be done
efficiently, allowing us to implement the learning algorithms,
and use the resulting predictors on test instances.
. Here
in
.
, de
, is an explicit fea
isnothingmorethan
in the
IV. AUXILIARY INFORMATION ON THE NOISE DISTRIBUTION
In the first part of the paper, we focus on the squared loss,
and discuss the implication of being provided different levels of
auxiliary information on the noise distribution in each round.
The first setting assumes just a known upper bound on the
variance of the noise. For the specific case of linear predictors,
we show one can learn using two noisy copies of each
one noisy copy of each
.
The second setting assumes that the covariance structure of
the noise is known. In that case, we show that one can learn
linear predictors with only one noisy copy of both
The third and most complex setting we consider is when the
noise has a fixed Gaussian distribution with known covariance
matrix. We show that one can even learn kernelbased predic
tors, using two independent noisy copies of each
noisy copy of
. We focus on Gaussian kernels, but also show
how the result can be extended, in a certain sense, to general ra
dial kernels.
Throughout the rest of the paper, we let
shorthand for expectation over
.
and
and.
, and one
be a
conditioned on
A. Setting 1: Upper Bound on the Variance
We begin with the simplest setting, which is when we only
know that
and
constants
,. Conditional expectation is used here because
we are assumingtheadversarycan changethenoise distribution
after each round, depending on the realizations of the past noisy
for some known
examples. We present an algorithm for learning linear predic
tors, using exactly two independent noisy copies of the instance
and one noisy copy of the target value
Section III, the algorithm is based on an adaptation of online
gradient descent, and the main requirement is to construct an
unbiased estimate of the gradient
lowing lemma.
. As discussed in
. This follows from the fol
Lemma 2: Let
be the gradient of
at . Letbe an additional independent
copy of
above assumptions, if
, and denote . Under the
, thenand
, where.
Proof: Because of the independence assumption, we have
For the second claim, we have by the independence assumption
that
The following theorem provides a bound on the regret for
Algorithm 1. The proof is provided in Section VIIIA.
Algorithm 1 Learning with Upper Bound on Noise Variance
PARAMETERS: ,.
INITIALIZE:.
For
Receive
Receive another independent copy
Theorem 1: Let
all
assume that
,are mutually independent. If we run Algorithm 1 with
parameters
,
2), then
be the squared loss. For
, and that,,
(whereis defined in Lemma
B. Setting 2: Known Covariance
We now turn to the case where rather than an upper bound
on the variance, we actually know the covariance matrix of the
noise at each round, which we denote as
. We assume that
Page 7
CESABIANCHI et al.: ONLINE LEARNING OF NOISY DATA7913
for all , wheredenotes the spectral norm. As
to
, we can still assume we only have an upper bound
(withouralgorithmicapproach,knowing
help much).
In this setting, we show it is possible to learn linear predic
tors, using just a single noisy copy
the previous subsection, where we needed an additional inde
pendentcopyof
.Theideaisthatifweusejustonenoisycopy
in our gradient estimate, we need to deal with bias terms. When
the covariance structure is known, we can calculate and remove
these bias terms, allowing an online gradient descent similar to
Algorithm1towork.AsinAlgorithm1,thebasicbuildingblock
is a construction of an unbiased estimate of the gradient
each iteration. See Algorithm 2 for the pseudocode.
on
doesnot
. This is opposed to
at
Algorithm 2 Learning with Known Noise Covariance
PARAMETERS: ,.
INITIALIZE:
.
For
Receive
Lemma 3: Letbe the gradient of
at
is the covariance matrix of
. Denote
, where . Then under
the assumptions above, if
,, and
, where,thenand
.
Proof: Using the zeromean and independence assump
tions on
,, we have
which implies that
the wellknown inequality
have
. As to the second claim, using
, we
Theorem 2: Let
all
assume that
such that the known covariance matrix
satisfies
, and
with parameters
Lemma 3, then
be the squared loss. For
are perturbed by independent noise
of the noise added to
. Assume further that
. If we run Algorithm 2
and , where
and
,
is defined in
Theproof issimilartotheproofofTheorem1,withLemma3
replacing Lemma 2. We note that if
knowing a bound on the fourth moment of
one can improve the bound to
is known (which requires
), then by picking
.
C. Setting 3: Gaussian Distribution
The third and most complex setting we consider in this sec
tion is when the noise is assumed to have a Gaussian distribu
tion
. Clearly, if we know the distribution, then we can
derive upper bounds on the moments of
are known on the original instances
Section IVB carry through to our setting, and we can learn
linear predictors. However, when we also know the noise has
a specific Gaussian distribution, we can learn the much more
powerful hypothesis class of kernelbased predictors.
Recall that the basic premise of kernelbased learning is that
thedata(originallyin
)ismappedtosomereproducingkernel
Hilbert space (RKHS),via a feature mapping
predictor is learned in that space. In our original space, this cor
respondstolearninganonlinearfunction.Usingthewellknown
kernel trick, inner products
might be infinitedimensional) can be easily computed via a
kernel function
.
While there are many possible kernel functions, perhaps the
most popular one is the Gaussian kernel, defined as
for some
corresponds to the inner product
priate RKHS. We we will show below how to learn from noisy
data with Gaussian kernels. In Section IVD, we show how this
can be extended, in a certain sense, to general radial kernels,
i.e., kernels of the form
priate real function
.
In this subsection, we assume that the noise distribution is
fixed for all . Hence, we may assume w.l.o.g. that
agonal matrix, with element
notice that there always exists a rotation matrix
has a Gaussian distribution with diagonal covariance ma
trix.Therefore,insteadoflearningwithrespectto
we can just learn with respect to
on any instance
by prerotating it using
here on rotationally invariant kernels, which depend just on the
Euclidean distance between instances, we have that
for any ,. Therefore, the data structure remains
the same in the kernel space, and all our guarantees will still
(assuming bounds
). Thus, the results of
, and a linear
in the RKHS (which
(the kernel width). This
in an appro
for an appro
is a di
at row/column . To see why,
, such that
,
, and predict
. Since we focus
Page 8
7914IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 12, DECEMBER 2011
hold.Asto
to assume that
Thealgorithmthatwepresent(Algorithm3)isbasedonbeing
able to receive two independent copies of each instance
wellasasingleindependentcopyof
learning algorithm that we use relies upon the online gradient
descent technique due to [11], with the main difference being
that instead of using a Gaussian kernel of width
surrogate kernel, as discussed in Section III.
,similartotheprevioussettings,wewillonlyneed
for some known parameter.
, as
.Asinthelinearcase,the
, we use a
Algorithm 3 Kernel Learning Algorithm with Gaussian Noise
PARAMETERS:,
INITIALIZE:for all
For:
Define
Define
Receive,, and independent copy
Let
//is gradient length with respect toat
Let
Let
If // If, then project
Let
for all
In order to define the surrogate kernel that we use, consider
the RKHS corresponding to the kernel
(5)
where we assume that
is less thanand
This can be shown to be a kernel by standard results (see for in
stance[8]).Notethat
canbeboundedbyaconstantwhen
for all (constant noise) and
when the feature values of observed instances
. Let be the feature mapping corresponding to this
RKHS.
The pseudocode of our algorithm is presented below. For
mallyspeaking,itisjustapplyingonlinegradientdescent,using
kernels, inthesurrogateRKHSthat weconstructed.However,it
is crucial to note that the actual output are elements
in the RKHS corresponding to
BeforestatingtheboundforAlgorithm3weneedanauxiliary
definition. Suppose that
is any element in the RKHS of
—plausible
are of order
.
,
which can be written as
. For example, this includes
for some
for any
the angle between
In other words, this is the angle between the component due to
positive support vectors, and the component due to the negative
support vectors. If one of the components is zero, define
to be . The main theorem of this section, whose proof is
presented in Section VIIIB, is the following.
by the representer theorem. Define to be
and.
Theorem 3: Let
all assume that
distribution
by arbitrary independent noise with
andbe fixed.If werun Algorithm3withthekernel
(5) such that
, and input parameters
be the squared loss. For
is perturbed by Gaussian noise with known
, whereis diagonal, andis perturbed
. Let
and
then
where
feature mapping induced by the Gaussian kernel with width
In particular, if
,
the above bound is
The intuition for
is that it measures how well separated
are the training examples: if the “positive” and “negative” ex
ample groups are not too close together, then the angle between
and
bound will be small. Note that in the RKHS corresponding to
a Gaussian kernel,
is always between 0 and
innerproductbetweenanytwoelements
itive.Inaddition,
canbeshowntobeexactlyzeroifandonly
if the positive and negative examples exactly coincide. Overall,
on realistic datasets, assuming there exist some good predictor
withnot too small is a pretty mild assumption, if some
thing interesting can be learned even on the unperturbed data.
and is the
.
, and, then
.
will be large, and the
, since the
isposand
D. Extension to General Radial Kernels
The Gaussian kernel we discussed previously is a member of
the family of radial kernels, that is kernels on ,
be written as a function of
kernel is the most popular member of this family, there are
manyotherradialkernels,suchas
which can
. Although the Gaussian
and
Page 9
CESABIANCHI et al.: ONLINE LEARNING OF NOISY DATA7915
for appropriate parameters ,,. Thus, a
reasonable question is whether Algorithm 3 and its analysis can
be extended to general radial kernels. The extension we are able
to show is in the following sense: for any radial kernel
there exists another radial kernel
arbitrarily well, for which one can extend Algorithm
3 and its analysis. Although the approximation parameter is
userdefined, the bound on the regret depends on this param
eter and deteriorates as the approximation gets better.
Recall from Section IIIB that the heart of our approach is
constructing a surrogate RKHS, with surrogate kernel , such
that
. In the Gaussian kernel case, the re
quired surrogateRKHScorresponds tothekernel definedin(5).
To deal with other kernels, constructing an appropriate surro
gate kernel becomes trickier. Luckily, we can still reduce the
problem, in some sense, to the case of Gaussian kernels. The
key technical result is the following theorem due to Schoenberg
([18], see also [19]), slightly paraphrased and adapted to our
purposes3:
,
, which approximates
Theorem 4 (Schoenberg’s Theorem): A function
radialkernelcorrespondingtoavalidRKHS,ifandonlyifthere
existsafinitenonnegativemeasure
,
is a
on,suchthatforany
Thisresultassertsthat,uptonormalizationfactors,radialker
nels can be characterized as Laplace transforms of probability
measures on the positive reals. Schoenberg’s Theorem has been
used by Micchelli et al. [20] to prove universality of radial ker
nels and by Scovel et al. [21] to establish approximation error
bounds. A related result is Bochner’s theorem (see, e.g., [22]),
whichcharacterizesthemoregeneralclassofshiftinvariantker
nels as Fourier transforms of multivariate distributions on
The above theorem implies that we can write inner products
in our RKHS using the approximate kernel
.
(6)
where
is a parameter and is the Gaussian kernel
with kernel width . Note
that this is a valid kernel by the reverse direction of Theorem 4.
If is chosen nottoo small,then
mation to
for all , . The reasonwhy we mustsettle for
approximations of the radial kernel, rather than the kernel itself,
isthefollowing:foreach
intheaboveintegral,weconstructa
surrogatekernel
suchthat
rogate kernel
is based on subtracting certain constants from
the kernel width
along each dimension, and this cannot be
done if
is larger than those constants.
By Fubini’s theorem, we can write (6) as
is an excellentapproxi
.Thesur
3To be precise, the theorem here is a corollary of Schoenberg’s theorem,
which discusses necessary and sufficient conditions for ?????? to be positive
definite, and Mercer’s theorem (see [8]), which asserts that such a function is a
kernel of a valid RKHS.
It turns out that the integral inside the expectation corresponds
toaninnerproduct,inavalidRKHS,betweenthenoisyinstance
and . This will be our surrogate kernel for .
To provide a concrete case study, we will outline the results
for the specific radial kernel4
postponing
Section VIIIC. Just to make our analysis simpler to present,
we assume here that
(this is a reasonable assumption to make when the
feature values of the original data is
The approximate kernel we will consider is
thefulltechnicaldetails and proofsto
for some parameter, where
).
(7)
where
off the quality of the bound on the regret and the similarity of
to . This is a valid kernel by the reverse direc
tion of Theorem 4 since
is a userdefined parameter, which trades
Note thatis always between 0 and 1, so
Therefore,
values of not too small (see Fig. 1 for a graphical illustration).
As before, we let
denote the feature mapping associated with
the kernel .
The surrogate kernel that we will pick is defined as follows:
isanexcellentapproximationof for
(8)
As before, we let
this kernel. This is a valid kernel by the reverse direction of
Theorem 4.
Our algorithm looks exactly like Algorithm 3, only that now
we use the new definitions of
recall that for any
we define
to be the angle between
. The bound takes the following form.
denote the feature mapping associated with
,above. To state the bound,
for some,
and
Theorem 5: Let
all assume that
distribution
pendent noise with
be the squared loss. For
is perturbed by Gaussian noise with known
, andis perturbed by arbitrary inde
. Letand
4Note that the scaling factor ??? is the reasonable one to take, when we as
sume that the attribute values in the instances are on the order of ????.
Page 10
7916 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 12, DECEMBER 2011
Fig. 1. Comparison of ????? ? (solid line) and ????? ? (dashed line) as a function of ?? ? ? ?, for ? ? ? (left) and ? ? ? (right). Note that for ? ? ?, the two
graphs are visually indistinguishable.
be fixed. If we run Algorithm 3 with the kernel (7) where
, and input parameters
and
then
where
mapping induced by the kernel (7).
The proof of the theorem is provided in Section VIIIC.
andis the feature
V. UNKNOWN NOISE DISTRIBUTION
In thispart of the paper, we turn to study thesetting where we
wish to learn kernelbased predictors, while having no informa
tion about the noise distribution other than an upper bound on
its variance. This is relevant in cases where the noise is hard to
model,orifitmightchangeinanunexpectedorevenadversarial
manner. Moreover, we provide results with respect to general
analytic loss functions, which go beyond the squared loss on
which we focused in Section IV. We emphasize that the tech
niques here are substantially different than those of Section IV,
and do not rely on surrogate kernels. Instead, the techniques
focus on construction of unbiased gradient estimates directly in
the RKHS.
A. Algorithm
We present our algorithmic approach in a modular form. We
start by introducing the main algorithm, which contains several
subroutines. Then we prove our two main results, which bound
the regret of the algorithm, the number of queries to the oracle,
and the running time for two types of kernels: dot product and
Gaussian (our results can be extended to other kernel types as
well).Initself,thealgorithmisnothingmorethanastandardon
line gradient descent algorithm with a standard
bound.Thus,mostoftheproofsaredevotedtoadetaileddiscus
sion of howthe subroutines are implemented (including explicit
pseudocode). In this subsection, we describe just one subrou
tine, based on the techniques discussed in Section III. The other
subroutines require a more detailed and technical discussion,
and thus their implementation is described as part of the proofs
in Section VIII. In any case, the intuition behind the implemen
tations and the techniques used are described in Section III.
Fortheremainderofthissubsection,weassumeforsimplicity
that is a classification loss; namely, it can be written as a func
tion of
. It is not hard to adapt the results below
to the case where
is a regression loss (where
). Another simplifying assumption we will make,
purelyin theinterestof clarity, is thatthe noisewill be restricted
just to the instance
, and not to the target value
words, we assume that the learner is given access to
an oracle
which provides noisy copies of
make our lives easier, since the hard estimation problems relate
to
and not (e.g., estimating
manner, despite the nonlinearity of the feature mapping
the other hand, it will help to make our results more transparent,
and reduce tedious bookkeeping.
At each round, the algorithm below constructs an object
which we denote as
(note that it has no relationship
to
used in the previous section). This object has two
interpretations here: formally, it is an element of a reproducing
kernel Hilbert space (RKHS) corresponding to the kernel we
use, and is equal in expectation to
regret
is a function of
. In other
, and to
. This does not
in an unbiased
). On
. However, in terms
Page 11
CESABIANCHI et al.: ONLINE LEARNING OF NOISY DATA7917
of implementation, it is simply a data structure consisting of a
finite set of vectors from
. Thus, it can be efficiently stored
in memory and handled even for infinitedimensional RKHS.
Like
,has also two interpretations: formally, it
is an element in the RKHS, as defined in the pseudocode. In
terms of implementation, it is defined via the data structures
and the values of
To apply this hypothesis on a given instance
, where
routine which returns the inner product
pseudocode is provided as part of the proofs in Section VIII).
We start by considering dotproduct kernels; that is, kernels
that can be written as
has a Taylor expansion
for all —see theorem 4.19 in [8]. Our first result shows what
regret bound is achievable by the algorithm for any dotproduct
kernel, as well as characterize the number of oracle queries per
instance, and the overall running time of the algorithm. The
proof is provided in Section VIIIE.
at round .
, we compute
is a sub
(a
, where
such that
Theorem 6: Assume that the loss function
derivative
let
dotproduct kernel
for any
for all
. Then, for all
possible to implement the subroutines of Algorithm 4 such that:
1) The expected number of queries to each oracle
has an analytic
in its domain, and for all
(assuming it exists). Pick any
. Finally, assume that
returned by the oracle at round ,
and , it is
is
2) The expected running time of the algorithm is
3) If we run Algorithm 4 with
where
then
Algorithm 4 Kernel Learning Algorithm with Noisy Input
: Learning rate, number of rounds,
sample parameter
:
for all.
for all
//is a data structure which can store a
// variable number of vectors in
Define
Receive oracleand
Let
// Get unbiased estimates of in the RKHS
Let
// Get unbiased estimate of
Let // Perform gradient step
Let
// Compute squared norm, where
returns
If
Letfor all
//If squared norm is larger than , then project
We note that the distribution of the number of oracle queries
can be specified explicitly, and it decays very rapidly—see the
proof for details.
The parameter
is userdefined, and allows one to perform a
tradeoff between the number of noisy copies required for each
example, and the total number of examples. In other words, the
regret bound will be similar whether many noisy measurements
are provided on a few examples, or just a few noisy measure
ments are provided on many different examples.
Theresultpertainingtoradialkernelsisverysimilar,anduses
essentially the same techniques. For the sake of clarity, we pro
vide a more concrete result which pertains specifically to the
most important and popular radial kernel, namely the Gaussian
kernel. The proof is provided in Section VIIIF.
Theorem 7: Assume that the loss function
derivative
has an analytic
in its domain, and letfor all
(assuming it exists). Pick any Gaussian
for some
for any
. Then for all
it is possible to implement the subroutines of Algorithm
4 such that
kernel
nally, assume that
oracle at round , for all
. Fi
returned by the
and
Page 12
7918IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 12, DECEMBER 2011
1) The expected number of queries to each oracleis
2) The expected running time of the algorithm is
3) If we run Algorithm 4 with
where
then
As in Theorem 6, note that the number of oracle queries has a
fastdecayingdistribution.Also,notethatwithGaussiankernels,
is usually chosen to be on the order of the example’s squared
norms. Thus, if the noise added to the examples is proportional
to their original norm, we can assume that
thus
appearing in the bound is also bounded by a constant.
As previously mentioned, most of the subroutines are
described in the proofs section, as part of the proof of
Theorem 6. Here, we only show how to implement the
subroutine, which returns the gra
dient length estimate
. The idea is based on the technique
described in Section IIIC. We prove that
estimate of , and bound
earlier, we assume that
is analytic and can be written as
.
, and
is an unbiased
. As discussed
Subroutine 1
Sample nonnegative integeraccording to
Let
// Get unbiased estimate of in the RKHS
Return
Lemma 4: Assume that, and that
for allreturns,.
Fig.2. Absoluteloss,hingeloss,andanalyticapproximations.Fortheabsolute
loss, the line represents the loss as a function of ???????? ? ?. For the hinge
loss, the lines represent the loss as a function of ? ????????.
Denote the output of the subroutine above as
. Then for any given
, and define
it holds that and
where the expectation is with respect to the randomness of Sub
routine 1.
Proof: The result follows from Lemma 1, where
responds to the estimator , the function
and therandom variable
corresponds to
is random andis held fixed). The term
Lemma 1 can be upper bounded as
cor
corresponds to,
(where
in
B. Loss Function Examples
Theorems 6 and 7 both deal with generic loss functions
whose derivative can be written as
boundsinvolvethefunctions
presentafewexamplesoflossfunctionsandtheircorresponding
. As mentioned earlier, while the theorems in the previous
subsection are in terms of classification losses (i.e.,
tionof
),virtuallyidenticalresultscanbeprovenfor
regression losses (i.e.,
is a function of
willgiveexamplesfrom both families. Working outthefirst two
examples is straightforward. The proofs of the other two appear
in Section VIIIG. The loss functions in the last two examples
are illustrated graphically in Fig. 2.
, and the regret
.Below,we
is a func
), so we
Example1: Forthesquaredlossfunction,
, we have.
Page 13
CESABIANCHI et al.: ONLINE LEARNING OF NOISY DATA 7919
Example 2: For the exponential loss function,
we have
.
Example3: Recallthatthestandardabsolutelossisdefinedas
.Considera“smoothed”ab
, defined as an antideriva
(see proof for exact analytic
solute loss function
tive of
form). Then we have that
for some
Example 4: Recall that the standard hinge loss is defined
as
a “smoothed” hinge loss
tiderivative of
for exact analytic form). Then we have that
. Consider
, defined as an an
for some(see proof
Forany ,thelossfunctioninthelasttwoexamplesisconvex,
and,respectively,approximatetheabsoluteloss
and the hinge loss
large enough . Fig. 2 shows these loss functions graphically
for
. Note thatneed not be large in order to get a good
approximation. Also, we note that both the loss itself and its
gradient are computationally easy to evaluate.
Finally, we remind the reader that as discussed in
Section IIIC, performing an unbiased estimate of the gra
dient for nondifferentiable losses directly (such as the hinge
loss or absolute loss) appears to be impossible in general. On
the flip side, if one is willing to use a random number of queries
with polynomially decaying rather than exponentiallydecaying
tails, then one can achieve much better sample complexity re
sults, by focusing on loss functions (or approximations thereof)
which are only differentiable to a bounded order, rather than
fully analytic. This again demonstrates the tradeoff between the
number of examples, and the amount of information that needs
to be gathered on each example.
arbitrarily well for
VI. ARE MULTIPLE NOISY COPIES NECESSARY?
The positive results discussed so far are mostly based on
getting more than one noisy copy per example. However, one
might wonder if this is really necessary. In some applications
this is inconvenient, and one would prefer a method which
works when just a single noisy copy of each example is made
available. Moreover, in the setting of known noise covariance
(Section IVB), for linear predictors and squared loss, we
needed just one noisy copy of each example
learn. Perhaps a similar result can be obtained even when the
noise distribution in unknown?
in order to
In this subsection we show that, unfortunately, such a method
cannotbefound.Specifically,weprovethatifthenoisedistribu
tion is unknown, then under very mild assumptions, no method
can achieve sublinear regret, when it has access to just a single
noisy copy of each instance
the other hand, for the case of squared loss and linear kernels,
we know that we can learn based on two noisy copies of each
instance (see Section IVA). So without further assumptions,
the lower bound that we prove here is indeed tight. It is an in
teresting open problem to show improved lower bounds when
nonlinear kernels are used, or when the loss function is more
complex.
(even whenis known). On
Theorem 8: Let
be a compact convex subset of
satisfies the following: (1) it is bounded from
below; (2) it is differentiable at 0 with
learning algorithm which selects hypotheses from
lowedaccesstoasinglenoisycopyoftheinstanceateachround
, there existsa strategy for the adversary such that the sequence
of predictors output by the algorithm satisfies
, and let
. For any
and is al
with probability 1.
Note that condition (1) is satisfied by virtually any loss func
tion other than the linear loss, while condition (2) is satisfied
by most regression losses, and by all classification calibrated
losses,whichincludeallreasonablelossesforclassification(see
[23]).
The intuition of the proof is very simple: the adversary
chooses beforehand whether the examples are drawn i.i.d.
from a distribution
, and then perturbed by noise, or drawn
i.i.d. from some other distribution
The distributions
,and the noise are designed so that the
examples observed by the learner are distributed in the same
way irrespective of which of the two sampling strategies the
adversary chooses. Therefore, it is impossible for the learner
accessing a single copy of each instance to be statistically
consistent with respect to both distributions simultaneously.
As a result, the adversary can always choose a distribution on
which the algorithm will be inconsistent, leading to constant
regret.
To prove the theorem, we use a more general result which
leads to nonvanishing regret, and then show that under the as
sumptionsofTheorem8,theresultholds.Theproofoftheresult
is given in Section VIIII.
without adding noise.
Theorem 9: Let
pick any learning algorithm which selects hypotheses from
and is allowed access to a single noisy copy of the instance at
each round . If there exists a distribution over a compact subset
of
such that
be a compact convex subset ofand
(9)
Page 14
7920IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 12, DECEMBER 2011
are disjoint, then there exists a strategy for the adversary such
that the sequence
algorithm satisfies
of predictors output by the
with probability 1.
Another way to phrase this theorem is that the regret cannot
vanish, if given examples sampled i.i.d. from a distribution, the
learningproblemismorecomplicatedthanjustfindingthemean
of the data. Indeed, the adversary’s strategy we choose later on
is simply drawing and presenting examples from such a distri
bution. Below, we sketch how we use Theorem 9 in order to
prove Theorem 8. A full proof is provided in Section VIIIH.
We construct a very simple onedimensional distribution,
which satisfies the conditions of Theorem 9: it is simply
the uniform distribution on
. Thus, it is enough to show that
, whereis the vector
(10)
are disjoint, for some appropriately chosen
contrary, then under the assumptions on , we show that the first
set in (10) is inside a bounded ball around the origin, in a way
which is independent of
, no matter how large it is. Thus, if
we pick
to be large enough, and assume that the two sets in
(10) are not disjoint, then there must be some
and
. However, this can be shown to contradict the assumptions on
, leading to the desired result.
. Assuming the
such that both
have a subgradient of zero at
VII. CONCLUSIONS AND FUTURE WORK
We have investigated the problem of learning, in an online
fashion, linear and kernelbased predictors when the observed
examples are corrupted by noise. We have shown bounds
on the expected regret of learning algorithms under various
assumptions on the noise distribution and the loss function
(squared loss, analytic losses). A key ingredient of our results is
the derivation of unbiased estimates for the loss gradients based
on the possibility of obtaining a small but random number
of independent copies of each noisy example. We also show
that accessing more than one copy of each noisy example is a
necessary condition for learning with sublinear regret.
There are several interesting research directions worth pur
suing in the noisy learning framework introduced here. For in
stance, doing away with unbiasedness, which could lead to the
design of estimators that are applicable to more types of loss
functions,forwhichunbiasedestimatorsmaynotevenexist.Bi
ased estimates may also help in designing improved estimates
for kernel learning when the noise distribution is known, but
not necessarily Gaussian. Another open question is whether our
lower bound (Theorem 8) can be improved when nonlinear ker
nels are used.
VIII. PROOFS
A. Proof of Theorem 1
First, we use the following lemma that can be easily adapted
from [11].
Lemma5: Let
and for
jection operator on an origincentered ball of radius
for all
such that
beasequenceofvectors.Let
, whereletis the pro
. Then,
we have
Applying Lemma 5 with
obtain
as defined in Lemma 2, we
Taking expectation of both sides and using again Lemma 2, we
obtain that
Now, using convexity we get that
which gives
Picking as in the theorem statement concludes our proof.
B. Proof of Theorem 3
Toprovethetheorem,wewillneedafewauxiliarylemmas.In
particular, Lemma 6 is a key technical lemma, which will prove
crucial in connecting the RKHS with respect to
and the RKHS with respect to
between the norms of elements in the two RKHS’s.
To state the lemmas and proofs conveniently, recall the short
hand
,,
, . Lemma 8 connects
Lemma 6: For any ,
is a Gaussian randomvectorwith covariancematrix
then it holds that
, if we let where
,
Proof: The expectation in the lemma can be written as
(11)
Page 15
CESABIANCHI et al.: ONLINE LEARNING OF NOISY DATA 7921
Apurelytechnicalintegrationexerciserevealsthateachelement
in this product equals
equals
. Therefore, (11)
which is exactly.
Lemma 7: Let
RKHS. Let
scalars, such that
it holds that
denote a feature mapping to an arbitrary
be vectors in
for some
, and
. Then
whereis theanglebetweenand
in the RKHS (orif one of these
elements is zero).
We remark that this bound is designed for readability—it is
not the tightest upper bound possible.
Proof: The bound trivially holds if
are zero, so we will assume w.l.o.g. that they
are both nonzero.
To simplify notation, let
or
and notice that
fact that
. By the cosine theorem and the
, we have that
Solving for
quadratic equation, we have that
and taking the larger root in the resulting
(12)
(it is easy to verify that the term in the squared root is always
nonnegative). Therefore
From straightforward geometric arguments, we must have
(this is the same reason the term in
the squared root in (12) is nonnegative). Plugging this into the
righthand side (RHS) of the inequality above, we get an upper
bound of the form
where we used the fact that
upper bounding leads to the lemma statement.
. A straightforward
The following lemma is basically a corollary of Lemma 7.
Lemma 8:
Let
scalars, such that
is an element in the RKHS with respect to
, whose norm squared is at most
be vectors in, and
.
Then
Here,isthe angle
in the RKHS (or
betweenand
if one of the
elements is zero).
Proof: Picking some
lemma statement, we have
andas in the
(13)
where the last transition is by the fact that
Now, by definition of
is always positive.
,, it holds for any,that
which is at most. Therefore, we can upper bound (13) by
The lemma follows by noting that according to Lemma 7
With these lemmas in hand, we are now ready to prove the
main theorem.
To make the proof clearer, let
algorithm 3 at the beginning of round .
denote the value ofin
View other sources
Hide other sources
 Available from microsoft.com
 Available from Nicolò CesaBianchi · May 29, 2014