# Nicolò Cesa-BianchiUniversity of Milan | UNIMI · Department of Computer Science

Nicolò Cesa-Bianchi

## About

213

Publications

34,950

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

20,598

Citations

Citations since 2017

## Publications

Publications (213)

Online learning methods yield sequential regret bounds under minimal assumptions and provide in-expectation risk bounds for statistical learning. However, despite the apparent advantage of online guarantees over their statistical counterparts, recent findings indicate that in many important cases, regret bounds may not guarantee tight high-probabil...

Multitask learning is a powerful framework that enables one to simultaneously learn multiple related tasks by sharing information between them. Quantifying uncertainty in the estimated tasks is of pivotal importance for many downstream applications, such as online or active learning. In this work, we provide novel multitask confidence intervals in...

We study the problem of regret minimization for a single bidder in a sequence of first-price auctions where the bidder knows the item's value only if the auction is won. Our main contribution is a complete characterization, up to logarithmic factors, of the minimax regret in terms of the auction's transparency, which regulates the amount of informa...

We investigate online classification with paid stochastic experts. Here, before making their prediction, each expert must be paid. The amount that we pay each expert directly influences the accuracy of their prediction through some unknown Lipschitz "productivity" function. In each round, the learner must decide how much to pay each expert and then...

We study a $K$-armed bandit with delayed feedback and intermediate observations. We consider a model where intermediate observations have a form of a finite state, which is observed immediately after taking an action, whereas the loss is observed after an adversarially chosen delay. We show that the regime of the mapping of states to losses determi...

In this work, we improve on the upper and lower bounds for the regret of online learning with strongly observable undirected feedback graphs. The best known upper bound for this problem is $\mathcal{O}\bigl(\sqrt{\alpha T\ln K}\bigr)$, where $K$ is the number of actions, $\alpha$ is the independence number of the graph, and $T$ is the time horizon....

We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with delayed bandit feedback. By separating the cost of delayed feedback from that of bandit feedback, our analysis allows us to obtain new results in three important settings. On the one hand, we derive the first optimal (up to logarithmic factors) regret bounds f...

We investigate the problem of bandits with expert advice when the experts are fixed and known distributions over the actions. Improving on previous analyses, we show that the regret in this setting is controlled by information-theoretic quantities that measure the similarity between experts. In some natural special cases, this allows us to obtain t...

We study repeated bilateral trade where an adaptive $\sigma$-smooth adversary generates the valuations of sellers and buyers. We provide a complete characterization of the regret regimes for fixed-price mechanisms under different feedback models in the two cases where the learner can post either the same or different prices to buyers and sellers. W...

Bilateral trade, a fundamental topic in economics, models the problem of intermediating between two strategic agents, a seller and a buyer, willing to trade a good for which they hold private valuations. In this paper, we cast the bilateral trade problem in a regret minimization framework over T rounds of seller/buyer interactions, with no prior kn...

Nonstationary phenomena, such as satiation effects in recommendation, are a common feature of sequential decision-making problems. While these phenomena have been mostly studied in the framework of bandits with finitely many arms, in many practically relevant cases linear bandits provide a more effective modeling choice. In this work, we introduce...

The framework of feedback graphs is a generalization of sequential decision-making with bandit or full information feedback. In this work, we study an extension where the directed feedback graph is stochastic, following a distribution similar to the classical Erd\H{o}s-R\'enyi model. Specifically, in each round every edge in the graph is either rea...

We study a repeated game between a supplier and a retailer who want to maximize their respective profits without full knowledge of the problem parameters. After characterizing the uniqueness of the Stackelberg equilibrium of the stage game with complete information, we show that even with partial knowledge of the joint distribution of demand and pr...

We consider prediction with expert advice for strongly convex and bounded losses, and investigate trade-offs between regret and "variance" (i.e., squared difference of learner's predictions and best expert predictions). With $K$ experts, the Exponentially Weighted Average (EWA) algorithm is known to achieve $O(\log K)$ regret. We prove that a varia...

We consider online learning with feedback graphs, a sequential decision-making framework where the learner's feedback is determined by a directed graph over the action set. We present a computationally efficient algorithm for learning in this framework that simultaneously achieves near-optimal regret bounds in both stochastic and adversarial enviro...

We introduce and analyze AdaTask, a multitask online learning algorithm that adapts to the unknown structure of the tasks. When the $N$ tasks are stochastically activated, we show that the regret of AdaTask is better, by a factor that can be as large as $\sqrt{N}$, than the regret achieved by running $N$ independent algorithms, one for each task. A...

We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over the subsequent rounds in an adversarial way. The instantaneous loss observed by the player at the end of each round is then a sum of many loss components of previously played actions. This setting encompasses...

We study nonstochastic bandits and experts in a delayed setting where delays depend on both time and arms. While the setting in which delays only depend on time has been extensively studied, the arm-dependent delay setting better captures real-world applications at the cost of introducing new technical challenges. In the full information (experts)...

Motivated by the fact that humans like some level of unpredictability or novelty, and might therefore get quickly bored when interacting with a stationary policy, we introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions. Our mo...

Bilateral trade, a fundamental topic in economics, models the problem of intermediating between two strategic agents, a seller and a buyer, willing to trade a good for which they hold private valuations. In this paper, we cast the bilateral trade problem in a regret minimization framework over $T$ rounds of seller/buyer interactions, with no prior...

We study an active cluster recovery problem where, given a set of $n$ points and an oracle answering queries like "are these two points in the same cluster?", the task is to recover exactly all clusters using as few queries as possible. We begin by introducing a simple but general notion of margin between clusters that captures, as special cases, t...

We study the problem of online multiclass classification in a setting where the learner's feedback is determined by an arbitrary directed graph. While including bandit feedback as a special case, feedback graphs allow a much richer set of applications, including filtering and label efficient classification. We introduce Gappletron, the first online...

We introduce and analyze MT-OMD, a multitask generalization of Online Mirror Descent (OMD) which operates by sharing updates between tasks. We prove that the regret of MT-OMD is of order $\sqrt{1 + \sigma^2(N-1)}\sqrt{T}$, where $\sigma^2$ is the task variance according to the geometry induced by the regularizer, $N$ is the number of tasks, and $T$...

Online learning is a framework for the design and analysis of algorithms that build predictive models by processing data one at the time. Besides being computationally efficient, online algorithms enjoy theoretical performance guarantees that do not rely on statistical assumptions on the data source. In this review, we describe some of the most imp...

We propose an algorithm for stochastic and adversarial multiarmed bandits with switching costs, where the algorithm pays a price $\lambda$ every time it switches the arm being played. Our algorithm is based on adaptation of the Tsallis-INF algorithm of Zimmert and Seldin (2021) and requires no prior knowledge of the regime or time horizon. In the o...

Bilateral trade, a fundamental topic in economics, models the problem of intermediating between two strategic agents, a seller and a buyer, willing to trade a good for which they hold private valuations. Despite the simplicity of this problem, a classical result by Myerson and Satterthwaite (1983) affirms the impossibility of designing a mechanism...

We investigate the problem of exact cluster recovery using oracle queries. Previous results show that clusters in Euclidean spaces that are convex and separated with a margin can be reconstructed exactly using only $O(\log n)$ same-cluster queries, where $n$ is the number of input points. In this work, we study this problem in the more challenging...

We study the problem of recovering distorted clusters in the semi-supervised active clustering framework. Given an oracle revealing whether any two points lie in the same cluster, we are interested in designing algorithms that recover all clusters exactly, in polynomial time, and using as few queries as possible. Towards this end, we extend the not...

Motivated by recommendation problems in music streaming platforms, we propose a non-stationary stochastic bandit model in which the expected reward of an arm depends on the number of rounds that have passed since the arm was last pulled. After proving that finding an optimal policy is NP-hard even when all model parameters are known, we introduce a...

One of the main strengths of online algorithms is their ability to adapt to arbitrary data sequences. This is especially important in nonparametric settings, where regret is measured against rich classes of comparator functions that are able to fit complex environments. Although such hard comparators and complex environments may exhibit local regul...

Motivated by recommendation problems in music streaming platforms, we propose a non-stationary stochastic bandit model in which the expected reward of an arm depends on the number of rounds that have passed since the arm was last pulled. After proving that finding an optimal policy is NP-hard even when all model parameters are known, we introduce a...

Motivated by recommendation problems in music streaming platforms, we propose a nonstationary stochastic bandit model in which the expected reward of an arm depends on the number of rounds that have passed since the arm was last pulled. After proving that finding an optimal policy is NP-hard even when all model parameters are known, we introduce a...

We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. We first prove that the "delayed" Exp3 achieves the $O(\sqrt{(KT + D)\ln K})$ regret bound conjectured by Cesa-Bianchi et al. [2016], in the case of variable, but bounded delays. Here, $K$ is the number of actions and $D$ is the total de...

We study a setting in which a learner faces a sequence of A/B tests and has to make as many good decisions as possible within a given amount of time. Each A/B test $n$ is associated with an unknown (and potentially negative) reward $\mu_n \in [-1,1]$, drawn i.i.d. from an unknown and fixed distribution. For each A/B test $n$, the learner sequential...

We investigate learning algorithms that use similarity queries to approximately solve correlation clustering problems. The input consists of $n$ objects; each pair of objects has a hidden binary similarity score that we can learn through a query. The goal is to use as few queries as possible to partition the objects into clusters so to achieve the...

Gibbs-ERM learning is a natural idealized model of learning with stochastic optimization algorithms (such as Stochastic Gradient Langevin Dynamics and ---to some extent--- Stochastic Gradient Descent), while it also arises in other contexts, including PAC-Bayesian theory, and sampling mechanisms. In this work we study the excess risk suffered by a...

We study an asynchronous online learning setting with a network of agents. At each time step, some of the agents are activated, requested to make a prediction, and pay the corresponding loss. The loss function is then revealed to these agents and also to their neighbors in the network. When activations are stochastic, we show that the regret achiev...

We prove that two popular linear contextual bandit algorithms, OFUL and Thompson Sampling, can be made efficient using Frequent Directions, a deterministic online sketching technique. More precisely, we show that a sketch of size $m$ allows a $\mathcal{O}(md)$ update time for both algorithms, as opposed to $\Omega(d^2)$ required by their non-sketch...

Motivated by posted price auctions where buyers are grouped in an unknown number of latent types characterized by their private values for the good on sale, we investigate revenue maximization in stochastic dynamic pricing when the distribution of buyers' private values is supported on an unknown set of points in [0,1] of unknown cardinality K. Thi...

Decision tree classifiers are a widely used tool in data stream mining. The use of confidence intervals to estimate the gain associated with each split leads to very effective methods, like the popular Hoeffding tree algorithm. From a statistical viewpoint, the analysis of decision tree classifiers in a streaming setting requires knowing when enoug...

Boltzmann exploration is a classic strategy for sequential decision-making under uncertainty, and is one of the most standard tools in Reinforcement Learning (RL). Despite its widespread use, there is virtually no theoretical understanding about the limitations or the actual benefits of this exploration scheme. Does it drive exploration in a meanin...

We study algorithms for online nonparametric regression that learn the directions along which the regression function is smoother. Our algorithm learns the Mahalanobis metric based on the gradient outer product matrix $\boldsymbol{G}$ of the regression function (automatically adapting to the effective rank of this matrix), while simultaneously boun...

We study how the regret guarantees of nonstochastic multi-armed bandits can be improved, if the effective range of the losses in each round is small (e.g. the maximal difference between two losses in a given round). Despite a recent impossibility result, we show how this can be made possible under certain mild additional assumptions, such as availa...

Recognising human activities from streaming sources poses unique challenges to learning algorithms. Predictive models need to be scalable, incrementally trainable, and must remain bounded in size even when the data stream is arbitrarily long. In order to achieve high accuracy even in complex and dynamic environments methods should be also nonparame...

We investigate contextual online learning with nonparametric (Lipschitz) comparison classes under different assumptions on losses and feedback information. For full information feedback and Lipschitz losses, we characterize the minimax regret up to log factors by proving an upper bound matching a previously known lower bound. In a partial feedback...

Automated protein function prediction is a challenging problem with distinctive features, such as the hierarchical organization of protein functions and the scarcity of annotated proteins for most biological functions. We propose a multitask learning algorithm addressing both issues. Unlike standard multitask algorithms, which use task (protein fun...

Multi-task algorithms typically use task similarity information as a bias to speed up learning. We argue that, when the classification problem is unbalanced, task dissimilarity information provides a more effective bias, as rare class labels tend to be better separated from the frequent class labels. In particular, we show that a multi-task extensi...

In the problem of edge sign classification, we are given a directed graph (representing an online social network), and our task is to predict the binary labels of the edges (i.e., the positive or negative nature of the social relationships). Many successful heuristics for this problem are based on the troll-trust features, estimating on each node t...

Recognising human activities from streaming videos poses unique challenges to learning algorithms: predictive models need to be scalable, incrementally trainable, and must remain bounded in size even when the data stream is arbitrarily long. Furthermore, as parameter tuning is problematic in a streaming setting, suitable approaches should be parame...

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than $d$ hops to arrive, where $d$ is a delay parameter. We introduce \textsc{Exp3-Coop}, a cooperat...

We propose Sketched Online Newton (SON), an online second order learning algorithm that enjoys substantially improved regret guarantees for ill-conditioned data. SON is an enhanced version of the Online Newton Step, which, via sketching techniques enjoys a linear running time. We further improve the computational complexity to linear in the number...

Stream mining poses unique challenges to machine learning: predictive models
are required to be scalable, incrementally trainable, must remain bounded in
size (even when the data stream is arbitrarily long), and be nonparametric in
order to achieve high accuracy even in complex and dynamic environments.
Moreover, the learning system must be paramet...

Decision tree classifiers are a widely used tool in data stream mining. The use of confidence intervals to estimate the gain associated with each split leads to very effective methods, like the popular Hoeffding tree algorithm. From a statistical viewpoint, the analysis of decision tree classifiers in a streaming setting requires knowing when enoug...

We study a general class of online learning problems where the feedback is
specified by a graph. This class includes online prediction with expert advice
and the multi-armed bandit problem, but also several learning problems where
the online player does not necessarily observe his own loss. We analyze how the
structure of the feedback graph control...

A well-recognized limitation of kernel learning is the requirement to handle
a kernel matrix, whose size is quadratic in the number of training examples.
Many methods have been proposed to reduce this computational cost, mostly by
using a subset of the kernel matrix entries, or some form of low-rank matrix
approximation, or a random projection meth...

We present and study a partial-information model of online learning, where a
decision maker repeatedly chooses from a finite set of actions, and observes
some subset of the associated losses. This naturally models several situations
where the losses of different actions are related, and knowing the loss of one
action provides information on the los...

We introduce an online action recognition system that can be combined with any set of frame-by-frame feature descriptors. Our system covers the frame feature space with classifiers whose distribution adapts to the hardness of locally approximating the Bayes optimal classifier. An efficient nearest neighbour search is used to find and combine the lo...

We consider the partial observability model for multi-armed bandits,
introduced by Mannor and Shamir. Our main result is a characterization of
regret in the directed observability model in terms of the dominating and
independence numbers of the observability graph. We also show that in the
undirected case, the learner can achieve optimal regret wit...

Multi-armed bandit problems are receiving a great deal of attention because
they adequately formalize the exploration-exploitation trade-offs arising in
several industrially relevant applications, such as online advertisement and,
more generally, recommendation systems. In many cases, however, these
applications have a strong social component, whos...

Online learning algorithms are fast, memory-efficient, easy to implement, and
applicable to many prediction problems, including classification, regression,
and ranking. Several online algorithms were proposed in the past few decades,
some based on additive updates, like the Perceptron, and some on multiplicative
updates, like Winnow. A unified view...

We study the power of different types of adaptive (nonoblivious) adversaries
in the setting of prediction with expert advice, under both full-information
and bandit feedback. We measure the player's performance using a new notion of
regret, also known as policy regret, which better captures the adversary's
adaptiveness to the player's behavior. In...

We investigate the problem of active learning on a given tree whose nodes are
assigned binary labels in an adversarial way. Inspired by recent results by
Guillory and Bilmes, we characterize (up to constant factors) the optimal
placement of queries so to minimize the mistakes made on the non-queried nodes.
Our query selection algorithm is extremely...

Predicting the nodes of a given graph is a fascinating theoretical problem
with applications in several domains. Since graph sparsification via spanning
trees retains enough information while making the task much easier, trees are
an important special case of this problem. Although it is known how to predict
the nodes of an unweighted tree in a nea...

We present very efficient active learning algorithms for link classification
in signed networks. Our algorithms are motivated by a stochastic model in which
edge labels are obtained through perturbations of a initial sign assignment
consistent with a two-clustering of the nodes. We provide a theoretical
analysis within this model, showing that we c...

Motivated by social balance theory, we develop a theory of link
classification in signed networks using the correlation clustering index as
measure of label regularity. We derive learning bounds in terms of correlation
clustering within three fundamental transductive learning settings: online,
batch and active. Our main algorithmic contribution is...

We show a regret minimization algorithm for setting the reserve price in a sequence of second-price auctions, under the assumption that all bids are independently drawn from the same unknown and arbitrary distribution. Our algorithm is computationally efficient, and achieves a regret of Õ(√T) in a sequence of T auctions. This holds even when the nu...

We study regret minimization bounds in which the dependence on the number of experts is replaced by measures of the realized complexity of the expert class. The measures we consider are defined in retrospect given the realized losses. We concentrate on two interesting cases. In the first, our measure of complexity is the number of different "leadin...

We investigate the problem of sequentially predicting the binary labels on
the nodes of an arbitrary weighted graph. We show that, under a suitable
parametrization of the problem, the optimal number of prediction mistakes can
be characterized (up to logarithmic factors) by the cutsize of a random
spanning tree of the graph. The cutsize is induced b...

We investigate online kernel algorithms which simultaneously process multiple
classification tasks while a fixed constraint is imposed on the size of their
active sets. We focus in particular on the design of algorithms that can
efficiently deal with problems where the number of tasks is extremely high and
the task data are large scale. Two new pro...

The stochastic multi-armed bandit problem is well understood when the
reward distributions are sub-Gaussian. In this paper we examine the
bandit problem under the weaker assumption that the distributions have
moments of order 1+\epsilon, for some $\epsilon \in (0,1]$.
Surprisingly, moments of order 2 (i.e., finite variance) are sufficient
to obtain...

Multi-armed bandit problems are the most basic examples of sequential
decision problems with an exploration-exploitation trade-off. This is the
balance between staying with the option that gave highest payoffs in the past
and exploring new options that might give higher payoffs in the future.
Although the study of bandit problems dates back to the...

Mirror descent with an entropic regularizer is known to achieve shifting
regret bounds that are logarithmic in the dimension. This is done using either
a carefully designed projection or by a weight sharing technique. Via a novel
unified analysis, we show that these two approaches deliver essentially
equivalent bounds on a notion of regret generali...

We address the online linear optimization problem with bandit feedback. Our
contribution is twofold. First, we provide an algorithm (based on exponential
weights) with a regret of order $\sqrt{d n \log N}$ for any finite action set
with $N$ actions, under the assumption that the instantaneous loss is bounded
by 1. This shaves off an extraneous $\sq...

We study online learning of linear and kernel-based predictors, when individual examples are corrupted by random noise, and both examples and noise type can be chosen adversarially and change over time. We begin with the setting where some auxiliary information on the noise distribution is provided, and we wish to learn predictors with respect to t...

Predicting the nodes of a given graph is a fascinating theoretical problem with applications in several domains. Since graph sparsification via spanning trees retains enough information while making the task much easier, trees are an important special case of this problem. Although it is known how to predict the nodes of an unweighted tree in a nea...

We develop a new tool for data-dependent analysis of the
exploration-exploitation trade-off in learning under limited feedback. Our tool
is based on two main ingredients. The first ingredient is a new concentration
inequality that makes it possible to control the concentration of weighted
averages of multiple (possibly uncountably many) simultaneou...

We present a set of high-probability inequalities that control the
concentration of weighted averages of multiple (possibly uncountably many)
simultaneously evolving and interdependent martingales. Our results extend the
PAC-Bayesian analysis in learning theory from the i.i.d. setting to martingales
opening the way for its application to importance...

We provide the first algorithm for online bandit linear optimization whose
regret after T rounds is of order sqrt{Td ln N} on any finite class X of N
actions in d dimensions, and of order d*sqrt{T} (up to log factors) when X is
infinite. These bounds are not improvable in general. The basic idea utilizes
tools from convex geometry to construct what...

In many real world applications, the number of examples to learn from is plentiful, but we can only obtain limited information on each individual example. We study the possibilities of efficient, provably correct, large-scale learning in such settings. The main theme we would like to establish is that large amounts of examples can compensate for th...

Gene function prediction is a complex multilabel classification problem with several distinctive features: the hierarchical relationships between functional classes, the presence of multiple sources of biomolecular data, the unbalance between positive and negative examples for each class, the complexity of the whole-ontology and genome-wide dimensi...

Most traditional online learning algorithms are based on variants of mirror
descent or follow-the-leader. In this paper, we present an online algorithm
based on a completely different approach, tailored for transductive settings,
which combines "random playout" and randomized rounding of loss subgradients.
As an application of our approach, we pres...