## About

307

Publications

46,616

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

5,804

Citations

## Publications

Publications (307)

Using a martingale concentration inequality, concentration bounds “from time n 0 on” are derived for stochastic approximation algorithms with contractive maps and both martingale difference and Markov noises. These are applied to reinforcement learning algorithms, in particular to asynchronous Q-learning and TD(0).

We consider the problem of scheduling packet transmissions in a wireless network of users while minimizing the energy consumed and the transmission delay. A challenge is that transmissions of users that are close to each other mutually interfere, while users that are far apart can transmit simultaneously without much interference. Each user has a q...

A novel reinforcement learning algorithm is introduced for multiarmed restless bandits with average reward, using the paradigms of Q-learning and Whittle index. Specifically, we leverage the structure of the Whittle index policy to reduce the search space of Q-learning, resulting in major computational gains. Rigorous convergence analysis is provid...

In this work, we show that for the martingale problem for a class of degenerate diffusions with bounded continuous drift and diffusion coefficients, the small noise limit of non-degenerate approximations leads to a unique Feller limit. The proof uses the theory of viscosity solutions applied to the associated backward Kolmogorov equations. Under ap...

The Whittle index policy is a heuristic that has shown remarkable good performance (with guaranted asymptotic optimality) when applied to the class of problems known as multi-armed restless bandits. In this paper we develop QWI, an algorithm based on Q-learning in order to learn theWhittle indices. The key feature is the deployment of two timescale...

We correct an error in the statement of the main result of Borkar (2021) and also point out some improvements.

We propose a scheme for accelerating Markov Chain Monte Carlo by introducing random resets that become increasingly rare in a precise sense. We show that this still leads to the desired asymptotic average and establish an associated concentration bound. We show by numerical experiments that this scheme can be used to advantage in order to accelerat...

The popular LSPE($\lambda$) algorithm for policy evaluation is revisited to derive a concentration bound that gives high probability performance guarantees from some time on.

We consider a prospect theoretic version of the classical Q-learning algorithm for discounted reward Markov decision processes, wherein the controller perceives a distorted and noisy future reward, modeled by a nonlinearity that accentuates gains and under-represents losses relative to a reference point. We analyze the asymptotic behavior of the sc...

We introduce a model of graph-constrained dynamic choice with reinforcement modeled by positively
$\alpha$
-homogeneous rewards. We show that its empirical process, which can be written as a stochastic approximation recursion with Markov noise, has the same probability law as a certain vertex reinforced random walk. We use this equivalence to sho...

To overcome the curses of dimensionality and modeling of Dynamic Programming (DP) methods to solve Markov Decision Process (MDP) problems, Reinforcement Learning (RL) methods are adopted in practice. Contrary to traditional RL algorithms which do not consider the structural properties of the optimal policy, we propose a structure-aware learning alg...

We study the problem of user association, i.e., determining which base station (BS) a user should associate with, in a dense millimeter wave (mmWave) network. In our system model, in each time slot, a user arrives with some probability in a region with a relatively small geographical area served by a dense mmWave network. Our goal is to devise an a...

Using a martingale concentration inequality, concentration bounds `from time $n_0$ on' are derived for stochastic approximation algorithms with contractive maps and both martingale difference and Markov noises. These are applied to reinforcement learning algorithms, in particular to asynchronous Q-learning and TD(0).

We consider a simultaneous small noise limit for a singularly perturbed coupled diffusion described by $$\begin{aligned} dX^{\varepsilon }_t= & {} b(X^{\varepsilon }_t, Y^{\varepsilon }_t)dt + \varepsilon ^{\alpha }dB_t, \\ dY^{\varepsilon }_t= & {} - \frac{1}{\varepsilon } \nabla _yU(X^{\varepsilon }_t, Y^{\varepsilon }_t)dt + \frac{s(\varepsilon...

We analyze the DQN reinforcement learning algorithm as a stochastic approximation scheme using the o.d.e. (for ‘ordinary differential equation’) approach and point out certain theoretical issues. We then propose a modified scheme called Full Gradient DQN (FG-DQN, for short) that has a sound theoretical basis and compare it with the original scheme...

We consider a prospect theoretic version of the classical Q-learning algorithm for discounted reward Markov decision processes, wherein the controller perceives a distorted and noisy future reward, modeled by a nonlinearity that accentuates gains and underrepresents losses relative to a reference point. We analyze the asymptotic behavior of the sch...

We study the problem of scheduling packet transmissions with the aim of minimizing the energy consumption and data transmission delay of users in a wireless network in which spatial reuse of spectrum is employed. We approach this problem using the theory of Whittle index for cost minimizing restless bandits, which has been used to effectively solve...

We derive equivalent linear and dynamic programs for infinite horizon risk-sensitive control for minimization of the asymptotic growth rate of the cumulative cost.

We argue that graph-constrained dynamic choice with reinforcement can be viewed as a scaled version of a special instance of replicator dynamics. The latter also arises as the limiting differential equation for the empirical measures of a vertex reinforced random walk on a directed graph. We use this equivalence to show that for a class of positive...

A novel reinforcement learning algorithm is introduced for multiarmed restless bandits with average reward, using the paradigms of Q-learning and Whittle index. Specifically, we leverage the structure of the Whittle index policy to reduce the search space of Q-learning, resulting in major computational gains. Rigorous convergence analysis is provid...

The main result in this paper is a variational formula for the exit rate from a bounded domain for a diffusion process in terms of the stationary law of the diffusion constrained to remain in this domain forever. Related results on the geometric ergodicity of the controlled Q-process are also presented.

A multiplicative relative value iteration algorithm for solving the dynamic programming equation for the risk-sensitive control problem is studied for discrete time controlled Markov chains with a compact Polish state space, and controlled diffusions in on the whole Euclidean space. The main result is a proof of convergence to the desired limit in...

This work revisits the constant stepsize stochastic approximation algorithm for tracking a slowly moving target and obtains a bound for the tracking error that is valid for the entire time axis, using the Alekseev nonlinear variation of constants formula. It is the first non-asymptotic bound for the entire time axis in the sense that it is not base...

Recent studies have demonstrated that the decisions of agents in society are shaped by their own intrinsic motivation, and also by the compliance with the social norm. In other words, the decision of acting in a particular manner will be affected by the opinion of society. This social comparison mechanism can lead to imitation behavior, where an ag...

In this paper, we study how to shape opinions in social networks when the matrix of interactions is unknown. We consider classical opinion dynamics with some stubborn agents and the possibility of continuously influencing the opinions of a few selected agents, albeit under resource constraints. We map the opinion dynamics to a value iteration schem...

We study the problem of scheduling packet transmissions with the aim of minimizing the energy consumption and data transmission delay of users in a wireless network in which spatial reuse of spectrum is employed. We approach this problem using the theory of Whittle index for cost minimizing restless bandits, which has been used to effectively solve...

We consider the long run average or ‘ergodic’ control of a discrete time Markov process with a probabilistic constraint in terms of a bound on the exit rate from a bounded subset of the state space. This is a natural counterpart of the more common probabilistic constraints in the finite horizon control problems. Using a recent characterization by A...

We consider a novel model of stochastic replicator dynamics for potential games that converts to a Langevin equation on a sphere after a change of variables. This is distinct from the models studied earlier. In particular, it is ill-posed due to non-uniqueness of solutions, but is amenable to a natural selection principle that picks a unique soluti...

We formulate and study the infinite-dimensional linear programming problem associated with the deterministic long-run average cost control problem. Along with its dual, it allows one to characterize the optimal value of this control problem. The novelty of our approach is that we focus on the general case wherein the optimal value may depend on the...

We revisit the classical problem of identifying the source of rumor in a network, assumed unique, treating as given the network topology and the set of rumor infected nodes. In addition, it is assumed that some partial information about the order in which the nodes were infected, is also available. Such form of information is commonly available to...

This is an overview of the work of the authors and their collaborators on the characterization of risk sensitive costs and rewards in terms of an abstract Collatz-Wielandt formula and in case of rewards, also a controlled version of the Donsker-Varadhan formula. For the finite state and action case, this leads to useful linear and dynamic programmi...

We address the variational formulation of the risk-sensitive reward problem for non-degenerate diffusions on $\mathbb{R}^d$ controlled through the drift. We establish a variational formula on the whole space and also show that the risk-sensitive value equals the generalized principal eigenvalue of the semilinear operator. This can be viewed as a co...

We study the well-posedness of the Bellman equation for the ergodic control problem for a controlled Markov process in $R^d$ for a near-monotone cost and establish convergence results for the associated 'relative value iteration' algorithm which computes its solution recursively. In addition, we present some results concerning the stability and asy...

We formulate and study the infinite dimensional linear programming (LP) problem associated with the deterministic discrete time long-run average criterion optimal control problem. Along with its dual, this LP problem allows one to characterize the optimal value of the optimal control problem. The novelty of our approach is that we focus on the gene...

To overcome the curse of dimensionality and curse of modeling in Dynamic Programming (DP) methods for solving classical Markov Decision Process (MDP) problems, Reinforcement Learning (RL) algorithms are popular. In this paper, we consider an infinite-horizon average reward MDP problem and prove the optimality of the threshold policy under certain c...

We consider a simultaneous small noise limit for a singularly perturbed coupled diffusion described by \begin{eqnarray*} dX^{\varepsilon}_t &=& b(X^{\varepsilon}_t, Y^{\varepsilon}_t)dt + \varepsilon^{\alpha}dB_t, dY^{\varepsilon}_t &=& - \frac{1}{\varepsilon} \nabla_yU(X^{\varepsilon}_t, Y^{\varepsilon}_t)dt + \frac{s(\varepsilon)}{\sqrt{\varepsil...

We consider a dynamical system with finitely many equilibria and perturbed by
small noise, in addition to being controlled by an `expensive' control. We
study the invariant distribution of the controlled process as the variance of
the noise becomes vanishingly small. It is shown that depending on the relative
magnitudes of the noise variance and th...

This article proposes a novel vector field based guidance scheme for tracking and surveillance by an aerial agent of a convoy moving along a possibly non-linear trajectory on the ground. The scheme first computes a time varying ellipse that encompasses all the targets in the convoy using a simple regression based algorithm. It then ensures converge...

Given an ODE and its perturbation, the Alekseev formula ex-presses the solutions of the latter in terms related to the former. By exploiting this formula and a new concentration inequality for martingale-differences, we develop a novel approach for analyzing nonlinear Stochastic Approximation (SA). This approach is useful for studying a SA’s behavi...

Using the expression for the unnormalized nonlinear filter for a hidden Markov model, we develop a dynamic-programming-like backward recursion for the filter. This is combined with some ideas from reinforcement learning and a conditional version of importance sampling in order to develop a scheme based on stochastic approximation for estimating the...

Viewing a two time scale stochastic approximation scheme as a noisy discretization of a singularly perturbed differential equation, we obtain a concentration bound for its iterates that captures its behavior with quantifiable high probability. This uses Alekseev's nonlinear variation of constants formula and a martingale concentration inequality an...

We introduce and study the infinite dimensional linear programming problem which along with its dual allows one to characterize the optimal value of the deterministic long-run average optimal control problem in the general case when the latter may depend on the initial conditions of the system.

Background
In the framework of network sampling, random walk (RW) based estimation techniques provide many pragmatic solutions while uncovering the unknown network as little as possible. Despite several theoretical advances in this area, RW based sampling techniques usually make a strong assumption that the samples are in stationary regime, and hen...

A reinforcement learning algorithm is proposed in order to solve a multi-criterion Markov decision process, i.e., an MDP with a vector running cost. Specifically, it combines a Q-learning scheme for a weighted linear combination of the prescribed running costs with an incremental version of replicator dynamics that updates the weights. The objectiv...

This work revisits the constant stepsize stochastic approximation algorithm for tracking a slowly moving target and obtains a bound for the tracking error that is valid for all time, using the Alekseev non-linear variation of constants formula.

In this work, we examine the problem of rumor source inference on a network whose topology is known, given infected nodes and pairwise information in the form of pairwise partial orders on the set of nodes of the underlying graph based on the order in which they were infected. We analyze the Maximum Likelihood Estimator (MLE) of the rumor source, a...

We propose a dynamic formulation of file-sharing networks in terms of an average cost Markov decision process with constraints. By analyzing a Whittle-like relaxation thereof, we propose an index policy in the spirit of Whittle and compare it by simulations with other natural heuristics.

We propose a novel vector field based guidance scheme for tracking and surveillance of a convoy moving along a possibly nonlinear trajectory on the ground, by an aerial agent. The scheme first computes a time varying ellipse that encompasses all the targets in the convoy using a simple regression based algorithm. It then ensures convergence of the...

The notion of approachability was introduced by Blackwell (Pac J Math 6(1):1–8, 1956) in the context of vector-valued repeated games. The famous ‘Blackwell’s approachability theorem’ prescribes a strategy for approachability, i.e., for ‘steering’ the average vector cost of a given agent toward a given target set, irrespective of the strategies of t...

The egalitarian processor sharing model is viewed as a restless bandit and its Whittle indexability is established. A numerical scheme for computing the Whittle indices is provided, along with supporting numerical experiments.

We propose a distributed version of a stochastic approximation scheme constrained to remain in the intersection of a finite family of convex sets. The projection to the intersection of these sets is also computed in a distributed manner and a `nonlinear gossip' mechanism is employed to blend the projection iterations with the stochastic approximati...

We consider a controlled stochastic dynamics on a connected graph with gossip-like nearest neighbor affine interactions on a faster time scale. In the limit as the time scale separation diverges, followed by a limit as the graph grows to an infinite graph, we recover a mean field dynamics.

In this paper we consider energy efficient scheduling in a multiuser setting where each user has a finite sized queue and there is a cost associated with holding packets (jobs) in each queue (modeling the delay constraints). The packets of each user need to be sent over a common channel. The channel qualities seen by the users are time-varying and...

We consider the problem of dynamically scheduling $M$ out of $N$ binary Markov chains when only noisy observations of state are available, with ergodic (equivalently, long run average) reward. By passing on to the equivalent problem of controlling the conditional distribution of state given observations and controls, it is cast as a restless bandit...

In A Relative Value Iteration Algorithm for Nondegenerate Controlled Diffusions, [SIAM J. Control Optim., 50 (2012), pp. 1886-1902], convergence of the relative value iteration for the ergodic control problem for a nondegenerate diffusion controlled through its drift was established, under the assumption of geometric ergodicity, using two methods:...

We revisit the problem of inferring the source of a rumor on a network given a snapshot of the extent of its spread. We differ from prior work in two aspects, i) we consider settings where additional relative information about the infection times of a fraction of node pairs is also available to the estimator; and ii) instead of only considering the...

The ‘value’ of infinite horizon risk-sensitive control is the principal eigenvalue of a certain positive operator. For the case of compact domain, Chang has built upon a nonlinear version of the Krein-Rutman theorem to give a ‘min-max’ characterization of this eigenvalue which may be viewed as a generalization of the classical Collatz-Wielandt form...

We propose two asynchronously distributed approaches for graph-based semi-supervised learning. The first approach is based on stochastic approximation, whereas the second approach is based on randomized Kaczmarz algorithm. In addition to the possibility of distributed implementation, both approaches can be naturally applied online to streaming data...

We propose an extremely simple window adaptation scheme for backoff in 802.11 MAC protocol. The scheme uses constant stepsize stochastic approximation to adjust collision probabilities to set values, using an approximate analytic relationship between this probability and the back-off window. A further variation of this scheme also adapts the set po...

We consider the task of scheduling a crawler to retrieve from several sites their ephemeral content. This is content, such as news or posts at social network groups, for which a user typically loses interest after some days or hours. Thus development of a timely crawling policy for ephemeral information sources is very important. We first formulate...

We propose an iterative and distributed Markov Chain Monte Carlo scheme for estimation of effective edge conductances in a graph. A sample complexity analysis is provided. The theoretical guarantees on the performance of the proposed algorithm are weak compared to those of existing algorithms. But numerical experiments suggest that the algorithm mi...

Consider a finite irreducible Markov chain with invariant probability $\pi$. Define its inverse communication speed as the expectation to go from x to y, when x, y are sampled independently according to $\pi$. In the discrete time setting and when $\pi$ is the uniform distribution $\upsilon$, Litvak and Ejov have shown that the permutation matrices...

We consider the problem of inferring the source of a rumor in a given large network. We assume that the rumor propagates in the network through a discrete time susceptible-infected model. Input to our problem includes information regarding the entire network, an infected subgraph of the network observed at some known time instant, and the probabili...

Function estimation on Online Social Networks (OSN) is an important field of study in complex network analysis. An efficient way to do function estimation on large networks is to use random walks. We can then defer to the extensive theory of Markov chains to do error analysis of these estimators. In this work we compare two existing techniques, Met...

Several estimation techniques assume validity of Gaussian approximations for estimation purposes. Interestingly, these ensemble methods have proven to work very well for high-dimensional data even when the distributions involved are not necessarily Gaussian. We attempt to bridge the gap between this oft-used computational assumption and the theoret...

We consider a Robbins-Monro type iteration wherein noisy measurements are event-driven and therefore arrive asynchronously. We propose a modification of step-sizes that ensures desired asymptotic behaviour regardless of this aspect. This generalizes earlier results on asynchronous stochastic approximation wherein the asynchronous behaviour is acros...

We revisit the problem of inferring the overall ranking among entities in the framework of Bradley-Terry-Luce (BTL) model, based on available empirical data on pairwise preferences. By a simple transformation, we can cast the problem as that of solving a noisy linear system, for which a ready algorithm is available in the form of the randomized Kac...

We develop two new online actor-critic control algorithms with adaptive feature tuning for Markov Decision Processes (MDPs). One of our algorithms is proposed for the long-run average cost objective, while the other works for discounted cost MDPs. Our actor-critic architecture incorporates parameterization both in the policy and the value function....

We consider a gossip-based distributed stochastic approximation scheme wherein processors situated at the nodes of a connected graph perform stochastic approximation algorithms, modified further by an additive interaction term equal to a weighted average of iterates at neighboring nodes along the lines of "gossip" algorithms. We allow these averagi...

This paper aims at achieving a "good" estimator for the gradient of a
function on a high-dimensional space. Often such functions are not sensitive in
all coordinates and the gradient of the function is almost sparse. We propose a
method for gradient estimation that combines ideas from Spall's Simultaneous
Perturbation Stochastic Approximation with...