## No full-text available

To read the full-text of this research,

you can request a copy directly from the authors.

To read the full-text of this research,

you can request a copy directly from the authors.

We propose new multi-objective reinforcement learning algorithms that aim to find a globally Pareto-optimal deterministic policy that uniformly (in all states) maximizes a reward subject to a uniform probabilistic constraint over reaching forbidden states of a Markov decision process. Our requirements arise naturally in the context of safety-critical systems, but pose a significant unmet challenge. This class of learning problem is known to be hard and there are no off-the-shelf solutions that fully address the combined requirements of determinism and uniform optimality. Having formalized our requirements and highlighted the specific challenge of learning instability, using a simple counterexample, we define from first principles a stable Bellman operator that we prove partially respects our requirements. This operator is therefore a partial solution to our problem, but produces conservative polices in comparison to our previous approach, which was not designed to satisfy the same requirements. We thus propose a relaxation of the stable operator, using adaptive hysteresis, that forms the basis of a heuristic approach that is stable w.r.t. our counterexample and learns policies that are less conservative than those of the stable operator and our previous algorithm. In comparison to our previous approach, the policies of our adaptive hysteresis algorithm demonstrate improved monotonicity with increasing constraint probabilities, which is one of the characteristics we desire. We demonstrate that adaptive hysteresis works well with dynamic programming and reinforcement learning, and can be adapted to function approximation.

Next generation industrial plants will feature mobile robots (e.g., autonomous forklifts) moving side by side with humans. In these scenarios, robots must not only maximize efficiency, but must also mitigate risks. In this paper we study the problem of risk-aware path planning, i.e., the problem of computing shortest paths in stochastic environments while ensuring that average risk is bounded. Our method is based on the framework of constrained Markov Decision Processes (CMDP). To counterbalance the intrinsic computational complexity of CMDPs, we propose a hierarchical method that is suboptimal but obtains significant speedups. Simulation results in factory-like environments illustrate how the hierarchical method compares with the non hierarchical one.

Human movement differs from robot control because of its flexibility in unknown environments, robustness to perturbation, and tolerance of unknown parameters and unpredictable variability. We propose a new theory, risk-aware control, in which movement is governed by estimates of risk based on uncertainty about the current state and knowledge of the cost of errors. We demonstrate the existence of a feedback control law that implements risk-aware control and show that this control law can be directly implemented by populations of spiking neurons. Simulated examples of risk-aware control for time-varying cost functions as well as learning of unknown dynamics in a stochastic risky environment are provided.

We derive a family of risk-sensitive reinforcement learning methods for agents, who face sequential decision-making tasks in uncertain environments. By applying a utility function to the temporal difference (TD) error, nonlinear transformations are effectively applied not only to the received rewards but also to the true transition probabilities of the underlying Markov decision process. When appropriate utility functions are chosen, the agents' behaviors express key features of human behavior as predicted by prospect theory (Kahneman & Tversky, 1979), for example, different risk preferences for gains and losses, as well as the shape of subjective probability curves. We derive a risk-sensitive Q-learning algorithm, which is necessary for modeling human behavior when transition probabilities are unknown, and prove its convergence. As a proof of principle for the applicability of the new framework, we apply it to quantify human behavior in a sequential investment task. We find that the risk-sensitive variant provides a significantly better fit to the behavioral data and that it leads to an interpretation of the subject's responses that is indeed consistent with prospect theory. The analysis of simultaneously measured fMRI signals shows a significant correlation of the risk-sensitive TD error with BOLD signal change in the ventral striatum. In addition we find a significant correlation of the risk-sensitive Q-values with neural activity in the striatum, cingulate cortex, and insula that is not present if standard Q-values are used.

Predicting turn and stop maneuvers of potentially errant drivers is a basic requirement for advanced driver assistance systems for urban intersections. Previous work has shown that an early estimate of the driver's intent can be inferred by evaluating the vehicle's speed during the intersection approach. In the presence of a preceding vehicle, however, the velocity profile might be dictated by car-following behavior rather than by the need to slow down before doing a left or right turn. To infer the driver's intent under such circumstances, a simple, real-time capable approach using a parametric model to represent both car-following and turning behavior is proposed. The performance of two alternative parameterizations based on observations at an individual intersection and a generic curvature-based model is evaluated in combination with two different Bayes net classification algorithms. In addition, the driver model is shown to be capable of predicting the future trajectory of the vehicle.

A prototype of a longitudinal driving-assistance system, which is adaptive to driver behavior, is developed. Its functions include adaptive cruise control and forward collision warning/avoidance. The research data came from driver car-following tests in real traffic environments. Based on the data analysis, a driver model imitating the driver's operation is established to generate the desired throttle depression and braking pressure. Algorithms for collision warning and automatic braking activation are designed based on the driver's pedal deflection timing during approach (gap closing). A self-learning algorithm for driver characteristics is proposed based on the recursive least-square method with a forgetting factor. Using this algorithm, the parameters of the driver model can be identified from the data in the manual operation phase, and the identification result is applied during the automatic control phase in real time. A test bed with an electronic throttle and an electrohydraulic brake actuator is developed for system validation. The experimental results show that the self-learning algorithm is effective and that the system can, to some extent, adapt to individual characteristics.

Photocopy. Supplied by British Library. Thesis (Ph. D.)--King's College, Cambridge, 1989.

In this paper, we examine the problem of throughput maximization in an energy-harvesting two-hop amplify-and-forward relay network. This problem is investigated over a finite time horizon and in an online setting, where the causal knowledge of the harvested energy and that of fading are available. We use Markov decision process (MDP) formulation to present a mathematically tractable solution to the throughput maximization problem. In this solution, optimal power-use policy is obtained using backward induction algorithm of the corresponding discrete dynamic programming problem. We also present properties of the optimal policy for an important special case, where the power control at transmitters is limited to on-off switching. These properties facilitate the implementation of the MDP based solution. Our numerical simulations show that the proposed method outperforms existing solutions to this problem.

In this paper, we propose a robust dynamic point clustering method for detecting moving objects in stereo image sequences, which is essential for collision detection in driver assistance system. If multiple objects with similar motions are located in close proximity, dynamic points from different moving objects may be clustered together when using the position and velocity as clustering criteria. To solve this problem, we apply a geometric constraint between dynamic points using line segments. Based on this constraint, we propose a variable Knearest neighbor clustering method and three cost functions that are defined between line segments and points. The proposed method is verified experimentally in terms of its accuracy, and comparisons are also made with conventional methods that only utilize the positions and velocities of dynamic points.

Precrash systems have the potential for preventing or mitigating the results of an accident. However, optimal precrash activation can be only achieved by a driver-individual parameterization of the activation function. In this paper, an adaptation model is proposed, which calculates a driver-adapted activation threshold for the considered precrash algorithm. The model analyzes past situations to calculate a driver-individual activation threshold that achieves a desired activation frequency. The advantage of the proposed model is that the distribution is estimated using a distribution model. This has the result that an activation threshold can be already determined using a small data set. In addition, the confidence interval that has to be considered is decreased. The proposed model was applied in a study with test subjects. Results of this paper confirm the usability of the model. In comparison with an empirical approach, the proposed model achieves a significantly lower threshold and, thus, a higher safety effect of the system.

This paper proposes distributed multiuser multiband spectrum sensing policies for cognitive radio networks based on multiagent reinforcement learning. The spectrum sensing problem is formulated as a partially observable stochastic game and multiagent reinforcement learning is employed to find a solution. In the proposed reinforcement learning based sensing policies the secondary users (SUs) collaborate to improve the sensing reliability and to distribute the sensing tasks among the network nodes. The SU collaboration is carried out through local interactions in which the SUs share their local test statistics or decisions as well as information on the frequency bands sensed with their neighbors. As a result, a map of spectrum occupancy in a local neighborhood is created. The goal of the proposed sensing policies is to maximize the amount of free spectrum found given a constraint on the probability of missed detection. This is addressed by obtaining a balance between sensing more spectrum and the reliability of sensing results. Simulation results show that the proposed sensing policies provide an efficient way to find available spectrum in multiuser multiband cognitive radio scenarios.

Most reinforcement learning algorithms optimize the expected return of a Markov Decision Problem. Practice has taught us the lesson that this criterion is not always the most suitable because many applications require robust control strategies which also take into account the variance of the return. Classical control literature provides several techniques to deal with risk-sensitive optimization goals like the so-called worst-case optimality criterion exclusively focusing on risk-avoiding policies or classical risk-sensitive control, which transforms the returns by exponential utility functions. While the first approach is typically too restrictive, the latter suffers from the absence of an obvious way to design a corresponding model-free reinforcement learning algorithm.Our risk-sensitive reinforcement learning algorithm is based on a very different philosophy. Instead of transforming the return of the process, we transform the temporal differences during learning. While our approach reflects important properties of the classical exponential utility framework, we avoid its serious drawbacks for learning. Based on an extended set of optimality equations we are able to formulate risk-sensitive versions of various well-known reinforcement learning algorithms which converge with probability one under the usual conditions.

We propose for risk-sensitive control of finite Markov chains a counterpart of the popular Q-learning algorithm for classical Markov decision processes. The algorithm is shown to converge with probability one to the desired solution. The proof technique is an adaptation of the o.d.e. approach for the analysis of stochastic approximation algorithms, with most of the work involved used for the analysis of the specific o.d.e.s that arise.

This paper is concerned with dynamic quantizer design for state estimation of hidden Markov models (HMM) using multiple sensors under a sum power constraint at the sensor transmitters. The sensor nodes communicate with a fusion center over temporally correlated flat fading channels modelled by finite state Markov chains. Motivated by energy limitations in sensor nodes, we develop optimal quantizers by minimizing the long term average of the mean square estimation error with a constraint on the long term average of total transmission power across the sensors. Instead of introducing a cost function as a weighted sum of our two objectives, we propose a constrained Markov decision formulation as an average cost problem and employ a linear programming technique to obtain the optimal policy for the constrained problem. Our experimental results assert that the constrained approach is quite efficient in terms of computational complexity and memory requirements for our average cost problem and leads to the same optimal deterministic policies and optimal cost as the unconstrained approach under an irreducibility assumption on the underlying Markov chain and some mild regularity assumptions on the sensor measurement noise processes. We illustrate via numerical studies the performance results for the dynamic quantization scheme. We also study the effect of varying degrees of channel and measurement noise on the performance of the proposed scheme.

We consider the problem of energy-efficient point-to-point transmission of delay-sensitive data (e.g., multimedia data) over a fading channel. We propose a rigorous and unified framework for simultaneously utilizing both physical-layer and system-level techniques to minimize energy consumption, under delay constraints, in the presence of stochastic and unknown traffic and channel conditions. We formulate the problem as a Markov decision process and solve it online using reinforcement learning. The advantages of the proposed online method are that i) it does not require a priori knowledge of the traffic arrival and channel statistics to determine the jointly optimal physical-layer and system-level power management strategies; ii) it exploits partial information about the system so that less information needs to be learned than when using conventional reinforcement learning algorithms; and iii) it obviates the need for action exploration, which severely limits the adaptation speed and run-time performance of conventional reinforcement learning algorithms. Index Terms—Energy-efficient wireless multimedia communication, dynamic power management, power-control, adaptive modulation and coding, Markov decision process, reinforcement learning.

In this paper, we consider Markov Decision Processes (MDPs) with error
states. Error states are those states entering which is undesirable or
dangerous. We define the risk with respect to a policy as the probability of
entering such a state when the policy is pursued. We consider the problem of
finding good policies whose risk is smaller than some user-specified threshold,
and formalize it as a constrained MDP with two criteria. The first criterion
corresponds to the value function originally given. We will show that the risk
can be formulated as a second criterion function based on a cumulative return,
whose definition is independent of the original value function. We present a
model free, heuristic reinforcement learning algorithm that aims at finding
good deterministic policies. It is based on weighting the original value
function and the risk. The weight parameter is adapted in order to find a
feasible solution for the constrained problem that has a good performance with
respect to the value function. The algorithm was successfully applied to the
control of a feed tank with stochastic inflows that lies upstream of a
distillation column. This control task was originally formulated as an optimal
control problem with chance constraints, and it was solved under certain
assumptions on the model to obtain an optimal solution. The power of our
learning algorithm is that it can be used even when some of these restrictive
assumptions are relaxed.

We present methods for optimizing portfolios, asset allocations, and trading systems based on direct reinforcement (DR). In this approach, investment decision-making is viewed as a stochastic control problem, and strategies are discovered directly. We present an adaptive algorithm called recurrent reinforcement learning (RRL) for discovering investment policies. The need to build forecasting models is eliminated, and better trading performance is obtained. The direct reinforcement approach differs from dynamic programming and reinforcement algorithms such as TD-learning and Q-learning, which attempt to estimate a value function for the control problem. We find that the RRL direct reinforcement framework enables a simpler problem representation, avoids Bellman's curse of dimensionality and offers compelling advantages in efficiency. We demonstrate how direct reinforcement can be used to optimize risk-adjusted investment returns (including the differential Sharpe ratio), while accounting for the effects of transaction costs. In extensive simulation work using real financial data, we find that our approach based on RRL produces better trading strategies than systems utilizing Q-learning (a value function method). Real-world applications include an intra-daily currency trader and a monthly asset allocation system for the S&P 500 Stock Index and T-Bills.

Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911) The idea of learning to make appropriate responses based on reinforcing events has its roots in early psychological theories such as Thorndike's "law of effect" (quoted above). Although several important contributions were made in the 1950s, 1960s and 1970s by illustrious luminaries such as Bellman, Minsky, Klopf and others (Farley and Clark, 1954; Bellman, 1957; Minsky, 1961; Samuel, 1963; Michie and Chambers, 1968; Grossberg, 1975; Klopf, 1982), the last two decades have wit- nessed perhaps the strongest advances in the mathematical foundations of reinforcement learning, in addition to several impressive demonstrations of the performance of reinforcement learning algo- rithms in real world tasks. The introductory book by Sutton and Barto, two of the most influential and recognized leaders in the field, is therefore both timely and welcome. The book is divided into three parts. In the first part, the authors introduce and elaborate on the es- sential characteristics of the reinforcement learning problem, namely, the problem of learning "poli- cies" or mappings from environmental states to actions so as to maximize the amount of "reward"

The portfolio management for trading in the stock market poses a challenging stochastic control problem of significant commercial interests to finance industry. To date, many researchers have proposed various methods to build an intelligent portfolio management system that can recommend financial decisions for daily stock trading. Many promising results have been reported from the supervised learning community on the possibility of building a profitable trading system. More recently, several studies have shown that even the problem of integrating stock price prediction results with trading strategies can be successfully addressed by applying reinforcement learning algorithms. Motivated by this, we present a new stock trading framework that attempts to further enhance the performance of reinforcement learning-based systems. The proposed approach incorporates multiple Q-learning agents, allowing them to effectively divide and conquer the stock trading problem by defining necessary roles for cooperatively carrying out stock pricing and selection decisions. Furthermore, in an attempt to address the complexity issue when considering a large amount of data to obtain long-term dependence among the stock prices, we present a representation scheme that can succinctly summarize the history of price changes. Experimental results on a Korean stock market show that the proposed trading framework outperforms those trained by other alternative approaches both in terms of profit and risk management.

This paper presents novel Q-learning based stochastic control algorithms for rate and power control in V-BLAST transmission systems. The algorithms exploit the supermodularity and monotonic structure results derived in the companion paper. Rate and power control problem is posed as a stochastic optimization problem with the goal of minimizing the average transmission power under the constraint on the average delay that can be interpreted as the quality of service requirement of a given application. Standard Q-learning algorithm is modified to handle the constraints so that it can adaptively learn structured optimal policy for unknown channel/traffic statistics. We discuss the convergence of the proposed algorithms and explore their properties in simulations. To address the issue of unknown transmission costs in an unknown time-varying environment, we propose the variant of Q-learning algorithm in which power costs are estimated in online fashion, and we show that this algorithm converges to the optimal solution as long as the power cost estimates are asymptotically unbiased

This report presents a unified approach for the study of constrained Markov decision processes with a countable state space and unbounded costs. We consider a single controller having several objectives; it is desirable to design a controller that minimize one of cost objective, subject to inequality constraints on other cost objectives. The objectives that we study are both the expected average cost, as well as the expected total cost (of which the discounted cost is a special case). We provide two frameworks: the case were costs are bounded below, as well as the contracting framework. We characterize the set of achievable expected occupation measures as well as performance vectors. This allows us to reduce the original control dynamic problem into an infinite Linear Programming. We present a Lagrangian approach that enables us to obtain sensitivity analysis. In particular, we obtain asymptotical results for the constrained control problem: convergence of both the value and the pol...