Comparing the proposed algorithm with the J-based heuristic for the same scenario as Fig. 5b (parameter u ∈ [4, 10]). Proposed virtual is solid blue; proposed actual is dashed red; J-based is dashed yellow.

Comparing the proposed algorithm with the J-based heuristic for the same scenario as Fig. 5b (parameter u ∈ [4, 10]). Proposed virtual is solid blue; proposed actual is dashed red; J-based is dashed yellow.

Source publication
Preprint
Full-text available
This paper presents an online method that learns optimal decisions for a discrete time Markov decision problem with an opportunistic structure. The state at time $t$ is a pair $(S(t),W(t))$ where $S(t)$ takes values in a finite set $\mathcal{S}$ of basic states, and $\{W(t)\}_{t=0}^{\infty}$ is an i.i.d. sequence of random vectors that affect the s...

Contexts in source publication

Context 1
... to (63), which is itself a discounted approximation to the infinite horizon time average problem of interest. Nevertheless, the iteration (64) resembles a classic Robbins-Monro stochastic approximation (see [41]) and can likely be analyzed according to such techniques. Further, this value function based method (called the J-based method in Fig. 9) simulates remarkably well for the robot problem with no additional time average constraints (that is, no average power constraint). Indeed, using γ = 1/1000 and duplicating the scenario of Fig. 5b, we simulate this J-based method and compare to the actual and virtual rewards of the proposed algorithm (where actual rewards use the ...
Context 2
... for the robot problem with no additional time average constraints (that is, no average power constraint). Indeed, using γ = 1/1000 and duplicating the scenario of Fig. 5b, we simulate this J-based method and compare to the actual and virtual rewards of the proposed algorithm (where actual rewards use the redirect mode). The results are shown in Fig. 9, which shows three curves that look very similar, all appearing to reach near optimality (where optimality is defined by Theorems 1 and 2). Of course, this J-based heuristic cannot handle extended problems with time average inequality constraints, while our proposed algorithm handles these easily. Proposed virtual is solid blue; ...