Figure A3: LSTM memory cell without peepholes. z is the vector of cell input activations, i is the vector of input gate activations, f is the vector of forget gate activations, c is the vector of memory cell states, o is the vector of output gate activations, and y is the vector of cell output activations. The activation functions are g for the cell input, h for the cell state, and σ for the gates. Data flow is either "feed-forward" without delay or "recurrent" with a one-step delay. "Input" connections are from the external input to the LSTM network, while "recurrent" connections take inputs from other memory cells and hidden units of the LSTM network with a delay of one time step.

Source publication

RUDDER: Return Decomposition for Delayed Rewards

Preprint

Full-text available

Jun 2018

We propose a novel reinforcement learning approach for finite Markov decision processes (MDPs) with delayed rewards. In this work, biases of temporal difference (TD) estimates are proved to be corrected only exponentially slowly in the number of delay steps. Furthermore, variances of Monte Carlo (MC) estimates are proved to increase the variance of...

Managing engineering systems with large state and action spaces through deep reinforcement learning

Preprint

Full-text available

Nov 2018

Decision-making for engineering systems can be efficiently formulated as a Markov Decision Process (MDP) or a Partially Observable MDP (POMDP). Typical MDP and POMDP solution procedures utilize offline knowledge about the environment and provide detailed policies for relatively small systems with tractable state and action spaces. However, in large...

Similar publications